[Beowulf] cluster deployment and config management

psc pscadmin at avalon.umaryland.edu
Tue Sep 5 11:19:42 PDT 2017


Hey everyone, .. any idea what happened with perceus?
http://www.linux-mag.com/id/6386/
https://github.com/perceus/perceus

.. yeah; what happened with Arthur Stevens (Perceus, GravityFS/OS Green 
Provisioning, etc. ) where he is now; who is maintain, if anyone, perceus ?

.. and come on Greg K. ... we know you are luring there somewhere being 
busy with singularity
http://singularity.lbl.gov/ (kudos .. great job as always !!!)
.. wasn't perceus yours original baby?
https://gmkurtzer.github.io/
.. can you bring some light what happened with the perceus project? .. 
I'd love to see it integrated with singularity -- that would made my 
day/month/year  !!!!!!

thanks!
cheers,
psc

p.s. .. there there used to be rocks clusters (not sure about it's 
status these days)
http://www.rocksclusters.org/wordpress/

p.s.s. .. I'd say Warewulf is the "best" bet in most cases .. why keep 
reinventing the wheel ?


On 09/05/2017 01:43 PM, beowulf-request at beowulf.org wrote:
> Send Beowulf mailing list submissions to
> 	beowulf at beowulf.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://www.beowulf.org/mailman/listinfo/beowulf
> or, via email, send a message with subject or body 'help' to
> 	beowulf-request at beowulf.org
>
> You can reach the person managing the list at
> 	beowulf-owner at beowulf.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Beowulf digest..."
>
>
> Today's Topics:
>
>     1. Re: cluster deployment and config management (Joe Landman)
>     2. Re: cluster deployment and config management (Arif Ali)
>     3. RAID5 rebuild, remount with write without reboot? (mathog)
>     4. Re: RAID5 rebuild, remount with write without reboot?
>        (John Hearns)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 5 Sep 2017 08:20:03 -0400
> From: Joe Landman <joe.landman at gmail.com>
> To: beowulf at beowulf.org
> Subject: Re: [Beowulf] cluster deployment and config management
> Message-ID: <2da2937a-9055-6514-3d24-a739aee12845 at gmail.com>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> Good morning ...
>
>
> On 09/05/2017 01:24 AM, Stu Midgley wrote:
>> Morning everyone
>>
>> I am in the process of redeveloping our cluster deployment and config
>> management environment and wondered what others are doing?
>>
>> First, everything we currently have is basically home-grown.
> Nothing wrong with this, if it adequately solves the problem.  Many of
> the frameworks people use for these things are highly opinionated, and
> often, you'll find their opinions grate on your expectations.  At
> $dayjob-1, I developed our own kit precisely because so many of the
> other toolkits did little to big things wrong; not simply from an
> opinion point of view, but actively made specific errors that the
> developers glossed over as that aspect was unimportant to them ... while
> being of critical importance to me and my customers at the time.
>
>> Our cluster deployment is a system that I've developed over the years
>> and is pretty simple - if you know BASH and how pxe booting works.  It
>> has everything from setting the correct parameters in the bios, zfs
>> ram disks for the OS, lustre for state files (usually in /var) - all
>> in the initrd.
>>
>> We use it to boot cluster nodes, lustre servers, misc servers and
>> desktops.
>>
>> We basically treat everything like a cluster.
> The most competent baked distro out there for this was (in the past,
> haven't used it recently) Warewulf.  See https://github.com/warewulf/ .
> Still under active development, and Greg and team do a generally great
> job.  Least opinionated distro around, most flexible, and some of the
> best tooling.
>
>> However... we do have a proliferation of images... and all need to be
>> kept up-to-date and managed.  Most of the changes from one image to
>> the next are config files.
> Ahhh ... One of the things we did with our toolchain (it is open source,
> I've just never pushed it to github) was to completely separate booting
> from configuration.  That is, units booted to an operational state
> before we applied configuration.  This was in part due to long
> experience with nodes hanging during bootup with incorrect
> configurations.  If you minimize the chance for this, your nodes
> (barring physical device failure) always boot.  The only specific
> opinion we had w.r.t. this system was that the nodes had to be bootable
> via PXE, and there fore a working dhcp needed to exist on the network.
>
> Post boot configuration, we drove via a script that downloaded and
> launched other scripts.   Since we PXE booted, network addresses were
> fine.  We didn't even enforce final network address determination on PXE
> startup.
>
> We looked at the booting process as a state machine.  Lower level was
> raw hardware, no power.  Subsequent levels were bios POST, PXE of
> kernel, configuration phase.  During configuration phase *everything*
> was on the table w.r.t. changes.  We could (and did) alter networking,
> using programmatic methods, databases, etc. to determine and configure
> final network configs.  Same for disks, and other resources.
>
> Configuration changes could be pushed post boot by updating a script and
> either pushing (not normally recommended for clusters of reasonable
> size) or triggering a pull/run cycle for that script/dependencies.
>
> This allowed us to update images and configuration asynchronously.
>
> We had to manage images, but this turned out to be generally simple.  I
> was in the midst of putting image mappings into a distributed object
> store when the company died.  Config store is similarly simple, again
> using the same mechanisms, and could be driven entirely programmatically.
>
> Of course, for the chef/puppet/ansible/salt/cloudformation/... people,
> we could drive their process as well.
>
>
>> We don't have a good config management (which might, hopefully, reduce
>> the number of images we need).  We tried puppet, but it seems everyone
>> hates it.  Its too complicated?  Not the right tool?
> Highly opinionated config management is IMO (and yes, I am aware this is
> redundant humor) generally a bad idea.  Config management that gets out
> of your way until you need it is the right approach. Which is why we
> never tried to dictate what config management our users would use.  We
> simply handled getting the system up to an operational state, and they
> could use ours, theirs, or Frankensteinian kludges.
>
>> I was thinking of using git for config files, dumping a list of rpm's,
>> dumping the active services from systemd and somehow munging all that
>> together in the initrd.  ie. git checkout the server to get config
>> files and systemctl enable/start the appropriate services etc.
>>
>> It started to get complicated.
>>
>> Any feedback/experiences appreciated.  What works well?  What doesn't?
> IMO things that tie together config and booting are problematic at
> scale.  Leads to nearly unmanageable piles of images, as you've
> experienced.  Booting to an operational state, and applying all config
> post boot (ask me about my fstab replacement some day), makes for a very
> nice operational solution that scales wonderfully .... you can replicate
> images to local image servers if you wish, replicate config servers,
> load balance the whole thing to whatever scale you need.
>
>
>> Thanks.
>>
>>
>>
>> -- 
>> Dr Stuart Midgley
>> sdm900 at gmail.com <mailto:sdm900 at gmail.com>
>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list