[Beowulf] Fault tolerance & scaling up clusters (was Re: Bright Cluster Manager)

Sat May 19 01:46:44 PDT 2018

>>>>> "J" == Lux, Jim (337K) <james.p.lux at jpl.nasa.gov> writes:

    J> On May 17, 2018, at 06:01, Roland Fehrenbacher <rf at q-leap.de>
    J> wrote:

    >>>>>>> "J" == Lux, Jim (337K) <james.p.lux at jpl.nasa.gov> writes:
    >>
    J> The reason I hadn't looked at "diskless boot from a server" is
    J> the size of the image - assume you don't have a high bandwidth or
    J> reliable link.
    >>
    >> This is not something to worry about with Qlustar. A (compressed)
    >> Qlustar 10.0 image containing e.g. the core OS + slurm + OFED +
    >> Lustre is just a mere 165MB to be transferred (eating 420MB of
    >> RAM

    J> 165 MB = 1.3 Gbit At 64 kbps that's about 6 hrs.

Ouch. Sure, with 64 kbps you've had it. Wouldn't have expected that kind
of throughput at NASA in 2018, or are these compute nodes in space that
you want to boot from a head-node in Houston :)