[Beowulf] Fault tolerance & scaling up clusters (was Re: Bright Cluster Manager)

Lux, Jim (337K) james.p.lux at jpl.nasa.gov
Sat May 19 18:36:18 PDT 2018

These are compute nodes in space,  but booting over a wireless link from another node many km away.

The data rates for space are surprisingly low - it's a "joules/bit" kind of thing and power is precious.

The relay link between rovers on the surface of Mars and orbiters overhead is a few Mbit/sec at the fastest.   Links between Earth and Mars are usually a few kbps from earth to spacecraft (uplink) and a few Mbps back to Earth (downlink).

In Earth orbit, there are some spacecraft with high rate "crosslinks" - Iridium NEXT is a good example - there's a 12.5 Mbps half duplex link between the satellites.

The other problem is that most "remote" protocols are fairly error intolerant - the strategy is usually - if you get an error, just retry the whole thing.
TFTP (used by netboot) uses UDP, so it must have some way to deal with dropped blocks.

For instance, FTP doesn't have a good way to "restart" a file transfer in the middle (although good old Zmodem does<grin>)
I don't know about scp.  I think rsync can do restarts.

´╗┐On 5/19/18, 1:47 AM, "Beowulf on behalf of Roland Fehrenbacher" <beowulf-bounces at beowulf.org on behalf of rf at q-leap.de> wrote:

    >>>>> "J" == Lux, Jim (337K) <james.p.lux at jpl.nasa.gov> writes:
        J> On May 17, 2018, at 06:01, Roland Fehrenbacher <rf at q-leap.de>
        J> wrote:
        >>>>>>> "J" == Lux, Jim (337K) <james.p.lux at jpl.nasa.gov> writes:
        J> The reason I hadn't looked at "diskless boot from a server" is
        J> the size of the image - assume you don't have a high bandwidth or
        J> reliable link.
        >> This is not something to worry about with Qlustar. A (compressed)
        >> Qlustar 10.0 image containing e.g. the core OS + slurm + OFED +
        >> Lustre is just a mere 165MB to be transferred (eating 420MB of
        >> RAM
        J> 165 MB = 1.3 Gbit At 64 kbps that's about 6 hrs.
    Ouch. Sure, with 64 kbps you've had it. Wouldn't have expected that kind
    of throughput at NASA in 2018, or are these compute nodes in space that
    you want to boot from a head-node in Houston :)
    Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
    To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list