[Beowulf] Remote console management
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Bruce Allen ballen at gravity.phys.uwm.eduSun Sep 25 18:24:48 PDT 2005
- Previous message: [Beowulf] Remote console management
- Next message: [Beowulf] Remote console management
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> An alternative approach could be to deshuffle the money of the > distributed local storage that you sketched out, and have cheaper > diskless (and therefore almost stateless) compute nodes (or with a > non-raided single drive for scratchspace of intermediate results) plus a > gang of storage nodes that are dedicated access points to a bunch of > iscsi or fibre attached drive enclosures. Nope -- not enough bandwidth to the data. With our current plan our bandwidth to the data will be 400 x 100 MB/sec = 40 GB/sec. This is enough to read ALL 400 TB of data on the cluster in 10000 sec, or about three hours. You can't even come close with centralized (non-distributed) storage systems. Cheers, Bruce > Bruce Allen wrote: > >> Doug, >> >> Good to "see you" in this discussion -- I think this thread would be the >> basis for a nice article. >> >> Spending the $$$ to buy some extra nodes won't work in our case. We don't >> just use the cluster for computing, we also use it for data storage. Each >> of the 400+ nodes will have four 250GB disks and a hardware RAID controller >> (3ware 9500 or Areca 1110). If a node is acting odd, we'd like to be able >> to diagnose/fix/reboot/restore it quickly if possible. To replicate the >> data from a distant tape-backed repository will take many hours. So having >> some 'extra' machines doesn't help us so much, since we wouldn't know what >> data to keep on them, and moving the data onto them when needed would >> normally take much longer than bringing back to life the node that's gone >> down. >> >> Cheers, >> Bruce >> >> >> On Sat, 24 Sep 2005, Douglas Eadline wrote: >> >>> >>>> We're getting ready to put together our next large Linux compute cluster. >>>> This time around, we'd like to be able to interact with the machines >>>> remotely. By this I mean that if a machine is locked up, we'd like to be >>>> able to see what's on the console, power cycle it, mess with BIOS >>>> settings, and so on, WITHOUT having to drive to work, go into the cluster >>>> room, etc. >>>> >>> This brings up an interesting point and I realize this does come down to >>> a design philosophy, but cluster economics sometimes create non standard >>> solutions. So here is another way to look at "out of band monitoring". >>> Instead of adding layers of monitoring and control, why not take that >>> cost and buy extra nodes. (but make sure you have a remote hard power >>> cycle capability). If a node dies and cannot be rebooted, turn it off, and >>> fix it later. Of course monitoring fans and temperatures is a good thing >>> (tm), but if node will not boot, and you have to play with the BIOS, then >>> I would consider it broken. >>> >>> Because you have "over capacity" in your cluster (you bought extra nodes) >>> this does not impact the amount work that needs to get done. Indeed, prior >>> to the failure you can have the extra nodes working for you. You fully >>> understand that at various time one or two nodes will be off line. They >>> are taken out of the scheduler and there is no need to fix them right >>> away. >>> >>> This approach also depends on what you are doing with your >>> cluster and the cost of nodes etc. In some cases out-of-band access >>> is a good thing. In other cases, the "STONIH-AFIT" (shoot the other node >>> in the head and fix it tomorrow" approach is also reasonable. >>> >>> >>> -- >>> Doug >>> >>> check out http://www.clustermonkey.net >>> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > > > > -- > Michael Will > Penguin Computing Corp. > Sales Engineer > 415-954-2822 > 415-954-2899 fx > mwill at penguincomputing.com
- Previous message: [Beowulf] Remote console management
- Next message: [Beowulf] Remote console management
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
