[Beowulf] 512 nodes Myrinet cluster Challanges
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Mark Hahn hahn at physics.mcmaster.caTue May 2 14:20:26 PDT 2006
- Previous message: [Beowulf] 512 nodes Myrinet cluster Challanges
- Next message: [Beowulf] 512 nodes Myrinet cluster Challanges
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> > moving it, stripped them out as I didn't need them. (I _do_ always require > > net-IPMI on anything newly purchased.) I've added more nodes to the cluster > > Net-IPMI on all hardware? Why? Running a second (or 3rd) network isn't > a trivial amount of additional complexity, cables, or cost. What do I really like being able to reset remotely, as well as power up/down, fetch temperatures and fan speeds, etc. > you figure you pay extra on the nodes (many vendors charge to add IPMI, > sun, tyan, supermicro, etc), cables, switches, etc. As a data point on > a x2100 I bought recently the IPMI card was $150. the IPMI add-in for many Tyan boards is a lot less than that ($50?), but quite a few servers already have it. (such as the HP DL145 G2). and it's not a "real" nother network, since each rack's worth of IPMI net ports can just go to an in-rack switch. if you have 32-40 nodes/rack with a better-than-ethernet interconnect, then you've probably already got another switch (gigabit) in the rack so all the extra stuff is in-rack. > Seems like collecting fan speeds and temperatures in-band seems reasonable, > after all much of the data you want to collect isn't available via IPMI > anyways (cpu utilization, memory, disk I/O, etc.). true. though it's not clear to me how important those extras are to the kind of HPC cluster I run. a job gets complete ownership of its CPUs (and usually multiple whole nodes), so it's quite unlike a load-balancing cluster, where you actually want realtime info on cpu or memory utilization. doing load-balanced clusters is not unreasonable for more cores-per-node, or perhaps for strictly serial workloads. for anything that's nontrivially parallel, the job _must_ completely own all its resources, so there's really no reason to worry about unused memory on an already occupied node... > Upgrading a 208 3phase PDU to a switched PDU seems like it costs on the > order of $30 per node list. As a side benefit you get easy to query > load per phase. that's nice. but it only lets you power up/down. you can't do a warm reset, only hard ones that limit your life. > After dealing with a few clusters with PDUs in the airflow blocking > airflow and physical access to parts of the node I now specify the > zero-u variety that are outside the airflow. that's nice. HP's PDUs have a breaker section which consume about 1u each, and a set of outlet bars which mount zero-u (but which have far too many (or too low-power) outlets. interestingly, our racks are bayed together, which means that there's enough space for some airflow between racks. unfortunately, Quadrics switches are fairly narrow, so there's enough room for a noticable counter-circulation.
- Previous message: [Beowulf] 512 nodes Myrinet cluster Challanges
- Next message: [Beowulf] 512 nodes Myrinet cluster Challanges
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
