[Beowulf] why we need cheap, open learning clusters
Lux, Jim (337C)
james.p.lux at jpl.nasa.gov
Sun May 12 10:55:31 PDT 2013
I just ran across an interesting anecdote (in Malcolm Gladwell's "Outliers"). It's in the context of Bill Joy, who commented that using timesharing and interactive systems compared to traditional batch/card deck submission was like speed chess vs chess by mail. That interactivity facilitated his spending thousands of hours working with software.
I see parallels to the cluster world. The original Beowulfs were essentially owned by one person, who could do what they wanted with them, when they wanted to. If they wanted to reload all the software with a new version and try something, they could do that. If they wanted to rearrange the network switches, they could. It's very interactive.
Compare to the current large clusters. When you have 1000 nodes, you're not going to say "hmm, if we rearrange the interconnect, I wonder what happens". And that million dollar machine is going to have a batch queue, a scheduler, etc. . I am amused to read all the stuff on this list over the years, as we've moved from "bunch of boxes on shelves in my office" to things very reminiscent of my early days in big iron. Sure, the modern user of a cluster doesn't have to punch a deck of cards and hike down to the computer center (or, if lucky, to the RJE station in their building); they can submit the job by a few keystrokes online. But it's still "submit and wait", as opposed to "type line, press enter, and get results immediately".
Gladwell and Joy talk about this interactivity in the context of the famous 10,000 hour thing. (it takes 10,000 hours of doing something to become proficient). If your "cycle time" for a job is an hour, it takes a LONG time to accumulate the 10,000 hours (especially, if most of the time is spent doing things like punching cards or reading greenbar output listings.. That's not part of the "learning computer" stuff. On the other hand, if you can make a change in a few lines with a text editor (SOS on a DECWriter, I came to love you after I cast off the shackles of an 029), run the program, and see what happens, proficiency come that much faster.
This is why I think things like ArduWulf or, more particularly LittleFE, are valuable. And it's also why nobody should start packaging LittleFE clusters in an enclosure. Once all those mobos are in a box with walls, it starts to discourage random and rapid experimentation. If you put a littleFE in a sealed box with an inventory tag and a "breaking this seal voids warranty" and the only interface is the network jack or keyboard/monitor, you might as well put a modern multicore mobo in there and spin up VM instances. In this case, it's the very "assembled in a garage" kind of look that prompts the willingness of someone to go in and make some unauthorized changes, from which comes learning.
The learning cluster has to be cheap enough (and, I think physically portable) to be "owned" by a single person. Otherwise, it starts to be "community, shared property", and subject to access restrictions. It starts to look like significant capital equipment, with only authorized service, compliance with corporate/institutional IT security rules: Do you have all your patches up to date? Are you running the institutional virus checker?. Do you have full disk encryption?
Nobody is going to try to connect a ArduWulf to the "internet" and thus provide a vector for penetration. A LittleFE can run stand alone.. It doesn't need to be connected to the internet to work. A institutional requirement "thou shalt not connect a LittleFE to the internal network" is not a big deal and doesn't decrease the pedagogical value. This is a surprisingly useful aspect. At JPL, I have to jump through many hoops to order a bare motherboard, much less something in a box with a power supply, because of the (legitimate) concerns about IT security. Enough people have bought computers, calling them "laboratory instrument controllers" and the like, claiming that they weren't interconnected, and therefore didn't require all the stuff typically required ,and then connected them to the institutional network, and then inadvertently providing a pathway for evil.
When I bought those Arduino Uno Ethernets (they ARE going to be instrument controllers, after a fashion), the word "Ethernet" on the order form triggered a whole list of reviews from the netops people, the IT security people, etc. All of whom had legitimate concerns (was I going to be connecting something weird to the network that Net Ops needed to be aware of, was I going to be providing a vector for attack from Advanced Persistent Threats, etc.). It would not be unreasonable to say that JPL spent more money dealing with those issues (or showing that they aren't an issue) than we did on the actual hardware.
So, it's useful to have small scale toy clusters for development. They facilitate getting that 10,000 hours. And we NEED people who have that 10,000 hours, because they're the ones who will revolutionize HPC.
I also realized that's why I hated MIXAL and the MIX machine… And why I don't think it contributed to my skills as much as my using that PDP-11/20 or the DecSystem-10. The MIX machine was run as a batch job against my limited account. The PDP-11 was single user RT-11, the Dec-10 was running TOPS-10 with a terminal, and at UCSD we were running Psystem on LSI-11/3s . In my first real paying development effort, I was running FORTRAN on a Z80 machine, even though the eventual target was a CDC mainframe. Get it working on the Z80 locally, then convert it to card images and transfer to the CDC.
I am a BIG believer in personal computing…
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf