[Beowulf] Computation on the head node
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Perry E. Metzger perry at piermont.comMon May 19 09:44:42 PDT 2008
- Previous message: [Beowulf] Computation on the head node
- Next message: [Beowulf] Computation on the head node
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
"Jeffrey B. Layton" <laytonjb at charter.net> writes: >> Third, you could be doing lots of file i/o to legitimate data >> files. Here again, it is possible that if the files are small enough >> and your access patterns are repetitive enough that increasing your >> RAM could be enough to make everything fit in the buffer cache and >> radically lower the i/o bandwidth. On the other hand, if you're >> dealing with files that are tens or hundreds of gigabytes instead of >> tens of megabytes in size, and your access patterns are very >> scattered, that clearly isn't going to help and at that point you need >> to improve your I/O bandwidth substantially. > > It's never this simple - never :) Sometimes it is this simple. Indeed, often it is. > Plus, different file systems will impact the IO performance in > different ways. Well, of course. > It's never as simple, as "add more memory" or "need more bandwidth". Sometimes it *is* as simple as "add more memory". I remember one particular problem I dealt with once where adding about 30% more memory for file cache nearly eliminated disk i/o, at which point it was no longer necessary to optimize the i/o subsystem. If you don't believe that's ever happened, well, fine by me. It won't hurt me either way. :) > You need to understand your IO pattern and what the code is doing. Naturally, but sometimes the solution is as easy as "add more memory". The best way to improve i/o performance possible is to eliminate i/o if you can. If you're just spewing data out really, really fast, memory won't help. If you're reading and writing the same data, or you're reading a reasonable sized working set, memory helps. >>> The best way I've found is to look a the IO pattern of your >>> code(s). The best I've found to do this is to run an strace against >>> the code. I've written an strace analyzer that gives you a >>> higher-level view of what's going on with the IO. >> >> That will certainly give you some idea of access patterns for case 3 >> (above), but on the other hand, I've gotten pretty far just glancing >> at the code in question and looking at the size of my files. > > But what if don't have access to the source or can't share the source > with vendors (of the data set)? Very often you can figure out what the code is doing just by looking at things like page hit rates from the various status programs. They'll tell you what your I/O pattern is like. The vendor sharing issue is, of course, far more complicated. Everything that involves people and not machines is pretty much by definition more complicated. :) >> I have to say, though, that really dumb measures (like increasing the >> amount of RAM available for buffer cache -- gigs of memory are often a >> lot cheaper than a fiber channel card -- or just having a fast and >> cheap local drive for intermediate data i/o) can in many cases make >> the problem go away a lot better than complicated storage back end >> hardware can. > > IMHO and experience many times just adding memory can't make things > go away. If you don't know how to tune the file cache usage, that's certainly true -- without tuning, you'll never use the extra RAM. I've seen people who have added more memory and then said "well, see, that did no good" but they didn't know how to tune their OS correctly for their job load so of course it wasn't going to do them any good. (There are also systems where you just can't tune for what you want -- try tuning a Windows 2000 Server box to use more file cache, for example, and you'll spend your time tearing your hair out.) I've found that, remarkably often, more memory *can* make the problem go away, but only in cases where keeping most files hot in cache can eliminate the i/o entirely. If you're writing out tens or hundreds of gigs of generated data, memory alone is clearly not going to fix the problem. If you are reading primarily, but your working set is much larger than the amount of memory you could possibly afford, then more memory is clearly not going to help. However, if you are hitting a hot half gig or gig of file data for read and write, memory makes all the difference in the world. >> If you really are hitting disk and can't help it, a drive on every >> node means a lot of spindles and independent heads, versus a fairly >> small number of those at a central storage point. 200 spindles always >> beat 20. > > What if you need to share data across the nodes? Even again, it depends. If everyone's hitting a bunch of hot data in one spot, clearly you're going to lose (though even then, maybe having a ridiculously large ramdisk from a commercial supplier can help). If, on the other hand, data access patterns are fairly randomly spread, then it might be a big win to spread the data across the nodes, just as one can win big by slicing up database table rows across lots of servers in some applications. "It depends on what you are doing." > Having data spread out all over creation doesn't help. It can. Look at how Google does things. They spread their data out "all over creation", and they win big. They don't use giant file servers at all -- they spread disk i/o out to hundreds of thousands of spindles on hundreds of thousands of nodes. > In addition, I like to get drives out of the nodes if I can to help > increase MTTI. That's also a consideration. As always, it is a tradeoff, and understanding the particular application the cluster is being used for is key to knowing what the right thing is. >> In any case, let me note the most important rule: if your CPUs aren't >> doing work most of the time, you're not allocating resources >> properly. If the task is really I/O bound, there is no point in having >> more CPU than I/O can possibly manage. You're better off having 1/2th >> the number of nodes with gargantuan amounts of cache memory than >> having CPUs that are spending 80% of their time twiddling their >> thumbs. The goal is to have the CPUs crunching 100% of the time, and >> if they're not doing that, you're not doing things as well as you can. > > I absolutely disagree. Chacun a son gout. I was under the impression that in scientific computing the name of the game was having your computation done as fast as possible at the lowest possible cost. If your CPU is idle, why did you pay for it? They're a huge cost differential these days between fast and slow CPUs. Why didn't you buy a much cheaper CPU that would remain nearly 100% busy while keeping the I/O subsystem as fast? You would have saved lots of cash, your job would be done just as fast, and probably (in a modern system) you would have saved a whole lot of electricity because slower CPUs eat fewer Watts. > I can name many examples where the code has to do a fair amount of > IO as part of the computation so you have to write data. Sure, but the name of the game is, wait for I/O as little as possible. Every moment the CPU is idle it could be doing something else instead. Precious resource is being wasted. > Doing this in an efficient manner is pretty damn important. I believe that's more or less what I've said, yes? > Understanding the IO pattern of your code to help you chose the > underlying hardware and file system is absolutely critical. I can't say that I disagree there, however, the reason for that is so that your CPU can spend more of its time working and less twiddling its virtual thumbs waiting for data to work on. > I can also think of examples where you can't stuff enough memory > in a box so you will have to consider IO as part of the computation. I fully agree. It is trivial to think of examples where you have to flush out so much data that it is simply impossible to alleviate the problem with RAM. If your working set is 100G of data, you're not going to fix that with RAM on individual nodes. However, when you can fix things with RAM, it is a wonderfully simple and elegant solution. > I believe you're thinking of local IO - like a desktop. No, really, I'm not. >>> I'm also working on a tool that can take the strace output and >>> create a "simulator" that will run in a similar manner to the >>> original code but actually perform the IO of the original code using >>> dummy data. This allows you to "give" away a simple dummy code to >>> various HPC storage vendors and test your application. This code is >>> taking a little longer than I'd hoped to develop :( >> >> It sounds cool, but I suspect that with even simpler tools you can >> probably deduce most of what is going on and get around it. > > If you know a better way - let's hear it! Well, as always, it depends on your particular cluster issue, and I'm not privy to your job load. :) > I haven't seen one yet and having worked for an HPC storage company > I haven't seen one from them either. I'm always looking for better > techniques but I have to tell, I'm really skeptical of your ideas. You needn't listen to me, then. It is a free country, and I won't be insulted in the least if you ignore me. I'm pretty laid back about that sort of thing. :) Perry
- Previous message: [Beowulf] Computation on the head node
- Next message: [Beowulf] Computation on the head node
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
