Linux magic wand - was Re: [Beowulf] Re: "hobbyists"
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Lawrence Stewart larry.stewart at sicortex.comTue Jun 24 18:11:06 PDT 2008
- Previous message: [Beowulf] Re: "hobbyists"
- Next message: [Beowulf] Re: "hobbyists"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Mark Hahn wrote: > > so the question is, if you had a magic wand, what would you change in > the kernel (or perhaps libc or other support libs, etc)? most of the > things I can think of are not clear-cut. I'd like to be able to give > better info from perf counters to our users (but I don't think Linux > is really in the way). I suspect we lose some performance due to jitter > injected by the OS (and/or our own monitoring) and would like to improve, > but again, it's hard to blame Linux. I'd love to have better options > for cluster-aware filesystems. kernel-assisted network shared memory? > _______________________________________________ There's a good rant to be written for Usenix or the Ottowa Linux Symposium I suspect. VM - 4096 is small now. In 1976 a page was 512 bytes. It moved to 4096 in the mid '90s? I forget. Since then computers and memory bandwidths are much bigger and faster. The telling point for me was that I took a look at a running system and there were only a couple of <hundred> VM areas in service, so page breakage amounts to almost nothing. We run with 64K pages and plan to experiment with much larger ones. One could argue about thread stacks, but I think that threads and HPC don't mix well, so there won't be that many. I am aware of the great debate about the right way to program high core-count nodes, but I doubt that more threads than processors is the right answer. Linux also has pretty poor mechanisms for keeping physical memory contiguous, the slabs tend to get fragmented, which is why the big page stuff and things like bigphysarea get preallocated. There's no good reason why you couldn't compact memory on the fly. The VM system is also in the way of OS bypass RDMA NICs - you either get large kernel patches like Quadrics to let virtual RDMA work, or you get pinning and registration and other performance sapping cruft. The new external-pager stuff may help a lot here, I haven't looked at it yet. I/O system The block device layer has 512 byte sectors wired in, and is solely useful for devices that you own exclusively. You've got queueing going on at multiple levels, I think because the architecture has assumptions about cpu/disk performance ratios baked in. And the segments of a bio have to complete in order, what's that about? A little one we ran into here is that the block I/O system doesn't know if an I/O is to satisfy an I stream page fault or a D stream page fault. Consequently if your L1 Icache is not coherent (and few are) you have to flush it on all read completions. A little book keeping would solve that. (I hope I am wrong about this one!) File systems Agree complelely about cluster aware FS. We struggle with the Lustre patch sets, which may be an extreme case. Performance stuff We are big users of the PAPI infrastructure, which is pretty good, but once you step off that train you have to deal with things like sysfs. So we're trying to read hardware counters without undue disturbance to running HPC applications, and the advice of Linux is to make a system call for each value, converted to ascii. This makes sense for slow admin stuff but not for performance data. At least it isn't XML. Runtime system I tend towards thinking we would be better off without shared libraries. Memory is big, programs are generally small. There is a lot of complexity here, to which I am allergic. To the extent that shared libraries make the program slower (due to separate segments for library data, for example), lets get rid of them. Two arguments in favor are when the library is implementing a system service chosen by the admin, rather than the programmer (PAM modules), and there is this talk about MPI ABIs, so applications can use alternate packages without relinking. I think that is a bad idea too, but it is off-topic. OS noise This becomes a big issue in large systems. There's way too much stuff running in linux, each piece separately designed, each thread with its own notions of timing and periodic wakeups. Maybe the OS should run on a separate node altogether, and you communicate with it via RDMA. All that is left behind is maybe memory management. -L
- Previous message: [Beowulf] Re: "hobbyists"
- Next message: [Beowulf] Re: "hobbyists"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
