[Beowulf] Differenz between a Grid and a Cluster???
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.Robert G. Brown rgb at phy.duke.edu
Thu Sep 22 08:04:45 PDT 2005
- Previous message: [Beowulf] Differenz between a Grid and a Cluster???
- Next message: [Beowulf] Differenz between a Grid and a Cluster???
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Joe Landman writes: > In a nutshell, a grid defines a virtualized cloud of processing/data > motion across one or more domains of control and > authentication/authorization, while a cluster provides a virtualized > cloud of processing/data motion across a single domain of control and > authentication/authorization. Clusters are often more tightly coupled > via low latency network or high performance fabrics than grids. Then > there is the relative hype and the marketing/branding ... Agreed. >> To be really fair, one should note that tools have existed to manage >> moderate cluster heterogeneity for single applications since the very >> earliest days of PVM. The very first presentation I ever saw on PVM in >> 1992 showed slides of computations parallelized over a cluster that >> included a Cray, a pile of DEC workstations, and a pile of Sun >> workstations. PVM's aimk and arch-specific binary path layout was > > aimk is IMO evil. Not PVM's in particular, but aimk in general. The > one I like to point to is Grid Engine. It is very hard to adapt to new > environments. <warning> patented rgb ramble follows, hit d now or waste your day...:-) </warning> Oh, I agree, I agree. aimk and I got to be very good -- "friends", I suppose -- some years ago. It was a godawful hack, and it caused me significant disturbance to learn that it was the basis of SGE even today. But of course, you also have to look at the magnitude of the task it was trying to -- its complexity was just a mirror of the incredible mix of Unices and hardware architectures available at the time. You had your bsd-likes (SunOS), your sysv-likes (Irix), your let's-make-up-our-own-likes (NeXT OS anyone:-), your unix-doesn't-have-enough-management-commands-likes (AIX of Evil). These changed everything from basic paths to /etc layout to the way flags worked on common commands to using nonstandard commands and nonstandard tools to do nearly ANYTHING. aimk used rampant screen scraping just to identify the architecture it was running on, since there weren't anything like standard tools for doing so. And it was just as bad for the users -- who would e.g. try using ps aux on a sysv box when they should've been using ps ef. This problem hasn't really gone away, it's just one that we don't usually face any more. Once SunOS was "the" standard unix, such as it was, by virtue of selling more workstation seats than everybody else put together and by being the basic platform upon which open source software was built (it would nearly always build under SunOS if it built on anything, and maybe it would build under Irix, or CrayOS, or AIX -- but no guarantees. Now Linux is "the" standard unix, by virtue of "selling" more server AND workstation seats than everybody else put together AND by being the basic platform upon which "most" open source software is built, with its primary friendly open source competitor (freeBSD) maintaining a fairly deliberate compatibility level to make it easy to port the vast open source library back and forth between them. In addition, the emergence and moderate degree of standardization of the lsb_release command and the uname -a command have made it possible to determine distribution, release, kernel, hardware architecture, and base architecture without resorting to wildass looking for a file in some particular path that is only to be found there in version 7.1 of widgetronic's Eunix, if it is installed on a Monday. It's a moderate shame that they didn't just name it the release command so that the BSDs, solaris, and others can play too, but you can't have everything. The modern solution to this problem at the build level in the Open Source world is to use Gnu's autoconf stuff. This operates at a level of complexity that puts aimk to shame, and hides most of that complexity from the user as long as it works, which it does for precisely those architectures (e.g. linux) that are all homogeneously built anyway and don't generally much need it. If nothing else, it does provide one with a sound way to build a Makefile or package that should "work" (build, install, function) as far as e.g. include paths and so on across a wide-er range of architectures than would likely otherwise be the case. Sometimes without actually having to instrument the code with all sorts of #ifdefs. MY own solution to all of this is to just say screw it. I write software that does not use autoconf, for the most part -- numerical code it just isn't that complex. GUI code, sure. Code that uses a lot of high end libraries, maybe. But numerical code that links -lm and the GSL and perhaps a network library or pvm, why bother? I create rpms that will (re)build cleanly and easily on all the linux platforms I have access to. If anybody wants to use the (GPL) tools elsewhere, somewhere that is really incompatible with this (solaris, Irix, AIX, WinXX), or even on a debian system -- well, that's why the GPL is there. They can, they just have to do any repackaging or porting required themselves. I may "help" them, I may even swallow any #ifdef'd patches back into the program if they don't bother the linux build, but I reject BOTH aimk AND Makefile.am, Makefile.in, etc. I like to have just one Makefile, human readable and human editable, instead of e.g. Makefile.irix, Makefile.linux, Makefile.aix or Makefile.am and Makefile.in where much of the Makefile you use is completely beyond your ability to control. So I'm atavistic and curmudgeonly. I also think that while diversity is important for evolution to occur, deliberate diversity of proprietary source material that seeks to "lock in" users to a nonstandard environment where the nonstandardness is all in >>places that do not matter<< tends to SURPRESS evolution, by eliminating the sharing of memes. Microsoft is famous for this, but go down the list -- Sun, IBM, DEC, Cray, SGI, they ALL do this. By investing energy to make a tool portable across their deliberate code-breaking differentiation, you only encourage them and permit them to continue a crepuscular existence where they can present to their waning body of users the illusion that they too can provide the full range of open source packages that are out there and are hence "as good as" (or even better than) plain old linux (any flavor) or freebsd. They are incorrect, of course -- the number of packages, and level of sophistication of the packages, available in e.g. FC4 + extras + livna +... is staggering. Even commercial software vendors, if they ever come to understand how, could use tools like yum to COMPLETELY AUTOMATE the distribution and updating of their properly packaged software in a totally reliable way to the licensed desktop or cluster node -- it merely requires the right combination of handshaking, encryptions, and keys. So the sooner open source package maintainers stop expending significant resources to ensure buildability across the commercial unices (except where the commercial vendors directly support the port with real money), the sooner the commercial incompatible unices will just go away. Evolution involves BOTH sharing of genes AND the elimination of the less fit. Right now linux and freebsd are enjoying the fruits of robust gene-sharing and an INTERNAL culling of the crap -- applications are contantly being deprecated and die away, with the ultimately democratic process used to determine what lives -- ANYTHING can live as long as somebody loves it and will invest the time required to maintain it, but the last person out please turn out the lights and close the door. Nothing is wasted -- the code base persists and can be looted for new projects. There is an interesting sort of "competition" between linux distributions themselves -- made interesting because they all share well over 90% of their code base, so it is a competition among siblings pursuing a lover (the "user"), not a war between two species seeking to take over an ecological niche. It is this that makes Microsoft's recent MPI announcement so very interesting. If history is any predictor of the future, Microsoft's business strategy is perfectly clear. They've identified clusters as a real market with both direct profit potential and with "prestige" (market clout) associated with it. Even Apple gets systems into the Top 500, where Microsoft gets laughed at. SO a) co-opt the technology that created the market "Microsoft introduces its own compatible version of MPI". b) sell lots of systems (they hope) into Microsoft-branded clusters. Invest any and all resources required to help users port their MPI applications into Microsoft-based compilers, toolsets, and so on. Develop windows-based cluster management tools so that they can comfortably manage their clusters from the Windows desktop without having to really "know" anything. c) when they have what they deem to be an adequate market share, introduce the first INcompatibilities. "Extend" MPI. Get users to use their proprietary or at least nonstandard extentions. Rely on the immense cost of de-porting the applications out of their development environment and use this to leverage the death of their competitors. Take over the cluster marketplace when people who use alternative MPIs or cluster OS's are no longer able to run the key applications built to use WinMPI. Of course, step c) (which in the past has led to the DEATH of their competitors) won't work against an open source competitor that doesn't run on their platform in any significant numbers anyway. That is, they stand to gain or lose all ten of the people using an open source MPI on Windows clusters, or something like that up front. The one place where they might be able to make inroads is in clusters that use proprietary tools to e.g. do genetics or other kinds of work. Even that will depend on some armtwisting of those vendors into dropping their support for linux-based platforms. I can see no good reason for them ever doing so as long as linux clusters continue to be strongly represented in the marketplace, and I see no reason that MOST users will be convinced to use Microsoft based clusters given that they will be much, much more expensive and fundamentally less portable. > When you run on multiple heterogenous platforms and you are dealing with > floating point codes, you need to be very careful with a number of > things, including rounding modes, precision, sensitivity of the > algorithm to roundoff error accumulation at different rates, the fact > that PCs are 80 bit floating point units, and RISC/Cray machines use > 32/64 bits and 64/128 for doubles. It could be done, but if you wanted > reliable/reasonable answers, you had to be aware of these issues and > make sure you code was designed appropriately. Right. Again, in the pre-posix days this was essential. Even today it doesn't hurt to be aware. However, it is now "possible" to write code that works pretty well across even various hardware architectures. Usually the issue is one of efficiency, not actual "getting the wrong answers" unless one does something really stupid and presume that e.g. an int is 32 bits no more no less. If one uses sizeof() and friends or the various macros/functions intended to yield precision information, you can write portable code that is quite reliable. I think the more interesting issue IS efficiency. ATLAS stands as a shining example of what can be gained by paying careful attention to cache sizes and speeds and the bottlenecks between the various layers of the memory hierarchy as one switches block sizes/strides and algorithms around. There are really significant differences in BLAS efficiency between a "generic" blas and one that is atlas-tuned to your particular hardware. I think that this is what Mark was referring to -- if somebody writes their linear application with a stride suitable for a 2 MB cache and runs it on a CPU with a 512K cache it will run, but it won't run efficiently. If you do the vice versa, the same is true. The efficiency differential CAN be as much as a factor of 3, from what ATLAS teaches us, so one can WASTE part of the potential of a node if one is mistuned -- not so much get a wrong answer as waste the resource. To the user this may or may not be a problem. If I have a problem that will take me a week to run on a really large grid, that might run in only three days if I spend three weeks retuning it for all the architectures represented (presuming I know HOW to retune it) well, the economics of the alternatives are obvious. Similarly if I have a job that will run for a year on the grid if I don't spend three weeks retuning it per architecture and in six months if I do, well, those economics are obvious too. And in fact, most grid users will just run their code in blissfull ignorance of the multiarchitecture problem and the fact that their code might well run 3x faster on the ORIGINAL architecture if it were rewritten by a competent programmer. To my direct experience, whole blocks of physics code is written by graduate students who had maybe two whole courses in programming in their lives, who wrote in Fortran 77 (or even Fortran IV), and who thought that the right way to program a special function was to look up a series in Abramowitz and Stegun and implement it in a flat loop. As in sometimes they even get the right answer (eventually, after a bit of rewriting and testing)...;-) > [...] > >> Some of the gridware packages do exactly this -- you don't distribute >> binaries, you distribute tarball (or other) packages and a set of rules >> to build and THEN run your application. I don't think that any of these >> use rpms, although they should -- a well designed src rpm is a nearly > > RPM is not a panacea. It has lots of problems in general. The idea is > good, just the implementation ranges from moderately ok to absolutely > horrendous, depending upon what you want to do with it. If you view RPM > as a fancy container for a package, albiet one that is slightly brain > damaged, you are least likely to be bitten by some of its more > interesting features. What features? Go look at the RedHat kernels > circa 2.4 for all the work-arounds they needed to do to deal with its > shortcomings. > > I keep hearing how terrible tarballs and zip files are for package > distribution. But you know, unlike RPMs, they work the same, > everywhere. Sure they don't have versioning and file/package registry > for easy installation/removal. That is an easily fixable problem IMO. > Sure they don't have scripting of the install. Again, this is easily > fixable (jar files and the par files are examples that come to mind for > software distribution from java and perl respectively). Sure, and when you're done you'll have reinvented an rpm, or close to it. Without doubt, a reinvention can avoid some problems with the original, whether it is a version bump or a complete new start. Subversion vs CVS comes to mind (because I'm trying to make that transition myself at this moment in time). It also generally introduces new ones, or reveals problems that the reinvention could've/should've fixed but didn't. As in subversion has some very annoying features as well, features that actually COMPLICATE maintaining a personal repository of a large base of packages compared to the much simpler but feature poor CVS. I'd have much rather seen an update of CVS to fix its flaws instead of a completely new paradigm with new flaws of its own that fixes USER LEVEL features more than anything else. Introducing e.g. cvs move and cvs remove seems like it is a lot simpler than creating an entire berkeley db back end. In this particular case, if you want to see dark evil, check out pacman (the ATLAS -- HEP/DOE ATLAS grid not linear algebra ATLAS solution to this problem): http://physics.bu.edu/~youssef/pacman/ Note well the "linux-redhat-7.1" in their examples. Yes, they mean it. Telling you why would only end up with all of us having a headache and being nasty to children and pets for the rest of the day. That is, most reinventions of rpms are going to not only reinvent wheels, they will reinvent wheels BADLY (and of course have most of the problems with permissions and location and so on that rpms will have as those problems are fundamental to the nature of the grid and have nothing to do with the packaging mechanism per se. To be blunt, a good packaging mechanism will have the following: a compressed archive of the sources; a patching mechanism; an automated build/rebuild process, versioning; dependencies and dependency resolution; metadata; pre and post install scripts; both install and deinstall mechanisms. rpms have all of this but the dependency resolution part; first yup and then yum added dependency resolution (and a better handling of metadata). Other packaging products either will have these features or they will be distinctly inferior. Everything else is just how the features are IMPLEMENTED. Using a tarball as the toplevel wrapper vs a cpio archive is an irrelevant change, for example. rpms could very definitely manage things at the rpm db level better, and it is still POSSIBLE to build in circular dependencies and stuff like that, and there are always questions about how to obsolete things and whether or not it is possible to remove something and put something in that eventually replaces it without forcing the removal of its entire dependency tree and ITS reinstallation as well. However, most of the design decisions that make these things are conservative and defensive -- if you follow their rules (or yum's rules) you make it DIFFICULT to break your system where rpm --force or the unbridled installation of tarballs onto an rpm-based system will INEVITABLY break your system. Frankly, having managed systems and wrestled with this for decades now, I think rpms (augmented by yum) are a little miracle. Not perfect, but for a bear they dance damn gracefully. I am absolutely certain that one could replace pacman with rpms plus yum on top of any of the conservative distros (e.g. RHEL and derivatives such as CentOS, SuSE, probably even FC, CaOSity, Scientific Linux and less conservative deriviatives) and end up with something far, far more robust. Just look at the example in e.g. http://physics.bu.edu/~youssef/pacman/htmls/Language_overview.html and see how many places where you see functionality that is redundant with rpms, only less robust. Hell, you could make a toolset called "ms_pacman" (sorry:-) that replaced the entire thing WITH a yum repository or a gentoo-like rpm build mechanism if only the sandbox issue was worked out. And there there are multiple solutions -- it is only peripherally an rpm issue. For yet another "solution", make /usr/local the binary sandbox for user-based rpms -- all OS rpms comply with the FHS so /usr/local is completely free. At the end of a computation, blank /usr/local. Strictly control / otherwise -- program dependencies to be filled only from the approved/controlled grid distribtion repos. Make sure /usr/local is off of root's path (or use some other path for the sandbox). It just isn't that hard, except when you use systems that don't really comply with rpm's standards. If somebody wants to give me a grant to write it, I'd cheerily do so and contribute it to the public weal:-). > We have seen up to a factor of 2 on chemistry codes. If your run takes > 2 weeks (a number of our customers take longer than 2 weeks), it > matters. If your run takes 2 minutes, it probably doesn't matter unless > you need to do 10000 runs. Exactly. Might even get a factor of 3. Sometimes it matters, more often it doesn't, and usually there are other factors of ~2 to be had from truly optimizing even on i386... > It is not hard to manage binaries in general with a little thought and > design. It is not good to purposefully run a system at a lower speed as > a high performance computational resource unless the cost/pain of > getting the better binaries is to large or simply impossible (e.g. some > of the vendor code out there is still compiled against RedHat 7.1 on > i386, makes supporting it ... exciting ... and not in a good way) Ahh, then you'll REALLY like pacman and the ATLAS grid. Clearly an "exciting" project -- largely because they've been very, very slow to recognize that in the long run it costs FAR MORE to run on obsolete operating systems than it does to bite the bullet and port their (admittedly vast) code base to something approximating modern compilers and posix compliance let alone hardware optimization. So multiarchitecture gridware isn't that hard -- even ad hoc hacked wheel-reinventing crap can be made to function. It could be (and probably is) also done well, of course, if anybody ever bothers. >> For most of >> the (embarrassingly parallel) jobs that use a grid in the first place, >> the point is the massive numbers of CPUs with near perfect scaling, not >> how much you eke out of each CPU. > > Grids are used not just for embarrassingly parallel jobs. They are also > used to implement large distributed pipeline computing systems (in bio > for example). These systems have throughput rates governed in large > part by the performance per node. Running on a cluster would be ideal > in many cases, as you will have that nice high bandwidth network fabric > to help move data about (gigabit is good, IB and others are better for > this). > Agreed. However, those grids are a bit different in architecture. There the grid is a union of clusters (basic definition) but each cluster is actually architected for the problem at hand. ATLAS actually was in this category -- there too data storage issues were as or more important than raw processing power. The nature of the HEP world is to process massive data sets from runs, intermixing (for example) monte carlo simulations and other data transformations. Hundreds of terabytes (you'll recall from our discussions back then;-) was the STARTING point for a participating cluster center, with a clear scaling pathway to petabytes as the technology evolved. So perhaps what I should have said (and did say, elsewhere) is that for most grids there is an immediate benefit to having more nodes/correctly architected clusters, even if you don't fully optimize per node. There is MORE of a benefit if you DO correctly optimize, where correctly optimize is at least but not limited to rebuild for the hardware architecture at hand (remake for x86_64 vs reuse i386 binaries). This is the kind of optimization that should quite reasonably be handled automagically by the gridware package. And where it MAY be much more -- instrument the code to take advantage of cache sizes, relative memory bandwidths, use specific constellations of supported but nonstandard hardware instuctions. You've already pointed out the basic economics of whether or not it is worthwhile to reach for this sort of optimization, which cannot be done, in general, by a gridware package management system except where it e.g. decides to install the correct (optimized) version of libraries such as ATLAS (the linear algebra system). > Rapidly emerging from the pipeline/grid world for bio computing is > something we have been saying all along, that the major pain is (apart > from authentication, scheduling, etc) data motion. There, CPU > speed/type doesn't matter all that much. The problem is trying to force > fit a steer through a straw. There are other problems associated with > this as well, but the important aspect of these systems is measured in > throughput (which is not number of jobs of embarrassingly parallel work > per unit time, but how many threads and how much data you can push > through per unit time). To use the steer and straw analogy, you can > build a huge pipeline by aggregating many straws. Just don't ask the > steer how he likes having parts of him being pushed through many straws. > The pipeline for these folks is the computer (no not the network). > Databases factor into this mix. As do other things. The computations > are rarely floating point intensive. > > Individual computation performance does matter, as pipelines do have > transmission rates at least partially impacted by CPU performance. In > some cases, long pipelines with significant computing tasks are CPU > bound, and can takes days/weeks. These are cases prime for acceleration > by leveraging the better CPU technology. Yes yes yes. >>> in that way of thinking, grids make a lot of sense as a >>> shrink-wrap-app farm. >> >> Sure. Or farms for any application where building a binary for the 2-3 >> distinct architectures takes five minutes per and you plan to run them >> for months on hundreds of CPUs. Retuning and optimizing per >> architecture being strictly optional -- do it if the return for doing so >> outweighs the cost. Or if you have slave -- I mean "graduate student" >> -- labor with nothing better to do:-) > > Heh... I remember doing empirical fits to energy levels and band > structures and other bits of computing as an integral part of the > computing path for my first serious computing assignment in grad school. > I seem to remember thinking it could be automated, and starting to > work on the Fortran code to do. Perl was quite new then, not quite to > version 3. > > Pipelines are set up and torn down with abandon. They are virtualized, > so you never know which bit of processing you are going to do next, or > where your data will come from, or where it is going to until you get > your marching orders. It is quite different from Monte Carlo. It is > not embarrassingly parallel per node, but per pipe which may use one > through hundreds (thousands) of nodes. > > Most parallelization on clusters is the wide type: you distribute your > run over as many nodes as practical for good performance. > Parallelization on grids can either be trivial ala Monte Carlo, or > pipeline based. Pipeline based parallelism is getting as much work done > by creating the longest pipeline path practical keeping as much work > done per unit time as possible (and keeping the communication costs > down). Call this deep type parallelism On some tasks, pipelines are > very good for getting lots of work done. For other tasks they are not > so good. There is an analogy with current CPU pipelines if you wish to > make it. Very interesting. So perhaps for certain CLASSES of task one can create automagical optimizers either for the build process or the execution process, provided that you have tools that can directly measure or extract critical information about bandwidths, latencies, bottlenecks along the pipeline pathway both withing nodes and between nodes. Which I completely believe of course, and which is the basis for the xmlbenchd project that I've started and had NO time for for five months or so. I did get the core benchmark code (ex cpu_rate) to spit out xml-wrapped results in time to show Jack Dongarra when he visited, and there are a few other humans in the world working on this at the computer science level. It does seem to be an essential component of building "portably efficient" grid applications if not general cluster applications, where one CAN optimize on YOUR cluster but might prefer to use generic and portable tools to do so. rgb > > Joe > >> >> rgb >> >>> >>> regards, mark hahn. >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics LLC, > email: landman at scalableinformatics.com > web : http://www.scalableinformatics.com > phone: +1 734 786 8423 > fax : +1 734 786 8452 > cell : +1 734 612 4615 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050922/6a1c4028/attachment.bin
- Previous message: [Beowulf] Differenz between a Grid and a Cluster???
- Next message: [Beowulf] Differenz between a Grid and a Cluster???
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list