From deadline at eadline.org Tue Sep 14 17:34:44 2021 From: deadline at eadline.org (Douglas Eadline) Date: Tue, 14 Sep 2021 13:34:44 -0400 Subject: [Beowulf] [EXTERNAL] Re: Deskside clusters In-Reply-To: References: <98e386fdb767d715af9ea327f7deecba.squirrel@mail.eadline.org> <1FDA1ABD-BC3F-4DC3-BBB8-9957CAA1C443@jpl.caltech.edu>

Message-ID: <2f54412c6e92f6b1c04ca9a0784539b1.squirrel@mail.eadline.org> Here are the questions I am curious about. High core count is great, if it works for your performance goals. 1. Clock speed: All the turbo stuff is great for a low number of processes, but if you load all the cores, then you are now running at base clock speeds, which due to the large number of cores and the thermal envelope is often not that fast. 2. Memory BW: Take the number of memory channels multiply by the memory speed BW and divide by the number of cores. That is of course a worst case BW/core, however, memory hungry apps may not be able to use all the cores. 3. Cache Size: same idea as Memory, why do you think things like AMD 3D V-Cache will be landing very soon. IMO fat-core processors are designed and work best for cloud applications; shared use, bursty applications, containers. HPC tends to light everything up at once for long periods of time. -- Doug > Not anymore, at least not in the HPC realm.?? We recently purchased > quad-socket systems with a total of 96 Intel cores/node, and dual socket > systems with 128 AMD cores/node. > > With Intel now marking their "highly scalable" (or something like that) > line of processors, and AMD, who was always pushing highr core-counts, > back in the game, I think numbers like that will be common in HPC > clusters puchased in the next year or so. > > But, yeah, I guess 28 physical cores is more than the average desktop > has these days. > > > Prentice > > On 8/24/21 6:42 PM, Jonathan Engwall wrote: >> EMC offers dual socket 28 physical core processors. That's a lot of >> computer. >> >> On Tue, Aug 24, 2021, 1:33 PM Lux, Jim (US 7140) via Beowulf >> > wrote: >> >> Yes, indeed.. I didn't call out Limulus, because it was mentioned >> earlier in the thread. >> >> And another reason why you might want your own. >> Every so often, the notice from JPL's HPC goes out to the users - >> "Halo/Gattaca/clustername will not be available because it is >> reserved for Mars {Year}"?? While Mars landings at JPL are a *big >> deal*, not everyone is working on them (in fact, by that time, >> most of the Martians are now working on something else), and you >> want to get your work done.?? I suspect other institutional >> clusters have similar "the 800 pound (363 kg) gorilla has >> requested" scenarios. >> >> >> ???On 8/24/21, 11:34 AM, "Douglas Eadline" > > wrote: >> >> >> ?? ?? Jim, >> >> ?? ?? You are describing a lot of the design pathway for Limulus >> ?? ?? clusters. The local (non-data center) power, heat, noise are >> all >> ?? ?? minimized while performance is maximized. >> >> ?? ?? A well decked out system is often less than $10K and >> ?? ?? are on par with a fat multi-core workstations. >> ?? ?? (and there are reasons a clustered approach performs better) >> >> ?? ?? Another use case is where there is no available research data >> center >> ?? ?? hardware because there is no specialized >> sysadmins/space/budget. >> ?? ?? (Many smaller colleges and universities fall into this >> ?? ?? group). Plus, often times, dropping something into a data >> center >> ?? ?? means an additional cost to the researchers budget. >> >> ?? ?? -- >> ?? ?? Doug >> >> >> ?? ?? > I've been looking at "small scale" clusters for a long time >> (2000?)?? and >> ?? ?? > talked a lot with the folks from Orion, as well as on this >> list. >> ?? ?? > They fit in a "hard to market to" niche. >> ?? ?? > >> ?? ?? > My own workflow tends to have use cases that are a big >> "off-nominal" - one >> ?? ?? > is the rapid iteration of a computational model while >> experimenting - That >> ?? ?? > is, I have a python code that generates input to Numerical >> ?? ?? > Electromagnetics Code (NEC), I run the model over a range of >> parameters, >> ?? ?? > then look at the output to see if I'm getting what what I >> want. If not, I >> ?? ?? > change the code (which essentially changes the antenna >> design), rerun the >> ?? ?? > models, and see if it worked.?? I'd love an iteration time >> of >> "a minute or >> ?? ?? > two" for the computation, maybe a minute or two to plot the >> outputs >> ?? ?? > (fiddling with the plot ranges, etc.).?? For reference, for >> a >> radio >> ?? ?? > astronomy array on the far side of the Moon, I was running >> 144 cases, each >> ?? ?? > at 380 frequencies: to run 1 case takes 30 seconds, so >> farming it out to >> ?? ?? > 12 processors gave me a 6 minute run time, which is in the >> right range. >> ?? ?? > Another model of interaction of antnenas on a spacecraft >> runs about 15 >> ?? ?? > seconds/case; and a third is about 120 seconds/case. >> ?? ?? > >> ?? ?? > To get "interactive development", then, I want the "cycle >> time" to be 10 >> ?? ?? > minutes - 30 minutes of thinking about how to change the >> design and >> ?? ?? > altering the code to generate the new design, make a couple >> test runs to >> ?? ?? > find the equivalent of "syntax errors", and then turn it >> loose - get a cup >> ?? ?? > of coffee, answer a few emails, come back and see the >> results.?? I could >> ?? ?? > iterate maybe a half dozen shots a day, which is pretty >> productive. >> ?? ?? > (Compared to straight up sequential - 144 runs at 30 seconds >> is more than >> ?? ?? > an hour - and that triggers a different working cadence that >> devolves to >> ?? ?? > sort of one shot a day) - The "10 minute" turnaround is also >> compatible >> ?? ?? > with my job, which, unfortunately, has things other than >> computing - >> ?? ?? > meetings, budgets, schedules.?? At 10 minute runs, I can >> carve out a few >> ?? ?? > hours and get into that "flow state" on the technical >> problem, before >> ?? ?? > being disrupted by "a person from Porlock." >> ?? ?? > >> ?? ?? > So this is, I think, a classic example of?? "I want local >> control" - sure, >> ?? ?? > you might have access to a 1000 or more node cluster, but >> you're going to >> ?? ?? > have to figure out how to use its batch management system >> (SLURM and PBS >> ?? ?? > are two I've used) - and that's a bit different than "self >> managed 100% >> ?? ?? > access". Or, AWS kinds of solutions for EP problems. >> ??There's something >> ?? ?? > very satisfying about getting an idea and not having to "ok, >> now I have to >> ?? ?? > log in to the remote cluster with TFA, set up the tunnel, >> move my data, >> ?? ?? > get the job spun up, get the data back" - especially for >> iterative >> ?? ?? > development.?? I did do that using JPLs and TACCs clusters, >> and "moving >> ?? ?? > data" proved to be a barrier - the other thing was the >> "iterative code >> ?? ?? > development" in between runs - Most institutional clusters >> discourage >> ?? ?? > interactive development on the cluster (even if you're only >> sucking up one >> ?? ?? > core).?? ??If the tools were a bit more "transparent" and >> there were "shared >> ?? ?? > disk" capabilities, this might be more attractive, and while >> everyone is >> ?? ?? > exceedingly helpful, there are still barriers to making it >> "run it on my >> ?? ?? > desktop" >> ?? ?? > >> ?? ?? > Another use case that I wind up designing for is the "HPC in >> places >> ?? ?? > without good communications and limited infrastructure" -?? >> The notional >> ?? ?? > use case might be an archaeological expedition wanting to >> use HPC to >> ?? ?? > process ground penetrating radar data or something like >> that.?? ??(or, given >> ?? ?? > that I work at JPL, you have a need for HPC on the surface >> of Mars) - So >> ?? ?? > sending your data to a remote cluster isn't really an >> option.?? And here, >> ?? ?? > the "speedup" you need might well be a factor of 10-20 over >> a single >> ?? ?? > computer, something doable in a "portable" configuration >> (check it as >> ?? ?? > luggage, for instance). Just as for my antenna modeling >> problems, turning >> ?? ?? > an "overnight" computation into a "10-20 minute" computation >> would change >> ?? ?? > the workflow dramatically. >> ?? ?? > >> ?? ?? > >> ?? ?? > Another market is "learn how to cluster" - for which the RPi >> clusters work >> ?? ?? > (or "packs" of Beagleboards) - they're fun, and in a >> classroom >> ?? ?? > environment, I think they are an excellent cost effective >> solution to >> ?? ?? > learning all the facets of "bringing up a cluster from >> scratch", but I'm >> ?? ?? > not convinced they provide a good "MIPS/Watt" or >> "MIPS/liter" metric - in >> ?? ?? > terms of convenience.?? That is, rather than a cluster of 10 >> RPis, you >> ?? ?? > might be better off just buying a faster desktop machine. >> ?? ?? > >> ?? ?? > Let's talk design desirements/constraints >> ?? ?? > >> ?? ?? > I've had a chance to use some "clusters in a box" over the >> last decades, >> ?? ?? > and I'd suggest that while power is one constraint, another >> is noise. >> ?? ?? > Just the other day, I was in a lab and someone commented >> that "those >> ?? ?? > computers are amazingly fast, but you really need to put >> them in another >> ?? ?? > room". Yes, all those 1U and 2U rack mounted boxes with tiny >> fans >> ?? ?? > screaming is just not "office compatible"?? ??And that kind >> of >> brings up >> ?? ?? > another interesting constraint for "deskside" computing - >> heat.?? Sure you >> ?? ?? > can plug in 1500W of computers (or even 3000W if you have >> two circuits), >> ?? ?? > but can you live in your office with a 1500W space heater? >> ?? ?? > Interestingly, for *my* workflow, that's probably ok - *my* >> computation >> ?? ?? > has a 10-30% duty cycle - think for 30 minutes, compute for >> 5-10.?? But >> ?? ?? > still, your office mate will appreciate if you keep the >> sound level down >> ?? ?? > to 50dBA. >> ?? ?? > >> ?? ?? > GPUs - some codes can use them, some can't.?? They tend, >> though, to be >> ?? ?? > noisy (all that air flow for cooling). I don't know that GPU >> manufacturers >> ?? ?? > spend a lot of time on this.?? Sure, I've seen charts and >> specs that claim >> ?? ?? > <50 dBA. But I think they're gaming the measurement, >> counting on the user >> ?? ?? > to be a gamer wearing headphones or with a big sound >> system.?? I will say, >> ?? ?? > for instance, that the PS/4 positively roars when spun up >> unless you????????ve >> ?? ?? > got external forced ventilation to keep the inlet air temp >> low. >> ?? ?? > >> ?? ?? > Looking at GSA guidelines for office space - if it's >> "deskside" it's got >> ?? ?? > to fit in the 50-80 square foot cubicle or your shared part >> of a 120 >> ?? ?? > square foot office. >> ?? ?? > >> ?? ?? > Then one needs to figure out the "refresh cycle time" for >> buying hardware >> ?? ?? > - This has been a topic on this list forever - you have 2 >> years of >> ?? ?? > computation to do: do you buy N nodes today at speed X, or >> do you wait a >> ?? ?? > year, buy N/2 nodes at speed 4X, and finish your computation >> at the same >> ?? ?? > time. >> ?? ?? > >> ?? ?? > Fancy desktop PCs with monitors, etc. come in at under $5k, >> including >> ?? ?? > burdens and installation, but not including monthly service >> charges (in an >> ?? ?? > institutional environment).?? If you look at "purchase >> limits" there's some >> ?? ?? > thresholds (usually around $10k, then increasing in factors >> of 10 or 100 >> ?? ?? > steps) for approvals.?? So a $100k deskside box is going to >> be a tough >> ?? ?? > sell. >> ?? ?? > >> ?? ?? > >> ?? ?? > >> ?? ?? > ??????On 8/24/21, 6:07 AM, "Beowulf on behalf of Douglas >> Eadline" >> ?? ?? > > on behalf of >> deadline at eadline.org > wrote: >> ?? ?? > >> ?? ?? >?? ?? ??Jonathan >> ?? ?? > >> ?? ?? >?? ?? ??It is a real cluster, available in 4 and 8 node >> versions. >> ?? ?? >?? ?? ??The design if for non-data center use. That is, local >> ?? ?? >?? ?? ??office, lab, home where power, cooling, and noise >> ?? ?? >?? ?? ??are important. More info here: >> ?? ?? > >> ?? ?? > >> https://urldefense.us/v3/__https://www.limulus-computing.com__;!!PvBDto6Hs4WbVuu7!f3kkkCuq3GKO288fxeGGHi3i-bsSY5P83PKu_svOVUISu7dkNygQtSvIpxHkE0XDpKU4fOA$ >> >> ?? ?? > >> https://urldefense.us/v3/__https://www.limulus-computing.com/Limulus-Manual__;!!PvBDto6Hs4WbVuu7!f3kkkCuq3GKO288fxeGGHi3i-bsSY5P83PKu_svOVUISu7dkNygQtSvIpxHkE0XD7eWwVuM$ >> >> ?? ?? > >> ?? ?? >?? ?? ??-- >> ?? ?? >?? ?? ??Doug >> ?? ?? > >> ?? ?? > >> ?? ?? > >> ?? ?? >?? ?? ??> Hi Doug, >> ?? ?? >?? ?? ??> >> ?? ?? >?? ?? ??> Not to derail the discussion, but a quick question >> you >> say desk >> ?? ?? > side >> ?? ?? >?? ?? ??> cluster is it a single machine that will run a vm >> cluster? >> ?? ?? >?? ?? ??> >> ?? ?? >?? ?? ??> Regards, >> ?? ?? >?? ?? ??> Jonathan >> ?? ?? >?? ?? ??> >> ?? ?? >?? ?? ??> -----Original Message----- >> ?? ?? >?? ?? ??> From: Beowulf > > On Behalf Of Douglas >> ?? ?? > Eadline >> ?? ?? >?? ?? ??> Sent: 23 August 2021 23:12 >> ?? ?? >?? ?? ??> To: John Hearns > > >> ?? ?? >?? ?? ??> Cc: Beowulf Mailing List > > >> ?? ?? >?? ?? ??> Subject: Re: [Beowulf] List archives >> ?? ?? >?? ?? ??> >> ?? ?? >?? ?? ??> John, >> ?? ?? >?? ?? ??> >> ?? ?? >?? ?? ??> I think that was on twitter. >> ?? ?? >?? ?? ??> >> ?? ?? >?? ?? ??> In any case, I'm working with these processors >> right now. >> ?? ?? >?? ?? ??> >> ?? ?? >?? ?? ??> On the new Ryzens, the power usage is actually >> quite >> tunable. >> ?? ?? >?? ?? ??> There are three settings. >> ?? ?? >?? ?? ??> >> ?? ?? >?? ?? ??> 1) Package Power Tracking: The PPT threshold is the >> allowed socket >> ?? ?? > power >> ?? ?? >?? ?? ??> consumption permitted across the voltage rails >> supplying the >> ?? ?? > socket. >> ?? ?? >?? ?? ??> >> ?? ?? >?? ?? ??> 2) Thermal Design Current: The maximum current >> (TDC) >> (amps) that can >> ?? ?? > be >> ?? ?? >?? ?? ??> delivered by a specific motherboard's voltage >> regulator >> ?? ?? > configuration in >> ?? ?? >?? ?? ??> thermally-constrained scenarios. >> ?? ?? >?? ?? ??> >> ?? ?? >?? ?? ??> 3) Electrical Design Current: The maximum current >> (EDC) (amps) that >> ?? ?? > can be >> ?? ?? >?? ?? ??> delivered by a specific motherboard's voltage >> regulator >> ?? ?? > configuration in a >> ?? ?? >?? ?? ??> peak ("spike") condition for a short period of >> time. >> ?? ?? >?? ?? ??> >> ?? ?? >?? ?? ??> My goal is to tweak the 105W TDP R7-5800X so it >> draws >> power like >> ?? ?? > the >> ?? ?? >?? ?? ??> 65W-TDP R5-5600X >> ?? ?? >?? ?? ??> >> ?? ?? >?? ?? ??> This is desk-side cluster low power stuff. >> ?? ?? >?? ?? ??> I am using extension cable-plug for Limulus blades >> that have an >> ?? ?? > in-line >> ?? ?? >?? ?? ??> current meter (normally used for solar panels). >> ?? ?? >?? ?? ??> Now I can load them up and watch exactly how much >> current is being >> ?? ?? > pulled >> ?? ?? >?? ?? ??> across the 12V rails. >> ?? ?? >?? ?? ??> >> ?? ?? >?? ?? ??> If you need more info, let me know >> ?? ?? >?? ?? ??> >> ?? ?? >?? ?? ??> -- >> ?? ?? >?? ?? ??> Doug >> ?? ?? >?? ?? ??> >> ?? ?? >?? ?? ??>> The Beowulf list archives seem to end in July >> 2021. >> ?? ?? >?? ?? ??>> I was looking for Doug Eadline's post on limiting >> AMD >> power and >> ?? ?? > the >> ?? ?? >?? ?? ??>> results on performance. >> ?? ?? >?? ?? ??>> >> ?? ?? >?? ?? ??>> John H >> ?? ?? >?? ?? ??>> _______________________________________________ >> ?? ?? >?? ?? ??>> Beowulf mailing list, Beowulf at beowulf.org >> sponsored by Penguin >> ?? ?? >?? ?? ??>> Computing To change your subscription (digest mode >> or >> unsubscribe) >> ?? ?? >?? ?? ??>> visit >> ?? ?? >?? ?? ??>> >> https://urldefense.us/v3/__https://link.edgepilot.com/s/9c656d83/pBaaRl2iV0OmLHAXqkoDZQ?u=https:*__;Lw!!PvBDto6Hs4WbVuu7!f3kkkCuq3GKO288fxeGGHi3i-bsSY5P83PKu_svOVUISu7dkNygQtSvIpxHkE0XDvUGSdHI$ >> >> ?? ?? >?? ?? ??>> /beowulf.org/cgi-bin/mailman/listinfo/beowulf >> >> ?? ?? >?? ?? ??>> >> ?? ?? >?? ?? ??> >> ?? ?? >?? ?? ??> >> ?? ?? >?? ?? ??> -- >> ?? ?? >?? ?? ??> Doug >> ?? ?? >?? ?? ??> >> ?? ?? >?? ?? ??> _______________________________________________ >> ?? ?? >?? ?? ??> Beowulf mailing list, Beowulf at beowulf.org >> sponsored by Penguin >> ?? ?? > Computing >> ?? ?? >?? ?? ??> To change your subscription (digest mode or >> unsubscribe) visit >> ?? ?? >?? ?? ??> >> https://urldefense.us/v3/__https://link.edgepilot.com/s/9c656d83/pBaaRl2iV0OmLHAXqkoDZQ?u=https:**Abeowulf.org*cgi-bin*mailman*listinfo*beowulf__;Ly8vLy8v!!PvBDto6Hs4WbVuu7!f3kkkCuq3GKO288fxeGGHi3i-bsSY5P83PKu_svOVUISu7dkNygQtSvIpxHkE0XDUP8JZUc$ >> >> ?? ?? >?? ?? ??> >> ?? ?? > >> ?? ?? > >> ?? ?? >?? ?? ??-- >> ?? ?? >?? ?? ??Doug >> ?? ?? > >> ?? ?? >?? ?? ??_______________________________________________ >> ?? ?? >?? ?? ??Beowulf mailing list, Beowulf at beowulf.org >> sponsored by Penguin >> ?? ?? > Computing >> ?? ?? >?? ?? ??To change your subscription (digest mode or >> unsubscribe) >> visit >> ?? ?? > >> https://urldefense.us/v3/__https://beowulf.org/cgi-bin/mailman/listinfo/beowulf__;!!PvBDto6Hs4WbVuu7!f3kkkCuq3GKO288fxeGGHi3i-bsSY5P83PKu_svOVUISu7dkNygQtSvIpxHkE0XDv6c1nNc$ >> >> ?? ?? > >> ?? ?? > >> >> >> ?? ?? -- >> ?? ?? Doug >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org >> sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >> >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -- Doug From pbisbal at pppl.gov Fri Sep 17 17:51:09 2021 From: pbisbal at pppl.gov (Prentice Bisbal) Date: Fri, 17 Sep 2021 13:51:09 -0400 Subject: [Beowulf] [EXTERNAL] Re: Deskside clusters In-Reply-To: <2f54412c6e92f6b1c04ca9a0784539b1.squirrel@mail.eadline.org> References: <98e386fdb767d715af9ea327f7deecba.squirrel@mail.eadline.org> <1FDA1ABD-BC3F-4DC3-BBB8-9957CAA1C443@jpl.caltech.edu>

<2f54412c6e92f6b1c04ca9a0784539b1.squirrel@mail.eadline.org> Message-ID: We ran a number of apps on evaluation systems before determining the 96-core Intel systems provided the best results. I was responsible for running HPL and HPCG benchmarks. We had researcher run their various simulation codes. I agree withjust about everything you said, especially (1) - I think turbo frequencies are? irrelevant for HPC, since all the cores will typically be pinned during an HPC job. When I calculated the theoretical FLOPS for the evaluation systems, I had to look at the AVX512 frequencies for the Intel processors, since that is yet another operating frequency. To Intel's credit, the different CPU frequency information and when those frequencies would be invoked, was available online for all but their newest processors (probably just hadn't been published yet), whereas I couldn't find any frquency stepping information for the AMDs. I often point out to people that clock frequencies have been decreasing as core counts go up. I remember in the early 2000s that CPUs with a Clock of ~3.5 GHz were pretty common. Now it seems most processors have a baseclock below 3 GHz, and only go above that in "turbo mode". Where I disagree with you is (3). Whether or not cache size is important depends on the size of the job. If your iterating through data-parallel loops over a large dataset that exceeds cache size, the opportunity to reread cached data is probably limited or nonexistent. As we often say here, "it depends". I'm sore someone with better low-level hardware knowledge will pipe in and tell me why I'm wrong (Cunningham's Law). Prentice On 9/14/21 1:34 PM, Douglas Eadline wrote: > Here are the questions I am curious about. High core count is great, > if it works for your performance goals. > > 1. Clock speed: All the turbo stuff is great for a low > number of processes, but if you load all the cores, then you are > now running at base clock speeds, which due to the large number > of cores and the thermal envelope is often not that fast. > > 2. Memory BW: Take the number of memory channels multiply by > the memory speed BW and divide by the number of cores. That > is of course a worst case BW/core, however, memory hungry apps > may not be able to use all the cores. > > 3. Cache Size: same idea as Memory, why do you think > things like AMD 3D V-Cache will be landing very soon. > > IMO fat-core processors are designed and work best > for cloud applications; shared use, bursty applications, > containers. HPC tends to light everything up at once for > long periods of time. > > -- > Doug > > > >> Not anymore, at least not in the HPC realm.?? We recently purchased >> quad-socket systems with a total of 96 Intel cores/node, and dual socket >> systems with 128 AMD cores/node. >> >> With Intel now marking their "highly scalable" (or something like that) >> line of processors, and AMD, who was always pushing highr core-counts, >> back in the game, I think numbers like that will be common in HPC >> clusters puchased in the next year or so. >> >> But, yeah, I guess 28 physical cores is more than the average desktop >> has these days. >> >> >> Prentice >> >> On 8/24/21 6:42 PM, Jonathan Engwall wrote: >>> EMC offers dual socket 28 physical core processors. That's a lot of >>> computer. >>> >>> On Tue, Aug 24, 2021, 1:33 PM Lux, Jim (US 7140) via Beowulf >>> > wrote: >>> >>> Yes, indeed.. I didn't call out Limulus, because it was mentioned >>> earlier in the thread. >>> >>> And another reason why you might want your own. >>> Every so often, the notice from JPL's HPC goes out to the users - >>> "Halo/Gattaca/clustername will not be available because it is >>> reserved for Mars {Year}"?? While Mars landings at JPL are a *big >>> deal*, not everyone is working on them (in fact, by that time, >>> most of the Martians are now working on something else), and you >>> want to get your work done.?? I suspect other institutional >>> clusters have similar "the 800 pound (363 kg) gorilla has >>> requested" scenarios. >>> >>> >>> ???On 8/24/21, 11:34 AM, "Douglas Eadline" >> > wrote: >>> >>> >>> ?? ?? Jim, >>> >>> ?? ?? You are describing a lot of the design pathway for Limulus >>> ?? ?? clusters. The local (non-data center) power, heat, noise are >>> all >>> ?? ?? minimized while performance is maximized. >>> >>> ?? ?? A well decked out system is often less than $10K and >>> ?? ?? are on par with a fat multi-core workstations. >>> ?? ?? (and there are reasons a clustered approach performs better) >>> >>> ?? ?? Another use case is where there is no available research data >>> center >>> ?? ?? hardware because there is no specialized >>> sysadmins/space/budget. >>> ?? ?? (Many smaller colleges and universities fall into this >>> ?? ?? group). Plus, often times, dropping something into a data >>> center >>> ?? ?? means an additional cost to the researchers budget. >>> >>> ?? ?? -- >>> ?? ?? Doug >>> >>> >>> ?? ?? > I've been looking at "small scale" clusters for a long time >>> (2000?)?? and >>> ?? ?? > talked a lot with the folks from Orion, as well as on this >>> list. >>> ?? ?? > They fit in a "hard to market to" niche. >>> ?? ?? > >>> ?? ?? > My own workflow tends to have use cases that are a big >>> "off-nominal" - one >>> ?? ?? > is the rapid iteration of a computational model while >>> experimenting - That >>> ?? ?? > is, I have a python code that generates input to Numerical >>> ?? ?? > Electromagnetics Code (NEC), I run the model over a range of >>> parameters, >>> ?? ?? > then look at the output to see if I'm getting what what I >>> want. If not, I >>> ?? ?? > change the code (which essentially changes the antenna >>> design), rerun the >>> ?? ?? > models, and see if it worked.?? I'd love an iteration time >>> of >>> "a minute or >>> ?? ?? > two" for the computation, maybe a minute or two to plot the >>> outputs >>> ?? ?? > (fiddling with the plot ranges, etc.).?? For reference, for >>> a >>> radio >>> ?? ?? > astronomy array on the far side of the Moon, I was running >>> 144 cases, each >>> ?? ?? > at 380 frequencies: to run 1 case takes 30 seconds, so >>> farming it out to >>> ?? ?? > 12 processors gave me a 6 minute run time, which is in the >>> right range. >>> ?? ?? > Another model of interaction of antnenas on a spacecraft >>> runs about 15 >>> ?? ?? > seconds/case; and a third is about 120 seconds/case. >>> ?? ?? > >>> ?? ?? > To get "interactive development", then, I want the "cycle >>> time" to be 10 >>> ?? ?? > minutes - 30 minutes of thinking about how to change the >>> design and >>> ?? ?? > altering the code to generate the new design, make a couple >>> test runs to >>> ?? ?? > find the equivalent of "syntax errors", and then turn it >>> loose - get a cup >>> ?? ?? > of coffee, answer a few emails, come back and see the >>> results.?? I could >>> ?? ?? > iterate maybe a half dozen shots a day, which is pretty >>> productive. >>> ?? ?? > (Compared to straight up sequential - 144 runs at 30 seconds >>> is more than >>> ?? ?? > an hour - and that triggers a different working cadence that >>> devolves to >>> ?? ?? > sort of one shot a day) - The "10 minute" turnaround is also >>> compatible >>> ?? ?? > with my job, which, unfortunately, has things other than >>> computing - >>> ?? ?? > meetings, budgets, schedules.?? At 10 minute runs, I can >>> carve out a few >>> ?? ?? > hours and get into that "flow state" on the technical >>> problem, before >>> ?? ?? > being disrupted by "a person from Porlock." >>> ?? ?? > >>> ?? ?? > So this is, I think, a classic example of?? "I want local >>> control" - sure, >>> ?? ?? > you might have access to a 1000 or more node cluster, but >>> you're going to >>> ?? ?? > have to figure out how to use its batch management system >>> (SLURM and PBS >>> ?? ?? > are two I've used) - and that's a bit different than "self >>> managed 100% >>> ?? ?? > access". Or, AWS kinds of solutions for EP problems. >>> ??There's something >>> ?? ?? > very satisfying about getting an idea and not having to "ok, >>> now I have to >>> ?? ?? > log in to the remote cluster with TFA, set up the tunnel, >>> move my data, >>> ?? ?? > get the job spun up, get the data back" - especially for >>> iterative >>> ?? ?? > development.?? I did do that using JPLs and TACCs clusters, >>> and "moving >>> ?? ?? > data" proved to be a barrier - the other thing was the >>> "iterative code >>> ?? ?? > development" in between runs - Most institutional clusters >>> discourage >>> ?? ?? > interactive development on the cluster (even if you're only >>> sucking up one >>> ?? ?? > core).?? ??If the tools were a bit more "transparent" and >>> there were "shared >>> ?? ?? > disk" capabilities, this might be more attractive, and while >>> everyone is >>> ?? ?? > exceedingly helpful, there are still barriers to making it >>> "run it on my >>> ?? ?? > desktop" >>> ?? ?? > >>> ?? ?? > Another use case that I wind up designing for is the "HPC in >>> places >>> ?? ?? > without good communications and limited infrastructure" -? >>> The notional >>> ?? ?? > use case might be an archaeological expedition wanting to >>> use HPC to >>> ?? ?? > process ground penetrating radar data or something like >>> that.?? ??(or, given >>> ?? ?? > that I work at JPL, you have a need for HPC on the surface >>> of Mars) - So >>> ?? ?? > sending your data to a remote cluster isn't really an >>> option.?? And here, >>> ?? ?? > the "speedup" you need might well be a factor of 10-20 over >>> a single >>> ?? ?? > computer, something doable in a "portable" configuration >>> (check it as >>> ?? ?? > luggage, for instance). Just as for my antenna modeling >>> problems, turning >>> ?? ?? > an "overnight" computation into a "10-20 minute" computation >>> would change >>> ?? ?? > the workflow dramatically. >>> ?? ?? > >>> ?? ?? > >>> ?? ?? > Another market is "learn how to cluster" - for which the RPi >>> clusters work >>> ?? ?? > (or "packs" of Beagleboards) - they're fun, and in a >>> classroom >>> ?? ?? > environment, I think they are an excellent cost effective >>> solution to >>> ?? ?? > learning all the facets of "bringing up a cluster from >>> scratch", but I'm >>> ?? ?? > not convinced they provide a good "MIPS/Watt" or >>> "MIPS/liter" metric - in >>> ?? ?? > terms of convenience.?? That is, rather than a cluster of 10 >>> RPis, you >>> ?? ?? > might be better off just buying a faster desktop machine. >>> ?? ?? > >>> ?? ?? > Let's talk design desirements/constraints >>> ?? ?? > >>> ?? ?? > I've had a chance to use some "clusters in a box" over the >>> last decades, >>> ?? ?? > and I'd suggest that while power is one constraint, another >>> is noise. >>> ?? ?? > Just the other day, I was in a lab and someone commented >>> that "those >>> ?? ?? > computers are amazingly fast, but you really need to put >>> them in another >>> ?? ?? > room". Yes, all those 1U and 2U rack mounted boxes with tiny >>> fans >>> ?? ?? > screaming is just not "office compatible"?? ??And that kind >>> of >>> brings up >>> ?? ?? > another interesting constraint for "deskside" computing - >>> heat.?? Sure you >>> ?? ?? > can plug in 1500W of computers (or even 3000W if you have >>> two circuits), >>> ?? ?? > but can you live in your office with a 1500W space heater? >>> ?? ?? > Interestingly, for *my* workflow, that's probably ok - *my* >>> computation >>> ?? ?? > has a 10-30% duty cycle - think for 30 minutes, compute for >>> 5-10.?? But >>> ?? ?? > still, your office mate will appreciate if you keep the >>> sound level down >>> ?? ?? > to 50dBA. >>> ?? ?? > >>> ?? ?? > GPUs - some codes can use them, some can't.?? They tend, >>> though, to be >>> ?? ?? > noisy (all that air flow for cooling). I don't know that GPU >>> manufacturers >>> ?? ?? > spend a lot of time on this.?? Sure, I've seen charts and >>> specs that claim >>> ?? ?? > <50 dBA. But I think they're gaming the measurement, >>> counting on the user >>> ?? ?? > to be a gamer wearing headphones or with a big sound >>> system.?? I will say, >>> ?? ?? > for instance, that the PS/4 positively roars when spun up >>> unless you????????ve >>> ?? ?? > got external forced ventilation to keep the inlet air temp >>> low. >>> ?? ?? > >>> ?? ?? > Looking at GSA guidelines for office space - if it's >>> "deskside" it's got >>> ?? ?? > to fit in the 50-80 square foot cubicle or your shared part >>> of a 120 >>> ?? ?? > square foot office. >>> ?? ?? > >>> ?? ?? > Then one needs to figure out the "refresh cycle time" for >>> buying hardware >>> ?? ?? > - This has been a topic on this list forever - you have 2 >>> years of >>> ?? ?? > computation to do: do you buy N nodes today at speed X, or >>> do you wait a >>> ?? ?? > year, buy N/2 nodes at speed 4X, and finish your computation >>> at the same >>> ?? ?? > time. >>> ?? ?? > >>> ?? ?? > Fancy desktop PCs with monitors, etc. come in at under $5k, >>> including >>> ?? ?? > burdens and installation, but not including monthly service >>> charges (in an >>> ?? ?? > institutional environment).?? If you look at "purchase >>> limits" there's some >>> ?? ?? > thresholds (usually around $10k, then increasing in factors >>> of 10 or 100 >>> ?? ?? > steps) for approvals.?? So a $100k deskside box is going to >>> be a tough >>> ?? ?? > sell. >>> ?? ?? > >>> ?? ?? > >>> ?? ?? > >>> ?? ?? > ??????On 8/24/21, 6:07 AM, "Beowulf on behalf of Douglas >>> Eadline" >>> ?? ?? > >> on behalf of >>> deadline at eadline.org > wrote: >>> ?? ?? > >>> ?? ?? >?? ?? ??Jonathan >>> ?? ?? > >>> ?? ?? >?? ?? ??It is a real cluster, available in 4 and 8 node >>> versions. >>> ?? ?? >?? ?? ??The design if for non-data center use. That is, local >>> ?? ?? >?? ?? ??office, lab, home where power, cooling, and noise >>> ?? ?? >?? ?? ??are important. More info here: >>> ?? ?? > >>> ?? ?? > >>> https://urldefense.us/v3/__https://www.limulus-computing.com__;!!PvBDto6Hs4WbVuu7!f3kkkCuq3GKO288fxeGGHi3i-bsSY5P83PKu_svOVUISu7dkNygQtSvIpxHkE0XDpKU4fOA$ >>> >>> ?? ?? > >>> https://urldefense.us/v3/__https://www.limulus-computing.com/Limulus-Manual__;!!PvBDto6Hs4WbVuu7!f3kkkCuq3GKO288fxeGGHi3i-bsSY5P83PKu_svOVUISu7dkNygQtSvIpxHkE0XD7eWwVuM$ >>> >>> ?? ?? > >>> ?? ?? >?? ?? ??-- >>> ?? ?? >?? ?? ??Doug >>> ?? ?? > >>> ?? ?? > >>> ?? ?? > >>> ?? ?? >?? ?? ??> Hi Doug, >>> ?? ?? >?? ?? ??> >>> ?? ?? >?? ?? ??> Not to derail the discussion, but a quick question >>> you >>> say desk >>> ?? ?? > side >>> ?? ?? >?? ?? ??> cluster is it a single machine that will run a vm >>> cluster? >>> ?? ?? >?? ?? ??> >>> ?? ?? >?? ?? ??> Regards, >>> ?? ?? >?? ?? ??> Jonathan >>> ?? ?? >?? ?? ??> >>> ?? ?? >?? ?? ??> -----Original Message----- >>> ?? ?? >?? ?? ??> From: Beowulf >> > On Behalf Of Douglas >>> ?? ?? > Eadline >>> ?? ?? >?? ?? ??> Sent: 23 August 2021 23:12 >>> ?? ?? >?? ?? ??> To: John Hearns >> > >>> ?? ?? >?? ?? ??> Cc: Beowulf Mailing List >> > >>> ?? ?? >?? ?? ??> Subject: Re: [Beowulf] List archives >>> ?? ?? >?? ?? ??> >>> ?? ?? >?? ?? ??> John, >>> ?? ?? >?? ?? ??> >>> ?? ?? >?? ?? ??> I think that was on twitter. >>> ?? ?? >?? ?? ??> >>> ?? ?? >?? ?? ??> In any case, I'm working with these processors >>> right now. >>> ?? ?? >?? ?? ??> >>> ?? ?? >?? ?? ??> On the new Ryzens, the power usage is actually >>> quite >>> tunable. >>> ?? ?? >?? ?? ??> There are three settings. >>> ?? ?? >?? ?? ??> >>> ?? ?? >?? ?? ??> 1) Package Power Tracking: The PPT threshold is the >>> allowed socket >>> ?? ?? > power >>> ?? ?? >?? ?? ??> consumption permitted across the voltage rails >>> supplying the >>> ?? ?? > socket. >>> ?? ?? >?? ?? ??> >>> ?? ?? >?? ?? ??> 2) Thermal Design Current: The maximum current >>> (TDC) >>> (amps) that can >>> ?? ?? > be >>> ?? ?? >?? ?? ??> delivered by a specific motherboard's voltage >>> regulator >>> ?? ?? > configuration in >>> ?? ?? >?? ?? ??> thermally-constrained scenarios. >>> ?? ?? >?? ?? ??> >>> ?? ?? >?? ?? ??> 3) Electrical Design Current: The maximum current >>> (EDC) (amps) that >>> ?? ?? > can be >>> ?? ?? >?? ?? ??> delivered by a specific motherboard's voltage >>> regulator >>> ?? ?? > configuration in a >>> ?? ?? >?? ?? ??> peak ("spike") condition for a short period of >>> time. >>> ?? ?? >?? ?? ??> >>> ?? ?? >?? ?? ??> My goal is to tweak the 105W TDP R7-5800X so it >>> draws >>> power like >>> ?? ?? > the >>> ?? ?? >?? ?? ??> 65W-TDP R5-5600X >>> ?? ?? >?? ?? ??> >>> ?? ?? >?? ?? ??> This is desk-side cluster low power stuff. >>> ?? ?? >?? ?? ??> I am using extension cable-plug for Limulus blades >>> that have an >>> ?? ?? > in-line >>> ?? ?? >?? ?? ??> current meter (normally used for solar panels). >>> ?? ?? >?? ?? ??> Now I can load them up and watch exactly how much >>> current is being >>> ?? ?? > pulled >>> ?? ?? >?? ?? ??> across the 12V rails. >>> ?? ?? >?? ?? ??> >>> ?? ?? >?? ?? ??> If you need more info, let me know >>> ?? ?? >?? ?? ??> >>> ?? ?? >?? ?? ??> -- >>> ?? ?? >?? ?? ??> Doug >>> ?? ?? >?? ?? ??> >>> ?? ?? >?? ?? ??>> The Beowulf list archives seem to end in July >>> 2021. >>> ?? ?? >?? ?? ??>> I was looking for Doug Eadline's post on limiting >>> AMD >>> power and >>> ?? ?? > the >>> ?? ?? >?? ?? ??>> results on performance. >>> ?? ?? >?? ?? ??>> >>> ?? ?? >?? ?? ??>> John H >>> ?? ?? >?? ?? ??>> _______________________________________________ >>> ?? ?? >?? ?? ??>> Beowulf mailing list, Beowulf at beowulf.org >>> sponsored by Penguin >>> ?? ?? >?? ?? ??>> Computing To change your subscription (digest mode >>> or >>> unsubscribe) >>> ?? ?? >?? ?? ??>> visit >>> ?? ?? >?? ?? ??>> >>> https://urldefense.us/v3/__https://link.edgepilot.com/s/9c656d83/pBaaRl2iV0OmLHAXqkoDZQ?u=https:*__;Lw!!PvBDto6Hs4WbVuu7!f3kkkCuq3GKO288fxeGGHi3i-bsSY5P83PKu_svOVUISu7dkNygQtSvIpxHkE0XDvUGSdHI$ >>> >>> ?? ?? >?? ?? ??>> /beowulf.org/cgi-bin/mailman/listinfo/beowulf >>> >>> ?? ?? >?? ?? ??>> >>> ?? ?? >?? ?? ??> >>> ?? ?? >?? ?? ??> >>> ?? ?? >?? ?? ??> -- >>> ?? ?? >?? ?? ??> Doug >>> ?? ?? >?? ?? ??> >>> ?? ?? >?? ?? ??> _______________________________________________ >>> ?? ?? >?? ?? ??> Beowulf mailing list, Beowulf at beowulf.org >>> sponsored by Penguin >>> ?? ?? > Computing >>> ?? ?? >?? ?? ??> To change your subscription (digest mode or >>> unsubscribe) visit >>> ?? ?? >?? ?? ??> >>> https://urldefense.us/v3/__https://link.edgepilot.com/s/9c656d83/pBaaRl2iV0OmLHAXqkoDZQ?u=https:**Abeowulf.org*cgi-bin*mailman*listinfo*beowulf__;Ly8vLy8v!!PvBDto6Hs4WbVuu7!f3kkkCuq3GKO288fxeGGHi3i-bsSY5P83PKu_svOVUISu7dkNygQtSvIpxHkE0XDUP8JZUc$ >>> >>> ?? ?? >?? ?? ??> >>> ?? ?? > >>> ?? ?? > >>> ?? ?? >?? ?? ??-- >>> ?? ?? >?? ?? ??Doug >>> ?? ?? > >>> ?? ?? >?? ?? ??_______________________________________________ >>> ?? ?? >?? ?? ??Beowulf mailing list, Beowulf at beowulf.org >>> sponsored by Penguin >>> ?? ?? > Computing >>> ?? ?? >?? ?? ??To change your subscription (digest mode or >>> unsubscribe) >>> visit >>> ?? ?? > >>> https://urldefense.us/v3/__https://beowulf.org/cgi-bin/mailman/listinfo/beowulf__;!!PvBDto6Hs4WbVuu7!f3kkkCuq3GKO288fxeGGHi3i-bsSY5P83PKu_svOVUISu7dkNygQtSvIpxHkE0XDv6c1nNc$ >>> >>> ?? ?? > >>> ?? ?? > >>> >>> >>> ?? ?? -- >>> ?? ?? Doug >>> >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org >>> sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >>> >>> >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >> > > -- > Doug > From deadline at eadline.org Fri Sep 17 20:24:41 2021 From: deadline at eadline.org (Douglas Eadline) Date: Fri, 17 Sep 2021 16:24:41 -0400 Subject: [Beowulf] [EXTERNAL] Re: Deskside clusters In-Reply-To: References: <98e386fdb767d715af9ea327f7deecba.squirrel@mail.eadline.org> <1FDA1ABD-BC3F-4DC3-BBB8-9957CAA1C443@jpl.caltech.edu>

<2f54412c6e92f6b1c04ca9a0784539b1.squirrel@mail.eadline.org> Message-ID: <4ae0d1f17920bd44f5ba763745c3281d.squirrel@mail.eadline.org> --snip-- > > Where I disagree with you is (3). Whether or not cache size is important > depends on the size of the job. If your iterating through data-parallel > loops over a large dataset that exceeds cache size, the opportunity to > reread cached data is probably limited or nonexistent. As we often say > here, "it depends". I'm sore someone with better low-level hardware > knowledge will pipe in and tell me why I'm wrong (Cunningham's Law). > Of course it all depends. However, as core counts go up, a fixed amount of cache must get shared. Since the high core counts are putting pressure on main memory BW, cache gets more important. This is why AMD is doing V-cache for new processors. Core counts have outstripped memory BW, their solution seems to be big caches. And, cache is only good the second time :-) -- big snip-- -- Doug From lohitv at gwmail.gwu.edu Sat Sep 18 17:20:51 2021 From: lohitv at gwmail.gwu.edu (Lohit Valleru) Date: Sat, 18 Sep 2021 13:20:51 -0400 Subject: [Beowulf] [beowulf] nfs vs parallel filesystems Message-ID: Hello Everyone, I am trying to find answers to an age old question of NFS vs Parallel file systems. Specifically - Isilon oneFS vs parallel filesystems.Specifically looking for any technical articles or papers that can help me understand what exactly will not work on oneFS. I understand that at the end - it all depends on workloads. But at what capacity of metadata io or a particular io pattern is bad in NFS.Would just getting a beefy isilon NFS HDD based storage - resolve most of the issues? I am trying to find sources that can say that no matter how beefy an NFS server can get with HDDs as backed - it will not be as good as parallel filesystems for so and so workload. If possible - Can anyone point me to experiences or technical papers that mention so and so do not work with NFS. Does it have to be that at the end - i will have to test my workloads across both NFS/OneFS and Parallel File systems and then see what would not work? I am concerned that any test case might not be valid, compared to real shared workloads where performance might lag once the storage reaches PBs in scale and millions of files. Thank you, Lohit -------------- next part -------------- An HTML attachment was scrubbed... URL: From blomqvist.janne at gmail.com Sat Sep 18 19:09:18 2021 From: blomqvist.janne at gmail.com (Janne Blomqvist) Date: Sat, 18 Sep 2021 22:09:18 +0300 Subject: [Beowulf] [beowulf] nfs vs parallel filesystems In-Reply-To: References: Message-ID: On Sat, Sep 18, 2021 at 8:21 PM Lohit Valleru via Beowulf wrote: > > Hello Everyone, > > I am trying to find answers to an age old question of NFS vs Parallel file systems. Specifically - Isilon oneFS vs parallel filesystems.Specifically looking for any technical articles or papers that can help me understand what exactly will not work on oneFS. > I understand that at the end - it all depends on workloads. > But at what capacity of metadata io or a particular io pattern is bad in NFS.Would just getting a beefy isilon NFS HDD based storage - resolve most of the issues? > I am trying to find sources that can say that no matter how beefy an NFS server can get with HDDs as backed - it will not be as good as parallel filesystems for so and so workload. > If possible - Can anyone point me to experiences or technical papers that mention so and so do not work with NFS. > > Does it have to be that at the end - i will have to test my workloads across both NFS/OneFS and Parallel File systems and then see what would not work? > > I am concerned that any test case might not be valid, compared to real shared workloads where performance might lag once the storage reaches PBs in scale and millions of files. For one thing NFS is not cache coherent, but rather implements a looser form of consistency called close-to-open consistency. See e.g. the spec at https://datatracker.ietf.org/doc/html/rfc7530#section-1.4.6 One case in which this matters is if you have a workload where multiple nodes concurrently write to a shared file. E.g. with the ever-popular IOR benchmarking tool, a slurm batch file like #SBATCH -N 2 # 2 nodes #SBATCH --ntasks-per-node=1 # 1 MPI task per node SEGMENTCOUNT=100 #Offset must be equal to ntasks-per-node OFFSET=1 srun IOR -a POSIX -t 1000 -b 1000 -s $SEGMENTCOUNT -C -Q $OFFSET -e -i 5 -d 10 -v -w -r -W -R -g -u -q -o testfile This should fail due to corruption within minutes if the testfile is on NFS. Not saying any parallel filesystem will handle this either. Some will. -- Janne Blomqvist From hearnsj at gmail.com Sun Sep 19 08:54:35 2021 From: hearnsj at gmail.com (John Hearns) Date: Sun, 19 Sep 2021 09:54:35 +0100 Subject: [Beowulf] [beowulf] nfs vs parallel filesystems In-Reply-To: References: Message-ID: Lohit, good morning. I work for Dell in the EMEA HPC team. You make some interesting observations. Please ping me offline regarding Isilon. Regarding NFS we have a brand new Ready Architecture which uses Poweredge servers and ME series storage (*) It gets some pretty decent performance and I would honestly say that these days NFS is a perfectly good fit for small clusters - the clusters which are used by departments or small/medium sized engineering companies. If you want to try out your particular workloads we have labs available. You then go on to talk about petabytes of data - that is the field where you have to look at scale out filesystems. (*) I cannot find this on public webpages yet, sorry On Sat, 18 Sept 2021 at 18:21, Lohit Valleru via Beowulf < beowulf at beowulf.org> wrote: > Hello Everyone, > > I am trying to find answers to an age old question of NFS vs Parallel file > systems. Specifically - Isilon oneFS vs parallel filesystems.Specifically > looking for any technical articles or papers that can help me understand > what exactly will not work on oneFS. > I understand that at the end - it all depends on workloads. > But at what capacity of metadata io or a particular io pattern is bad in > NFS.Would just getting a beefy isilon NFS HDD based storage - resolve > most of the issues? > I am trying to find sources that can say that no matter how beefy an NFS > server can get with HDDs as backed - it will not be as good as parallel > filesystems for so and so workload. > If possible - Can anyone point me to experiences or technical papers that > mention so and so do not work with NFS. > > Does it have to be that at the end - i will have to test my workloads > across both NFS/OneFS and Parallel File systems and then see what would not > work? > > I am concerned that any test case might not be valid, compared to real > shared workloads where performance might lag once the storage reaches PBs > in scale and millions of files. > > Thank you, > Lohit > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hearnsj at gmail.com Sun Sep 19 11:02:10 2021 From: hearnsj at gmail.com (John Hearns) Date: Sun, 19 Sep 2021 12:02:10 +0100 Subject: [Beowulf] [EXTERNAL] Re: Deskside clusters In-Reply-To: <4ae0d1f17920bd44f5ba763745c3281d.squirrel@mail.eadline.org> References: <98e386fdb767d715af9ea327f7deecba.squirrel@mail.eadline.org> <1FDA1ABD-BC3F-4DC3-BBB8-9957CAA1C443@jpl.caltech.edu>

<2f54412c6e92f6b1c04ca9a0784539b1.squirrel@mail.eadline.org> <4ae0d1f17920bd44f5ba763745c3281d.squirrel@mail.eadline.org> Message-ID: Eadline's Law : Cache is only good the second time. On Fri, 17 Sep 2021, 21:25 Douglas Eadline, wrote: > --snip-- > > > > Where I disagree with you is (3). Whether or not cache size is important > > depends on the size of the job. If your iterating through data-parallel > > loops over a large dataset that exceeds cache size, the opportunity to > > reread cached data is probably limited or nonexistent. As we often say > > here, "it depends". I'm sore someone with better low-level hardware > > knowledge will pipe in and tell me why I'm wrong (Cunningham's Law). > > > > Of course it all depends. However, as core counts go up, a > fixed amount of cache must get shared. Since the high core counts > are putting pressure on main memory BW, cache gets more > important. This is why AMD is doing V-cache for new processors. > Core counts have outstripped memory BW, their solution > seems to be big caches. And, cache is only good the second time :-) > > > -- big snip-- > > -- > Doug > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hearnsj at gmail.com Mon Sep 20 08:10:37 2021 From: hearnsj at gmail.com (John Hearns) Date: Mon, 20 Sep 2021 09:10:37 +0100 Subject: [Beowulf] [beowulf] nfs vs parallel filesystems In-Reply-To: References: Message-ID: This talk by Keith Manthey is well worth listening to. Vendor neutral as I recall, so don't worry about a sales message bein gpushed HPC Storage 101 in this series https://www.dellhpc.org/eventsarchive.html On Sat, 18 Sept 2021 at 18:21, Lohit Valleru via Beowulf < beowulf at beowulf.org> wrote: > Hello Everyone, > > I am trying to find answers to an age old question of NFS vs Parallel file > systems. Specifically - Isilon oneFS vs parallel filesystems.Specifically > looking for any technical articles or papers that can help me understand > what exactly will not work on oneFS. > I understand that at the end - it all depends on workloads. > But at what capacity of metadata io or a particular io pattern is bad in > NFS.Would just getting a beefy isilon NFS HDD based storage - resolve > most of the issues? > I am trying to find sources that can say that no matter how beefy an NFS > server can get with HDDs as backed - it will not be as good as parallel > filesystems for so and so workload. > If possible - Can anyone point me to experiences or technical papers that > mention so and so do not work with NFS. > > Does it have to be that at the end - i will have to test my workloads > across both NFS/OneFS and Parallel File systems and then see what would not > work? > > I am concerned that any test case might not be valid, compared to real > shared workloads where performance might lag once the storage reaches PBs > in scale and millions of files. > > Thank you, > Lohit > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcownie at gmail.com Mon Sep 20 10:35:55 2021 From: jcownie at gmail.com (Jim Cownie) Date: Mon, 20 Sep 2021 11:35:55 +0100 Subject: [Beowulf] [EXTERNAL] Re: Deskside clusters In-Reply-To: References: <98e386fdb767d715af9ea327f7deecba.squirrel@mail.eadline.org> <1FDA1ABD-BC3F-4DC3-BBB8-9957CAA1C443@jpl.caltech.edu>

<2f54412c6e92f6b1c04ca9a0784539b1.squirrel@mail.eadline.org> <4ae0d1f17920bd44f5ba763745c3281d.squirrel@mail.eadline.org> Message-ID: <7BF6BE74-204F-4E77-9F5A-DA5E814C9C36@gmail.com> > Eadline's Law : Cache is only good the second time. Hmm, that?s why they have all those clever pre-fetchers which try to guess your memory access patterns and predict what's going to be needed next. (Your choice whether you read ?clever? in a cynical voice or not :-)) *IF* that works, then the cache is useful the first time. If not, then they can mess things up royally by evicting stuff that you did want there. > On 19 Sep 2021, at 12:02, John Hearns wrote: > > Eadline's Law : Cache is only good the second time. > > On Fri, 17 Sep 2021, 21:25 Douglas Eadline, > wrote: > --snip-- > > > > Where I disagree with you is (3). Whether or not cache size is important > > depends on the size of the job. If your iterating through data-parallel > > loops over a large dataset that exceeds cache size, the opportunity to > > reread cached data is probably limited or nonexistent. As we often say > > here, "it depends". I'm sore someone with better low-level hardware > > knowledge will pipe in and tell me why I'm wrong (Cunningham's Law). > > > > Of course it all depends. However, as core counts go up, a > fixed amount of cache must get shared. Since the high core counts > are putting pressure on main memory BW, cache gets more > important. This is why AMD is doing V-cache for new processors. > Core counts have outstripped memory BW, their solution > seems to be big caches. And, cache is only good the second time :-) > > > -- big snip-- > > -- > Doug > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf -- Jim James Cownie Mob: +44 780 637 7146 -------------- next part -------------- An HTML attachment was scrubbed... URL: From stewart at serissa.com Mon Sep 20 16:17:20 2021 From: stewart at serissa.com (Lawrence Stewart) Date: Mon, 20 Sep 2021 12:17:20 -0400 Subject: [Beowulf] [EXTERNAL] Re: Deskside clusters In-Reply-To: <7BF6BE74-204F-4E77-9F5A-DA5E814C9C36@gmail.com> References: <98e386fdb767d715af9ea327f7deecba.squirrel@mail.eadline.org> <1FDA1ABD-BC3F-4DC3-BBB8-9957CAA1C443@jpl.caltech.edu>

<2f54412c6e92f6b1c04ca9a0784539b1.squirrel@mail.eadline.org> <4ae0d1f17920bd44f5ba763745c3281d.squirrel@mail.eadline.org> <7BF6BE74-204F-4E77-9F5A-DA5E814C9C36@gmail.com> Message-ID: Well said. Expanding on this, caches work because of both temporal locality and spatial locality. Spatial locality is addressed by having cache lines be substantially larger than a byte or word. These days, 64 bytes is pretty common. Some prefetch schemes, like the L1D version that fetches the VA ^ 64 clearly affect spatial locality. Streaming prefetch has an expanded notion of ?spatial? I suppose! What puzzles me is why compilers seem not to have evolved much notion of cache management. It seems like something a smart compiler could do. Instead, it is left to Prof. Goto and the folks at ATLAS and BLIS to figure out how to rewrite algorithms for efficient cache behavior. To my limited knowledge, compilers don?t make much use of PREFETCH or any non-temporal loads and stores either. It seems to me that once the programmer helps with RESTRICT and so forth, then compilers could perfectly well dynamically move parts of arrays around to maximize cache use. -L > On 2021, Sep 20, at 6:35 AM, Jim Cownie wrote: > >> Eadline's Law : Cache is only good the second time. > > Hmm, that?s why they have all those clever pre-fetchers which try to guess your memory access patterns and predict what's going to be needed next. > (Your choice whether you read ?clever? in a cynical voice or not :-)) > *IF* that works, then the cache is useful the first time. > If not, then they can mess things up royally by evicting stuff that you did want there. > >> On 19 Sep 2021, at 12:02, John Hearns > wrote: >> >> Eadline's Law : Cache is only good the second time. >> >> On Fri, 17 Sep 2021, 21:25 Douglas Eadline, > wrote: >> --snip-- >> > >> > Where I disagree with you is (3). Whether or not cache size is important >> > depends on the size of the job. If your iterating through data-parallel >> > loops over a large dataset that exceeds cache size, the opportunity to >> > reread cached data is probably limited or nonexistent. As we often say >> > here, "it depends". I'm sore someone with better low-level hardware >> > knowledge will pipe in and tell me why I'm wrong (Cunningham's Law). >> > >> >> Of course it all depends. However, as core counts go up, a >> fixed amount of cache must get shared. Since the high core counts >> are putting pressure on main memory BW, cache gets more >> important. This is why AMD is doing V-cache for new processors. >> Core counts have outstripped memory BW, their solution >> seems to be big caches. And, cache is only good the second time :-) >> >> >> -- big snip-- >> >> -- >> Doug >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > > -- Jim > James Cownie > > Mob: +44 780 637 7146 > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From pedmon at cfa.harvard.edu Mon Sep 20 17:05:14 2021 From: pedmon at cfa.harvard.edu (Paul Edmon) Date: Mon, 20 Sep 2021 13:05:14 -0400 Subject: [Beowulf] Open Positions at FASRC Message-ID: https://www.rc.fas.harvard.edu/about/employment/ -Paul Edmon- From james.p.lux at jpl.nasa.gov Mon Sep 20 17:41:24 2021 From: james.p.lux at jpl.nasa.gov (Lux, Jim (US 7140)) Date: Mon, 20 Sep 2021 17:41:24 +0000 Subject: [Beowulf] [EXTERNAL] Re: Deskside clusters In-Reply-To: References: <98e386fdb767d715af9ea327f7deecba.squirrel@mail.eadline.org> <1FDA1ABD-BC3F-4DC3-BBB8-9957CAA1C443@jpl.caltech.edu>

<2f54412c6e92f6b1c04ca9a0784539b1.squirrel@mail.eadline.org> <4ae0d1f17920bd44f5ba763745c3281d.squirrel@mail.eadline.org> <7BF6BE74-204F-4E77-9F5A-DA5E814C9C36@gmail.com> Message-ID: <939CE0BF-D7C8-4A22-9422-30DA0EF7107A@jpl.nasa.gov> From: Beowulf on behalf of Lawrence Stewart Date: Monday, September 20, 2021 at 9:17 AM To: Jim Cownie Cc: Lawrence Stewart , Douglas Eadline , "beowulf at beowulf.org" Subject: Re: [Beowulf] [EXTERNAL] Re: Deskside clusters Well said. Expanding on this, caches work because of both temporal locality and spatial locality. Spatial locality is addressed by having cache lines be substantially larger than a byte or word. These days, 64 bytes is pretty common. Some prefetch schemes, like the L1D version that fetches the VA ^ 64 clearly affect spatial locality. Streaming prefetch has an expanded notion of ?spatial? I suppose! What puzzles me is why compilers seem not to have evolved much notion of cache management. It seems like something a smart compiler could do. Instead, it is left to Prof. Goto and the folks at ATLAS and BLIS to figure out how to rewrite algorithms for efficient cache behavior. To my limited knowledge, compilers don?t make much use of PREFETCH or any non-temporal loads and stores either. It seems to me that once the programmer helps with RESTRICT and so forth, then compilers could perfectly well dynamically move parts of arrays around to maximize cache use. -L I suspect that there?s enough variability among cache implementation and the wide variety of algorithms that might use it that writing a smart-enough compiler is ?hard? and ?expensive?. Leaving it to the library authors is probably the best ?bang for the buck?. -------------- next part -------------- An HTML attachment was scrubbed... URL: From james.p.lux at jpl.nasa.gov Mon Sep 20 18:27:52 2021 From: james.p.lux at jpl.nasa.gov (Lux, Jim (US 7140)) Date: Mon, 20 Sep 2021 18:27:52 +0000 Subject: [Beowulf] Rant on why HPC isn't as easy as I'd like it to be. Message-ID: The recent comments on compilers, caches, etc., are why HPC isn?t a bigger deal. The infrastructure today is reminiscent of what I used in the 1970s on a big CDC or Burroughs or IBM machine, perhaps with a FPS box attached. I prepare a job, with some sort of job control structure, submit it to a batch queue, and get my results some time later. Sure, I?m not dropping off a deck or tapes, and I?m not getting green-bar paper or a tape back, but really, it?s not much different ? I drop a file and get files back either way. And just like back then, it?s up to me to figure out how best to arrange my code to run fastest (or me, wall clock time, but others it might be CPU time or cost or something else) It would be nice if the compiler (or run-time or infrastructure) figured out the whole ?what?s the arrangement of cores/nodes/scratch storage for this application on this particular cluster?. I also acknowledge that this is a ?hard? problem and one that doesn?t have the commercial value of, say, serving the optimum ads to me when I read the newspaper on line. Yeah, it?s not that hard to call library routines for matrix operations, and to put my trust in the library writers ? I trust them more than I trust me to find the fastest linear equation solver, fft, etc. ? but so far, the next level of abstraction up ? ?how many cores/nodes? is still left to me, and that means doing instrumentation, figuring out what the results mean, etc. From: Beowulf on behalf of "beowulf at beowulf.org" Reply-To: Jim Lux Date: Monday, September 20, 2021 at 10:42 AM To: Lawrence Stewart , Jim Cownie Cc: Douglas Eadline , "beowulf at beowulf.org" Subject: Re: [Beowulf] [EXTERNAL] Re: Deskside clusters From: Beowulf on behalf of Lawrence Stewart Date: Monday, September 20, 2021 at 9:17 AM To: Jim Cownie Cc: Lawrence Stewart , Douglas Eadline , "beowulf at beowulf.org" Subject: Re: [Beowulf] [EXTERNAL] Re: Deskside clusters Well said. Expanding on this, caches work because of both temporal locality and spatial locality. Spatial locality is addressed by having cache lines be substantially larger than a byte or word. These days, 64 bytes is pretty common. Some prefetch schemes, like the L1D version that fetches the VA ^ 64 clearly affect spatial locality. Streaming prefetch has an expanded notion of ?spatial? I suppose! What puzzles me is why compilers seem not to have evolved much notion of cache management. It seems like something a smart compiler could do. Instead, it is left to Prof. Goto and the folks at ATLAS and BLIS to figure out how to rewrite algorithms for efficient cache behavior. To my limited knowledge, compilers don?t make much use of PREFETCH or any non-temporal loads and stores either. It seems to me that once the programmer helps with RESTRICT and so forth, then compilers could perfectly well dynamically move parts of arrays around to maximize cache use. -L I suspect that there?s enough variability among cache implementation and the wide variety of algorithms that might use it that writing a smart-enough compiler is ?hard? and ?expensive?. Leaving it to the library authors is probably the best ?bang for the buck?. -------------- next part -------------- An HTML attachment was scrubbed... URL: From pbisbal at pppl.gov Mon Sep 20 20:58:49 2021 From: pbisbal at pppl.gov (Prentice Bisbal) Date: Mon, 20 Sep 2021 16:58:49 -0400 Subject: [Beowulf] [EXTERNAL] Re: Deskside clusters In-Reply-To: <7BF6BE74-204F-4E77-9F5A-DA5E814C9C36@gmail.com> References: <98e386fdb767d715af9ea327f7deecba.squirrel@mail.eadline.org> <1FDA1ABD-BC3F-4DC3-BBB8-9957CAA1C443@jpl.caltech.edu>

<2f54412c6e92f6b1c04ca9a0784539b1.squirrel@mail.eadline.org> <4ae0d1f17920bd44f5ba763745c3281d.squirrel@mail.eadline.org> <7BF6BE74-204F-4E77-9F5A-DA5E814C9C36@gmail.com> Message-ID: <572bd005-e3cf-80c7-5b9f-81c9175df7ec@pppl.gov> On 9/20/21 6:35 AM, Jim Cownie wrote: >> Eadline's Law : Cache is only good the second time. > > Hmm, that?s why they have all those clever pre-fetchers which try to > guess your memory access patterns and predict what's going to be > needed next. > (Your choice whether you read ?clever? in a cynical voice or not :-)) > *IF* that works, then the cache is useful the first time. > If not, then they can mess things up royally by evicting stuff that > you did want there. > I thought about prefetching, but deliberately left it out of my original response because I didn't want to open that can of worms... or put my foot in my mouth. Prentice From peter.st.john at gmail.com Mon Sep 20 21:42:57 2021 From: peter.st.john at gmail.com (Peter St. John) Date: Mon, 20 Sep 2021 17:42:57 -0400 Subject: [Beowulf] Rant on why HPC isn't as easy as I'd like it to be. In-Reply-To: References: Message-ID: My dream is to use some sort of optimization software (I would try Genetic Programing say) with a heterogeneous cluster (of mixed fat and light nodes, even different network topologies in sub-clusters) to determine the optimal configuration and optimal running parameters in an application domain for itself. I have published (to a limited audience) on Genetic Algorithms optimizing themselves recursively (there is converge behaviour). Peter On Mon, Sep 20, 2021 at 2:28 PM Lux, Jim (US 7140) via Beowulf < beowulf at beowulf.org> wrote: > The recent comments on compilers, caches, etc., are why HPC isn?t a bigger > deal. The infrastructure today is reminiscent of what I used in the 1970s > on a big CDC or Burroughs or IBM machine, perhaps with a FPS box attached. > > I prepare a job, with some sort of job control structure, submit it to a > batch queue, and get my results some time later. Sure, I?m not dropping > off a deck or tapes, and I?m not getting green-bar paper or a tape back, > but really, it?s not much different ? I drop a file and get files back > either way. > > > > And just like back then, it?s up to me to figure out how best to arrange > my code to run fastest (or me, wall clock time, but others it might be CPU > time or cost or something else) > > > > It would be nice if the compiler (or run-time or infrastructure) figured > out the whole ?what?s the arrangement of cores/nodes/scratch storage for > this application on this particular cluster?. > > I also acknowledge that this is a ?hard? problem and one that doesn?t have > the commercial value of, say, serving the optimum ads to me when I read the > newspaper on line. > > > Yeah, it?s not that hard to call library routines for matrix operations, > and to put my trust in the library writers ? I trust them more than I trust > me to find the fastest linear equation solver, fft, etc. ? but so far, the > next level of abstraction up ? ?how many cores/nodes? is still left to me, > and that means doing instrumentation, figuring out what the results mean, > etc. > > > > > > *From: *Beowulf on behalf of " > beowulf at beowulf.org" > *Reply-To: *Jim Lux > *Date: *Monday, September 20, 2021 at 10:42 AM > *To: *Lawrence Stewart , Jim Cownie < > jcownie at gmail.com> > *Cc: *Douglas Eadline , "beowulf at beowulf.org" < > beowulf at beowulf.org> > *Subject: *Re: [Beowulf] [EXTERNAL] Re: Deskside clusters > > > > > > > > *From: *Beowulf on behalf of Lawrence > Stewart > *Date: *Monday, September 20, 2021 at 9:17 AM > *To: *Jim Cownie > *Cc: *Lawrence Stewart , Douglas Eadline < > deadline at eadline.org>, "beowulf at beowulf.org" > *Subject: *Re: [Beowulf] [EXTERNAL] Re: Deskside clusters > > > > Well said. Expanding on this, caches work because of both temporal > locality and > > spatial locality. Spatial locality is addressed by having cache lines be > substantially > > larger than a byte or word. These days, 64 bytes is pretty common. Some > prefetch schemes, > > like the L1D version that fetches the VA ^ 64 clearly affect spatial > locality. Streaming > > prefetch has an expanded notion of ?spatial? I suppose! > > > > What puzzles me is why compilers seem not to have evolved much notion of > cache management. It > > seems like something a smart compiler could do. Instead, it is left to > Prof. Goto and the folks > > at ATLAS and BLIS to figure out how to rewrite algorithms for efficient > cache behavior. To my > > limited knowledge, compilers don?t make much use of PREFETCH or any > non-temporal loads and stores > > either. It seems to me that once the programmer helps with RESTRICT and so > forth, then compilers could perfectly well dynamically move parts of arrays > around to maximize cache use. > > > > -L > > > > I suspect that there?s enough variability among cache implementation and > the wide variety of algorithms that might use it that writing a > smart-enough compiler is ?hard? and ?expensive?. > > > > Leaving it to the library authors is probably the best ?bang for the > buck?. > > > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From james.p.lux at jpl.nasa.gov Mon Sep 20 22:05:55 2021 From: james.p.lux at jpl.nasa.gov (Lux, Jim (US 7140)) Date: Mon, 20 Sep 2021 22:05:55 +0000 Subject: [Beowulf] [EXTERNAL] Re: Rant on why HPC isn't as easy as I'd like it to be. In-Reply-To: References:

Message-ID: <22594F5A-5D5D-40CE-A9E1-6173D134A002@jpl.nasa.gov> Maybe a starting point would be to define a standardized interface that would interact with a suite of tools that helps you determine optimized strategies for that cluster (automatically) ? (and by tools, I don?t mean that they all support a text editor and a scripting tool). So a sort of ?autoconfigure? process. From: "Peter St. John" Date: Monday, September 20, 2021 at 2:43 PM To: Jim Lux Cc: "beowulf at beowulf.org" Subject: [EXTERNAL] Re: [Beowulf] Rant on why HPC isn't as easy as I'd like it to be. My dream is to use some sort of optimization software (I would try Genetic Programing say) with a heterogeneous cluster (of mixed fat and light nodes, even different network topologies in sub-clusters) to determine the optimal configuration and optimal running parameters in an application domain for itself. I have published (to a limited audience) on Genetic Algorithms optimizing themselves recursively (there is converge behaviour). Peter On Mon, Sep 20, 2021 at 2:28 PM Lux, Jim (US 7140) via Beowulf > wrote: The recent comments on compilers, caches, etc., are why HPC isn?t a bigger deal. The infrastructure today is reminiscent of what I used in the 1970s on a big CDC or Burroughs or IBM machine, perhaps with a FPS box attached. I prepare a job, with some sort of job control structure, submit it to a batch queue, and get my results some time later. Sure, I?m not dropping off a deck or tapes, and I?m not getting green-bar paper or a tape back, but really, it?s not much different ? I drop a file and get files back either way. And just like back then, it?s up to me to figure out how best to arrange my code to run fastest (or me, wall clock time, but others it might be CPU time or cost or something else) It would be nice if the compiler (or run-time or infrastructure) figured out the whole ?what?s the arrangement of cores/nodes/scratch storage for this application on this particular cluster?. I also acknowledge that this is a ?hard? problem and one that doesn?t have the commercial value of, say, serving the optimum ads to me when I read the newspaper on line. Yeah, it?s not that hard to call library routines for matrix operations, and to put my trust in the library writers ? I trust them more than I trust me to find the fastest linear equation solver, fft, etc. ? but so far, the next level of abstraction up ? ?how many cores/nodes? is still left to me, and that means doing instrumentation, figuring out what the results mean, etc. From: Beowulf > on behalf of "beowulf at beowulf.org" > Reply-To: Jim Lux > Date: Monday, September 20, 2021 at 10:42 AM To: Lawrence Stewart >, Jim Cownie > Cc: Douglas Eadline >, "beowulf at beowulf.org" > Subject: Re: [Beowulf] [EXTERNAL] Re: Deskside clusters From: Beowulf > on behalf of Lawrence Stewart > Date: Monday, September 20, 2021 at 9:17 AM To: Jim Cownie > Cc: Lawrence Stewart >, Douglas Eadline >, "beowulf at beowulf.org" > Subject: Re: [Beowulf] [EXTERNAL] Re: Deskside clusters Well said. Expanding on this, caches work because of both temporal locality and spatial locality. Spatial locality is addressed by having cache lines be substantially larger than a byte or word. These days, 64 bytes is pretty common. Some prefetch schemes, like the L1D version that fetches the VA ^ 64 clearly affect spatial locality. Streaming prefetch has an expanded notion of ?spatial? I suppose! What puzzles me is why compilers seem not to have evolved much notion of cache management. It seems like something a smart compiler could do. Instead, it is left to Prof. Goto and the folks at ATLAS and BLIS to figure out how to rewrite algorithms for efficient cache behavior. To my limited knowledge, compilers don?t make much use of PREFETCH or any non-temporal loads and stores either. It seems to me that once the programmer helps with RESTRICT and so forth, then compilers could perfectly well dynamically move parts of arrays around to maximize cache use. -L I suspect that there?s enough variability among cache implementation and the wide variety of algorithms that might use it that writing a smart-enough compiler is ?hard? and ?expensive?. Leaving it to the library authors is probably the best ?bang for the buck?. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From hearnsj at gmail.com Tue Sep 21 07:05:30 2021 From: hearnsj at gmail.com (John Hearns) Date: Tue, 21 Sep 2021 08:05:30 +0100 Subject: [Beowulf] [EXTERNAL] Re: Deskside clusters In-Reply-To: <572bd005-e3cf-80c7-5b9f-81c9175df7ec@pppl.gov> References: <98e386fdb767d715af9ea327f7deecba.squirrel@mail.eadline.org> <1FDA1ABD-BC3F-4DC3-BBB8-9957CAA1C443@jpl.caltech.edu>

<2f54412c6e92f6b1c04ca9a0784539b1.squirrel@mail.eadline.org> <4ae0d1f17920bd44f5ba763745c3281d.squirrel@mail.eadline.org> <7BF6BE74-204F-4E77-9F5A-DA5E814C9C36@gmail.com> <572bd005-e3cf-80c7-5b9f-81c9175df7ec@pppl.gov> Message-ID: Yes, but which foot? You have enough space for two toes from each foot for q taste, and you then need some logic to decide which one to use. On Mon, 20 Sept 2021 at 21:59, Prentice Bisbal via Beowulf < beowulf at beowulf.org> wrote: > On 9/20/21 6:35 AM, Jim Cownie wrote: > > >> Eadline's Law : Cache is only good the second time. > > > > Hmm, that?s why they have all those clever pre-fetchers which try to > > guess your memory access patterns and predict what's going to be > > needed next. > > (Your choice whether you read ?clever? in a cynical voice or not :-)) > > *IF* that works, then the cache is useful the first time. > > If not, then they can mess things up royally by evicting stuff that > > you did want there. > > > > I thought about prefetching, but deliberately left it out of my original > response because I didn't want to open that can of worms... or put my > foot in my mouth. > > Prentice > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hearnsj at gmail.com Tue Sep 21 07:08:32 2021 From: hearnsj at gmail.com (John Hearns) Date: Tue, 21 Sep 2021 08:08:32 +0100 Subject: [Beowulf] [EXTERNAL] Re: Deskside clusters In-Reply-To: <939CE0BF-D7C8-4A22-9422-30DA0EF7107A@jpl.nasa.gov> References: <98e386fdb767d715af9ea327f7deecba.squirrel@mail.eadline.org> <1FDA1ABD-BC3F-4DC3-BBB8-9957CAA1C443@jpl.caltech.edu>

<2f54412c6e92f6b1c04ca9a0784539b1.squirrel@mail.eadline.org> <4ae0d1f17920bd44f5ba763745c3281d.squirrel@mail.eadline.org> <7BF6BE74-204F-4E77-9F5A-DA5E814C9C36@gmail.com> <939CE0BF-D7C8-4A22-9422-30DA0EF7107A@jpl.nasa.gov> Message-ID: Over on the Julia discussion list there are often topics on performance or varying performance - these often turn out to be due to the BLAS libraries in use, and how they are being used. I believe that there is a project for pureJulia BLAS. On Mon, 20 Sept 2021 at 18:41, Lux, Jim (US 7140) via Beowulf < beowulf at beowulf.org> wrote: > > > > > *From: *Beowulf on behalf of Lawrence > Stewart > *Date: *Monday, September 20, 2021 at 9:17 AM > *To: *Jim Cownie > *Cc: *Lawrence Stewart , Douglas Eadline < > deadline at eadline.org>, "beowulf at beowulf.org" > *Subject: *Re: [Beowulf] [EXTERNAL] Re: Deskside clusters > > > > Well said. Expanding on this, caches work because of both temporal > locality and > > spatial locality. Spatial locality is addressed by having cache lines be > substantially > > larger than a byte or word. These days, 64 bytes is pretty common. Some > prefetch schemes, > > like the L1D version that fetches the VA ^ 64 clearly affect spatial > locality. Streaming > > prefetch has an expanded notion of ?spatial? I suppose! > > > > What puzzles me is why compilers seem not to have evolved much notion of > cache management. It > > seems like something a smart compiler could do. Instead, it is left to > Prof. Goto and the folks > > at ATLAS and BLIS to figure out how to rewrite algorithms for efficient > cache behavior. To my > > limited knowledge, compilers don?t make much use of PREFETCH or any > non-temporal loads and stores > > either. It seems to me that once the programmer helps with RESTRICT and so > forth, then compilers could perfectly well dynamically move parts of arrays > around to maximize cache use. > > > > -L > > > > I suspect that there?s enough variability among cache implementation and > the wide variety of algorithms that might use it that writing a > smart-enough compiler is ?hard? and ?expensive?. > > > > Leaving it to the library authors is probably the best ?bang for the > buck?. > > > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hearnsj at gmail.com Tue Sep 21 11:24:45 2021 From: hearnsj at gmail.com (John Hearns) Date: Tue, 21 Sep 2021 12:24:45 +0100 Subject: [Beowulf] Rant on why HPC isn't as easy as I'd like it to be. In-Reply-To: References: Message-ID: Some points well made here. I have seen in the past job scripts passed on from graduate student to graduate student - the case I am thinking on was an Abaqus script for 8 core systems, being run on a new 32 core system. Why WOULD a graduate student question a script given to them - which works. They should be getting on with their science. I guess this is where Research Software Engineers come in. Another point I would make is about modern processor architectures, for instance AMD Rome/Milan. You can have different Numa Per Socket options, which affect performance. We set the preferred IO path - which I have seen myself to have an effect on latency of MPI messages. IF you are not concerned about your hardware layout you would just go ahead and run, missing a lot of performance. I am now going to be controversial and common that over in Julia land the pattern seems to be these days people develop on their own laptops, or maybe local GPU systems. There is a lot of microbenchmarking going on. But there seems to be not a lot of thought given to CPU pinning or shat happens with hyperthreading. I guess topics like that are part of HPC 'Black Magic' - though I would imagine the low latency crowd are hot on them. I often introduce people to the excellent lstopo/hwloc utilities which show the layout of a system. Most people are pleasantly surprised to find this. On Mon, 20 Sept 2021 at 19:28, Lux, Jim (US 7140) via Beowulf < beowulf at beowulf.org> wrote: > The recent comments on compilers, caches, etc., are why HPC isn?t a bigger > deal. The infrastructure today is reminiscent of what I used in the 1970s > on a big CDC or Burroughs or IBM machine, perhaps with a FPS box attached. > > I prepare a job, with some sort of job control structure, submit it to a > batch queue, and get my results some time later. Sure, I?m not dropping > off a deck or tapes, and I?m not getting green-bar paper or a tape back, > but really, it?s not much different ? I drop a file and get files back > either way. > > > > And just like back then, it?s up to me to figure out how best to arrange > my code to run fastest (or me, wall clock time, but others it might be CPU > time or cost or something else) > > > > It would be nice if the compiler (or run-time or infrastructure) figured > out the whole ?what?s the arrangement of cores/nodes/scratch storage for > this application on this particular cluster?. > > I also acknowledge that this is a ?hard? problem and one that doesn?t have > the commercial value of, say, serving the optimum ads to me when I read the > newspaper on line. > > > Yeah, it?s not that hard to call library routines for matrix operations, > and to put my trust in the library writers ? I trust them more than I trust > me to find the fastest linear equation solver, fft, etc. ? but so far, the > next level of abstraction up ? ?how many cores/nodes? is still left to me, > and that means doing instrumentation, figuring out what the results mean, > etc. > > > > > > *From: *Beowulf on behalf of " > beowulf at beowulf.org" > *Reply-To: *Jim Lux > *Date: *Monday, September 20, 2021 at 10:42 AM > *To: *Lawrence Stewart , Jim Cownie < > jcownie at gmail.com> > *Cc: *Douglas Eadline , "beowulf at beowulf.org" < > beowulf at beowulf.org> > *Subject: *Re: [Beowulf] [EXTERNAL] Re: Deskside clusters > > > > > > > > *From: *Beowulf on behalf of Lawrence > Stewart > *Date: *Monday, September 20, 2021 at 9:17 AM > *To: *Jim Cownie > *Cc: *Lawrence Stewart , Douglas Eadline < > deadline at eadline.org>, "beowulf at beowulf.org" > *Subject: *Re: [Beowulf] [EXTERNAL] Re: Deskside clusters > > > > Well said. Expanding on this, caches work because of both temporal > locality and > > spatial locality. Spatial locality is addressed by having cache lines be > substantially > > larger than a byte or word. These days, 64 bytes is pretty common. Some > prefetch schemes, > > like the L1D version that fetches the VA ^ 64 clearly affect spatial > locality. Streaming > > prefetch has an expanded notion of ?spatial? I suppose! > > > > What puzzles me is why compilers seem not to have evolved much notion of > cache management. It > > seems like something a smart compiler could do. Instead, it is left to > Prof. Goto and the folks > > at ATLAS and BLIS to figure out how to rewrite algorithms for efficient > cache behavior. To my > > limited knowledge, compilers don?t make much use of PREFETCH or any > non-temporal loads and stores > > either. It seems to me that once the programmer helps with RESTRICT and so > forth, then compilers could perfectly well dynamically move parts of arrays > around to maximize cache use. > > > > -L > > > > I suspect that there?s enough variability among cache implementation and > the wide variety of algorithms that might use it that writing a > smart-enough compiler is ?hard? and ?expensive?. > > > > Leaving it to the library authors is probably the best ?bang for the > buck?. > > > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tjrc at sanger.ac.uk Tue Sep 21 12:02:08 2021 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Tue, 21 Sep 2021 12:02:08 +0000 Subject: [Beowulf] Rant on why HPC isn't as easy as I'd like it to be. [EXT] In-Reply-To: References: Message-ID: I think that?s exactly the situation we?ve been in for a long time, especially in life sciences, and it?s becoming more entrenched. My experience is that the average user of our scientific computing systems has been becoming less technically savvy for many years now. The presence of the cloud makes that more acute, in particular because it makes it easy for the user to effectively throw more hardware at the problem, which reduces the incentive to make their code particularly fast or efficient. Cost is the only brake on it, and in many cases I?m finding the PI doesn?t actually care about that. They care that a result is being obtained (and it?s time to first result they care about, not time to complete all the analysis), and so they typically don?t have much time for those of us who are telling them they need to invest in time up front developing and optimising efficient code. And cost is not necessarily the brake I thought it was going to be anyway. One recent project we?ve done on AWS has impressed me a great deal. It?s not terribly CPU efficient, and would doubtless, with sufficient effort, run much more efficiently on premise. But it?s extremely elastic in its nature, and so a good fit for the cloud. Once a week, the project has to completely re-analyse the 600,000+ COVID genomes we?e sequenced so far, looking for new branches in the phylogenetic tree, and to complete that analysis inside 8 hours. Initial attempts to naively convert the HPC implementation to run on AWS looked as though they were going to be very expensive (~$50k per weekly run). But a fundamental reworking of the entire workflow to make it as cloud native as possible, by which I mean almost exclusively serverless, has succeeded beyond what I expected. The total cost is <$5,000 a month, and because there is essentially no statically configured infrastructure at all, the security is fairly easy to be comfortable about. And all of that was done with no detailed thinking about whether the actual algorithms running in the containers are at all optimised in a traditional HPC sense. It?s just not needed for this particular piece of work. Did it need software developers with hardcore knowledge of performance optimisation? No. Was it rapid to develop and deploy? Yes. Is the performance fast enough for UK national COVID variant surveillance? Yes. Is it cost effective? Yes. Sold! The one thing it did need was knowledgeable cloud architects, but the cloud providers can and do help with that. Tim -- Tim Cutts Head of Scientific Computing Wellcome Sanger Institute On 21 Sep 2021, at 12:24, John Hearns > wrote: Some points well made here. I have seen in the past job scripts passed on from graduate student to graduate student - the case I am thinking on was an Abaqus script for 8 core systems, being run on a new 32 core system. Why WOULD a graduate student question a script given to them - which works. They should be getting on with their science. I guess this is where Research Software Engineers come in. Another point I would make is about modern processor architectures, for instance AMD Rome/Milan. You can have different Numa Per Socket options, which affect performance. We set the preferred IO path - which I have seen myself to have an effect on latency of MPI messages. IF you are not concerned about your hardware layout you would just go ahead and run, missing a lot of performance. I am now going to be controversial and common that over in Julia land the pattern seems to be these days people develop on their own laptops, or maybe local GPU systems. There is a lot of microbenchmarking going on. But there seems to be not a lot of thought given to CPU pinning or shat happens with hyperthreading. I guess topics like that are part of HPC 'Black Magic' - though I would imagine the low latency crowd are hot on them. I often introduce people to the excellent lstopo/hwloc utilities which show the layout of a system. Most people are pleasantly surprised to find this. -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ascheinine at acm.org Tue Sep 21 19:09:05 2021 From: ascheinine at acm.org (ascheinine at acm.org) Date: Tue, 21 Sep 2021 14:09:05 -0500 Subject: [Beowulf] Rant on why HPC isn't as easy as I'd like it to be. In-Reply-To: References:

Message-ID: A few years ago I wrote a summary of software for supercomputing with a focus on near-future developments. The Dept. of Energy had a new round of funding for libraries and frameworks for supercomputers. The description of the projects sounded like they could be flexible and efficient w.r.t. computer architecture and the hardware implementation, and various tools could be integrated. In my limited experience I did not see this ideal universe arise. -- Alan Scheinine 200 Georgann Dr., Apt. E6 Vicksburg, MS 39180 Email: ascheinine at acm.org If above Email bounces, try: alscheinine at fieldknot.net Mobile phone: 225 288 4176 From james.p.lux at jpl.nasa.gov Tue Sep 21 20:56:52 2021 From: james.p.lux at jpl.nasa.gov (Lux, Jim (US 7140)) Date: Tue, 21 Sep 2021 20:56:52 +0000 Subject: [Beowulf] [EXTERNAL] Re: Rant on why HPC isn't as easy as I'd like it to be. [EXT] Message-ID: This is interesting, because the figure of merit is ?time to first answer? ? that?s much like my iterative antenna modeling task ? The cost of the computing is negligible compared to the cost of the engineer(s) waiting for the results. Running the AWS calculator, a EC instance is $74/month when used 2hrs/day, so let?s call that about $1/hr. With an engineer billing out at $100/hr, spinning up 50 instances isn?t all that unreasonable. From: Beowulf on behalf of Tim Cutts Date: Tuesday, September 21, 2021 at 5:02 AM To: John Hearns Cc: "beowulf at beowulf.org" Subject: [EXTERNAL] Re: [Beowulf] Rant on why HPC isn't as easy as I'd like it to be. [EXT] I think that?s exactly the situation we?ve been in for a long time, especially in life sciences, and it?s becoming more entrenched. My experience is that the average user of our scientific computing systems has been becoming less technically savvy for many years now. The presence of the cloud makes that more acute, in particular because it makes it easy for the user to effectively throw more hardware at the problem, which reduces the incentive to make their code particularly fast or efficient. Cost is the only brake on it, and in many cases I?m finding the PI doesn?t actually care about that. They care that a result is being obtained (and it?s time to first result they care about, not time to complete all the analysis), and so they typically don?t have much time for those of us who are telling them they need to invest in time up front developing and optimising efficient code. And cost is not necessarily the brake I thought it was going to be anyway. One recent project we?ve done on AWS has impressed me a great deal. It?s not terribly CPU efficient, and would doubtless, with sufficient effort, run much more efficiently on premise. But it?s extremely elastic in its nature, and so a good fit for the cloud. Once a week, the project has to completely re-analyse the 600,000+ COVID genomes we?e sequenced so far, looking for new branches in the phylogenetic tree, and to complete that analysis inside 8 hours. Initial attempts to naively convert the HPC implementation to run on AWS looked as though they were going to be very expensive (~$50k per weekly run). But a fundamental reworking of the entire workflow to make it as cloud native as possible, by which I mean almost exclusively serverless, has succeeded beyond what I expected. The total cost is <$5,000 a month, and because there is essentially no statically configured infrastructure at all, the security is fairly easy to be comfortable about. And all of that was done with no detailed thinking about whether the actual algorithms running in the containers are at all optimised in a traditional HPC sense. It?s just not needed for this particular piece of work. Did it need software developers with hardcore knowledge of performance optimisation? No. Was it rapid to develop and deploy? Yes. Is the performance fast enough for UK national COVID variant surveillance? Yes. Is it cost effective? Yes. Sold! The one thing it did need was knowledgeable cloud architects, but the cloud providers can and do help with that. Tim -- Tim Cutts Head of Scientific Computing Wellcome Sanger Institute On 21 Sep 2021, at 12:24, John Hearns > wrote: Some points well made here. I have seen in the past job scripts passed on from graduate student to graduate student - the case I am thinking on was an Abaqus script for 8 core systems, being run on a new 32 core system. Why WOULD a graduate student question a script given to them - which works. They should be getting on with their science. I guess this is where Research Software Engineers come in. Another point I would make is about modern processor architectures, for instance AMD Rome/Milan. You can have different Numa Per Socket options, which affect performance. We set the preferred IO path - which I have seen myself to have an effect on latency of MPI messages. IF you are not concerned about your hardware layout you would just go ahead and run, missing a lot of performance. I am now going to be controversial and common that over in Julia land the pattern seems to be these days people develop on their own laptops, or maybe local GPU systems. There is a lot of microbenchmarking going on. But there seems to be not a lot of thought given to CPU pinning or shat happens with hyperthreading. I guess topics like that are part of HPC 'Black Magic' - though I would imagine the low latency crowd are hot on them. I often introduce people to the excellent lstopo/hwloc utilities which show the layout of a system. Most people are pleasantly surprised to find this. -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdm900 at gmail.com Wed Sep 22 07:01:10 2021 From: sdm900 at gmail.com (Stu Midgley) Date: Wed, 22 Sep 2021 15:01:10 +0800 Subject: [Beowulf] [EXTERNAL] Re: Rant on why HPC isn't as easy as I'd like it to be. [EXT] In-Reply-To: References: Message-ID: We actually help people use our commercial HPCaaS (dug.com/hpcaas) and work with them to get great performance on our platform. With a little help, they get way better performance and save money compared with the public clouds. Interestingly, it is the bioinformatics groups that benefit the most. We get their workflows running really well, using all threads on a node, making good use of the IO (our all flash VAST system helps a lot)... and they get a LOT of data and work done. @Tim - we even provide the HPC for Imperial College's Masters in Advanced computing. Supporting the users goes a long way. Stu. On Wed, Sep 22, 2021 at 4:57 AM Lux, Jim (US 7140) via Beowulf < beowulf at beowulf.org> wrote: > This is interesting, because the figure of merit is ?time to first answer? > ? that?s much like my iterative antenna modeling task ? The cost of the > computing is negligible compared to the cost of the engineer(s) waiting for > the results. > > Running the AWS calculator, a EC instance is $74/month when used 2hrs/day, > so let?s call that about $1/hr. With an engineer billing out at $100/hr, > spinning up 50 instances isn?t all that unreasonable. > > > > > > *From: *Beowulf on behalf of Tim Cutts < > tjrc at sanger.ac.uk> > *Date: *Tuesday, September 21, 2021 at 5:02 AM > *To: *John Hearns > *Cc: *"beowulf at beowulf.org" > *Subject: *[EXTERNAL] Re: [Beowulf] Rant on why HPC isn't as easy as I'd > like it to be. [EXT] > > > > I think that?s exactly the situation we?ve been in for a long time, > especially in life sciences, and it?s becoming more entrenched. My > experience is that the average user of our scientific computing systems has > been becoming less technically savvy for many years now. > > > > The presence of the cloud makes that more acute, in particular because it > makes it easy for the user to effectively throw more hardware at the > problem, which reduces the incentive to make their code particularly fast > or efficient. Cost is the only brake on it, and in many cases I?m finding > the PI doesn?t actually care about that. They care that a result is being > obtained (and it?s time to first result they care about, not time to > complete all the analysis), and so they typically don?t have much time for > those of us who are telling them they need to invest in time up front > developing and optimising efficient code. > > > > And cost is not necessarily the brake I thought it was going to be > anyway. One recent project we?ve done on AWS has impressed me a great > deal. It?s not terribly CPU efficient, and would doubtless, with > sufficient effort, run much more efficiently on premise. But it?s > extremely elastic in its nature, and so a good fit for the cloud. Once a > week, the project has to completely re-analyse the 600,000+ COVID genomes > we?e sequenced so far, looking for new branches in the phylogenetic tree, > and to complete that analysis inside 8 hours. Initial attempts to naively > convert the HPC implementation to run on AWS looked as though they were > going to be very expensive (~$50k per weekly run). But a fundamental > reworking of the entire workflow to make it as cloud native as possible, by > which I mean almost exclusively serverless, has succeeded beyond what I > expected. The total cost is <$5,000 a month, and because there is > essentially no statically configured infrastructure at all, the security is > fairly easy to be comfortable about. And all of that was done with no > detailed thinking about whether the actual algorithms running in the > containers are at all optimised in a traditional HPC sense. It?s just not > needed for this particular piece of work. Did it need software developers > with hardcore knowledge of performance optimisation? No. Was it rapid to > develop and deploy? Yes. Is the performance fast enough for UK national > COVID variant surveillance? Yes. Is it cost effective? Yes. Sold! The > one thing it did need was knowledgeable cloud architects, but the cloud > providers can and do help with that. > > > > Tim > > > > -- > > Tim Cutts > Head of Scientific Computing > Wellcome Sanger Institute > > > > On 21 Sep 2021, at 12:24, John Hearns wrote: > > > > Some points well made here. I have seen in the past job scripts passed on > from graduate student to graduate student - the case I am thinking on was > an Abaqus script for 8 core systems, being run on a new 32 core system. Why > WOULD a graduate student question a script given to them - which works. > They should be getting on with their science. I guess this is where > Research Software Engineers come in. > > > > Another point I would make is about modern processor architectures, for > instance AMD Rome/Milan. You can have different Numa Per Socket options, > which affect performance. We set the preferred IO path - which I have seen > myself to have an effect on latency of MPI messages. IF you are not > concerned about your hardware layout you would just go ahead and run, > missing a lot of performance. > > > > I am now going to be controversial and common that over in Julia land the > pattern seems to be these days people develop on their own laptops, or > maybe local GPU systems. There is a lot of microbenchmarking going on. But > there seems to be not a lot of thought given to CPU pinning or shat happens > with hyperthreading. I guess topics like that are part of HPC 'Black Magic' > - though I would imagine the low latency crowd are hot on them. > > > > I often introduce people to the excellent lstopo/hwloc utilities which > show the layout of a system. Most people are pleasantly surprised to find > this. > > > > -- The Wellcome Sanger Institute is operated by Genome Research Limited, a > charity registered in England with number 1021457 and a company registered > in England with number 2742969, whose registered office is 215 Euston Road, > London, NW1 2BE. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -- Dr Stuart Midgley sdm900 at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From guy.coates at gmail.com Thu Sep 23 12:45:37 2021 From: guy.coates at gmail.com (Guy Coates) Date: Thu, 23 Sep 2021 13:45:37 +0100 Subject: [Beowulf] Rant on why HPC isn't as easy as I'd like it to be. [EXT] In-Reply-To: References:

Message-ID: Out of interest, how large are the compute jobs (memory, runtime etc)? How easy to get them to fit into a serverless environment? Thanks, Guy On Tue, 21 Sept 2021 at 13:02, Tim Cutts wrote: > I think that?s exactly the situation we?ve been in for a long time, > especially in life sciences, and it?s becoming more entrenched. My > experience is that the average user of our scientific computing systems has > been becoming less technically savvy for many years now. > > The presence of the cloud makes that more acute, in particular because it > makes it easy for the user to effectively throw more hardware at the > problem, which reduces the incentive to make their code particularly fast > or efficient. Cost is the only brake on it, and in many cases I?m finding > the PI doesn?t actually care about that. They care that a result is being > obtained (and it?s time to first result they care about, not time to > complete all the analysis), and so they typically don?t have much time for > those of us who are telling them they need to invest in time up front > developing and optimising efficient code. > > And cost is not necessarily the brake I thought it was going to be > anyway. One recent project we?ve done on AWS has impressed me a great > deal. It?s not terribly CPU efficient, and would doubtless, with > sufficient effort, run much more efficiently on premise. But it?s > extremely elastic in its nature, and so a good fit for the cloud. Once a > week, the project has to completely re-analyse the 600,000+ COVID genomes > we?e sequenced so far, looking for new branches in the phylogenetic tree, > and to complete that analysis inside 8 hours. Initial attempts to naively > convert the HPC implementation to run on AWS looked as though they were > going to be very expensive (~$50k per weekly run). But a fundamental > reworking of the entire workflow to make it as cloud native as possible, by > which I mean almost exclusively serverless, has succeeded beyond what I > expected. The total cost is <$5,000 a month, and because there is > essentially no statically configured infrastructure at all, the security is > fairly easy to be comfortable about. And all of that was done with no > detailed thinking about whether the actual algorithms running in the > containers are at all optimised in a traditional HPC sense. It?s just not > needed for this particular piece of work. Did it need software developers > with hardcore knowledge of performance optimisation? No. Was it rapid to > develop and deploy? Yes. Is the performance fast enough for UK national > COVID variant surveillance? Yes. Is it cost effective? Yes. Sold! The > one thing it did need was knowledgeable cloud architects, but the cloud > providers can and do help with that. > > Tim > > -- > Tim Cutts > Head of Scientific Computing > Wellcome Sanger Institute > > > On 21 Sep 2021, at 12:24, John Hearns wrote: > > Some points well made here. I have seen in the past job scripts passed on > from graduate student to graduate student - the case I am thinking on was > an Abaqus script for 8 core systems, being run on a new 32 core system. Why > WOULD a graduate student question a script given to them - which works. > They should be getting on with their science. I guess this is where > Research Software Engineers come in. > > Another point I would make is about modern processor architectures, for > instance AMD Rome/Milan. You can have different Numa Per Socket options, > which affect performance. We set the preferred IO path - which I have seen > myself to have an effect on latency of MPI messages. IF you are not > concerned about your hardware layout you would just go ahead and run, > missing a lot of performance. > > I am now going to be controversial and common that over in Julia land the > pattern seems to be these days people develop on their own laptops, or > maybe local GPU systems. There is a lot of microbenchmarking going on. But > there seems to be not a lot of thought given to CPU pinning or shat happens > with hyperthreading. I guess topics like that are part of HPC 'Black Magic' > - though I would imagine the low latency crowd are hot on them. > > I often introduce people to the excellent lstopo/hwloc utilities which > show the layout of a system. Most people are pleasantly surprised to find > this. > > > -- The Wellcome Sanger Institute is operated by Genome Research Limited, a > charity registered in England with number 1021457 and a company registered > in England with number 2742969, whose registered office is 215 Euston Road, > London, NW1 2BE. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -- Dr. Guy Coates +44(0)7801 710224 -------------- next part -------------- An HTML attachment was scrubbed... URL: From engwalljonathanthereal at gmail.com Thu Sep 23 18:43:01 2021 From: engwalljonathanthereal at gmail.com (Jonathan Engwall) Date: Thu, 23 Sep 2021 11:43:01 -0700 Subject: [Beowulf] Neocortex webinar Message-ID: Hello Beowulf The next Neocortex Webinar has been announced, anyone who might not know can email them at: neocortex at psc.edu I think it is an exciting project. October 4 falls on a Monday only days away, so register now. Jonathan Engwall -------------- next part -------------- An HTML attachment was scrubbed... URL: From pizarroa at amazon.com Thu Sep 23 14:37:06 2021 From: pizarroa at amazon.com (Pizarro, Angel) Date: Thu, 23 Sep 2021 14:37:06 +0000 Subject: [Beowulf] Rant on why HPC isn't as easy as I'd like it to be. [EXT] In-Reply-To: References:

Message-ID: DISCLOSURE: I work for AWS HPC Developer Relations in the services team. We developer AWS Batch, AWS ParallelCluster, NICE DCV, etc. Lambda?s limits today are 128MB to 10,240MB (~10GB) and billed in 1MB per ms increments. 15 minute max runtime for the function invocation. Would you all be interested in a hands-on self-paced workshop on creating (or porting) an application to serverless environment? E.g. Monte-Carlo simulation, a genome alignment or variant call, or some other problem? We have some basic data processing documentation but nothing that speaks to real-world HPC use case and that is a something I want to fill the gap on if folks are interested in it. Dr. Denis Bauer at CSIRO is also doing interesting things with serverless. -angel -- Angel Pizarro | Principal Developer Advocate, HPC @ AWS From: Beowulf on behalf of Guy Coates Date: Thursday, September 23, 2021 at 8:46 AM To: Tim Cutts Cc: Beowulf Subject: RE: [EXTERNAL] [Beowulf] Rant on why HPC isn't as easy as I'd like it to be. [EXT] CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Out of interest, how large are the compute jobs (memory, runtime etc)? How easy to get them to fit into a serverless environment? Thanks, Guy On Tue, 21 Sept 2021 at 13:02, Tim Cutts > wrote: I think that?s exactly the situation we?ve been in for a long time, especially in life sciences, and it?s becoming more entrenched. My experience is that the average user of our scientific computing systems has been becoming less technically savvy for many years now. The presence of the cloud makes that more acute, in particular because it makes it easy for the user to effectively throw more hardware at the problem, which reduces the incentive to make their code particularly fast or efficient. Cost is the only brake on it, and in many cases I?m finding the PI doesn?t actually care about that. They care that a result is being obtained (and it?s time to first result they care about, not time to complete all the analysis), and so they typically don?t have much time for those of us who are telling them they need to invest in time up front developing and optimising efficient code. And cost is not necessarily the brake I thought it was going to be anyway. One recent project we?ve done on AWS has impressed me a great deal. It?s not terribly CPU efficient, and would doubtless, with sufficient effort, run much more efficiently on premise. But it?s extremely elastic in its nature, and so a good fit for the cloud. Once a week, the project has to completely re-analyse the 600,000+ COVID genomes we?e sequenced so far, looking for new branches in the phylogenetic tree, and to complete that analysis inside 8 hours. Initial attempts to naively convert the HPC implementation to run on AWS looked as though they were going to be very expensive (~$50k per weekly run). But a fundamental reworking of the entire workflow to make it as cloud native as possible, by which I mean almost exclusively serverless, has succeeded beyond what I expected. The total cost is <$5,000 a month, and because there is essentially no statically configured infrastructure at all, the security is fairly easy to be comfortable about. And all of that was done with no detailed thinking about whether the actual algorithms running in the containers are at all optimised in a traditional HPC sense. It?s just not needed for this particular piece of work. Did it need software developers with hardcore knowledge of performance optimisation? No. Was it rapid to develop and deploy? Yes. Is the performance fast enough for UK national COVID variant surveillance? Yes. Is it cost effective? Yes. Sold! The one thing it did need was knowledgeable cloud architects, but the cloud providers can and do help with that. Tim -- Tim Cutts Head of Scientific Computing Wellcome Sanger Institute On 21 Sep 2021, at 12:24, John Hearns > wrote: Some points well made here. I have seen in the past job scripts passed on from graduate student to graduate student - the case I am thinking on was an Abaqus script for 8 core systems, being run on a new 32 core system. Why WOULD a graduate student question a script given to them - which works. They should be getting on with their science. I guess this is where Research Software Engineers come in. Another point I would make is about modern processor architectures, for instance AMD Rome/Milan. You can have different Numa Per Socket options, which affect performance. We set the preferred IO path - which I have seen myself to have an effect on latency of MPI messages. IF you are not concerned about your hardware layout you would just go ahead and run, missing a lot of performance. I am now going to be controversial and common that over in Julia land the pattern seems to be these days people develop on their own laptops, or maybe local GPU systems. There is a lot of microbenchmarking going on. But there seems to be not a lot of thought given to CPU pinning or shat happens with hyperthreading. I guess topics like that are part of HPC 'Black Magic' - though I would imagine the low latency crowd are hot on them. I often introduce people to the excellent lstopo/hwloc utilities which show the layout of a system. Most people are pleasantly surprised to find this. -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf -- Dr. Guy Coates +44(0)7801 710224 -------------- next part -------------- An HTML attachment was scrubbed... URL: From pbisbal at pppl.gov Mon Sep 27 15:11:35 2021 From: pbisbal at pppl.gov (Prentice Bisbal) Date: Mon, 27 Sep 2021 11:11:35 -0400 Subject: [Beowulf] [External] Re: Rant on why HPC isn't as easy as I'd like it to be. [EXT] In-Reply-To: References:

Message-ID: <61ee4384-af84-6353-d264-8f45f9cb02af@pppl.gov> I'd be interested Prentice On 9/23/21 10:37 AM, Pizarro, Angel via Beowulf wrote: > > DISCLOSURE: I work for AWS HPC Developer Relations in the services > team. We developer AWS Batch, AWS ParallelCluster, NICE DCV, etc. > > Lambda?s limits today are 128MB to 10,240MB (~10GB) and billed in 1MB > per ms increments. 15 minute max runtime for the function invocation. > > Would you all be interested in a hands-on self-paced workshop on > creating (or porting) an application to serverless environment? E.g. > Monte-Carlo simulation, a genome alignment or variant call, or some > other problem? We have some basic data processing documentation but > nothing that speaks to real-world HPC use case and that is a something > I want to fill the gap on if folks are interested in it. > > Dr. Denis Bauer at CSIRO is also doing interesting things with > serverless. > > -angel > > -- > > Angel Pizarro | Principal Developer Advocate, HPC @ AWS > > *From: *Beowulf on behalf of Guy Coates > > *Date: *Thursday, September 23, 2021 at 8:46 AM > *To: *Tim Cutts > *Cc: *Beowulf > *Subject: *RE: [EXTERNAL] [Beowulf] Rant on why HPC isn't as easy as > I'd like it to be. [EXT] > > *CAUTION*: This email originated from outside of the organization. Do > not click links or open attachments unless you can confirm the sender > and know the content is safe. > > Out of interest, how large are the compute jobs (memory, runtime > etc)?? How easy to get them to fit into a serverless environment? > > Thanks, > > > Guy > > On Tue, 21 Sept 2021 at 13:02, Tim Cutts > wrote: > > I think that?s exactly the situation we?ve been in for a long > time, especially in life sciences, and it?s becoming more > entrenched.? My experience is that the average user of our > scientific computing systems has been becoming less technically > savvy for many years now. > > The presence of the cloud makes that more acute, in particular > because it makes it easy for the user to effectively throw more > hardware at the problem, which reduces the incentive to make their > code particularly fast or efficient.? Cost is the only brake on > it, and in many cases I?m finding the PI doesn?t actually care > about that.? They care that a result is being obtained (and it?s > time to first result they care about, not time to complete all the > analysis), and so they typically don?t have much time for those of > us who are telling them they need to invest in time up front > developing and optimising efficient code. > > And cost is not necessarily the brake I thought it was going to be > anyway.? One recent project we?ve done on AWS has impressed me a > great deal.? It?s not terribly CPU efficient, and would doubtless, > with sufficient effort, run much more efficiently on premise.? But > it?s extremely elastic in its nature, and so a good fit for the > cloud. ? Once a week, the project has to completely re-analyse the > 600,000+ COVID genomes we?e sequenced so far, looking for new > branches in the phylogenetic tree, and to complete that analysis > inside 8 hours. ? Initial attempts to naively convert the HPC > implementation to run on AWS looked as though they were going to > be very expensive (~$50k per weekly run).? But a fundamental > reworking of the entire workflow to make it as cloud native as > possible, by which I mean almost exclusively serverless, has > succeeded beyond what I expected.? The total cost is <$5,000 a > month, and because there is essentially no statically configured > infrastructure at all, the security is fairly easy to be > comfortable about. And all of that was done with no detailed > thinking about whether the actual algorithms running in the > containers are at all optimised in a traditional HPC sense.? It?s > just not needed for this particular piece of work.? Did it need > software developers with hardcore knowledge of performance > optimisation? No.? Was it rapid to develop and deploy?? Yes.? Is > the performance fast enough for UK national COVID variant > surveillance?? Yes.? Is it cost effective? Yes.? Sold!? The one > thing it did need was knowledgeable cloud architects, but the > cloud providers can and do help with that. > > Tim > > -- > > Tim Cutts > Head of Scientific Computing > Wellcome Sanger Institute > > > > On 21 Sep 2021, at 12:24, John Hearns > wrote: > > Some points well made here. I have seen in the past job > scripts passed on from graduate student to graduate student - > the case I am thinking on was an Abaqus script for 8 core > systems, being run on a new 32 core system. Why WOULD a > graduate student question a script given to them - which > works. They should be getting on with their science. I guess > this is where Research Software Engineers come in. > > Another point I would make is about modern processor > architectures, for instance AMD Rome/Milan. You can have > different Numa Per Socket options, which affect performance. > We set the preferred IO path - which I have seen myself to > have an effect on latency of MPI messages. IF you are not > concerned about your hardware layout you would just go ahead > and run, missing? a lot of performance. > > I am now going to be controversial and common?that?over in > Julia land the pattern seems to be these days people develop > on their own laptops, or maybe local GPU systems. There is a > lot of microbenchmarking going on. But there seems to be not a > lot of thought given to CPU pinning or shat happens with > hyperthreading. I guess topics like that are part of HPC > 'Black Magic' - though I would imagine the low latency crowd > are hot on them. > > I often introduce people to the excellent lstopo/hwloc > utilities which show the layout of a system. Most people are > pleasantly surprised to find this. > > -- The Wellcome Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose > registered office is 215 Euston Road, London, NW1 2BE. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > > > > -- > > Dr. Guy Coates > +44(0)7801 710224 > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From pedmon at cfa.harvard.edu Wed Sep 29 13:58:58 2021 From: pedmon at cfa.harvard.edu (Paul Edmon) Date: Wed, 29 Sep 2021 09:58:58 -0400 Subject: [Beowulf] Data Destruction Message-ID: Occassionally we get DUA (Data Use Agreement) requests for sensitive data that require data destruction (e.g. NIST 800-88). We've been struggling with how to handle this in an era of distributed filesystems and disks.? We were curious how other people handle requests like this?? What types of filesystems to people generally use for this and how do people ensure destruction?? Do these types of DUA's preclude certain storage technologies from consideration or are there creative ways to comply using more common scalable filesystems? Thanks in advance for the info. -Paul Edmon- From e.scott.atchley at gmail.com Wed Sep 29 14:06:53 2021 From: e.scott.atchley at gmail.com (Scott Atchley) Date: Wed, 29 Sep 2021 10:06:53 -0400 Subject: [Beowulf] Data Destruction In-Reply-To: References: Message-ID: Are you asking about selectively deleting data from a parallel file system (PFS) or destroying drives after removal from the system either due to failure or system decommissioning? For the latter, DOE does not allow us to send any non-volatile media offsite once it has had user data on it. When we are done with drives, we have a very big shredder. On Wed, Sep 29, 2021 at 9:59 AM Paul Edmon via Beowulf wrote: > Occassionally we get DUA (Data Use Agreement) requests for sensitive > data that require data destruction (e.g. NIST 800-88). We've been > struggling with how to handle this in an era of distributed filesystems > and disks. We were curious how other people handle requests like this? > What types of filesystems to people generally use for this and how do > people ensure destruction? Do these types of DUA's preclude certain > storage technologies from consideration or are there creative ways to > comply using more common scalable filesystems? > > Thanks in advance for the info. > > -Paul Edmon- > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pedmon at cfa.harvard.edu Wed Sep 29 14:15:20 2021 From: pedmon at cfa.harvard.edu (Paul Edmon) Date: Wed, 29 Sep 2021 10:15:20 -0400 Subject: [Beowulf] Data Destruction In-Reply-To: References:

Message-ID: <0ae72287-b539-5c60-006e-8bbf843c9936@cfa.harvard.edu> The former.? We are curious how to selectively delete data from a parallel filesystem.? For example we commonly use Lustre, ceph, and Isilon in our environment.? That said if other types allow for easier destruction of selective data we would be interested in hearing about it. -Paul Edmon- On 9/29/2021 10:06 AM, Scott Atchley wrote: > Are you asking about selectively deleting data from a parallel file > system (PFS) or destroying drives after removal from the system either > due to failure or system decommissioning? > > For the latter, DOE does not allow us to send any non-volatile media > offsite once it has had user data on it. When we are done with?drives, > we have a very big shredder. > > On Wed, Sep 29, 2021 at 9:59 AM Paul Edmon via Beowulf > > wrote: > > Occassionally we get DUA (Data Use Agreement) requests for sensitive > data that require data destruction (e.g. NIST 800-88). We've been > struggling with how to handle this in an era of distributed > filesystems > and disks.? We were curious how other people handle requests like > this? > What types of filesystems to people generally use for this and how do > people ensure destruction?? Do these types of DUA's preclude certain > storage technologies from consideration or are there creative ways to > comply using more common scalable filesystems? > > Thanks in advance for the info. > > -Paul Edmon- > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From e.scott.atchley at gmail.com Wed Sep 29 14:32:32 2021 From: e.scott.atchley at gmail.com (Scott Atchley) Date: Wed, 29 Sep 2021 10:32:32 -0400 Subject: [Beowulf] Data Destruction In-Reply-To: <0ae72287-b539-5c60-006e-8bbf843c9936@cfa.harvard.edu> References:

<0ae72287-b539-5c60-006e-8bbf843c9936@cfa.harvard.edu> Message-ID: For our users that have sensitive data, we keep it encrypted at rest and in movement. For HDD-based systems, you can perform a secure erase per NIST standards. For SSD-based systems, the extra writes from the secure erase will contribute to the wear on the drives and possibly their eventually wearing out. Most SSDs provide an option to mark blocks as zero without having to write the zeroes. I do not think that it is exposed up to the PFS layer (Lustre, GPFS, Ceph, NFS) and is only available at the ext4 or XFS layer. On Wed, Sep 29, 2021 at 10:15 AM Paul Edmon wrote: > The former. We are curious how to selectively delete data from a parallel > filesystem. For example we commonly use Lustre, ceph, and Isilon in our > environment. That said if other types allow for easier destruction of > selective data we would be interested in hearing about it. > > -Paul Edmon- > On 9/29/2021 10:06 AM, Scott Atchley wrote: > > Are you asking about selectively deleting data from a parallel file system > (PFS) or destroying drives after removal from the system either due to > failure or system decommissioning? > > For the latter, DOE does not allow us to send any non-volatile media > offsite once it has had user data on it. When we are done with drives, we > have a very big shredder. > > On Wed, Sep 29, 2021 at 9:59 AM Paul Edmon via Beowulf < > beowulf at beowulf.org> wrote: > >> Occassionally we get DUA (Data Use Agreement) requests for sensitive >> data that require data destruction (e.g. NIST 800-88). We've been >> struggling with how to handle this in an era of distributed filesystems >> and disks. We were curious how other people handle requests like this? >> What types of filesystems to people generally use for this and how do >> people ensure destruction? Do these types of DUA's preclude certain >> storage technologies from consideration or are there creative ways to >> comply using more common scalable filesystems? >> >> Thanks in advance for the info. >> >> -Paul Edmon- >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Renfro at tntech.edu Wed Sep 29 14:32:52 2021 From: Renfro at tntech.edu (Renfro, Michael) Date: Wed, 29 Sep 2021 14:32:52 +0000 Subject: [Beowulf] Data Destruction In-Reply-To: <0ae72287-b539-5c60-006e-8bbf843c9936@cfa.harvard.edu> References:

<0ae72287-b539-5c60-006e-8bbf843c9936@cfa.harvard.edu> Message-ID: I have to wonder if the intent of the DUA is to keep physical media from winding up in the wrong hands. If so, if the servers hosting the parallel filesystem (or a normal single file server) is physically secured in a data center, and the drives are destroyed on decommissioning, that might satisfy the requirements. From: Beowulf on behalf of Paul Edmon via Beowulf Date: Wednesday, September 29, 2021 at 9:15 AM To: Scott Atchley Cc: Beowulf Mailing List Subject: Re: [Beowulf] Data Destruction External Email Warning This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests. ________________________________ The former. We are curious how to selectively delete data from a parallel filesystem. For example we commonly use Lustre, ceph, and Isilon in our environment. That said if other types allow for easier destruction of selective data we would be interested in hearing about it. -Paul Edmon- On 9/29/2021 10:06 AM, Scott Atchley wrote: Are you asking about selectively deleting data from a parallel file system (PFS) or destroying drives after removal from the system either due to failure or system decommissioning? For the latter, DOE does not allow us to send any non-volatile media offsite once it has had user data on it. When we are done with drives, we have a very big shredder. On Wed, Sep 29, 2021 at 9:59 AM Paul Edmon via Beowulf > wrote: Occassionally we get DUA (Data Use Agreement) requests for sensitive data that require data destruction (e.g. NIST 800-88). We've been struggling with how to handle this in an era of distributed filesystems and disks. We were curious how other people handle requests like this? What types of filesystems to people generally use for this and how do people ensure destruction? Do these types of DUA's preclude certain storage technologies from consideration or are there creative ways to comply using more common scalable filesystems? Thanks in advance for the info. -Paul Edmon- _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From skylar.thompson at gmail.com Wed Sep 29 14:35:43 2021 From: skylar.thompson at gmail.com (Skylar Thompson) Date: Wed, 29 Sep 2021 07:35:43 -0700 Subject: [Beowulf] Data Destruction In-Reply-To: References: Message-ID: <20210929143543.sk2qf7qmpl2dqiop@angmar.local> We have one storage system (DDN/GPFS) that is required to be NIST-compliant, and we bought self-encrypting drives for it. The up-charge for SED drives has diminished significantly over the past few years so that might be easier than doing it in software and then having to verify/certify that the software is encrypting everything that it should be. On Wed, Sep 29, 2021 at 09:58:58AM -0400, Paul Edmon via Beowulf wrote: > Occassionally we get DUA (Data Use Agreement) requests for sensitive data > that require data destruction (e.g. NIST 800-88). We've been struggling with > how to handle this in an era of distributed filesystems and disks.? We were > curious how other people handle requests like this?? What types of > filesystems to people generally use for this and how do people ensure > destruction?? Do these types of DUA's preclude certain storage technologies > from consideration or are there creative ways to comply using more common > scalable filesystems? > > Thanks in advance for the info. > > -Paul Edmon- > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf -- Skylar From pedmon at cfa.harvard.edu Wed Sep 29 14:49:53 2021 From: pedmon at cfa.harvard.edu (Paul Edmon) Date: Wed, 29 Sep 2021 10:49:53 -0400 Subject: [Beowulf] Data Destruction In-Reply-To: References:

<0ae72287-b539-5c60-006e-8bbf843c9936@cfa.harvard.edu> Message-ID: Yeah, that's what we were surmising.? But paranoia and compliance being what it is we were curious what others were doing. -Paul Edmon- On 9/29/2021 10:32 AM, Renfro, Michael wrote: > > I have to wonder if the intent of the DUA is to keep physical media > from winding up in the wrong hands. If so, if the servers hosting the > parallel filesystem (or a normal single file server) is physically > secured in a data center, and the drives are destroyed on > decommissioning, that might satisfy the requirements. > > *From: *Beowulf on behalf of Paul Edmon > via Beowulf > *Date: *Wednesday, September 29, 2021 at 9:15 AM > *To: *Scott Atchley > *Cc: *Beowulf Mailing List > *Subject: *Re: [Beowulf] Data Destruction > > *External Email Warning* > > *This email originated from outside the university. Please use caution > when opening attachments, clicking links, or responding to requests.* > > ------------------------------------------------------------------------ > > The former.? We are curious how to selectively delete data from a > parallel filesystem.? For example we commonly use Lustre, ceph, and > Isilon in our environment.? That said if other types allow for easier > destruction of selective data we would be interested in hearing about it. > > -Paul Edmon- > > On 9/29/2021 10:06 AM, Scott Atchley wrote: > > Are you asking about selectively deleting data from a parallel > file system (PFS) or destroying drives after removal from the > system either due to failure or system decommissioning? > > For the latter, DOE does not allow us to send any non-volatile > media offsite once it has had user data on it. When we are done > with?drives, we have a very big shredder. > > On Wed, Sep 29, 2021 at 9:59 AM Paul Edmon via Beowulf > > wrote: > > Occassionally we get DUA (Data Use Agreement) requests for > sensitive > data that require data destruction (e.g. NIST 800-88). We've been > struggling with how to handle this in an era of distributed > filesystems > and disks.? We were curious how other people handle requests > like this? > What types of filesystems to people generally use for this and > how do > people ensure destruction?? Do these types of DUA's preclude > certain > storage technologies from consideration or are there creative > ways to > comply using more common scalable filesystems? > > Thanks in advance for the info. > > -Paul Edmon- > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From pedmon at cfa.harvard.edu Wed Sep 29 14:52:33 2021 From: pedmon at cfa.harvard.edu (Paul Edmon) Date: Wed, 29 Sep 2021 10:52:33 -0400 Subject: [Beowulf] Data Destruction In-Reply-To: References:

<0ae72287-b539-5c60-006e-8bbf843c9936@cfa.harvard.edu> Message-ID: <940bd926-ea59-49c8-6fa2-2c4dcf61ef32@cfa.harvard.edu> I guess the question is for a parallel filesystem how do you make sure you have 0'd out the file with out borking the whole filesystem since you are spread over a RAID set and could be spread over multiple hosts. -Paul Edmon- On 9/29/2021 10:32 AM, Scott Atchley wrote: > For our users that have sensitive data, we keep it encrypted at rest > and in movement. > > For HDD-based systems, you can perform a secure erase per NIST > standards. For SSD-based systems, the extra writes from the secure > erase will contribute to the wear on the drives and possibly their > eventually wearing out. Most SSDs provide an option to mark blocks as > zero without having to write the zeroes. I do not think that it is > exposed up to the PFS layer (Lustre, GPFS, Ceph, NFS) and is only > available at the ext4 or XFS layer. > > On Wed, Sep 29, 2021 at 10:15 AM Paul Edmon > wrote: > > The former.? We are curious how to selectively delete data from a > parallel filesystem.? For example we commonly use Lustre, ceph, > and Isilon in our environment.? That said if other types allow for > easier destruction of selective data we would be interested in > hearing about it. > > -Paul Edmon- > > On 9/29/2021 10:06 AM, Scott Atchley wrote: >> Are you asking about selectively deleting data from a parallel >> file system (PFS) or destroying drives after removal from the >> system either due to failure or system decommissioning? >> >> For the latter, DOE does not allow us to send any non-volatile >> media offsite once it has had user data on it. When we are done >> with?drives, we have a very big shredder. >> >> On Wed, Sep 29, 2021 at 9:59 AM Paul Edmon via Beowulf >> > wrote: >> >> Occassionally we get DUA (Data Use Agreement) requests for >> sensitive >> data that require data destruction (e.g. NIST 800-88). We've >> been >> struggling with how to handle this in an era of distributed >> filesystems >> and disks.? We were curious how other people handle requests >> like this? >> What types of filesystems to people generally use for this >> and how do >> people ensure destruction?? Do these types of DUA's preclude >> certain >> storage technologies from consideration or are there creative >> ways to >> comply using more common scalable filesystems? >> >> Thanks in advance for the info. >> >> -Paul Edmon- >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org >> sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) >> visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From skylar.thompson at gmail.com Wed Sep 29 14:57:17 2021 From: skylar.thompson at gmail.com (Skylar Thompson) Date: Wed, 29 Sep 2021 07:57:17 -0700 Subject: [Beowulf] Data Destruction In-Reply-To: <940bd926-ea59-49c8-6fa2-2c4dcf61ef32@cfa.harvard.edu> References:

<0ae72287-b539-5c60-006e-8bbf843c9936@cfa.harvard.edu> <940bd926-ea59-49c8-6fa2-2c4dcf61ef32@cfa.harvard.edu> Message-ID: <20210929145717.of2td6sm2abnnk4c@angmar.local> In this case, we've successfully pushed back with the granting agency (US NIH, generally, for us) that it's just not feasible to guarantee that the data are truly gone on a production parallel filesystem. The data are encrypted at rest (including offsite backups), which has been sufficient for our purposes. We'll then just use something like GNU shred(1) to do a best-effort secure delete. In addition to RAID, other confounding factors to be aware of are snapshots and cached data. On Wed, Sep 29, 2021 at 10:52:33AM -0400, Paul Edmon via Beowulf wrote: > I guess the question is for a parallel filesystem how do you make sure you > have 0'd out the file with out borking the whole filesystem since you are > spread over a RAID set and could be spread over multiple hosts. > > -Paul Edmon- > > On 9/29/2021 10:32 AM, Scott Atchley wrote: > > For our users that have sensitive data, we keep it encrypted at rest and > > in movement. > > > > For HDD-based systems, you can perform a secure erase per NIST > > standards. For SSD-based systems, the extra writes from the secure erase > > will contribute to the wear on the drives and possibly their eventually > > wearing out. Most SSDs provide an option to mark blocks as zero without > > having to write the zeroes. I do not think that it is exposed up to the > > PFS layer (Lustre, GPFS, Ceph, NFS) and is only available at the ext4 or > > XFS layer. > > > > On Wed, Sep 29, 2021 at 10:15 AM Paul Edmon > > wrote: > > > > The former.? We are curious how to selectively delete data from a > > parallel filesystem.? For example we commonly use Lustre, ceph, > > and Isilon in our environment.? That said if other types allow for > > easier destruction of selective data we would be interested in > > hearing about it. > > > > -Paul Edmon- > > > > On 9/29/2021 10:06 AM, Scott Atchley wrote: > > > Are you asking about selectively deleting data from a parallel > > > file system (PFS) or destroying drives after removal from the > > > system either due to failure or system decommissioning? > > > > > > For the latter, DOE does not allow us to send any non-volatile > > > media offsite once it has had user data on it. When we are done > > > with?drives, we have a very big shredder. > > > > > > On Wed, Sep 29, 2021 at 9:59 AM Paul Edmon via Beowulf > > > > wrote: > > > > > > Occassionally we get DUA (Data Use Agreement) requests for > > > sensitive > > > data that require data destruction (e.g. NIST 800-88). We've > > > been > > > struggling with how to handle this in an era of distributed > > > filesystems > > > and disks.? We were curious how other people handle requests > > > like this? > > > What types of filesystems to people generally use for this > > > and how do > > > people ensure destruction?? Do these types of DUA's preclude > > > certain > > > storage technologies from consideration or are there creative > > > ways to > > > comply using more common scalable filesystems? > > > > > > Thanks in advance for the info. > > > > > > -Paul Edmon- > > > > > > _______________________________________________ > > > Beowulf mailing list, Beowulf at beowulf.org > > > sponsored by Penguin Computing > > > To change your subscription (digest mode or unsubscribe) > > > visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > > > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf -- Skylar From sassy-work at sassy.formativ.net Wed Sep 29 15:41:08 2021 From: sassy-work at sassy.formativ.net (=?ISO-8859-1?Q?J=F6rg_Sa=DFmannshausen?=) Date: Wed, 29 Sep 2021 16:41:08 +0100 Subject: [Beowulf] Data Destruction In-Reply-To: <20210929145717.of2td6sm2abnnk4c@angmar.local> References: <940bd926-ea59-49c8-6fa2-2c4dcf61ef32@cfa.harvard.edu> <20210929145717.of2td6sm2abnnk4c@angmar.local> Message-ID: <3450927.8eZotpTBpl@deepblue> Dear all, interesting discussion and very timely for me as well as we are currently setting up a new HPC facility, using OpenStack throughout so we can build a Data Safe Haven with it as well. The question about data security came up too in various conversations, both internal and with industrial partners. Here I actually asked one collaboration partner what they understand about "data at rest": - the drive has been turned off - the data is not being accessed For the former, that is easy we simply encrypt all drives, one way or another. However, that means when the drive is on, the data is not encrypted. For the latter that is a bit more complicated as you need to decrypt the files/folder when you want to access them. This, however, in addition to the drive encryption itself, should give you potentially the maximum security. When you want to destroy that data, deleting the encrypted container *and* the access key, i.e. the piece you need to decrypt it, like a Yubikey, should in my humble opinion being enough for most data. If you need more, shred the drive and don't use fancy stuff like RAID or PFS. If you still need more, don't store the data at all but print it out on paper and destroy it by means of incineration. :D How about that? All the best from a sunny London J?rg Am Mittwoch, 29. September 2021, 15:57:17 BST schrieb Skylar Thompson: > In this case, we've successfully pushed back with the granting agency (US > NIH, generally, for us) that it's just not feasible to guarantee that the > data are truly gone on a production parallel filesystem. The data are > encrypted at rest (including offsite backups), which has been sufficient > for our purposes. We'll then just use something like GNU shred(1) to do a > best-effort secure delete. > > In addition to RAID, other confounding factors to be aware of are snapshots > and cached data. > > On Wed, Sep 29, 2021 at 10:52:33AM -0400, Paul Edmon via Beowulf wrote: > > I guess the question is for a parallel filesystem how do you make sure you > > have 0'd out the file with out borking the whole filesystem since you are > > spread over a RAID set and could be spread over multiple hosts. > > > > -Paul Edmon- > > > > On 9/29/2021 10:32 AM, Scott Atchley wrote: > > > For our users that have sensitive data, we keep it encrypted at rest and > > > in movement. > > > > > > For HDD-based systems, you can perform a secure erase per NIST > > > standards. For SSD-based systems, the extra writes from the secure erase > > > will contribute to the wear on the drives and possibly their eventually > > > wearing out. Most SSDs provide an option to mark blocks as zero without > > > having to write the zeroes. I do not think that it is exposed up to the > > > PFS layer (Lustre, GPFS, Ceph, NFS) and is only available at the ext4 or > > > XFS layer. > > > > > > On Wed, Sep 29, 2021 at 10:15 AM Paul Edmon > > > > > > wrote: > > > The former. We are curious how to selectively delete data from a > > > parallel filesystem. For example we commonly use Lustre, ceph, > > > and Isilon in our environment. That said if other types allow for > > > easier destruction of selective data we would be interested in > > > hearing about it. > > > > > > -Paul Edmon- > > > > > > On 9/29/2021 10:06 AM, Scott Atchley wrote: > > > > Are you asking about selectively deleting data from a parallel > > > > file system (PFS) or destroying drives after removal from the > > > > system either due to failure or system decommissioning? > > > > > > > > For the latter, DOE does not allow us to send any non-volatile > > > > media offsite once it has had user data on it. When we are done > > > > with drives, we have a very big shredder. > > > > > > > > On Wed, Sep 29, 2021 at 9:59 AM Paul Edmon via Beowulf > > > > > > > > > wrote: > > > > Occassionally we get DUA (Data Use Agreement) requests for > > > > sensitive > > > > data that require data destruction (e.g. NIST 800-88). We've > > > > been > > > > struggling with how to handle this in an era of distributed > > > > filesystems > > > > and disks. We were curious how other people handle requests > > > > like this? > > > > What types of filesystems to people generally use for this > > > > and how do > > > > people ensure destruction? Do these types of DUA's preclude > > > > certain > > > > storage technologies from consideration or are there creative > > > > ways to > > > > comply using more common scalable filesystems? > > > > > > > > Thanks in advance for the info. > > > > > > > > -Paul Edmon- > > > > > > > > _______________________________________________ > > > > Beowulf mailing list, Beowulf at beowulf.org > > > > sponsored by Penguin Computing > > > > To change your subscription (digest mode or unsubscribe) > > > > visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > > > > > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit > > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf From ellis at ellisv3.com Wed Sep 29 15:42:46 2021 From: ellis at ellisv3.com (Ellis Wilson) Date: Wed, 29 Sep 2021 11:42:46 -0400 Subject: [Beowulf] Data Destruction In-Reply-To: <940bd926-ea59-49c8-6fa2-2c4dcf61ef32@cfa.harvard.edu> References:

<0ae72287-b539-5c60-006e-8bbf843c9936@cfa.harvard.edu> <940bd926-ea59-49c8-6fa2-2c4dcf61ef32@cfa.harvard.edu> Message-ID: <47febea6-efec-1fac-584f-29bbf67e4395@ellisv3.com> Apologies in advance for the top-post -- too many interleaved streams here to sanely bottom-post appropriately. SED drives, which are a reasonably small mark-up for both HDDs and SSDs, provide full drive or per-band solutions to "wipe" the drive by revving the key associated with the band or drive. For enterprise HDDs the feature is extremely common -- for enterprise SSDs it is hit or miss (NVMe tend to have it, SATA infrequently do). This is your best bet for a solution where you're a-ok with wiping the entire system. Note there's non-zero complexity here usually revolving around a non-zero price KMIP server, but it's (usually) not terrible. My old employ (Panasas) supports this level of encryption in their most recent release. Writing zeros over HDDs or SSDs today is an extremely dubious solution. SSDs will just write the zeros elsewhere (or more commonly, not write them at all) and HDDs are far more complex than the olden days so you're still given no hard guarantees there that writing to LBA X is actually writing to LBA X. Add a PFS and then local FS in front of this and forget about it. You're just wasting bandwidth. If you have a multi-tenant system and cannot just wipe the whole system by revving encryption keys on the drives, you're options are static partitioning of the drives into SED bands per tenant and a rather complex setup with a KMIP server and parallel parallel file systems to support that, or client-side encryption. Lustre 2.14 provides this via fsencrypt for data, which is actually pretty slick. This is your best bet to cryptographically shred the data for individual users. I have no experience with other commercial file systems so cannot comment on who does or doesn't support client-side encryption, but whoever does should allow you to fairly trivially shred the bits associated with that user/project/org by discarding/revving the corresponding keys. If you go the client-side encryption route and shred the keys, snapshots, PFS, local FS, RAID, and all of the other factors here play no role and you can safely promise the data is mathematically "gone" to the end-user. Best, ellis On 9/29/21 10:52 AM, Paul Edmon via Beowulf wrote: > I guess the question is for a parallel filesystem how do you make sure > you have 0'd out the file with out borking the whole filesystem since > you are spread over a RAID set and could be spread over multiple hosts. > > -Paul Edmon- > > On 9/29/2021 10:32 AM, Scott Atchley wrote: >> For our users that have sensitive data, we keep it encrypted at rest >> and in movement. >> >> For HDD-based systems, you can perform a secure erase per NIST >> standards. For SSD-based systems, the extra writes from the secure >> erase will contribute to the wear on the drives and possibly their >> eventually wearing out. Most SSDs provide an option to mark blocks as >> zero without having to write the zeroes. I do not think that it is >> exposed up to the PFS layer (Lustre, GPFS, Ceph, NFS) and is only >> available at the ext4 or XFS layer. >> >> On Wed, Sep 29, 2021 at 10:15 AM Paul Edmon > > wrote: >> >> The former.? We are curious how to selectively delete data from a >> parallel filesystem.? For example we commonly use Lustre, ceph, >> and Isilon in our environment.? That said if other types allow for >> easier destruction of selective data we would be interested in >> hearing about it. >> >> -Paul Edmon- >> >> On 9/29/2021 10:06 AM, Scott Atchley wrote: >>> Are you asking about selectively deleting data from a parallel >>> file system (PFS) or destroying drives after removal from the >>> system either due to failure or system decommissioning? >>> >>> For the latter, DOE does not allow us to send any non-volatile >>> media offsite once it has had user data on it. When we are done >>> with?drives, we have a very big shredder. >>> >>> On Wed, Sep 29, 2021 at 9:59 AM Paul Edmon via Beowulf >>> > wrote: >>> >>> Occassionally we get DUA (Data Use Agreement) requests for >>> sensitive >>> data that require data destruction (e.g. NIST 800-88). We've >>> been >>> struggling with how to handle this in an era of distributed >>> filesystems >>> and disks.? We were curious how other people handle requests >>> like this? >>> What types of filesystems to people generally use for this >>> and how do >>> people ensure destruction?? Do these types of DUA's preclude >>> certain >>> storage technologies from consideration or are there creative >>> ways to >>> comply using more common scalable filesystems? >>> >>> Thanks in advance for the info. >>> >>> -Paul Edmon- >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org >>> sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) >>> visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >>> >>> > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > From ellis at ellisv3.com Wed Sep 29 15:46:53 2021 From: ellis at ellisv3.com (Ellis Wilson) Date: Wed, 29 Sep 2021 11:46:53 -0400 Subject: [Beowulf] Data Destruction In-Reply-To: <3450927.8eZotpTBpl@deepblue> References: <940bd926-ea59-49c8-6fa2-2c4dcf61ef32@cfa.harvard.edu> <20210929145717.of2td6sm2abnnk4c@angmar.local> <3450927.8eZotpTBpl@deepblue> Message-ID: On 9/29/21 11:41 AM, J?rg Sa?mannshausen wrote: > If you still need more, don't store the data at all but print it out on paper > and destroy it by means of incineration. :D I have heard stories from past colleagues of one large US Lab putting their HDDs through wood chippers with magnets on the chipped side to kill the bits good and dead. As a storage fanatic that always struck me as something I'd have loved to see. Best, ellis From james.p.lux at jpl.nasa.gov Wed Sep 29 19:00:08 2021 From: james.p.lux at jpl.nasa.gov (Lux, Jim (US 7140)) Date: Wed, 29 Sep 2021 19:00:08 +0000 Subject: [Beowulf] [EXTERNAL] Re: Data Destruction In-Reply-To: References: <940bd926-ea59-49c8-6fa2-2c4dcf61ef32@cfa.harvard.edu> <20210929145717.of2td6sm2abnnk4c@angmar.local> <3450927.8eZotpTBpl@deepblue> Message-ID: <64B7F07B-36DC-4589-BE5C-8DE72AF644C4@jpl.nasa.gov> There are special purpose drive shredders - they'll even come out to your facility with such a device mounted on a truck. It's not as exciting as you might think. Throw stuff in, makes horrible noise, done. For entertainment, the truck sized wood chipper used when clearing old orchards is much more fun. Another unit rips the tree out of the ground and drops it into the giant feed hopper on top. Entire lemon or avocado trees turned into mulch in a very short time. This isn't the one I saw, but it's the same idea, except horizontal feed. https://www.youtube.com/watch?v=BCdmO6WvBYk We had a lengthy discussion on Slack at work (JPL) about ways to destroy drives - melting them in some sort of forge seems to be the most fun, unless you have a convenient volcano with flowing lava handy. ?On 9/29/21, 8:47 AM, "Beowulf on behalf of Ellis Wilson" wrote: On 9/29/21 11:41 AM, J?rg Sa?mannshausen wrote: > If you still need more, don't store the data at all but print it out on paper > and destroy it by means of incineration. :D I have heard stories from past colleagues of one large US Lab putting their HDDs through wood chippers with magnets on the chipped side to kill the bits good and dead. As a storage fanatic that always struck me as something I'd have loved to see. Best, ellis _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://urldefense.us/v3/__https://beowulf.org/cgi-bin/mailman/listinfo/beowulf__;!!PvBDto6Hs4WbVuu7!Yrfky03OZZ3wo1YOCW77WLjeKXbglr4IVVHlYwAqOLxxgkK95rTaAlheRxZy6Ww0M92RGxs$ From lohitv at gwmail.gwu.edu Wed Sep 29 21:26:26 2021 From: lohitv at gwmail.gwu.edu (Lohit Valleru) Date: Wed, 29 Sep 2021 17:26:26 -0400 Subject: [Beowulf] RoCE vs. InfiniBand In-Reply-To: References: <4143909.Z0dSWV6CRx@deepblue> <3249207.C5SgPnEN8U@deepblue> Message-ID: Hello Everyone.. I am now at a similar confusion between what to choose for a new cluster - ROCE vs Infiniband. My experience with ROCE when I tried it recently was that it was not easy to set it up. It required me to set up qos for lossless fabric, and pfc for flow control. On top of that - it required me to decide how much buffer to dedicate for each class. I am not a networking expert, and it has been very difficult for me to understand how much of a buffer would be enough, in addition to how to monitor buffers, and understand that I would need to dedicate more buffers. Following a few mellanox documents - I think i did enable ROCE however i am not sure if i had set it up the right way. Because I was not able to make it reliably work with GPFS or any MPI applications. In the past, when I worked with Infiniband, it was a breeze to set it up, and make it work with GPFS and other MPI applications. I did have issues with Infiniband errors, which were not easy to debug - but other than that, management and setup of infiniband seemed to be very easy. Currently - We have a lot of ethernet networking issues, where we see many discards and retransmits of packets, leading to GPFS complaining about the same and remounting the filesystem or expelling the filesystem. In addition, I see IO performance issues. The reason I was told that it might be because of a low buffer 100Gb switch, and a deep buffer 100Gb switch might solve the issue. However - I could not prove the same with respect to buffers. We enabled ECN and it seemed to make the network a bit stable but the performance is still lacking. Most of the issues were because of IO between Storage and Clients, where multiple Storage servers are able to give out a lot more network bandwidth than a single client could take. So I was thinking that a better solution, with least setup issues and debugging would be to have both ethernet and infiniband. Ethernet for administrative traffic and Infiniband for data traffic. However, the argument is to why not use ROCE instead of infiniband. When it comes to ROCE - I find it very difficult to find documentation on how to set things up the correct way and debug issues with buffering/flow control. I do see that Mellanox has some documentation with respect to ROCE, but it has not been very clear. I have understood that ROCE would mostly be beneficial when it comes to long distances, or when we might need to route between ethernet and infiniband. May I know if anyone could help me understand with their experience, on what they would choose if they build a new cluster, and why would that be. Would ROCE be easier to setup,manage and debug or Infiniband? As of now - the new cluster is going to be within a single data center, and it might span to about 500 nodes with 4 GPUs and 64 cores each. We might get storages ( with multiple storage servers containing from 2 - 6 ConnectX6 per server) that can do 420GB/s or more, and clients with either a single ConnectX6 - 100G or 8 ConnectX6 cards. For ROCE - May i know if anyone could help me point to the respective documentation that could help me learn on how to set it up and debug it correctly. Thank you, Lohit On Sun, Jan 17, 2021 at 8:07 PM Stu Midgley wrote: > Morning (Hi Gilad) > > We run RoCE over Mellanox 100G Ethernet and get 1.3us latency for the > shortest hop. Increasing slightly as you go through the fabric. > > We run ethernet for a full dual-plane fat-tree :) It is 100% possible > with Mellanox :) > > We love it. > > > On Fri, Jan 15, 2021 at 8:40 PM J?rg Sa?mannshausen < > sassy-work at sassy.formativ.net> wrote: > >> Hi Gilad, >> >> thanks for the feedback, much appreciated. >> In an ideal world, you are right of course. OpenStack is supported >> natively on >> InfiniBand, and you can get the MetroX system to connect between two >> different >> sites (I leave it open of how to read that) etc. >> >> However, in the real world all of that needs to fit into a budget. From >> what I >> can see on the cluster, most jobs are in the region between 64 and 128 >> cores. >> So, that raises the question for that rather small amount of cores, do we >> really need InfiniBand or can we do what we need to do with RoCE v2? >> >> In other words, for the same budget, does it make sense to remove the >> InfiniBand part of the design and get say one GPU box in instead? >> >> What I want to avoid is to make the wrong decision (cheap and cheerful) >> and >> ending up with a badly designed cluster later. >> >> As you mentioned MetroX: remind me please, what kind of cable does it >> need? Is >> that something special or can we use already existing cables, whatever is >> used >> between data centre sites (sic!)? >> >> We had a chat with Darren about that which was, as always talking to your >> colleague Darren, very helpful. I remember very distinct there was a >> reason >> why we went for the InfiniBand/RoCE solution but I cannot really remember >> it. >> It was something with the GPU boxes we want to buy as well. >> >> I will pass your comments on to my colleague next week when I am back at >> work >> and see what they say. So many thanks for your sentiments here which are >> much >> appreciated from me! >> >> All the best from a cold London >> >> J?rg >> >> Am Donnerstag, 26. November 2020, 12:51:55 GMT schrieb Gilad Shainer: >> > Let me try to help: >> > >> > - OpenStack is supported natively on InfiniBand already, >> therefore >> > there is no need to go to Ethernet for that >> >> > - File system wise, you can have IB file system, and connect >> > directly to IB system. >> >> > - Depends on the distance, you can run 2Km IB between >> switches, or >> > use Mellanox MetroX for connecting over 40Km. VicinityIO have system >> that >> > go over thousands of miles? >> >> > - IB advantages are with much lower latency (switches alone >> are 3X >> > lower latency), cost effectiveness (for the same speed, IB switches are >> > more cost effective than Ethernet) and the In-Network Computing engines >> > (MPI reduction operations, Tag Matching run on the network) >> >> > If you need help, feel free to contact directly. >> > >> > Regards, >> > Gilad Shainer >> > >> > From: Beowulf [mailto:beowulf-bounces at beowulf.org] On Behalf Of John >> Hearns >> > Sent: Thursday, November 26, 2020 3:42 AM >> > To: J?rg Sa?mannshausen ; Beowulf >> Mailing >> > List >> Subject: Re: [Beowulf] RoCE vs. InfiniBand >> > >> > External email: Use caution opening links or attachments >> > >> > Jorg, I think I might know where the Lustre storage is ! >> > It is possible to install storage routers, so you could route between >> > ethernet and infiniband. >> It is also worth saying that Mellanox have Metro >> > Infiniband switches - though I do not think they go as far as the west >> of >> > London! >> > Seriously though , you ask about RoCE. I will stick my neck out and say >> yes, >> > if you are planning an Openstack cluster >> with the intention of having >> > mixed AI and 'traditional' HPC workloads I would go for a RoCE style >> setup. >> > In fact I am on a discussion about a new project for a customer with >> > similar aims in an hours time. >> > I could get some benchmarking time if you want to do a direct >> comparison of >> > Gromacs on IB / RoCE >> >> > >> > >> > >> > >> > >> > >> > >> > >> > On Thu, 26 Nov 2020 at 11:14, J?rg Sa?mannshausen >> > > >> > wrote: >> Dear all, >> > >> > as the DNS problems have been solve (many thanks for doing this!), I was >> > wondering if people on the list have some experiences with this >> question: >> > >> > We are currently in the process to purchase a new cluster and we want to >> > use >> OpenStack for the whole management of the cluster. Part of the cluster >> > will run HPC applications like GROMACS for example, other parts typical >> > OpenStack applications like VM. We also are implementing a Data Safe >> Haven >> > for the more sensitive data we are aiming to process. Of course, we >> want to >> > have a decent size GPU partition as well! >> > >> > Now, traditionally I would say that we are going for InfiniBand. >> However, >> > for >> reasons I don't want to go into right now, our existing file storage >> > (Lustre) will be in a different location. Thus, we decided to go for >> RoCE >> > for the file storage and InfiniBand for the HPC applications. >> > >> > The point I am struggling is to understand if this is really the best of >> > the >> solution or given that we are not building a 100k node cluster, we >> > could use RoCE for the few nodes which are doing parallel, read MPI, >> jobs >> > too. I have a nagging feeling that I am missing something if we are >> moving >> > to pure RoCE and ditch the InfiniBand. We got a mixed workload, from >> ML/AI >> > to MPI applications like GROMACS to pipelines like they are used in the >> > bioinformatic corner. We are not planning to partition the GPUs, the >> > current design model is to have only 2 GPUs in a chassis. >> > So, is there something I am missing or is the stomach feeling I have >> really >> > a >> lust for some sushi? :-) >> > >> > Thanks for your sentiments here, much welcome! >> > >> > All the best from a dull London >> > >> > J?rg >> > >> > >> > >> > _______________________________________________ >> > Beowulf mailing list, Beowulf at beowulf.org >> > sponsored by Penguin Computing >> To change your subscription (digest mode or >> > unsubscribe) visit >> > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf< >> https://nam11.safelink >> > >> s.protection.outlook.com/?url=https%3A%2F%2Fbeowulf.org%2Fcgi-bin%2Fmailman% >> > 2Flistinfo%2Fbeowulf&data=04%7C01%7CShainer%40nvidia.com >> %7C8e220b6be2fa48921 >> > >> dce08d892005b27%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637419877513157 >> > >> 960%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1h >> > >> aWwiLCJXVCI6Mn0%3D%7C1000&sdata=0NLRDQHkYol82mmqs%2BQrFryEuitIpDss2NwgIeyg1K >> > 8%3D&reserved=0> >> >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >> > > > -- > Dr Stuart Midgley > sdm900 at gmail.com > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sassy-work at sassy.formativ.net Wed Sep 29 21:31:02 2021 From: sassy-work at sassy.formativ.net (=?ISO-8859-1?Q?J=F6rg_Sa=DFmannshausen?=) Date: Wed, 29 Sep 2021 22:31:02 +0100 Subject: [Beowulf] [EXTERNAL] Re: Data Destruction In-Reply-To: <64B7F07B-36DC-4589-BE5C-8DE72AF644C4@jpl.nasa.gov> References:

<64B7F07B-36DC-4589-BE5C-8DE72AF644C4@jpl.nasa.gov> Message-ID: <5014979.ZkpcWPKftn@deepblue> Dear all, back at a previous work place the shredder did come to our place and like Jim said: loud, not much to look at other than a heap of shredded metal and plastic the other end. There is an active volcano around right now: https://www.spiegel.de/wissenschaft/natur/vulkanausbruch-auf-la-palma-lavastrom-fliesst-in-den-atlantik-a-9b86cd64-d9b6-4566-9a41-06a1c0ac4a47? jwsource=cl (Sorry, German text but the pictures are quite spectacular) I think the only flipside is that usually the volcano does not come to you! :D Melting it all down is a good idea from the data destruction point of view. However, it makes it much more difficult to separate the more precious elements later on. Here taking out the platters first and manually disassemble the rest might be a better way forward. You can shredder the platters and recycle them and recycle the rest as require. Ok, I take off my chemists hat. :D All the best J?rg Am Mittwoch, 29. September 2021, 20:00:08 BST schrieb Lux, Jim (US 7140) via Beowulf: > There are special purpose drive shredders - they'll even come out to your > facility with such a device mounted on a truck. It's not as exciting as you > might think. Throw stuff in, makes horrible noise, done. > > For entertainment, the truck sized wood chipper used when clearing old > orchards is much more fun. Another unit rips the tree out of the ground and > drops it into the giant feed hopper on top. Entire lemon or avocado trees > turned into mulch in a very short time. > > This isn't the one I saw, but it's the same idea, except horizontal feed. > https://www.youtube.com/watch?v=BCdmO6WvBYk > > > We had a lengthy discussion on Slack at work (JPL) about ways to destroy > drives - melting them in some sort of forge seems to be the most fun, > unless you have a convenient volcano with flowing lava handy. > > ?On 9/29/21, 8:47 AM, "Beowulf on behalf of Ellis Wilson" > wrote: > On 9/29/21 11:41 AM, J?rg Sa?mannshausen wrote: > > If you still need more, don't store the data at all but print it out > > on paper and destroy it by means of incineration. :D > > I have heard stories from past colleagues of one large US Lab putting > their HDDs through wood chippers with magnets on the chipped side to > kill the bits good and dead. As a storage fanatic that always struck me > as something I'd have loved to see. > > Best, > > ellis > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://urldefense.us/v3/__https://beowulf.org/cgi-bin/mailman/listinfo/beo > wulf__;!!PvBDto6Hs4WbVuu7!Yrfky03OZZ3wo1YOCW77WLjeKXbglr4IVVHlYwAqOLxxgkK95r > TaAlheRxZy6Ww0M92RGxs$ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf From sassy-work at sassy.formativ.net Wed Sep 29 21:51:40 2021 From: sassy-work at sassy.formativ.net (=?ISO-8859-1?Q?J=F6rg_Sa=DFmannshausen?=) Date: Wed, 29 Sep 2021 22:51:40 +0100 Subject: [Beowulf] Data Destruction In-Reply-To: <47febea6-efec-1fac-584f-29bbf67e4395@ellisv3.com> References: <940bd926-ea59-49c8-6fa2-2c4dcf61ef32@cfa.harvard.edu> <47febea6-efec-1fac-584f-29bbf67e4395@ellisv3.com> Message-ID: <1694605.ttdIU95dJi@deepblue> Hi Ellis, interesting concept. I did not know about the Lustre fsencrypt but then, I am less the in-detail expert in PFS. Just to make sure I get the concept of that correct: Basically Lustre is providing projects which itself are encrypted, similar to the encrypted containers I mentioned before. So in order to access the project folder, you would need some kind of encryption key. Without that, you only have meaningless data in front of you. Is that understanding correct? Does anybody happen to know if a similar system like the one Lustre is offering is possible on Ceph? The only problem I have with all these things is: at one point you will need to access the decrypted data. Then you need to make sure that this data is not leaving your system. So for that reason we are using a Data Safe Haven where data ingress and egress is done via a staging system. Some food for thought. Thanks All the best J?rg Am Mittwoch, 29. September 2021, 16:42:46 BST schrieb Ellis Wilson: > Apologies in advance for the top-post -- too many interleaved streams > here to sanely bottom-post appropriately. > > SED drives, which are a reasonably small mark-up for both HDDs and SSDs, > provide full drive or per-band solutions to "wipe" the drive by revving > the key associated with the band or drive. For enterprise HDDs the > feature is extremely common -- for enterprise SSDs it is hit or miss > (NVMe tend to have it, SATA infrequently do). This is your best bet for > a solution where you're a-ok with wiping the entire system. Note > there's non-zero complexity here usually revolving around a non-zero > price KMIP server, but it's (usually) not terrible. My old employ > (Panasas) supports this level of encryption in their most recent release. > > Writing zeros over HDDs or SSDs today is an extremely dubious solution. > SSDs will just write the zeros elsewhere (or more commonly, not write > them at all) and HDDs are far more complex than the olden days so you're > still given no hard guarantees there that writing to LBA X is actually > writing to LBA X. Add a PFS and then local FS in front of this and > forget about it. You're just wasting bandwidth. > > If you have a multi-tenant system and cannot just wipe the whole system > by revving encryption keys on the drives, you're options are static > partitioning of the drives into SED bands per tenant and a rather > complex setup with a KMIP server and parallel parallel file systems to > support that, or client-side encryption. Lustre 2.14 provides this via > fsencrypt for data, which is actually pretty slick. This is your best > bet to cryptographically shred the data for individual users. I have no > experience with other commercial file systems so cannot comment on who > does or doesn't support client-side encryption, but whoever does should > allow you to fairly trivially shred the bits associated with that > user/project/org by discarding/revving the corresponding keys. If you > go the client-side encryption route and shred the keys, snapshots, PFS, > local FS, RAID, and all of the other factors here play no role and you > can safely promise the data is mathematically "gone" to the end-user. > > Best, > > ellis > > On 9/29/21 10:52 AM, Paul Edmon via Beowulf wrote: > > I guess the question is for a parallel filesystem how do you make sure > > you have 0'd out the file with out borking the whole filesystem since > > you are spread over a RAID set and could be spread over multiple hosts. > > > > -Paul Edmon- > > > > On 9/29/2021 10:32 AM, Scott Atchley wrote: > >> For our users that have sensitive data, we keep it encrypted at rest > >> and in movement. > >> > >> For HDD-based systems, you can perform a secure erase per NIST > >> standards. For SSD-based systems, the extra writes from the secure > >> erase will contribute to the wear on the drives and possibly their > >> eventually wearing out. Most SSDs provide an option to mark blocks as > >> zero without having to write the zeroes. I do not think that it is > >> exposed up to the PFS layer (Lustre, GPFS, Ceph, NFS) and is only > >> available at the ext4 or XFS layer. > >> > >> On Wed, Sep 29, 2021 at 10:15 AM Paul Edmon >> > >> > wrote: > >> The former. We are curious how to selectively delete data from a > >> parallel filesystem. For example we commonly use Lustre, ceph, > >> and Isilon in our environment. That said if other types allow for > >> easier destruction of selective data we would be interested in > >> hearing about it. > >> > >> -Paul Edmon- > >> > >> On 9/29/2021 10:06 AM, Scott Atchley wrote: > >>> Are you asking about selectively deleting data from a parallel > >>> file system (PFS) or destroying drives after removal from the > >>> system either due to failure or system decommissioning? > >>> > >>> For the latter, DOE does not allow us to send any non-volatile > >>> media offsite once it has had user data on it. When we are done > >>> with drives, we have a very big shredder. > >>> > >>> On Wed, Sep 29, 2021 at 9:59 AM Paul Edmon via Beowulf > >>> > >>> > wrote: > >>> Occassionally we get DUA (Data Use Agreement) requests for > >>> sensitive > >>> data that require data destruction (e.g. NIST 800-88). We've > >>> been > >>> struggling with how to handle this in an era of distributed > >>> filesystems > >>> and disks. We were curious how other people handle requests > >>> like this? > >>> What types of filesystems to people generally use for this > >>> and how do > >>> people ensure destruction? Do these types of DUA's preclude > >>> certain > >>> storage technologies from consideration or are there creative > >>> ways to > >>> comply using more common scalable filesystems? > >>> > >>> Thanks in advance for the info. > >>> > >>> -Paul Edmon- > >>> > >>> _______________________________________________ > >>> Beowulf mailing list, Beowulf at beowulf.org > >>> sponsored by Penguin Computing > >>> To change your subscription (digest mode or unsubscribe) > >>> visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > >>> > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit > > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf From ellis at ellisv3.com Thu Sep 30 00:30:07 2021 From: ellis at ellisv3.com (Ellis Wilson) Date: Wed, 29 Sep 2021 20:30:07 -0400 Subject: [Beowulf] Data Destruction In-Reply-To: <1694605.ttdIU95dJi@deepblue> References: <940bd926-ea59-49c8-6fa2-2c4dcf61ef32@cfa.harvard.edu> <47febea6-efec-1fac-584f-29bbf67e4395@ellisv3.com> <1694605.ttdIU95dJi@deepblue> Message-ID: <5e925b0c-dd4b-fcee-7513-86eac1a087d0@ellisv3.com> On 9/29/21 5:51 PM, J?rg Sa?mannshausen wrote: > interesting concept. I did not know about the Lustre fsencrypt but then, I am > less the in-detail expert in PFS. > > Just to make sure I get the concept of that correct: Basically Lustre is > providing projects which itself are encrypted, similar to the encrypted > containers I mentioned before. So in order to access the project folder, you > would need some kind of encryption key. Without that, you only have > meaningless data in front of you. Is that understanding correct? The lustre kernel client module won't even permit open. There may be a way around that with a hacked kernel module, but even then if you don't have the key, you don't have the data. And it's not explicitly "project" based, in that the key really just applies to directories and all of their children (recursively). Last, apologies, but I typo'd the name. It's fscrypt, not fsencrypt. Was typing too quickly. See section 30.5 of the Lustre manual for details. At present no directories end up encrypted, so if you have *nix perms you'll be able to traverse everything, but you can't open anything (or again, even if you could, nonsense comes out). Full directory (including metadata) encryption is slated for the next release of Lustre. > The only problem I have with all these things is: at one point you will need > to access the decrypted data. Then you need to make sure that this data is not > leaving your system. So for that reason we are using a Data Safe Haven where > data ingress and egress is done via a staging system. I think this is orthogonal to the issue in question. For sure having a form of air gap to control the flow of data in/out is very useful, but in multi-tenant PFS you still need to provide some protections against malicious tenants convolving other people's data with their own and purporting it's a legitimate export. Client-side encryption (when managed appropriately) provides fairly decent protection against this form of the problem. Best, ellis From sdm900 at gmail.com Thu Sep 30 01:31:43 2021 From: sdm900 at gmail.com (Stu Midgley) Date: Thu, 30 Sep 2021 09:31:43 +0800 Subject: [Beowulf] RoCE vs. InfiniBand In-Reply-To: References: <4143909.Z0dSWV6CRx@deepblue> <3249207.C5SgPnEN8U@deepblue> Message-ID: Morning I strongly suggest you get Mellanox to come in help with the initial config. Their technical teams are great and they know what they are doing. We run OpenMPI with UCX on top of Mellanox Multi-Host ethernet network. The setup required a few parameters on each switch and away we went. At least with RoCE you can have ALL your equipment on the 1 network (storage, desktops, cluster etc). You don't need to have dual setups for storage (so you can access it via infiniband for the cluster and ethernet for other devices). We run a >10k node cluster with a full L3 RoCE network and it performs wonderfully and reliably. On the switches we do roce lossy interface ethernet * qos trust both interface ethernet * traffic-class 3 congestion-control ecn minimum-relative 75 maximum-relative 95 and on the hosts (connectX5) we do /bin/mlnx_qos -i "$eth" --trust dscp /bin/mlnx_qos -i "$eth" --prio2buffer=0,0,0,0,0,0,0,0 /bin/mlnx_qos -i "$eth" --pfc 0,0,0,0,0,0,0,0 /bin/mlnx_qos -i "$eth" --buffer_size=524160,0,0,0,0,0,0,0 # mlxreg requires the existence of /etc/mft/mft.conf (even though we don't need it) mft_config="/etc/mft/mft.conf" mkdir -p "${mft_config%/*}" && touch "$mft_config" source /sys/class/net/${eth}/device/uevent mlxreg -d $PCI_SLOT_NAME -reg_name ROCE_ACCL --set "roce_adp_retrans_en=0x1,roce_tx_window_en=0x1,roce_slow_restart_en=0x0" --yes echo 1 > /proc/sys/net/ipv4/tcp_ecn Stu. On Thu, Sep 30, 2021 at 5:26 AM Lohit Valleru wrote: > Hello Everyone.. > > I am now at a similar confusion between what to choose for a new cluster - > ROCE vs Infiniband. > My experience with ROCE when I tried it recently was that it was not easy > to set it up. It required me to set up qos for lossless fabric, and pfc for > flow control. On top of that - it required me to decide how much buffer to > dedicate for each class. > I am not a networking expert, and it has been very difficult for me to > understand how much of a buffer would be enough, in addition to how to > monitor buffers, and understand that I would need to dedicate more buffers. > Following a few mellanox documents - I think i did enable ROCE however i > am not sure if i had set it up the right way. Because I was not able to > make it reliably work with GPFS or any MPI applications. > > In the past, when I worked with Infiniband, it was a breeze to set it up, > and make it work with GPFS and other MPI applications. I did have issues > with Infiniband errors, which were not easy to debug - but other than that, > management and setup of infiniband seemed to be very easy. > > Currently - We have a lot of ethernet networking issues, where we see many > discards and retransmits of packets, leading to GPFS complaining about the > same and remounting the filesystem or expelling the filesystem. In > addition, I see IO performance issues. > The reason I was told that it might be because of a low buffer 100Gb > switch, and a deep buffer 100Gb switch might solve the issue. > However - I could not prove the same with respect to buffers. > We enabled ECN and it seemed to make the network a bit stable but the > performance is still lacking. > Most of the issues were because of IO between Storage and Clients, where > multiple Storage servers are able to give out a lot more network bandwidth > than a single client could take. > > So I was thinking that a better solution, with least setup issues and > debugging would be to have both ethernet and infiniband. Ethernet for > administrative traffic and Infiniband for data traffic. > However, the argument is to why not use ROCE instead of infiniband. > > When it comes to ROCE - I find it very difficult to find documentation on > how to set things up the correct way and debug issues with buffering/flow > control. > I do see that Mellanox has some documentation with respect to ROCE, but it > has not been very clear. > > I have understood that ROCE would mostly be beneficial when it comes to > long distances, or when we might need to route between ethernet and > infiniband. > > May I know if anyone could help me understand with their experience, on > what they would choose if they build a new cluster, and why would that be. > Would ROCE be easier to setup,manage and debug or Infiniband? > > As of now - the new cluster is going to be within a single data center, > and it might span to about 500 nodes with 4 GPUs and 64 cores each. > We might get storages ( with multiple storage servers containing from 2 - > 6 ConnectX6 per server) that can do 420GB/s or more, and clients with > either a single ConnectX6 - 100G or 8 ConnectX6 cards. > > For ROCE - May i know if anyone could help me point to the respective > documentation that could help me learn on how to set it up and debug it > correctly. > > Thank you, > Lohit > > On Sun, Jan 17, 2021 at 8:07 PM Stu Midgley wrote: > >> Morning (Hi Gilad) >> >> We run RoCE over Mellanox 100G Ethernet and get 1.3us latency for the >> shortest hop. Increasing slightly as you go through the fabric. >> >> We run ethernet for a full dual-plane fat-tree :) It is 100% possible >> with Mellanox :) >> >> We love it. >> >> >> On Fri, Jan 15, 2021 at 8:40 PM J?rg Sa?mannshausen < >> sassy-work at sassy.formativ.net> wrote: >> >>> Hi Gilad, >>> >>> thanks for the feedback, much appreciated. >>> In an ideal world, you are right of course. OpenStack is supported >>> natively on >>> InfiniBand, and you can get the MetroX system to connect between two >>> different >>> sites (I leave it open of how to read that) etc. >>> >>> However, in the real world all of that needs to fit into a budget. From >>> what I >>> can see on the cluster, most jobs are in the region between 64 and 128 >>> cores. >>> So, that raises the question for that rather small amount of cores, do >>> we >>> really need InfiniBand or can we do what we need to do with RoCE v2? >>> >>> In other words, for the same budget, does it make sense to remove the >>> InfiniBand part of the design and get say one GPU box in instead? >>> >>> What I want to avoid is to make the wrong decision (cheap and cheerful) >>> and >>> ending up with a badly designed cluster later. >>> >>> As you mentioned MetroX: remind me please, what kind of cable does it >>> need? Is >>> that something special or can we use already existing cables, whatever >>> is used >>> between data centre sites (sic!)? >>> >>> We had a chat with Darren about that which was, as always talking to >>> your >>> colleague Darren, very helpful. I remember very distinct there was a >>> reason >>> why we went for the InfiniBand/RoCE solution but I cannot really >>> remember it. >>> It was something with the GPU boxes we want to buy as well. >>> >>> I will pass your comments on to my colleague next week when I am back at >>> work >>> and see what they say. So many thanks for your sentiments here which are >>> much >>> appreciated from me! >>> >>> All the best from a cold London >>> >>> J?rg >>> >>> Am Donnerstag, 26. November 2020, 12:51:55 GMT schrieb Gilad Shainer: >>> > Let me try to help: >>> > >>> > - OpenStack is supported natively on InfiniBand already, >>> therefore >>> > there is no need to go to Ethernet for that >>> >>> > - File system wise, you can have IB file system, and connect >>> > directly to IB system. >>> >>> > - Depends on the distance, you can run 2Km IB between >>> switches, or >>> > use Mellanox MetroX for connecting over 40Km. VicinityIO have system >>> that >>> > go over thousands of miles? >>> >>> > - IB advantages are with much lower latency (switches alone >>> are 3X >>> > lower latency), cost effectiveness (for the same speed, IB switches are >>> > more cost effective than Ethernet) and the In-Network Computing engines >>> > (MPI reduction operations, Tag Matching run on the network) >>> >>> > If you need help, feel free to contact directly. >>> > >>> > Regards, >>> > Gilad Shainer >>> > >>> > From: Beowulf [mailto:beowulf-bounces at beowulf.org] On Behalf Of John >>> Hearns >>> > Sent: Thursday, November 26, 2020 3:42 AM >>> > To: J?rg Sa?mannshausen ; Beowulf >>> Mailing >>> > List >>> Subject: Re: [Beowulf] RoCE vs. InfiniBand >>> > >>> > External email: Use caution opening links or attachments >>> > >>> > Jorg, I think I might know where the Lustre storage is ! >>> > It is possible to install storage routers, so you could route between >>> > ethernet and infiniband. >>> It is also worth saying that Mellanox have Metro >>> > Infiniband switches - though I do not think they go as far as the west >>> of >>> > London! >>> > Seriously though , you ask about RoCE. I will stick my neck out and >>> say yes, >>> > if you are planning an Openstack cluster >>> with the intention of having >>> > mixed AI and 'traditional' HPC workloads I would go for a RoCE style >>> setup. >>> > In fact I am on a discussion about a new project for a customer with >>> > similar aims in an hours time. >>> > I could get some benchmarking time if you want to do a direct >>> comparison of >>> > Gromacs on IB / RoCE >>> >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > On Thu, 26 Nov 2020 at 11:14, J?rg Sa?mannshausen >>> > > >>> > wrote: >>> Dear all, >>> > >>> > as the DNS problems have been solve (many thanks for doing this!), I >>> was >>> > wondering if people on the list have some experiences with this >>> question: >>> > >>> > We are currently in the process to purchase a new cluster and we want >>> to >>> > use >>> OpenStack for the whole management of the cluster. Part of the cluster >>> > will run HPC applications like GROMACS for example, other parts typical >>> > OpenStack applications like VM. We also are implementing a Data Safe >>> Haven >>> > for the more sensitive data we are aiming to process. Of course, we >>> want to >>> > have a decent size GPU partition as well! >>> > >>> > Now, traditionally I would say that we are going for InfiniBand. >>> However, >>> > for >>> reasons I don't want to go into right now, our existing file storage >>> > (Lustre) will be in a different location. Thus, we decided to go for >>> RoCE >>> > for the file storage and InfiniBand for the HPC applications. >>> > >>> > The point I am struggling is to understand if this is really the best >>> of >>> > the >>> solution or given that we are not building a 100k node cluster, we >>> > could use RoCE for the few nodes which are doing parallel, read MPI, >>> jobs >>> > too. I have a nagging feeling that I am missing something if we are >>> moving >>> > to pure RoCE and ditch the InfiniBand. We got a mixed workload, from >>> ML/AI >>> > to MPI applications like GROMACS to pipelines like they are used in the >>> > bioinformatic corner. We are not planning to partition the GPUs, the >>> > current design model is to have only 2 GPUs in a chassis. >>> > So, is there something I am missing or is the stomach feeling I have >>> really >>> > a >>> lust for some sushi? :-) >>> > >>> > Thanks for your sentiments here, much welcome! >>> > >>> > All the best from a dull London >>> > >>> > J?rg >>> > >>> > >>> > >>> > _______________________________________________ >>> > Beowulf mailing list, Beowulf at beowulf.org >>> > sponsored by Penguin Computing >>> To change your subscription (digest mode or >>> > unsubscribe) visit >>> > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf< >>> https://nam11.safelink >>> > >>> s.protection.outlook.com/?url=https%3A%2F%2Fbeowulf.org%2Fcgi-bin%2Fmailman% >>> > 2Flistinfo%2Fbeowulf&data=04%7C01%7CShainer%40nvidia.com >>> %7C8e220b6be2fa48921 >>> > >>> dce08d892005b27%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637419877513157 >>> > >>> 960%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1h >>> > >>> aWwiLCJXVCI6Mn0%3D%7C1000&sdata=0NLRDQHkYol82mmqs%2BQrFryEuitIpDss2NwgIeyg1K >>> > 8%3D&reserved=0> >>> >>> >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >>> >> >> >> -- >> Dr Stuart Midgley >> sdm900 at gmail.com >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >> > -- Dr Stuart Midgley sdm900 at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From hearnsj at gmail.com Thu Sep 30 07:51:40 2021 From: hearnsj at gmail.com (John Hearns) Date: Thu, 30 Sep 2021 08:51:40 +0100 Subject: [Beowulf] Data Destruction In-Reply-To: References: