<div dir="ltr"><div><div><div><div><div><div>Regarding storage, Chris Dagdigian comments:<br><br>>And you know what? After the Isilon NAS was deployed the management of 

*many* petabytes of single-namespace storage was now handled by the IT 

Director in his 'spare time' -- And the five engineers who used to do 

nothing > >but keep ZFS from falling over were re-assigned to more 

impactful and presumably more fun/interesting work.<br>

<br></div>The person who runs the huge JASMIN climate research project in the UK makes the same comment, only with Panasas storage.<br></div>He is able to manage petabytes of Panasas storage with himself and one other person. A lto of that storage installed by my fair hands.<br></div>To be honest though installing Panasas is a matter of how fast you can unbox the blades  (*)<br><br></div>(*) Well, that is not so in real life! During that install we had several 'funnies' - all of which were diagnosed and a fix given by the superb Panasas support.<br></div>Including the shelf where after replacing every component over the period of two weeks - something like Triggers Broom <a href="http://foolsandhorses.weebly.com/triggers-broom.html">http://foolsandhorses.weebly.com/triggers-broom.html</a><br></div><div>we at last found the bent pin in the multiway connector (ahem)<br></div><div><div><br><br><div><br><br></div></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 3 May 2018 at 09:23, John Hearns <span dir="ltr"><<a href="mailto:hearnsj@googlemail.com" target="_blank">hearnsj@googlemail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div><div><div><div><div><div><div><div><div>Jorg,  I did not know that you used Bright.  Or I may have forgotten!<br></div>I thought you were a Debian fan.  Of relevance, Bright 8 now supports Debian.<br><br></div><div>You commented on the Slurm configuration file being changed.<br></div><div>I found during the install at Greenwich, where we put in a custom slurm.conf, that Bright has an option<br></div><div>to 'freeze' files. This is defined in the cmd.conf file.  So if new nodes are added, or other changes made,<br></div><div>the slurm.conf gile is left unchanged and you have to manually manage it.<br></div><div>I am not 100% sure what happens with an update of the RPMs, but I would imagine the freeze state is respected.<br></div><span class=""><div><br><br>

>I should add I am working in academia and I know little about the commercial <br>

>world here. Having said that, my friends in commerce are telling me that the <br>

>company likes to outsource as it is 'cheaper'. <br></div></span>I would not say cheaper. However (see below) HPC skills are scarce.<br></div>And if you are in industry you commit to your management that HPC resources will be up and running<br></div>for XX % of a year - ie you have some explaining to do if there is extended downtime.<br></div><div>HPC is looked upon as something comparable to machine tools - in Formula 1 we competed for beudget against<br></div><div>fize axis milling machines for instance. Can you imagine what would happen if the machine shop supervisor said<br></div><div>"Sorry - no parts being made today. My guys have the covers off and we are replacing one of the motors with one we got off Ebay"<br></div><div><br></div><div><br></div>So yes you do want commercial support for aspects of your setup - let us say that jobs are going into hold states <br></div>on your batch system, or jobs are immediately terminating. Do you:<br><br></div>a) spend all day going through logs with a fine tooth comb, and send out an email to the Slurm/PBS/SGE list and hope you get<br></div>some sort of answer<br><br></div>b) take a dump of the relevant logs and get a ticket opened with your support people<br><br></div>Actually in real life you do both, but path (b) is going to get you up and running quicker.<br><br></div>Also for storage, in industry you really want support on your storage.<span class=""><br><br><br><div><br><br>>Anyhow, I don't want to digress here too much. However, "..do HPC work in <br>

<span class="m_-3500645263727097909gmail-im">>commercial environments where the skills simply don't exist onsite."<br>

</span>>Are we a dying art?<br><br></div></span><div>Jorg, yes. HPC skills are rare, as are the people who take the time and trouble to learn deeply about the systems they operate.<br>I know this as recruitment consultants tell me this regularly.<br></div><div>I find that often in life people do the minimum they need, and once they are given instructions they never change,<br></div><div>even when the configuration steps they carry out have lost meaning. <br></div><div>I have met that attitude in several companies. Echoing Richard Feynman I call this 'cargo cult systems'<br>The people like you who are willing to continually learn and to abandon old ways of work<br></div><div>are invaluable.<br></div><div><br></div><div>I am consulting at the moment with a biotech firm in Denmark. Replying to Chris Dagdigian, this company does have excellent in-house<br></div><div>Linux skills, so I suppose is the exception to the rule!<br></div><div><br></div><div><br><br><br><br><br><div><br><br><br><br><br><br><br><br><br><div><div><div><div><div><br><br><div><div><div><br><br><br><br><br><br></div></div></div></div></div></div></div></div></div></div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On 2 May 2018 at 23:04, Jörg Saßmannshausen <span dir="ltr"><<a href="mailto:sassy-work@sassy.formativ.net" target="_blank">sassy-work@sassy.formativ.net</a><wbr>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Dear Chris,<br>

<br>

further to your email:<br>

<span><br>

> - And if miracles occur and they do have expert level linux people then<br>

> more often than not these people are overworked or stretched in many<br>

> directions<br>

<br>

</span>This is exactly what has happened to me at the old work place: pulled into too <br>

many different directions. <br>

<br>

I am a bit surprised about the ZFS experiences. Although I did not have <br>

petabyte of storage and I did not generate 300 TB per week, I did have a <br>

fairly large storage space running on xfs and ext4 for backups and <br>

provisioning of file space. Some of it was running on old hardware (please sit <br>

down, I am talking about me messing around with SCSI cables) and I gradually <br>

upgraded to newer one. So, I am not quite sure what went wrong with the ZFS <br>

storage here. <br>

<br>

However, there is a common trend, at least what I observe here in the UK, to <br>

out-source problems: pass the bucket to somebody else and we pay for it. <br>

I am personally still  more of an in-house expert than an out-sourced person <br>

who may or may not be able to understand what you are doing. <br>

I should add I am working in academia and I know little about the commercial <br>

world here. Having said that, my friends in commerce are telling me that the <br>

company likes to outsource as it is 'cheaper'. <br>

I agree with the Linux expertise. I think I am one of the two who are Linux <br>

admins in the present work place. The official line is: we do not support Linux <br>

(but we teach it). <br>

<br>

Anyhow, I don't want to digress here too much. However, "..do HPC work in <br>

<span>commercial environments where the skills simply don't exist onsite."<br>

</span>Are we a dying art?<br>

<br>

My 1 shilling here from a still cold and dark London.<br>

<span class="m_-3500645263727097909HOEnZb"><font color="#888888"><br>

Jörg<br>

</font></span><div class="m_-3500645263727097909HOEnZb"><div class="m_-3500645263727097909h5"><br>

<br>

<br>

Am Mittwoch, 2. Mai 2018, 16:19:48 BST schrieb Chris Dagdigian:<br>

> Jeff White wrote:<br>

> > I never used Bright.  Touched it and talked to a salesperson at a<br>

> > conference but I wasn't impressed.<br>

> > <br>

> > Unpopular opinion: I don't see a point in using "cluster managers"<br>

> > unless you have a very tiny cluster and zero Linux experience.  These<br>

> > are just Linux boxes with a couple applications (e.g. Slurm) running<br>

> > on them.  Nothing special. xcat/Warewulf/Scyld/Rocks just get in the<br>

> > way more than they help IMO.  They are mostly crappy wrappers around<br>

> > free software (e.g. ISC's dhcpd) anyway.  When they aren't it's<br>

> > proprietary trash.<br>

> > <br>

> > I install CentOS nodes and use<br>

> > Salt/Chef/Puppet/Ansible/WhoCa<wbr>res/Whatever to plop down my configs and<br>

> > software.  This also means I'm not suck with "node images" and can<br>

> > instead build everything as plain old text files (read: write<br>

> > SaltStack states), update them at will, and push changes any time.  My<br>

> > "base image" is CentOS and I need no "baby's first cluster" HPC<br>

> > software to install/PXEboot it.  YMMV<br>

> <br>

> Totally legit opinion and probably not unpopular at all given the user<br>

> mix on this list!<br>

> <br>

> The issue here is assuming a level of domain expertise with Linux,<br>

> bare-metal provisioning, DevOps and (most importantly) HPC-specific<br>

> configStuff that may be pervasive or easily available in your<br>

> environment but is often not easily available in a<br>

> commercial/industrial  environment where HPC or "scientific computing"<br>

> is just another business area that a large central IT organization must<br>

> support.<br>

> <br>

> If you have that level of expertise available then the self-managed DIY<br>

> method is best. It's also my preference<br>

> <br>

> But in the commercial world where HPC is becoming more and more<br>

> important you run into stuff like:<br>

> <br>

> - Central IT may not actually have anyone on staff who knows Linux (more<br>

> common than you expect; I see this in Pharma/Biotech all the time)<br>

> <br>

> - The HPC user base is not given budget or resource to self-support<br>

> their own stack because of a drive to centralize IT ops and support<br>

> <br>

> - And if they do have Linux people on staff they may be novice-level<br>

> people or have zero experience with HPC schedulers, MPI fabric tweaking<br>

> and app needs (the domain stuff)<br>

> <br>

> - And if miracles occur and they do have expert level linux people then<br>

> more often than not these people are overworked or stretched in many<br>

> directions<br>

> <br>

> <br>

> So what happens in these environments is that organizations will<br>

> willingly (and happily) pay commercial pricing and adopt closed-source<br>

> products if they can deliver a measurable reduction in administrative<br>

> burden, operational effort or support burden.<br>

> <br>

> This is where Bright, Univa etc. all come in -- you can buy stuff from<br>

> them that dramatically reduces that onsite/local IT has to manage the<br>

> care and feeding of.<br>

> <br>

> Just having a vendor to call for support on Grid Engine oddities makes<br>

> the cost of Univa licensing worthwhile. Just having a vendor like Bright<br>

> be on the hook for "cluster operations" is a huge win for an overworked<br>

> IT staff that does not have linux or HPC specialists on-staff or easily<br>

> available.<br>

> <br>

> My best example of "paying to reduce operational burden in HPC" comes<br>

> from a massive well known genome shop in the cambridge, MA area. They<br>

> often tell this story:<br>

> <br>

> - 300 TB of new data generation per week (many years ago)<br>

> - One of the initial storage tiers was ZFS running on commodity server<br>

> hardware<br>

> - Keeping the DIY ZFS appliances online and running took the FULL TIME<br>

> efforts of FIVE STORAGE ENGINEERS<br>

> <br>

> They realized that staff support was not scalable with DIY/ZFS at<br>

> 300TB/week of new data generation so they went out and bought a giant<br>

> EMC Isilon scale-out NAS platform<br>

> <br>

> And you know what? After the Isilon NAS was deployed the management of<br>

> *many* petabytes of single-namespace storage was now handled by the IT<br>

> Director in his 'spare time' -- And the five engineers who used to do<br>

> nothing but keep ZFS from falling over were re-assigned to more<br>

> impactful and presumably more fun/interesting work.<br>

> <br>

> <br>

> They actually went on stage at several conferences and told the story of<br>

> how Isilon allowed senior IT leadership to manage petabyte volumes of<br>

> data "in their spare time" -- this was a huge deal and really resonated<br>

> . Really reinforced for me how in some cases it's actually a good idea<br>

> to pay $$$ for commercial stuff if it delivers gains in<br>

> ops/support/management.<br>

> <br>

> <br>

> Sorry to digress! This is a topic near and dear to me. I often have to<br>

> do HPC work in commercial environments where the skills simply don't<br>

> exist onsite. Or more commonly -- they have budget to buy software or<br>

> hardware but they are under a hiring freeze and are not allowed to bring<br>

> in new Humans.<br>

> <br>

> Quite a bit of my work on projects like this is helping people make<br>

> sober decisions regarding "build" or "buy" -- and in those environments<br>

> it's totally clear that for some things it makes sense for them to pay<br>

> for an expensive commercially supported "thing" that they don't have to<br>

> manage or support themselves<br>

> <br>

> <br>

> My $.02 ...<br>

> <br>

> <br>

> <br>

> <br>

> <br>

> <br>

> ______________________________<wbr>_________________<br>

> Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

> To change your subscription (digest mode or unsubscribe) visit<br>

> <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/mailman<wbr>/listinfo/beowulf</a><br>

<br>

______________________________<wbr>_________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/mailman<wbr>/listinfo/beowulf</a><br>

</div></div></blockquote></div><br></div>

</div></div></blockquote></div><br></div>