From gus at ldeo.columbia.edu Thu Dec 4 09:43:38 2014 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 04 Dec 2014 12:43:38 -0500 Subject: [Beowulf] How to keep your chips neat ( `a la CERN) Message-ID: <54809D4A.1090007@ldeo.columbia.edu> http://cds.cern.ch/journal/CERNBulletin/2014/49/News%20Articles/1971984?ln=en ... and I always thought that air cans, a little alcohol, Q-tips, wipes, a household vacuum cleaner, were up to the task ... Gus Correa From prentice.bisbal at rutgers.edu Tue Dec 23 09:12:01 2014 From: prentice.bisbal at rutgers.edu (Prentice Bisbal) Date: Tue, 23 Dec 2014 12:12:01 -0500 Subject: [Beowulf] Putting /home on Lusture of GPFS Message-ID: <5499A261.6070503@rutgers.edu> Beowulfers, I have limited experience managing parallel filesytems like GPFS or Lustre. I was discussing putting /home and /usr/local for my cluster on a GPFS or Lustre filesystem, in addition to using it just for /scratch. I've never done this before, but it doesn't seem like all that bad an idea. My logic for this is the following: 1. Users often try to run programs from in /home, which leads to errors, no matter how many times I tell them not to do that. This would make the system more user-friendly. I could use quotas/policies to encourage them to use 'steer' them to use other filesystems if needed. 2. Having one storage system to manage is much better than 3. 3. Profit? Anyway, another person in the conversation felt that this would be bad, because if someone was running a job that would hammer the fileystem, it would make the filesystem unresponsive, and keep other people from logging in and doing work. I'm not buying this concern for the following reasons: If a job can hammer your parallel filesystem so that the login nodes become unresponsive, you've got bigger problems, because that means other jobs can't run on the cluster, and the job hitting the filesystem hard has probably slowed down to a crawl, too. I know there are some concerns with the stability of parallel filesystems, so if someone wants to comment on the dangers of that, too, I'm all ears. I think that the relative instability of parallel filesystems compared to NFS would be the biggest concern, not performance. -- Prentice Bisbal Manager of Information Technology Rutgers Discovery Informatics Institute (RDI2) Rutgers University http://rdi2.rutgers.edu From jeff.johnson at aeoncomputing.com Tue Dec 23 09:22:04 2014 From: jeff.johnson at aeoncomputing.com (Jeff Johnson) Date: Tue, 23 Dec 2014 09:22:04 -0800 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: <5499A261.6070503@rutgers.edu> References: <5499A261.6070503@rutgers.edu> Message-ID: <5499A4BC.5040204@aeoncomputing.com> 1. A little administrative 'tough love' isn't a bad thing. This is even if you unify everything under Lustre or GPFS. That same user could use up all of the inodes in your Lustre MDT just as easily as they can implode your NFS with reckless usage. I have seen several instances of /home running on Lustre. Just know the tradeoffs up front and if you are comfortable with them do it. Given the small block random I/O challenges in Lustre it can be a more robust approach to have places where different I/O can be run. (NFS and Lustre filesystems). That all depends on your NFS infrastructure being able to endure normal /home usage and small-block/random jobs. Just my $.02 worth. On 12/23/14 9:12 AM, Prentice Bisbal wrote: > Beowulfers, > > I have limited experience managing parallel filesytems like GPFS or > Lustre. I was discussing putting /home and /usr/local for my cluster > on a GPFS or Lustre filesystem, in addition to using it just for > /scratch. I've never done this before, but it doesn't seem like all > that bad an idea. My logic for this is the following: > > 1. Users often try to run programs from in /home, which leads to > errors, no matter how many times I tell them not to do that. This > would make the system more user-friendly. I could use quotas/policies > to encourage them to use 'steer' them to use other filesystems if needed. > > 2. Having one storage system to manage is much better than 3. > > 3. Profit? > > Anyway, another person in the conversation felt that this would be > bad, because if someone was running a job that would hammer the > fileystem, it would make the filesystem unresponsive, and keep other > people from logging in and doing work. I'm not buying this concern for > the following reasons: > > If a job can hammer your parallel filesystem so that the login nodes > become unresponsive, you've got bigger problems, because that means > other jobs can't run on the cluster, and the job hitting the > filesystem hard has probably slowed down to a crawl, too. > > I know there are some concerns with the stability of parallel > filesystems, so if someone wants to comment on the dangers of that, > too, I'm all ears. I think that the relative instability of parallel > filesystems compared to NFS would be the biggest concern, not > performance. > -- ------------------------------ Jeff Johnson Co-Founder Aeon Computing jeff.johnson at aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 x1001 f: 858-412-3845 m: 619-204-9061 4170 Morena Boulevard, Suite D - San Diego, CA 92117 High-performance Computing / Lustre Filesystems / Scale-out Storage From landman at scalableinformatics.com Tue Dec 23 09:33:53 2014 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 23 Dec 2014 12:33:53 -0500 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: <5499A261.6070503@rutgers.edu> References: <5499A261.6070503@rutgers.edu> Message-ID: <5499A781.6040006@scalableinformatics.com> On 12/23/2014 12:12 PM, Prentice Bisbal wrote: > Beowulfers, > > I have limited experience managing parallel filesytems like GPFS or > Lustre. I was discussing putting /home and /usr/local for my cluster > on a GPFS or Lustre filesystem, in addition to using it just for > /scratch. I've never done this before, but it doesn't seem like all > that bad an idea. My logic for this is the following: This is not a great idea ... > > 1. Users often try to run programs from in /home, which leads to > errors, no matter how many times I tell them not to do that. This > would make the system more user-friendly. I could use quotas/policies > to encourage them to use 'steer' them to use other filesystems if needed. This is an educational problem more than anything else. You could easily set up their bashrc/others to cd to $SCRATCH on login, or process startup. > > 2. Having one storage system to manage is much better than 3. True, though having one system increases the need for stability and performance of that one file system. > > 3. Profit? > > Anyway, another person in the conversation felt that this would be > bad, because if someone was running a job that would hammer the > fileystem, it would make the filesystem unresponsive, and keep other > people from logging in and doing work. I'm not buying this concern for > the following This happens. I've seen it happen. Many people have seen this happen. It does happen. > reasons: > > If a job can hammer your parallel filesystem so that the login nodes > become unresponsive, you've got bigger problems, because that means > other jobs can't run on the cluster, and the job hitting the > filesystem hard has probably slowed down to a crawl, too. Note that "hitting the file system hard" could be a) an IOP storm (millions of small files, think bioinformatics/proteomics/*omics code), which starves the rest of the system from IOP standpoint. Its always fun to watch these, for a schadenfreude definition of the word "fun". Just try doing a 'df -h .' on a directory on a file system being hammered in an IOP storm. This is pretty much the definition of Denial Of Service. Its very annoying when your users are denied service. b) sudden massive bolus on the part of a cluster job *really* makes peoples vi and other sessions ... surprising (again with that schadenfreude definition of the word "surprising"). IOPs and stats may work, but so much bandwidth and bulk data is flowing, that your storage systems can't keep up. This happens less frequently, and with a good design/implementation you can largely mitigate this (c.f. https://scalability.org/images/dash-3.png ) c) scratch down time for any reason now prevents users from using the system. That is, the failure radius is now much larger and less localized, impacting people whom might be able to otherwise work. /home should generally be on a reasonably fast and very stable platform. Apply quotas, and active LARTing, with daily/weekly/monthly metrics so that it doesn't get abused. /usr/local, similar issue (though you can simple NFS ro mount that in most cases). You can do it, but beware the issues. > > I know there are some concerns with the stability of parallel > filesystems, so if someone wants to comment on the dangers of that, > too, I'm all ears. I think that the relative instability of parallel > filesystems compared to NFS would be the biggest concern, not > performance. > Performance is always concern (see the point "a" above). Most of these things can be handled with education, and some automation (login and batch automatically generate a temp directory, and chdir the user into it, with a failure test built into the login, so if the PFS is down, it will revert to $HOME ). -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com twtr : @scalableinfo phone: +1 734 786 8423 x121 cell : +1 734 612 4615 From ellis at cse.psu.edu Tue Dec 23 10:30:08 2014 From: ellis at cse.psu.edu (Ellis H. Wilson III) Date: Tue, 23 Dec 2014 13:30:08 -0500 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: <5499A781.6040006@scalableinformatics.com> References: <5499A261.6070503@rutgers.edu> <5499A781.6040006@scalableinformatics.com> Message-ID: <5499B4B0.5050407@cse.psu.edu> On 12/23/2014 12:33 PM, Joe Landman wrote: >> Anyway, another person in the conversation felt that this would be >> bad, because if someone was running a job that would hammer the >> fileystem, it would make the filesystem unresponsive, and keep other >> people from logging in and doing work. I'm not buying this concern for >> the following > > This happens. I've seen it happen. Many people have seen this happen. > It does happen. And if the same physical storage is underlying both /home and /scratch, regardless of what file systems you have running there, you (and your users) are going to eventually be stricken with the above. If the filesystem you decide upon provides some form of prioritization (QoS/etc) you might be able to get away with a converged storage pool with prioritized /home, but I would still be leery of it. If I were in your shoes, separate /home and /scratch and more shock-treatment of stupid users would be my plan. If your scratch is considerably faster than home as it should be, that should be enough encouragement for your power users to migrate away from running on /home. You'll always have some limited set of users running basic stuff on /home. Not a big deal unless it's spread over dozens/hundreds/thousands of machines. Those are the users you need to convince running on scratch is the "right" way. The best way to convince them is to demonstrate how much faster it is. And if you have users writing tons and tons of tiny files, please go beat them on my behalf. Best, ellis From mdidomenico4 at gmail.com Tue Dec 23 10:35:30 2014 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Tue, 23 Dec 2014 13:35:30 -0500 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: <5499A261.6070503@rutgers.edu> References: <5499A261.6070503@rutgers.edu> Message-ID: I've always shied away from gpfs/lustre on /home and favoured netapp's for one simple reason. snapshots. i can't tell you home many times people have "accidentally" deleted a file. but yes, the "user education" about running jobs from /home usually happens at least once a year when someone new starts. we tend to publicly shame that person and they don't seem to do it anymore you never want to be "that guy" that slowed the whole system down... :) On Tue, Dec 23, 2014 at 12:12 PM, Prentice Bisbal wrote: > Beowulfers, > > I have limited experience managing parallel filesytems like GPFS or Lustre. > I was discussing putting /home and /usr/local for my cluster on a GPFS or > Lustre filesystem, in addition to using it just for /scratch. I've never > done this before, but it doesn't seem like all that bad an idea. My logic > for this is the following: > > 1. Users often try to run programs from in /home, which leads to errors, no > matter how many times I tell them not to do that. This would make the system > more user-friendly. I could use quotas/policies to encourage them to use > 'steer' them to use other filesystems if needed. > > 2. Having one storage system to manage is much better than 3. > > 3. Profit? > > Anyway, another person in the conversation felt that this would be bad, > because if someone was running a job that would hammer the fileystem, it > would make the filesystem unresponsive, and keep other people from logging > in and doing work. I'm not buying this concern for the following reasons: > > If a job can hammer your parallel filesystem so that the login nodes become > unresponsive, you've got bigger problems, because that means other jobs > can't run on the cluster, and the job hitting the filesystem hard has > probably slowed down to a crawl, too. > > I know there are some concerns with the stability of parallel filesystems, > so if someone wants to comment on the dangers of that, too, I'm all ears. I > think that the relative instability of parallel filesystems compared to NFS > would be the biggest concern, not performance. > > -- > Prentice Bisbal > Manager of Information Technology > Rutgers Discovery Informatics Institute (RDI2) > Rutgers University > http://rdi2.rutgers.edu > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From jchong at scinet.utoronto.ca Tue Dec 23 10:46:16 2014 From: jchong at scinet.utoronto.ca (Jason Chong) Date: Tue, 23 Dec 2014 13:46:16 -0500 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: References: <5499A261.6070503@rutgers.edu> Message-ID: <20141223184616.GF28027@gemini.scinet.utoronto.ca> On Tue, Dec 23, 2014 at 01:35:30PM -0500, Michael Di Domenico wrote: > I've always shied away from gpfs/lustre on /home and favoured netapp's > for one simple reason. snapshots. i can't tell you home many times > people have "accidentally" deleted a file. I actually do run GPFS as /home mostly for client scalability reason (have 4000 clients mounting /home on the cluster). However, we only mount it as read-only on compute node and usually user would see I/O failure and ask what is happening to prevent jobs from running from home. > > but yes, the "user education" about running jobs from /home usually > happens at least once a year when someone new starts. we tend to > publicly shame that person and they don't seem to do it anymore Definitely needs "user education" and I have seen one user killing parallel filesystem that way and everyone is frustrated. In some cases, we had to limit the number of jobs a user run to minimize their impact since they still would not listen and just want to get their code to run and finish the task they are doing without considering the consequences. Jason > > you never want to be "that guy" that slowed the whole system down... :) > > > > On Tue, Dec 23, 2014 at 12:12 PM, Prentice Bisbal > wrote: > > Beowulfers, > > > > I have limited experience managing parallel filesytems like GPFS or Lustre. > > I was discussing putting /home and /usr/local for my cluster on a GPFS or > > Lustre filesystem, in addition to using it just for /scratch. I've never > > done this before, but it doesn't seem like all that bad an idea. My logic > > for this is the following: > > > > 1. Users often try to run programs from in /home, which leads to errors, no > > matter how many times I tell them not to do that. This would make the system > > more user-friendly. I could use quotas/policies to encourage them to use > > 'steer' them to use other filesystems if needed. > > > > 2. Having one storage system to manage is much better than 3. > > > > 3. Profit? > > > > Anyway, another person in the conversation felt that this would be bad, > > because if someone was running a job that would hammer the fileystem, it > > would make the filesystem unresponsive, and keep other people from logging > > in and doing work. I'm not buying this concern for the following reasons: > > > > If a job can hammer your parallel filesystem so that the login nodes become > > unresponsive, you've got bigger problems, because that means other jobs > > can't run on the cluster, and the job hitting the filesystem hard has > > probably slowed down to a crawl, too. > > > > I know there are some concerns with the stability of parallel filesystems, > > so if someone wants to comment on the dangers of that, too, I'm all ears. I > > think that the relative instability of parallel filesystems compared to NFS > > would be the biggest concern, not performance. > > > > -- > > Prentice Bisbal > > Manager of Information Technology > > Rutgers Discovery Informatics Institute (RDI2) > > Rutgers University > > http://rdi2.rutgers.edu > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Jason Chong Phone: 416-978-4157 Systems Administrator and Web App Developer http://www.scinethpc.ca Compute/Calcul Canada http://www.computecanada.ca From samuel at unimelb.edu.au Tue Dec 23 15:33:13 2014 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 24 Dec 2014 10:33:13 +1100 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: <5499A261.6070503@rutgers.edu> References: <5499A261.6070503@rutgers.edu> Message-ID: <5499FBB9.30006@unimelb.edu.au> On 24/12/14 04:12, Prentice Bisbal wrote: > I have limited experience managing parallel filesytems like GPFS or > Lustre. I was discussing putting /home and /usr/local for my cluster on > a GPFS or Lustre filesystem, in addition to using it just for /scratch. We've been using GPFS for project space (which includes our home directories) as well as our scratch and HSM filesystems since 2010 and haven't had any major issues. We've done upgrades of GPFS over that time, the ability to do rolling upgrades is really nice (plus we have redundant pairs of NSDs so we can do hardware maintenance). Basically: Project space: * Uses filesets with quotas to limit a projects overall usage * Uses GPFS snapshots both for easy file recovery and as a target for TSM backups * Dedicated LUNs on IB connected DDN SFA10K with 1TB SATA drives Scratch space: * Any project that requests scratch space gets their own group writeable fileset (without quotas) so we can easily track space usage. * All LUNs on IB connected DDN SFA10K with 900GB SAS drives HSM space: * Uses filesets without quotas, except when projects exceed their allocated amount of tape+disk when we impose an immediate cap until they tidy up * Dedicated LUNs on IB connected DDN SFA10K with 1TB SATA drives (same controllers as project space) We kept a few LUNs up our sleeves on the SATA SFA10K, just in case.. All the best, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci From samuel at unimelb.edu.au Tue Dec 23 15:42:27 2014 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 24 Dec 2014 10:42:27 +1100 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: <5499A261.6070503@rutgers.edu> References: <5499A261.6070503@rutgers.edu> Message-ID: <5499FDE3.8060204@unimelb.edu.au> On 24/12/14 04:12, Prentice Bisbal wrote: > Anyway, another person in the conversation felt that this would be bad, > because if someone was running a job that would hammer the fileystem, it > would make the filesystem unresponsive, and keep other people from > logging in and doing work. I don't believe we've ever seen this issue with GPFS and we have some people running some pretty pathological codes for I/O (including OpenFOAM which is plain insane and some of the bioinformatics codes that want to do single byte synchronous I/O). I think the worst issue we've had was a problem with OpenFOAM with a user who ran us out of inodes - they created many millions of directories, each with 4 files in them. But we killed the job, added metadata disks online to extend inodes and then educated the user. It's not something unique to GPFS though (and could probably be harder to recover from on other filesystems). All the best, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci From samuel at unimelb.edu.au Tue Dec 23 15:43:47 2014 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 24 Dec 2014 10:43:47 +1100 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: References: <5499A261.6070503@rutgers.edu> Message-ID: <5499FE33.1080002@unimelb.edu.au> On 24/12/14 05:35, Michael Di Domenico wrote: > I've always shied away from gpfs/lustre on /home and favoured netapp's > for one simple reason. snapshots. i can't tell you home many times > people have "accidentally" deleted a file. We use GPFS snapshots for our project areas already, for just that reason. :-) -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci From james_cuff at harvard.edu Tue Dec 23 10:52:44 2014 From: james_cuff at harvard.edu (James Cuff) Date: Tue, 23 Dec 2014 13:52:44 -0500 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: <5499A261.6070503@rutgers.edu> References: <5499A261.6070503@rutgers.edu> Message-ID: On Tue, Dec 23, 2014 at 12:12 PM, Prentice Bisbal wrote: > I was discussing putting /home and /usr/local for my cluster on a GPFS or > Lustre filesystem, in addition to using it just for /scratch. TL;DR: "Unless you have the constitution of an ox, I really, really wouldn't recommend it." Although, I still can't wait to read this thread, I see Joe has already chipped in - this is going to be fun! Happy holidays to one and all! Best, j. -- dr. james cuff, assistant dean for research computing, harvard university | division of science | thirty eight oxford street, cambridge. ma. 02138 | +1 617 384 7647 | http://rc.fas.harvard.edu From bill at princeton.edu Tue Dec 23 18:09:18 2014 From: bill at princeton.edu (Bill Wichser) Date: Tue, 23 Dec 2014 21:09:18 -0500 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: References: <5499A261.6070503@rutgers.edu> Message-ID: <549A204E.90607@princeton.edu> > On Tue, Dec 23, 2014 at 12:12 PM, Prentice Bisbal > wrote: >> I was discussing putting /home and /usr/local for my cluster on a GPFS or >> Lustre filesystem, in addition to using it just for /scratch. We too have debated this. Seems a waste to add some 8 or 20 T to a local cluster when we have this nice, central filesystem available. And it's not like the users aren't already using it for everything now anyway. Yet we always come back to locality of data. Or at least locality of the login directory. The three filesystems versus one filesystem, well, it seems attractive. Not only to admins but to users as well. In the end, while we'd like to consolidate everything in one place, there are reasons not to do so. I suppose the strongest is that clusters are dynamic whereas that central storage, not so much. There were Sandybridge executables, then Westmere, then Ivy. Haswell is just beginning. So a new cluster with a local home is a great place for these execs, keeping architecture codes distinct. If for no other reason this has perpetuated the current method of keeping these local /home directories. I have other reasons as well. But that is perhaps my strongest this month. Bill From novosirj at ca.rutgers.edu Tue Dec 23 18:14:49 2014 From: novosirj at ca.rutgers.edu (Novosielski, Ryan) Date: Tue, 23 Dec 2014 21:14:49 -0500 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: <5499A261.6070503@rutgers.edu> References: <5499A261.6070503@rutgers.edu> Message-ID: I run an old Lustre (1.8.9), but it doesn't support some forms of file locking that were even required for compiling some software. Doesn't happen often, but enough to give me pause. ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |---------------------*O*--------------------- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | novosirj at rutgers.edu- 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark `' On Dec 23, 2014, at 12:11, Prentice Bisbal > wrote: Beowulfers, I have limited experience managing parallel filesytems like GPFS or Lustre. I was discussing putting /home and /usr/local for my cluster on a GPFS or Lustre filesystem, in addition to using it just for /scratch. I've never done this before, but it doesn't seem like all that bad an idea. My logic for this is the following: 1. Users often try to run programs from in /home, which leads to errors, no matter how many times I tell them not to do that. This would make the system more user-friendly. I could use quotas/policies to encourage them to use 'steer' them to use other filesystems if needed. 2. Having one storage system to manage is much better than 3. 3. Profit? Anyway, another person in the conversation felt that this would be bad, because if someone was running a job that would hammer the fileystem, it would make the filesystem unresponsive, and keep other people from logging in and doing work. I'm not buying this concern for the following reasons: If a job can hammer your parallel filesystem so that the login nodes become unresponsive, you've got bigger problems, because that means other jobs can't run on the cluster, and the job hitting the filesystem hard has probably slowed down to a crawl, too. I know there are some concerns with the stability of parallel filesystems, so if someone wants to comment on the dangers of that, too, I'm all ears. I think that the relative instability of parallel filesystems compared to NFS would be the biggest concern, not performance. -- Prentice Bisbal Manager of Information Technology Rutgers Discovery Informatics Institute (RDI2) Rutgers University http://rdi2.rutgers.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From samuel at unimelb.edu.au Tue Dec 23 20:13:21 2014 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 24 Dec 2014 15:13:21 +1100 Subject: [Beowulf] List admin away until 5th January Message-ID: <549A3D61.2060801@unimelb.edu.au> Hi all, The University of Melbourne closes from today until the 5th January so I'll be away and not paying as much attention as usual to email until then, so I may not notice queries about or issues with the Beowulf list until the new year. Hope everyone has a nice seasonal festival of their choice, and remember that December 26th is the 0xDF anniversary of the birth of Charles Babbage, who as well as having something to do with computers was also able to demonstrate that confused users are not a new problem: # On two occasions I have been asked, ? "Pray, Mr. Babbage, if you # put into the machine wrong figures, will the right answers come # out?" In one case a member of the Upper, and in the other a member # of the Lower House put this question. I am not able rightly to # apprehend the kind of confusion of ideas that could provoke such # a question. :-) All the best, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci From mdidomenico4 at gmail.com Wed Dec 24 04:44:03 2014 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Wed, 24 Dec 2014 07:44:03 -0500 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: <5499FE33.1080002@unimelb.edu.au> References: <5499A261.6070503@rutgers.edu> <5499FE33.1080002@unimelb.edu.au> Message-ID: On Tue, Dec 23, 2014 at 6:43 PM, Christopher Samuel wrote: > On 24/12/14 05:35, Michael Di Domenico wrote: > >> I've always shied away from gpfs/lustre on /home and favoured netapp's >> for one simple reason. snapshots. i can't tell you home many times >> people have "accidentally" deleted a file. > > We use GPFS snapshots for our project areas already, for just that > reason. :-) Hmmm, i haven't followed GPFS all that much since probably 2005'ish, is snapshots in it fairly new? I don't recall them being there way back then. Perhaps a re-evaluation is in order... From skylar.thompson at gmail.com Wed Dec 24 06:02:31 2014 From: skylar.thompson at gmail.com (Skylar Thompson) Date: Wed, 24 Dec 2014 08:02:31 -0600 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: References: <5499A261.6070503@rutgers.edu> <5499FE33.1080002@unimelb.edu.au> Message-ID: <549AC777.6030405@gmail.com> On 12/24/2014 06:44 AM, Michael Di Domenico wrote: > On Tue, Dec 23, 2014 at 6:43 PM, Christopher Samuel > wrote: >> On 24/12/14 05:35, Michael Di Domenico wrote: >> >>> I've always shied away from gpfs/lustre on /home and favoured netapp's >>> for one simple reason. snapshots. i can't tell you home many times >>> people have "accidentally" deleted a file. >> >> We use GPFS snapshots for our project areas already, for just that >> reason. :-) > > Hmmm, i haven't followed GPFS all that much since probably 2005'ish, > is snapshots in it fairly new? I don't recall them being there way > back then. Perhaps a re-evaluation is in order... They've been supported at least since v3, not sure about before though. One caution is that deleting snapshots is very metadata-intensive, so if you have lots of files you'll want to consider placing your metadata on fast storage (although probably you'll want to consider it regardless of snapshots). Skylar From prentice.bisbal at rutgers.edu Wed Dec 24 07:39:46 2014 From: prentice.bisbal at rutgers.edu (Prentice Bisbal) Date: Wed, 24 Dec 2014 10:39:46 -0500 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: References: <5499A261.6070503@rutgers.edu> Message-ID: <549ADE42.8020908@rutgers.edu> On 12/23/2014 01:35 PM, Michael Di Domenico wrote: > I've always shied away from gpfs/lustre on /home and favoured netapp's > for one simple reason. snapshots. i can't tell you home many times > people have "accidentally" deleted a file. We used a NetApp at my last employer for everything (/home, /usr/local, etc.) and everything used it (desktops, servers, the cluster), and the snapshot feature was priceless. Many users liked being able to replace a file that they accidentally deleted themselves. My inclination is towards GPFS for this reason instead of Lustre, since GPFS supports snapshotting (and a few other useful features that Lustre doesn't provide yet). > > but yes, the "user education" about running jobs from /home usually > happens at least once a year when someone new starts. we tend to > publicly shame that person and they don't seem to do it anymore > > you never want to be "that guy" that slowed the whole system down... :) > > > > On Tue, Dec 23, 2014 at 12:12 PM, Prentice Bisbal > wrote: >> Beowulfers, >> >> I have limited experience managing parallel filesytems like GPFS or Lustre. >> I was discussing putting /home and /usr/local for my cluster on a GPFS or >> Lustre filesystem, in addition to using it just for /scratch. I've never >> done this before, but it doesn't seem like all that bad an idea. My logic >> for this is the following: >> >> 1. Users often try to run programs from in /home, which leads to errors, no >> matter how many times I tell them not to do that. This would make the system >> more user-friendly. I could use quotas/policies to encourage them to use >> 'steer' them to use other filesystems if needed. >> >> 2. Having one storage system to manage is much better than 3. >> >> 3. Profit? >> >> Anyway, another person in the conversation felt that this would be bad, >> because if someone was running a job that would hammer the fileystem, it >> would make the filesystem unresponsive, and keep other people from logging >> in and doing work. I'm not buying this concern for the following reasons: >> >> If a job can hammer your parallel filesystem so that the login nodes become >> unresponsive, you've got bigger problems, because that means other jobs >> can't run on the cluster, and the job hitting the filesystem hard has >> probably slowed down to a crawl, too. >> >> I know there are some concerns with the stability of parallel filesystems, >> so if someone wants to comment on the dangers of that, too, I'm all ears. I >> think that the relative instability of parallel filesystems compared to NFS >> would be the biggest concern, not performance. >> >> -- >> Prentice Bisbal >> Manager of Information Technology >> Rutgers Discovery Informatics Institute (RDI2) >> Rutgers University >> http://rdi2.rutgers.edu >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From prentice.bisbal at rutgers.edu Wed Dec 24 07:47:37 2014 From: prentice.bisbal at rutgers.edu (Prentice Bisbal) Date: Wed, 24 Dec 2014 10:47:37 -0500 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: <549A204E.90607@princeton.edu> References: <5499A261.6070503@rutgers.edu> <549A204E.90607@princeton.edu> Message-ID: <549AE019.30508@rutgers.edu> On 12/23/2014 09:09 PM, Bill Wichser wrote: >> On Tue, Dec 23, 2014 at 12:12 PM, Prentice Bisbal >> wrote: >>> I was discussing putting /home and /usr/local for my cluster on a >>> GPFS or >>> Lustre filesystem, in addition to using it just for /scratch. > > We too have debated this. Seems a waste to add some 8 or 20 T to a > local cluster when we have this nice, central filesystem available. > And it's not like the users aren't already using it for everything now > anyway. Yet we always come back to locality of data. Or at least > locality of the login directory. The three filesystems versus one > filesystem, well, it seems attractive. Not only to admins but to > users as well. > > In the end, while we'd like to consolidate everything in one place, > there are reasons not to do so. I suppose the strongest is that > clusters are dynamic whereas that central storage, not so much. There > were Sandybridge executables, then Westmere, then Ivy. Haswell is just > beginning. So a new cluster with a local home is a great place for > these execs, keeping architecture codes distinct. If for no other > reason this has perpetuated the current method of keeping these local > /home directories. > > I have other reasons as well. But that is perhaps my strongest this > month. > In that case, I can't wait until January! Only 8 more days! I see the logic in having separate /usr/local for every cluster so you can install optimized binaries for each processor, but do you find your users take the time to recompile their own codes for each processor type, or did you come up with this arrangement to force them to do so? Prentice From prentice.bisbal at rutgers.edu Wed Dec 24 07:48:19 2014 From: prentice.bisbal at rutgers.edu (Prentice Bisbal) Date: Wed, 24 Dec 2014 10:48:19 -0500 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: References: <5499A261.6070503@rutgers.edu> Message-ID: <549AE043.5020304@rutgers.edu> Ryan, Thanks for that tid-bit. I never thought of that. On 12/23/2014 09:14 PM, Novosielski, Ryan wrote: > I run an old Lustre (1.8.9), but it doesn't support some forms of file > locking that were even required for compiling some software. Doesn't > happen often, but enough to give me pause. > > ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* > || \\UTGERS |---------------------*O*--------------------- > ||_// Biomedical | Ryan Novosielski - Senior Technologist > || \\ and Health | novosirj at rutgers.edu - > 973/972.0922 (2x0922) > || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark > `' > > On Dec 23, 2014, at 12:11, Prentice Bisbal > > wrote: > >> Beowulfers, >> >> I have limited experience managing parallel filesytems like GPFS or >> Lustre. I was discussing putting /home and /usr/local for my cluster on >> a GPFS or Lustre filesystem, in addition to using it just for /scratch. >> I've never done this before, but it doesn't seem like all that bad an >> idea. My logic for this is the following: >> >> 1. Users often try to run programs from in /home, which leads to errors, >> no matter how many times I tell them not to do that. This would make the >> system more user-friendly. I could use quotas/policies to encourage them >> to use 'steer' them to use other filesystems if needed. >> >> 2. Having one storage system to manage is much better than 3. >> >> 3. Profit? >> >> Anyway, another person in the conversation felt that this would be bad, >> because if someone was running a job that would hammer the fileystem, it >> would make the filesystem unresponsive, and keep other people from >> logging in and doing work. I'm not buying this concern for the following >> reasons: >> >> If a job can hammer your parallel filesystem so that the login nodes >> become unresponsive, you've got bigger problems, because that means >> other jobs can't run on the cluster, and the job hitting the filesystem >> hard has probably slowed down to a crawl, too. >> >> I know there are some concerns with the stability of parallel >> filesystems, so if someone wants to comment on the dangers of that, too, >> I'm all ears. I think that the relative instability of parallel >> filesystems compared to NFS would be the biggest concern, not >> performance. >> >> -- >> Prentice Bisbal >> Manager of Information Technology >> Rutgers Discovery Informatics Institute (RDI2) >> Rutgers University >> http://rdi2.rutgers.edu >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org >> sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From prentice.bisbal at rutgers.edu Wed Dec 24 07:54:12 2014 From: prentice.bisbal at rutgers.edu (Prentice Bisbal) Date: Wed, 24 Dec 2014 10:54:12 -0500 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: <5499A261.6070503@rutgers.edu> References: <5499A261.6070503@rutgers.edu> Message-ID: <549AE1A4.4080100@rutgers.edu> Everyone, Thanks for the feedback you've provided to my query below. I'm glad I'm not the only one who thought of this, and a lot of you raised very good points I haven't thought about. While I've been following parallel filesystems for years, I have very little experience actually managing them up to this point. (My BG/P came with GPFS filesystem for /scratch, but everything was already setup before I got here, so I've only had to deal with it when something breaks). You've all convinced me that this may not be an ideal solution arrangement, but if I go this route, GPFS might be a better fit for this than Lustre (mainly because Chris Samuels has proven it *is* possible with GPFS, and GPFS has snapshotting). Joe Landman, as always, has provided a wealth of information, and the rest of you have pointed out other potential pitfalls. with this approach. Thanks again for the feedback, and please keep the conversation going. Prentice On 12/23/2014 12:12 PM, Prentice Bisbal wrote: > Beowulfers, > > I have limited experience managing parallel filesytems like GPFS or > Lustre. I was discussing putting /home and /usr/local for my cluster > on a GPFS or Lustre filesystem, in addition to using it just for > /scratch. I've never done this before, but it doesn't seem like all > that bad an idea. My logic for this is the following: > > 1. Users often try to run programs from in /home, which leads to > errors, no matter how many times I tell them not to do that. This > would make the system more user-friendly. I could use quotas/policies > to encourage them to use 'steer' them to use other filesystems if needed. > > 2. Having one storage system to manage is much better than 3. > > 3. Profit? > > Anyway, another person in the conversation felt that this would be > bad, because if someone was running a job that would hammer the > fileystem, it would make the filesystem unresponsive, and keep other > people from logging in and doing work. I'm not buying this concern for > the following reasons: > > If a job can hammer your parallel filesystem so that the login nodes > become unresponsive, you've got bigger problems, because that means > other jobs can't run on the cluster, and the job hitting the > filesystem hard has probably slowed down to a crawl, too. > > I know there are some concerns with the stability of parallel > filesystems, so if someone wants to comment on the dangers of that, > too, I'm all ears. I think that the relative instability of parallel > filesystems compared to NFS would be the biggest concern, not > performance. > From landman at scalableinformatics.com Wed Dec 24 07:58:35 2014 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 24 Dec 2014 10:58:35 -0500 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: <549AE1A4.4080100@rutgers.edu> References: <5499A261.6070503@rutgers.edu> <549AE1A4.4080100@rutgers.edu> Message-ID: <549AE2AB.8010205@scalableinformatics.com> On 12/24/2014 10:54 AM, Prentice Bisbal wrote: > Everyone, > > Thanks for the feedback you've provided to my query below. I'm glad > I'm not the only one who thought of this, and a lot of you raised very > good points I haven't thought about. While I've been following > parallel filesystems for years, I have very little experience actually > managing them up to this point. (My BG/P came with GPFS filesystem for > /scratch, but everything was already setup before I got here, so I've > only had to deal with it when something breaks). > > You've all convinced me that this may not be an ideal solution > arrangement, but if I go this route, GPFS might be a better fit for > this than Lustre (mainly because Chris Samuels has proven it *is* > possible with GPFS, and GPFS has snapshotting). > > Joe Landman, as always, has provided a wealth of information, and the > rest of you have pointed out other potential pitfalls. with this > approach. > My pleasure ... I do think asking James Cuff, Chris Dwan, and others running/managing big kit (and the teams running the kit), what they are doing and why would be quite instructive in a bigger picture sense. Which to a degree suggests that mebbe a devops/best practices BoF or talk series, or educational workshop at SC15 wouldn't be a bad thing ... I'd be happy to submit a proposal for this for this year. Let me know ... > Thanks again for the feedback, and please keep the conversation going. > > Prentice > > On 12/23/2014 12:12 PM, Prentice Bisbal wrote: >> Beowulfers, >> >> I have limited experience managing parallel filesytems like GPFS or >> Lustre. I was discussing putting /home and /usr/local for my cluster >> on a GPFS or Lustre filesystem, in addition to using it just for >> /scratch. I've never done this before, but it doesn't seem like all >> that bad an idea. My logic for this is the following: >> >> 1. Users often try to run programs from in /home, which leads to >> errors, no matter how many times I tell them not to do that. This >> would make the system more user-friendly. I could use quotas/policies >> to encourage them to use 'steer' them to use other filesystems if >> needed. >> >> 2. Having one storage system to manage is much better than 3. >> >> 3. Profit? >> >> Anyway, another person in the conversation felt that this would be >> bad, because if someone was running a job that would hammer the >> fileystem, it would make the filesystem unresponsive, and keep other >> people from logging in and doing work. I'm not buying this concern >> for the following reasons: >> >> If a job can hammer your parallel filesystem so that the login nodes >> become unresponsive, you've got bigger problems, because that means >> other jobs can't run on the cluster, and the job hitting the >> filesystem hard has probably slowed down to a crawl, too. >> >> I know there are some concerns with the stability of parallel >> filesystems, so if someone wants to comment on the dangers of that, >> too, I'm all ears. I think that the relative instability of parallel >> filesystems compared to NFS would be the biggest concern, not >> performance. >> > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com twtr : @scalableinfo phone: +1 734 786 8423 x121 cell : +1 734 612 4615 From prentice.bisbal at rutgers.edu Wed Dec 24 08:38:57 2014 From: prentice.bisbal at rutgers.edu (Prentice Bisbal) Date: Wed, 24 Dec 2014 11:38:57 -0500 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: <549AE2AB.8010205@scalableinformatics.com> References: <5499A261.6070503@rutgers.edu> <549AE1A4.4080100@rutgers.edu> <549AE2AB.8010205@scalableinformatics.com> Message-ID: <549AEC21.1030400@rutgers.edu> On 12/24/2014 10:58 AM, Joe Landman wrote: > > On 12/24/2014 10:54 AM, Prentice Bisbal wrote: >> Everyone, >> >> Thanks for the feedback you've provided to my query below. I'm glad >> I'm not the only one who thought of this, and a lot of you raised >> very good points I haven't thought about. While I've been following >> parallel filesystems for years, I have very little experience >> actually managing them up to this point. (My BG/P came with GPFS >> filesystem for /scratch, but everything was already setup before I >> got here, so I've only had to deal with it when something breaks). >> >> You've all convinced me that this may not be an ideal solution >> arrangement, but if I go this route, GPFS might be a better fit for >> this than Lustre (mainly because Chris Samuels has proven it *is* >> possible with GPFS, and GPFS has snapshotting). >> >> Joe Landman, as always, has provided a wealth of information, and the >> rest of you have pointed out other potential pitfalls. with this >> approach. >> > My pleasure ... I do think asking James Cuff, Chris Dwan, and others > running/managing big kit (and the teams running the kit), what they > are doing and why would be quite instructive in a bigger picture sense. > > Which to a degree suggests that mebbe a devops/best practices BoF or > talk series, or educational workshop at SC15 wouldn't be a bad thing > ... I'd be happy to submit a proposal for this for this year. > > Let me know ... Actually, several other System Admins and I are trying to get more emphasis on System Administration at the SC conferences, and to even have a SysAdmin track. Talking about practical issues about managing filesystems, like those brought up here, would be a great topic to include in this. > > >> Thanks again for the feedback, and please keep the conversation going. >> >> Prentice >> >> On 12/23/2014 12:12 PM, Prentice Bisbal wrote: >>> Beowulfers, >>> >>> I have limited experience managing parallel filesytems like GPFS or >>> Lustre. I was discussing putting /home and /usr/local for my cluster >>> on a GPFS or Lustre filesystem, in addition to using it just for >>> /scratch. I've never done this before, but it doesn't seem like all >>> that bad an idea. My logic for this is the following: >>> >>> 1. Users often try to run programs from in /home, which leads to >>> errors, no matter how many times I tell them not to do that. This >>> would make the system more user-friendly. I could use >>> quotas/policies to encourage them to use 'steer' them to use other >>> filesystems if needed. >>> >>> 2. Having one storage system to manage is much better than 3. >>> >>> 3. Profit? >>> >>> Anyway, another person in the conversation felt that this would be >>> bad, because if someone was running a job that would hammer the >>> fileystem, it would make the filesystem unresponsive, and keep other >>> people from logging in and doing work. I'm not buying this concern >>> for the following reasons: >>> >>> If a job can hammer your parallel filesystem so that the login nodes >>> become unresponsive, you've got bigger problems, because that means >>> other jobs can't run on the cluster, and the job hitting the >>> filesystem hard has probably slowed down to a crawl, too. >>> >>> I know there are some concerns with the stability of parallel >>> filesystems, so if someone wants to comment on the dangers of that, >>> too, I'm all ears. I think that the relative instability of parallel >>> filesystems compared to NFS would be the biggest concern, not >>> performance. >>> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > From mathera at gmail.com Wed Dec 24 17:36:52 2014 From: mathera at gmail.com (Andrew Mather) Date: Thu, 25 Dec 2014 12:36:52 +1100 Subject: [Beowulf] Sysadmin track/BoF/Workshop(s) at SC15. Message-ID: > > My pleasure ... I do think asking James Cuff, Chris Dwan, and others > > running/managing big kit (and the teams running the kit), what they > > are doing and why would be quite instructive in a bigger picture sense. > > > > Which to a degree suggests that mebbe a devops/best practices BoF or > > talk series, or educational workshop at SC15 wouldn't be a bad thing > > ... I'd be happy to submit a proposal for this for this year. > > > > Let me know ... > > Actually, several other System Admins and I are trying to get more > emphasis on System Administration at the SC conferences, and to even > have a SysAdmin track. Talking about practical issues about managing > filesystems, like those brought up here, would be a great topic to > include in this. > This.. This so much ! :) Merry Christmas from Down-Under -- - http://surfcoast.redbubble.com | https://picasaweb.google.com/107747436224613508618 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "Unless someone like you, cares a whole awful lot, nothing is going to get better...It's not !" - The Lorax -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- A committee is a cul-de-sac, down which ideas are lured and then quietly strangled. *Sir Barnett Cocks * -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "A mind is like a parachute. It doesnt work if it's not open." :- Frank Zappa - -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason at lovesgoodfood.com Wed Dec 24 08:29:05 2014 From: jason at lovesgoodfood.com (Jason Riedy) Date: Wed, 24 Dec 2014 11:29:05 -0500 Subject: [Beowulf] Putting /home on Lusture of GPFS References: <5499A261.6070503@rutgers.edu> <549A204E.90607@princeton.edu> Message-ID: <87k31hau5q.fsf@qNaN.sparse.dyndns.org> And Bill Wichser writes: > We too have debated this. Seems a waste to add some 8 or 20 T to a > local cluster when we have this nice, central filesystem > available. Ah, but now we also have the filesystem-storage split... I'm contemplating using a single object pool (possibly Ceph) to provide both a slice for NFS /home and slices & pools for project storage. The reasons are more economic than performance or even management. The funding model is similar to "buy nodes for a cluster, get proportional access." If I can convince people to add a little overhead, then everyone gains some replication, etc... But, really, our current network setup is the limiting factor in back-of-the-envelope performance estimates. And I'm the limiting factor in the whole thing, as our system staff is, um, stretched thin. From samuel at unimelb.edu.au Thu Dec 25 15:39:59 2014 From: samuel at unimelb.edu.au (Chris Samuel) Date: Fri, 26 Dec 2014 10:39:59 +1100 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: References: <5499A261.6070503@rutgers.edu> <5499FE33.1080002@unimelb.edu.au> Message-ID: <1475247.iIG3j0qUyd@quad> On Wed, 24 Dec 2014 07:44:03 AM Michael Di Domenico wrote: > Hmmm, i haven't followed GPFS all that much since probably 2005'ish, > is snapshots in it fairly new? I don't recall them being there way > back then. Perhaps a re-evaluation is in order... I've only been using GPFS since 2010 and we've used them since then (can't remember the version we started on, sorry!). One thing we couldn't do was snapshots on an HSM GPFS filesystem (using TSM for Space Management). Apparently that's now supported in GPFS 4.x and TSM 7.x and so those upgrades are something on our list of tasks for the coming year. All the best, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci From samuel at unimelb.edu.au Thu Dec 25 15:41:19 2014 From: samuel at unimelb.edu.au (Chris Samuel) Date: Fri, 26 Dec 2014 10:41:19 +1100 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: <549AC777.6030405@gmail.com> References: <5499A261.6070503@rutgers.edu> <549AC777.6030405@gmail.com> Message-ID: <23663391.IBmI1hffbQ@quad> On Wed, 24 Dec 2014 08:02:31 AM Skylar Thompson wrote: > so if you have lots of files you'll want to consider placing your metadata > on fast storage This is what we do, we use SSDs on IBM v7000's for that. Works very well. -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci From samuel at unimelb.edu.au Thu Dec 25 15:52:33 2014 From: samuel at unimelb.edu.au (Chris Samuel) Date: Fri, 26 Dec 2014 10:52:33 +1100 Subject: [Beowulf] Putting /home on Lusture of GPFS In-Reply-To: <549AE019.30508@rutgers.edu> References: <5499A261.6070503@rutgers.edu> <549A204E.90607@princeton.edu> <549AE019.30508@rutgers.edu> Message-ID: <1678209.7WL3ZGoYFn@quad> On Wed, 24 Dec 2014 10:47:37 AM Prentice Bisbal wrote: > I see the logic in having separate /usr/local for every cluster so you > can install optimized binaries for each processor, but do you find your > users take the time to recompile their own codes for each processor > type, or did you come up with this arrangement to force them to do so? As we are supporting life sciences most of our users are not programmers and so we build a lot of software for them and so each system has its own /usr/local. We do have some people who do build their own code, but not that many. One other reason is that we keep all our healthcheck scripts in /usr/local (sourced from a central git repo) which runs over ethernet and we want them to keep working if IB has issues (which would cause GPFS to be unavailable on a node) to flag up problems and take nodes offline automatically in Slurm. All the best, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci