[Beowulf] Lustre on google cloud

Fri Jul 26 04:46:56 PDT 2019

) Terabyte scale data movement into or out of the cloud is not scary in
2019. You can move data into and out of the cloud at basically the line
rate of your internet connection as long as you take a little care in
selecting and tuning your firewalls and inline security devices.  Pushing
1TB/day etc.  into the cloud these days is no big deal and that level of
volume is now normal for a ton of different markets and industries.

Amazon will of course also send you a semi trailer full of hard drives to
import your data...  The web page says "Contact Sales for pricing"

On Fri, 26 Jul 2019 at 12:26, Chris Dagdigian <dag at sonsorol.org> wrote:

>
> Coming back late to this thread as yesterday was a travel/transit day ...
> some additional thoughts
>
> 1) I also avoid the word "cloud bursting" these days because it's been
> tarred by marketing smog and does not mean much. The blunt truth is that
> from a technical perspective having a hybrid premise/cloud HPC is very
> simple. The hard part is data -- either moving volumes back and forth or
> trying to maintain a consistent shared file system at WAN-scale networking
> distances.
>
> The only successful life science hybrid HPC environments I've really seen
> repeatedly are the ones that are chemistry or modeling focused because
> generally the chemistry folks have very small volumes of data to move but
> very large CPU requirements and occasional GPU needs. Since the data
> movement requirements are small for chemistry it's pretty easy to make them
> happy on-prem, on the cloud or on a hybrid design
>
> Not to say full on cloud bursting HPC systems don't exist at all of course
> but they are rare. I was talking with a pharma yesterday that uses HTcondor
> to span on-premise HPC with on demand AWS nodes. I just don't see that as
> often as I see distinct HPCs.
>
> My observed experience in this realm is that for life science we don't do
> a lot of WAN-spanning grids because we get killed by the gravitational pull
> of our data. We build HPC where the data resides and we keep them
> relatively simple in scope and we attempt to limit WAN scale data movement.
> For most this means that having onsite HPC and cloud HPC and we simply
> direct the workload to whichever HPC resource is closest to the data.
>
> So for Jörg -- based on what you have said I'd take a look at your
> userbase, your application mix and how your filesystem is organized. You
> may be able to set things up so that you can "burst" to the cloud for just
> a special subset of your apps, user groups or data sets. That could be your
> chemists or maybe you have a group of people who regularly compute heavily
> against a data set or set of references that rarely change -- in that case
> you may be able to replicate that part of your GPFS over to a cloud and
> send just that workload remotely, thus freeing up capacity on your local
> HPC for other work.
>
>
>
>
> 2) Terabyte scale data movement into or out of the cloud is not scary in
> 2019. You can move data into and out of the cloud at basically the line
> rate of your internet connection as long as you take a little care in
> selecting and tuning your firewalls and inline security devices.  Pushing
> 1TB/day etc.  into the cloud these days is no big deal and that level of
> volume is now normal for a ton of different markets and industries.   It's
> basically a cost and budget exercises these days and not a particularly
> hard IT or technology problem.
>
> There are two killer problems with cloud storage even though it gets
> cheaper all the time
>
> 2a) Cloud egress fees.  You get charged real money for data traffic
> leaving your cloud. In many environments these fees are so tiny as to be
> unnoticeable noise in the monthly bill. But if you are regularly moving
> terabyte or petabyte scale data into and out of a cloud provider then you
> will notice the egress fees on your bill and they will be large enough that
> you have to plan for them and optimize for cost
>
> 2b) The monthly recurring cost for cloud storage can be hard to bear at
> petascale unless you have solidly communicated all of the benefits /
> capabilities and can compare them honestly to a full transparent list of
> real world costs to do the same thing onsite.  The monthly s3 storage bill
> once you have a few petabytes in AWS is high enough that you start to catch
> yourself doing math every once in a while along the lines of "I could
> build a Lustre filesystem w/ 2x capacity for just 2-months worth of our
> cloud storage opex budget!"
>
>
>
>
>
>
> INKozin via Beowulf <beowulf at beowulf.org>
> July 26, 2019 at 4:23 AM
> I'm very much in favour of personal or team clusters as Chris has also
> mentioned. Then the contract between the user and the cloud is explicit.
> The data can be uploaded/ pre staged to S3 in advance (at no cost other
> than time) or copied directly as part of the cluster creation process. It
> makes no sense to replicate in the cloud your in-house infrastructure.
> However having a solid storage base in-house is good. What you should look
> into is the cost of transfer back if you really have to do it. The cost
> could be prohibitively high, eg if Bam files need to be returned. I'm sure
> Tim has an opinion.
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> Joe Landman <joe.landman at gmail.com>
> July 26, 2019 at 12:00 AM
>
>
>
> The issue is bursting with large data sets.  You might be able to
> pre-stage some portion of the data set in a public cloud, and then burst
> jobs from there.  Data motion between sites is going to be the hard problem
> in the mix.  Not technically hard, but hard from a cost/time perspective.
>
>
> Jörg Saßmannshausen <sassy-work at sassy.formativ.net>
> July 25, 2019 at 8:26 PM
> Dear all, dear Chris,
>
> thanks for the detailed explanation. We are currently looking into cloud-
> bursting so your email was very timely for me as I am suppose to look into
> it.
>
> One of the issues I can see with our workload is simply getting data into
> the
> cloud and back out again. We are not talking about a few Gigs here, we are
> talking up to say 1 or more TB. For reference: we got 9 PB of storage
> (GPFS)
> of which we are currently using 7 PB and there are around 1000+ users
> connected to the system. So cloud bursting would only be possible in some
> cases.
> Do you happen to have a feeling of how to handle the issue with the file
> sizes
> sensibly?
>
> Sorry for hijacking the thread here a bit.
>
> All the best from a hot London
>
> Jörg
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> Chris Dagdigian <dag at sonsorol.org>
> July 22, 2019 at 2:14 PM
>
> A lot of production HPC runs on cloud systems.
>
> AWS is big for this via their AWS Parallelcluster stack which does include
> lustre support via vfXT for lustre service although they are careful to
> caveat it as staging/scratch space not suitable for persistant storage.
> AWS has some cool node types now with 25gig, 50gig and 100-gigabit network
> support.
>
> Microsoft Azure is doing amazing things now that they have the
> cyclecomputing folks on board, integrated and able to call shots within the
> product space. They actually offer bare metal HPC and infiniband SKUs now
> and have some interesting parallel filesystem offerings as well.
>
> Can't comment on google as I've not touched or used it professionally  but
> AWS and Azure for sure are real players now to consider if you have an HPC
> requirement.
>
>
> That said, however, a sober cost accounting still shows on-prem or "owned'
> HPC is best from a financial perspective if your workload is 24x7x365
> constant.  The cloud based HPC is best for capability,  bursty workloads,
> temporary workloads, auto-scaling, computing against cloud-resident data
> sets or the neat new model where instead of on-prem multi-user shared HPC
> you go out and decide to deliver individual bespoke HPC clusters to each
> user or team on the cloud.
>
> The big paradigm shift for cloud HPC is that it does not make a lot of
> sense to make a monolithic stack shared by multiple competing users and
> groups. The automated provisioning and elasticity of the cloud make it more
> sensible to build many clusters so that you can tune each cluster
> specifically for the cluster or workload and then blow it up when the work
> is done.
>
> My $.02 of course!
>
> Chris
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20190726/c32d97f0/attachment.html>