[Beowulf] Pretty Big Data

Mon Jan 25 07:13:11 PST 2016

Comment interspersed below.

On 1/25/16, 6:41 AM, "Beowulf on behalf of Jason Riedy"
<beowulf-bounces at beowulf.org on behalf of jason at lovesgoodfood.com> wrote:

>And Christopher Samuel writes:
>> The rest of us will carry on as before I suspect...
>
>Using libraries that hide the (sometimes proprietary) API behind
>sufficient POSIX semantics...  Pretty much what the linked article says.
>The "new" architecture is just the old architecture with the fastest
>components at the relevant data sizes.  The host CPU is faster than the
>storage CPU (in general), so move FS logic there when possible.  Limit
>meta data scaling needs by splitting the storage.  eh.
>
>But I suspect the point is that people think they're willing to give up
>the POSIX semantics they cannot even specify (and often already have
>given up) to say they're using faster hardware.  Kinda like computational
>accelerators.  Those started with lighter semantics: single user, no
>double precision, no atomics, no...
>
>The big, open question that's terribly difficult to address in research
>space:  How do you efficiently mix multiple massive storage allocations
>that need high performance for a three to five year funding period and
>then archival storage afterward?  I suspect much commercial data has a
>similar time horizon for immediate usefulness.  Health care data is
>interestingly different.  CPU allocations are relatively short, so
>inefficiencies from splitting that usage are relatively short-lived.
>Storage lasts longer.

That¹s an interesting point.  Your funding is short lived, so ³get results
now² might be more important than ³get more results cheaply later².  That
pushes towards standardized familiar access methods (POSIX) than custom
APIs, because I suspect that in most of these cases, the rate limiting
resource is the software development, not the computational/storage
hardware.

And it has that interesting ³process it now, but save everything for
later² aspect.  Much of ³the web² is ephemeral, and is clearly not
intended for long time archiving and retrieval; given the large number of
dollars being spent on tools to use ³the web² they¹re going to be tailored
for that kind of ³retrieve recent data² kind of model.
(After all, I don¹t find myself searching through 5 year old emails very
often.. I do, but not very often compared to ³what did X send me a month
ago²)

>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>To change your subscription (digest mode or unsubscribe) visit
>http://www.beowulf.org/mailman/listinfo/beowulf