[Beowulf] part randomization (was GPFS and failed metadata NSD)

Michael Di Domenico mdidomenico4 at gmail.com
Tue May 23 05:14:18 PDT 2017


On Mon, May 22, 2017 at 7:01 PM, Lux, Jim (337C)
<james.p.lux at jpl.nasa.gov> wrote:
>
>>all of my contracts have a "part-randomization" clause in them to
>>ensure vendor's randomize the batches they pull parts from to build
>>the machines.  hopefully everyone else's does too... :)
>
> How do you verify that they really are from different batches (in a
> significant way)?  Assembled on different days? Or what?  I can easily see
> a mfr buying a weeks worth or a months worth of parts in one lot - and
> those will all have essentially the same characteristics.  Date codes on
> IC parts are typically just year and week, and relating that back to an
> actual production lot run is non-trivial.

we don't really check all that well.  but linux gives off a lot of
information these days on parts.  you can collect it and analyze it to
a point.

but in reality it's more a legal ramification, which gives us two
options.  1,  the affected hardware should not affect the entire
machine 2, if it does and we determine that part randomization wasn't
done we can force the vendor to fix it

i'm not a legal guy so i'm not willing to take this much further then
that.  i'm merely relaying a tidbit of knowledge i picked up when i
worked sales in a cluster industry and now for the govt.

> In the space business, we do a lot of lot tracking and so forth, and
> a) it isn¹t cheap
> b) it isn¹t always available
> c) it isn¹t necessarily meaningful
>
> That way when you get the alert that a bad batch of 2N2222 transistors has
> been found, you can check the as-built docs for your spacecraft on the way
> to Europa and breathe a sigh of relief that you didn¹t use one of ³those
> units².

yes, agreed, but i think we're talking about different realities here.
i'm not concerned over the individual capacitors in all the machines.
if i order 1000 dimms, i want to ensure they come from as many
different batches as reasonably possible.  that's certainly an easier
task then trying to count and catalog all the capacitors in a cluster.
(that should be a raffle like counting gumballs in a jar)

i honestly haven't done much research in this area, but it's interesting...


More information about the Beowulf mailing list