<html>
  <head>
    <meta content="text/html; charset=windows-1252"
      http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">Someone on the MVAPICH mailing list
      posted this image a few years back. I find the style of
      visualization compelling:<br>
      <br>
      <img alt="" src="cid:part1.07070706.07030507@microway.com"
        height="480" width="588"><br>
      <br>
      <br>
      On 03/25/2016 02:39 PM, Jeffrey Layton wrote:<br>
    </div>
    <blockquote
cite="mid:CAJfzO5SFxrxc=NqQx5846r1sRSMPewnHY-kBx5OOJgTJF3gVcg@mail.gmail.com"
      type="cite">
      <p><defanged_div dir="ltr"></defanged_div></p>
      <p><defanged_div></defanged_div></p>
      <p><defanged_div></defanged_div></p>
      <p><defanged_div></defanged_div></p>
      <p><defanged_div></defanged_div></p>
      <p><defanged_div>Olli-Pekka, et al,<br>
          <br>
        </defanged_div></p>
      <defanged_div>I took a look at your updated website - it looks
        very good. One thing I wanted to ask, and this question is
        probably one for the entire list, when you run a test across all
        of the nodes in the cluster, what process do you use to
        determine if nodes are "outliers" and need attention?<br>
        <defanged_div>
          <p><defanged_div><br>
            </defanged_div></p>
          <defanged_div>For example, one test you mention is to run
            stream and look at the TRIAD results for all of the nodes.
            If you run it across an entire cluster you end up with a
            collection of results. What do you do with those results? Do
            you look for nodes that are a certain percentage outside of
            the mean? Or do you look for nodes that are outside one
            standard deviation from the mean?<br>
            <br>
            <defanged_div>Thanks!<br>
              <br>
              <defanged_div>Jeff<br>
                <br>
                <defanged_div>P.S. I have my own ideas but I'm really
                  curious what other people do.<br>
                  <br>
                  <defanged_div>
                    <p><defanged_div class="gmail_extra"><br>
                      </defanged_div></p>
                    <p><defanged_div class="gmail_quote">On Wed, Mar 23,
                        2016 at 9:55 AM, Douglas Eadline <defanged_span
                          dir="ltr"><<a moz-do-not-send="true"
                            href="mailto:deadline@eadline.org"
                            target="_blank">deadline@eadline.org</a>></defanged_span>
                        wrote:<br>
                      </defanged_div></p>
                    <blockquote class="gmail_quote"
                      defanged_style="margin:0 0 0 .8ex;border-left:1px
                      #ccc solid;padding-left:1ex"><defanged_span
                        class=""><br>
                        > Thanks for the kind words and comments!
                        Good catch with HPL. It's<br>
                        > definitely part of the test regime. I
                        typically run 3 tests for<br>
                        > consistency:<br>
                        ><br>
                        > - Separate instance of STREAM2 on each node<br>
                        > - Separate instance of HPL on each node<br>
                        > - Simple MPI latency / bandwidth test
                        called mpisweep that tests every<br>
                        > link (I'll put this up on github later as
                        well)<br>
                        ><br>
                        > I now made the changes to the document.<br>
                        ><br>
                        > After this set of tests I'm not completely
                        sure if NPB will add any<br>
                        > further information. Those 3 benchmarks
                        combined with the other checks<br>
                        > should pretty much expose all the possible
                        issues. However, I could be<br>
                        > missing something again :)<br>
                        <br>
                      </defanged_span>NAS will verify the results. On
                      several occasion I have<br>
                      found NAS gave good numbers but the results did
                      not verify.<br>
                      This allowed me to look at lower level issues
                      until I found<br>
                      the problem (in one case a cable IIRC)<br>
                      <br>
                      BTW, I run NAS all the time to test performance
                      and make sure<br>
                      things are running properly on my deskside
                      clusters. I have done<br>
                      it so often I can tell which test is running by
                      watching wwtop<br>
                      (Warewulf cluster based top that shows loads, net,
                      memory but no<br>
                      application names).<br>
                      <br>
                      --<br>
                      Doug<br>
                      <defanged_span class=""><br>
                        ><br>
                        > Best regards,<br>
                        > O-P<br>
                        > --<br>
                        > Olli-Pekka Lehto<br>
                        > Development Manager<br>
                        > Computing Platforms<br>
                        > CSC - IT Center for Science Ltd.<br>
                        > E-Mail: <a moz-do-not-send="true"
                          href="mailto:olli-pekka.lehto@csc.fi">olli-pekka.lehto@csc.fi</a><br>
                        > Tel: <a moz-do-not-send="true"
                          href="tel:%2B358%2050%20381%208604"
                          defanged_value="+358503818604">+358 50 381
                          8604</a><br>
                        > skype: oplehto // twitter: ople<br>
                        ><br>
                      </defanged_span>
                      <p><defanged_div></defanged_div></p>
                      <p><defanged_div class="h5">>> From:
                          "Jeffrey Layton" <<a moz-do-not-send="true"
                            href="mailto:laytonjb@gmail.com">laytonjb@gmail.com</a>><br>
                          >> To: "Olli-Pekka Lehto" <<a
                            moz-do-not-send="true"
                            href="mailto:olli-pekka.lehto@csc.fi"><a class="moz-txt-link-abbreviated" href="mailto:olli-pekka.lehto@csc.fi">olli-pekka.lehto@csc.fi</a></a>><br>
                          >> Cc: <a moz-do-not-send="true"
                            href="mailto:beowulf@beowulf.org">beowulf@beowulf.org</a><br>
                          >> Sent: Tuesday, 22 March, 2016
                          16:45:20<br>
                          >> Subject: Re: [Beowulf] Cluster
                          consistency checks<br>
                          ><br>
                          >> Olli-Pekka,<br>
                          ><br>
                          >> Very nice - I'm glad you put a list
                          down. Many of the things that I do<br>
                          >> are based<br>
                          >> on experience.<br>
                          ><br>
                          >> A long time ago, in one of my
                          previous jobs, we used to run NAS Parallel<br>
                          >> Benchmark (NPB) on single nodes to
                          get a baseline of performance. We<br>
                          >> would look<br>
                          >> for outliers and triage and debug
                          them based on these results. We're not<br>
                          >> running the test for performance but
                          to make sure the cluster was a<br>
                          >> homogeneous<br>
                          >> as possible. Have you done this
                          before?<br>
                          ><br>
                          >> I've also seen people run HPL on
                          single nodes and look for outliers.<br>
                          >> After<br>
                          >> triaging these, HPL is run on smaller
                          groups of nodes within a single<br>
                          >> switch,<br>
                          >> look for outliers and triage them.
                          This continues up to the entire<br>
                          >> system. The<br>
                          >> point is not to get a great HPL
                          number to submit to the Top500 but<br>
                          >> rather to<br>
                          >> find potential network issues,
                          particularly network links.<br>
                          ><br>
                          >> Thanks for the good work!<br>
                          ><br>
                          >> Jeff<br>
                          ><br>
                          >> On Tue, Mar 22, 2016 at 11:32 AM,
                          Olli-Pekka Lehto <<br>
                          >> <a moz-do-not-send="true"
                            href="mailto:olli-pekka.lehto@csc.fi">olli-pekka.lehto@csc.fi</a>
                          ><br>
                          >> wrote:<br>
                          ><br>
                          >>> Hi,<br>
                          ><br>
                          >>> I finally got around to writing
                          down my cluster-consistency checklist<br>
                          >>> that I've<br>
                          >>> been planning for a long time:<br>
                          ><br>
                          >>> <a moz-do-not-send="true"
                            href="https://github.com/oplehto/cluster-checks/"
                            defanged_rel="noreferrer" target="_blank">https://github.com/oplehto/cluster-checks/</a><br>
                          >>> The goal is to try to make the
                          baseline installation of a cluster as<br>
                          >>> consistent<br>
                          >>> as possible and make vendors work
                          for their money. :) Of course<br>
                          >>> hopefully<br>
                          >>> publishing this will help vendors
                          capture some of the issues that slip<br>
                          >>> through<br>
                          >>> the cracks even before clusters
                          are handed over. It's also a good idea<br>
                          >>> to run<br>
                          >>> these types of checks during the
                          lifetime of the system as there's<br>
                          >>> always some<br>
                          >>> consistency creep as hardware
                          gets replaced.<br>
                          ><br>
                          >>> If someone is interested in
                          contributing, pull requests or comments on<br>
                          >>> the list<br>
                          >>> are welcome. I'm sure that
                          there's something missing as well. Right now<br>
                          >>> it's<br>
                          >>> just a text-file but making some
                          nicer scripts and postprocessing for<br>
                          >>> the<br>
                          >>> output might happen as well at
                          some point. All the examples are very HP<br>
                          >>> oriented as well at this point.<br>
                          ><br>
                          >>> Best regards,<br>
                          >>> Olli-Pekka<br>
                          >>> --<br>
                          >>> Olli-Pekka Lehto<br>
                          >>> Development Manager<br>
                          >>> Computing Platforms<br>
                          >>> CSC - IT Center for Science Ltd.<br>
                          >>> E-Mail: <a
                            moz-do-not-send="true"
                            href="mailto:olli-pekka.lehto@csc.fi"><a class="moz-txt-link-abbreviated" href="mailto:olli-pekka.lehto@csc.fi">olli-pekka.lehto@csc.fi</a></a><br>
                          >>> Tel: <a moz-do-not-send="true"
                            href="tel:%2B358%2050%20381%208604"
                            defanged_value="+358503818604">+358 50 381
                            8604</a><br>
                          >>> skype: oplehto // twitter: ople<br>
                          ><br>
                          >>>
                          _______________________________________________<br>
                          >>> Beowulf mailing list, <a
                            moz-do-not-send="true"
                            href="mailto:Beowulf@beowulf.org"><a class="moz-txt-link-abbreviated" href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a></a>
                          sponsored by Penguin<br>
                          >>> Computing<br>
                          >>> To change your subscription
                          (digest mode or unsubscribe) visit<br>
                          >>> <a moz-do-not-send="true"
                            href="http://www.beowulf.org/mailman/listinfo/beowulf"
                            defanged_rel="noreferrer" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>
                          ><br>
                        </defanged_div></p>
                      <defanged_div><defanged_div>> --<br>
                          > Mailscanner: Clean<br>
                          <defanged_span class="">><br>
                            >
                            _______________________________________________<br>
                            > Beowulf mailing list, <a
                              moz-do-not-send="true"
                              href="mailto:Beowulf@beowulf.org"><a class="moz-txt-link-abbreviated" href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a></a>
                            sponsored by Penguin Computing<br>
                            > To change your subscription (digest
                            mode or unsubscribe) visit<br>
                            > <a moz-do-not-send="true"
                              href="http://www.beowulf.org/mailman/listinfo/beowulf"
                              defanged_rel="noreferrer" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>
                            ><br>
                            <br>
                            <br>
                          </defanged_span>--<br>
                          Doug<br>
                          <defanged_span class="HOEnZb"><font
                              color="#888888"><br>
                              --<br>
                              Mailscanner: Clean<br>
                              <br>
                            </font></defanged_span></defanged_div></defanged_div></blockquote>
                    <defanged_div><br>
                      <defanged_div>
                        <br>
                        <fieldset class="mimeAttachmentHeader"></fieldset>
                        <br>
                        <pre wrap="">_______________________________________________
Beowulf mailing list, <a class="moz-txt-link-abbreviated" href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit <a class="moz-txt-link-freetext" href="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf</a>
</pre>
                      </defanged_div></defanged_div></defanged_div></defanged_div></defanged_div></defanged_div></defanged_div></defanged_div></defanged_div></blockquote>
    <br>
    <br>
    <div class="moz-signature">-- <br>
      Eliot Eshelman<br>
      Microway, Inc.
    </div>
  </body>
</html>