<div dir="ltr"><div>Prentice,   as I understand it the problem here is that with the same OS and IB drivers, there is a big difference in performance between stateful and NFS root nodes.</div><div>Throwing my hat into the ring,   try looking ot see if there is an excessive rate of interrupts in the nfsroot case, coming from the network card:</div><div><br></div><div>watch cat /proc/interrupts</div><div><br></div><div>You will probably need a large terminal window for this (or probably there is a way to filter the output)</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 14 September 2017 at 15:14, Prentice Bisbal <span dir="ltr"><<a href="mailto:pbisbal@pppl.gov" target="_blank">pbisbal@pppl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF" text="#000000">
    <p>Good question. I just checked using vmstat. When running xhpl on
      both systems, vmstat shows only zeros for si and so, even long
      after the performance degrades on the nfsroot instance. Just to be
      sure, I double-checked with top, which shows 0k of swap being
      used. <br><span class="HOEnZb"><font color="#888888">
    </font></span></p><span class="HOEnZb"><font color="#888888">
    <pre class="m_-8006342365568161241moz-signature" cols="72">Prentice</pre></font></span><div><div class="h5">
    <div class="m_-8006342365568161241moz-cite-prefix">On 09/13/2017 02:15 PM, Scott Atchley
      wrote:<br>
    </div>
    <blockquote type="cite">
      <div dir="ltr">Are you swapping?</div>
      <div class="gmail_extra"><br>
        <div class="gmail_quote">On Wed, Sep 13, 2017 at 2:14 PM, Andrew
          Latham <span dir="ltr"><<a href="mailto:lathama@gmail.com" target="_blank">lathama@gmail.com</a>></span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid">
            <div dir="ltr">ack, so maybe validate you can reproduce with
              another nfs root. Maybe a lab setup where a single server
              is serving nfs root to the node. If you could reproduce in
              that way then it would give some direction. Beyond that it
              sounds like an interesting problem.</div>
            <div class="gmail_extra">
              <div>
                <div class="m_-8006342365568161241h5"><br>
                  <div class="gmail_quote">On Wed, Sep 13, 2017 at 12:48
                    PM, Prentice Bisbal <span dir="ltr"><<a href="mailto:pbisbal@pppl.gov" target="_blank">pbisbal@pppl.gov</a>></span>
                    wrote:<br>
                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid">Okay,
                      based on the various responses I've gotten here
                      and on other lists, I feel I need to clarify
                      things:<br>
                      <br>
                      This problem only occurs when I'm running our
                      NFSroot based version of the OS (CentOS 6). When I
                      run the same OS installed on a local disk, I do
                      not have this problem, using the same exact
                      server(s).  For testing purposes, I'm using
                      LINPACK, and running the same executable  with the
                      same HPL.dat file in both instances.<br>
                      <br>
                      Because I'm testing the same hardware using
                      different OSes, this (should) eliminate the
                      problem being in the BIOS, and faulty hardware.
                      This leads me to believe it's most likely a
                      software configuration issue, like a kernel tuning
                      parameter, or some other software configuration
                      issue.<br>
                      <br>
                      These are Supermicro servers, and it seems they do
                      not provide CPU temps. I do see a chassis temp,
                      but not the temps of the individual CPUs. While I
                      agree that should be the first thing I look at,
                      it's not an option for me. Other tools like FLIR
                      and Infrared thermometers aren't really an option
                      for me, either.<br>
                      <br>
                      What software configuration, either a kernel a
                      parameter, configuration of numad or cpuspeed, or
                      some other setting, could affect this?<span class="m_-8006342365568161241m_5099190104119760613HOEnZb"><font color="#888888"><br>
                          <br>
                          Prentice</font></span><span class="m_-8006342365568161241m_5099190104119760613im m_-8006342365568161241m_5099190104119760613HOEnZb"><br>
                        <br>
                        On 09/08/2017 02:41 PM, Prentice Bisbal wrote:<br>
                      </span><span class="m_-8006342365568161241m_5099190104119760613im m_-8006342365568161241m_5099190104119760613HOEnZb">
                        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid">
                          Beowulfers,<br>
                          <br>
                          I need your assistance debugging a problem:<br>
                          <br>
                          I have a dozen servers that are all identical
                          hardware: SuperMicro servers with AMD Opteron
                          6320 processors. Every since we upgraded to
                          CentOS 6, the users have been complaining of
                          wildly inconsistent performance across these
                          12 nodes. I ran LINPACK on these nodes, and
                          was able to duplicate the problem, with
                          performance varying from ~14 GFLOPS to 64
                          GFLOPS.<br>
                          <br>
                          I've identified that performance on the slower
                          nodes starts off fine, and then slowly
                          degrades throughout the LINPACK run. For
                          example, on a node with this problem, during
                          first LINPACK test, I can see the performance
                          drop from 115 GFLOPS down to 11.3 GFLOPS. That
                          constant, downward trend continues throughout
                          the remaining tests. At the start of
                          subsequent tests, performance will jump up to
                          about 9-10 GFLOPS, but then drop to 5-6 GLOPS
                          at the end of the test.<br>
                          <br>
                          Because of the nature of this problem, I
                          suspect this might be a thermal issue. My
                          guess is that the processor speed is being
                          throttled to prevent overheating on the "bad"
                          nodes.<br>
                          <br>
                          But here's the thing: this wasn't a problem
                          until we upgraded to CentOS 6. Where I work,
                          we use a read-only NFSroot filesystem for our
                          cluster nodes, so all nodes are mounting and
                          using the same exact read-only image of the
                          operating system. This only happens with these
                          SuperMicro nodes, and only with the CentOS 6
                          on NFSroot. RHEL5 on NFSroot worked fine, and
                          when I installed CentOS 6 on a local disk, the
                          nodes worked fine.<br>
                          <br>
                          Any ideas where to look or what to tweak to
                          fix this? Any idea why this is only occuring
                          with RHEL 6 w/ NFS root OS?<br>
                          <br>
                        </blockquote>
                        <br>
                      </span>
                      <div class="m_-8006342365568161241m_5099190104119760613HOEnZb">
                        <div class="m_-8006342365568161241m_5099190104119760613h5">
                          ______________________________<wbr>_________________<br>
                          Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a>
                          sponsored by Penguin Computing<br>
                          To change your subscription (digest mode or
                          unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank" rel="noreferrer">http://www.beowulf.org/mailman<wbr>/listinfo/beowulf</a><br>
                        </div>
                      </div>
                    </blockquote>
                  </div>
                  <br>
                  <br clear="all">
                  <div><br>
                  </div>
                </div>
              </div>
              <span>-- <br>
                <div class="m_-8006342365568161241m_5099190104119760613gmail_signature" data-smartmail="gmail_signature">
                  <div dir="ltr">
                    <div>
                      <div dir="ltr">- Andrew "lathama" Latham <a href="mailto:lathama@gmail.com" target="_blank">lathama@gmail.com</a>
                        <a href="http://lathama.org" target="_blank">http://lathama.com</a> -</div>
                    </div>
                  </div>
                </div>
              </span></div>
            <br>
            ______________________________<wbr>_________________<br>
            Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored
            by Penguin Computing<br>
            To change your subscription (digest mode or unsubscribe)
            visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank" rel="noreferrer">http://www.beowulf.org/mailman<wbr>/listinfo/beowulf</a><br>
            <br>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
  </div></div></div>

<br>______________________________<wbr>_________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank" rel="noreferrer">http://www.beowulf.org/<wbr>mailman/listinfo/beowulf</a><br>
<br></blockquote></div><br></div>