<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=UTF-8" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Dmitri Chubarov wrote:

<blockquote

 cite="mid:a7b3ee1e0810071050g48f0895dvd5d8a9eb3c5b1381@mail.gmail.com"

 type="cite">

  <pre wrap="">Hello,

we have got a VX50 down here as well. We have observed very different

scalability on different applications. With an OpenMP molecular

dynamics code we have got over 14 times speedup while on a 2D finite

difference scheme I could not get far beyond 3 fold.

  </pre>

</blockquote>

2D finite difference can be comm intensive is the mesh is too small for

each processor to have a fair amount of work to do before needing the

neighboring values from a "far" node.<br>

<blockquote

 cite="mid:a7b3ee1e0810071050g48f0895dvd5d8a9eb3c5b1381@mail.gmail.com"

 type="cite">

  <pre wrap="">

On Tue, Oct 7, 2008 at 10:45 PM, Eric Thibodeau <a class="moz-txt-link-rfc2396E" href="mailto:kyron@neuralbs.com"><kyron@neuralbs.com></a> wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">PS: Interesting figures, I couldn't resist compressing the same binary DB on

a 16Core Opteron (Tyan VX50) machine and was dumbfounded to get horrible

results given the same context. The processing speed only came up to 6.4E6

bytes/sec ...for 16 cores, and they were all at 100% during the entire run

(FWIW, I tried different block sizes and it does have an impact but this

also changes the problem parameters).

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Reading your message in the Beowulf list I should say that it looks

interesting and probably shows something happening with the memory

access on the NUMA nodes. Did you try to run the archiver with

different affinity settings?

  </pre>

</blockquote>

I don't have affinity control over the app per say. I would have to

look/modify pbzip's code. Although, note that the PID's assignment to

one processor is governed by the kernel and is thus a scheduler issue.

Also note that I have noticed that the kernel doesn't just have fun

moving the processes around the cores.<br>

<blockquote

 cite="mid:a7b3ee1e0810071050g48f0895dvd5d8a9eb3c5b1381@mail.gmail.com"

 type="cite">

  <pre wrap="">

We have observed that the memory architecture shows some strange

behaviour. For instance the latency for a read from the same NUMA node

for different nodes varies significantly.

  </pre>

</blockquote>

This is the nature of NUMA. Furthermore, if you have to cross to a far

CPU, the latency is also dependent on the CPU's load.<br>

<blockquote

 cite="mid:a7b3ee1e0810071050g48f0895dvd5d8a9eb3c5b1381@mail.gmail.com"

 type="cite">

  <pre wrap="">

Also on the profiler I often see that x86 instructions that have one

of the operands in memory may

take disproportionally long. I believe that could explain the 100% CPU

load reported by the kernel.

  </pre>

</blockquote>

How do you identify the specific instruction using a profiler, this is

something that interests me.<br>

<blockquote

 cite="mid:a7b3ee1e0810071050g48f0895dvd5d8a9eb3c5b1381@mail.gmail.com"

 type="cite">

  <pre wrap="">

>From the very little knowledge of this platform that we have got, I

tend to advise the users not to expect good speedup on their

multithreaded applications. </pre>

</blockquote>

Using OpenMP (from GCC 4.3.x) and an embarrassingly parallel problem

(computing K-Means on a large database), I do get significant speedup

(15-16).<br>

<blockquote

 cite="mid:a7b3ee1e0810071050g48f0895dvd5d8a9eb3c5b1381@mail.gmail.com"

 type="cite">

  <pre wrap="">Yet it would be interesting to get a

better understanding of the programming techniques for this sedecimus

and the similar machines.</pre>

</blockquote>

OpenMP is IMHO the easiest one that will bring you the most performance

out of 3 lines of #pragma directives. If you manage to get a cluster of

VX50s, then learn a bit of MPI to glue all of this together ;)<br>

<blockquote

 cite="mid:a7b3ee1e0810071050g48f0895dvd5d8a9eb3c5b1381@mail.gmail.com"

 type="cite">

  <pre wrap="">Even more so due to the QPI systems becoming

commercially available very soon.</pre>

</blockquote>

Don't know that one (QPI)...oh...new Intel stuff...no matter how much I

try to stay ahead, I'm always years behind!<br>

<blockquote

 cite="mid:a7b3ee1e0810071050g48f0895dvd5d8a9eb3c5b1381@mail.gmail.com"

 type="cite">

  <pre wrap=""> At the moment we have got a few

small kernels written in C and Fortran with OpenMP that we use to

evaluate different parallelization strategies. Unfortunately, there

are no tools I would know of that could help to explain what's going

on inside the memory of this machine.

  </pre>

</blockquote>

Of course, check out TAU (

<a class="moz-txt-link-freetext" href="http://www.cs.uoregon.edu/research/tau/home.php">http://www.cs.uoregon.edu/research/tau/home.php</a> ), it will at least

help you identify bottlenecks and give you an impressive profiling

infrastructure.<br>

<blockquote

 cite="mid:a7b3ee1e0810071050g48f0895dvd5d8a9eb3c5b1381@mail.gmail.com"

 type="cite">

  <pre wrap="">

I am very much interested to hear more about your experience with VX50.

Best regards,

  Dima Chubarov

--

  Dmitri Chubarov

  junior researcher

  Siberian Branch of the Russian Academy of Sciences

  Institute of Computational Technologies

  <a class="moz-txt-link-freetext" href="http://www.ict.nsc.ru/indexen.php">http://www.ict.nsc.ru/indexen.php</a>

  </pre>

</blockquote>

<br>

Eric

</body>

</html>