[Beowulf] Strange error, gluster/ext4/zone_reclaim_mode
tegner at renget.se
Thu Aug 30 22:11:50 PDT 2012
Hi! And thanks for answer, much appreciated!
On 08/31/2012 12:47 AM, Mark Hahn wrote:
>> However, at one point one of the machines serving the file system went
>> down, after spitting out error messages as indicated in
>> We used the advice indicated in that link ("sysctl -w
>> vm.zone_reclaim_mode=1"), and after that the file servers seems to run
> this seems to be quite dependent on hardware, architecture, workload,
> kernel. did you notice any performance problems? or try the
> vm.min_kbytes_free angle? (also, does this server have swap?)
We use standard centos-6.2 kernel (2.6.32-220.17.1), and I didn't notice
anything strange on the servers (except the error messages before
changing zone_reclaim_mode). Regarding min_free_kbytes it is 225280 on
computational nodes (132 G total memory) and 90112 on file servers (32 G
total memory). Is this something I should try changing? The servers have
swap (64 G).
>> 1. We had to change the torque submit script like
>> ssh $(hostname) "mpirun -machinefile bla bla bla"
> I think this is unrelated. are you sure nothing changed, torque-wise,
> even its qmgr-level config? (or mpi versions/config.)
I agree, it seems unrelated, but I can't find anything else that have
>> 3. We have seen particularly lousy performance on one of our
> does it do a lot of file IO?
No, and when profiling this I noticed that one particular operation,
computing the gradient of a vector field, took too long, and besides the
time to complete this operation varies substantially over the
iterations. However, when performing this operation a second time (an
extra "dummy operation") that was NOT that slow. Could this indicate
that it has something to do with how the memory is handled?
Also, we have used a very similar set up previously, but were all
machines were running CentOS-5, and then we didn't see these strange
More information about the Beowulf