[Beowulf] Strange error, gluster/ext4/zone_reclaim_mode
diep at xs4all.nl
Fri Aug 31 02:47:49 PDT 2012
It seems a kernel page problem. Maybe somehow a file manager or other
software had allocated too many shared memory pages?
This is easy to check by executing 'ipcs' at every node.
I saw some strange things there in kernel used by Scientific Linux
6.2 - even after deletion of shared memory pages it kept remembering
them after a reboot.
On Aug 30, 2012, at 9:24 PM, Jon Tegner wrote:
> have this strange error. We run CFD calculations on a small cluster.
> Basically it consists of bunch of machines connected to a file system.
> The file system consists of 4 servers, CentOS-6.2, ext4 and glusterfs
> (3.2.7) on top. Infiniband is used for interconnect.
> For scheduling/resource management we use torque/maui, and
> typically we
> submit job in a torque submit script like:
> mpirun -machinefile bla bla bla
> However, at one point one of the machines serving the file system went
> down, after spitting out error messages as indicated in
> We used the advice indicated in that link ("sysctl -w
> vm.zone_reclaim_mode=1"), and after that the file servers seems to run
> OK. This happened in the middle of summer, and a few weeks later we
> noticed a few strange things:
> 1. We had to change the torque submit script like
> ssh $(hostname) "mpirun -machinefile bla bla bla"
> 2. zone_reclaim_node were set to 1 on all computational nodes (on the
> file servers this was done explicitly, NOT so on the computational
> 3. We have seen particularly lousy performance on one of our
> 4. The command "tail -f file" doesn't get updated properly.
> Any help/hints would be greatly appreciated!
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf