[Beowulf] Strange error, gluster/ext4/zone_reclaim_mode

Fri Aug 31 02:47:49 PDT 2012

hi Jon,

It seems a kernel page problem. Maybe somehow a file manager or other  
software had allocated too many shared memory pages?
This is easy to check by executing 'ipcs' at every node.

I saw some strange things there in kernel used by Scientific Linux  
6.2 - even after deletion of shared memory pages it kept remembering
them after a reboot.

On Aug 30, 2012, at 9:24 PM, Jon Tegner wrote:

> Hi,
>
> have this strange error. We run CFD calculations on a small cluster.
> Basically it consists of bunch of machines connected to a file system.
> The file system consists of 4 servers, CentOS-6.2, ext4 and glusterfs
> (3.2.7) on top. Infiniband is used for interconnect.
>
> For scheduling/resource management we use torque/maui, and  
> typically we
> submit job in a torque submit script like:
>
> mpirun -machinefile bla bla bla
>
> However, at one point one of the machines serving the file system went
> down, after spitting out error messages as indicated in
>
> https://bugzilla.redhat.com/show_bug.cgi?id=770545
>
> We used the advice indicated in that link ("sysctl -w
> vm.zone_reclaim_mode=1"), and after that the file servers seems to run
> OK. This happened in the middle of summer, and a few weeks later we
> noticed a few strange things:
>
> 1. We had to change the torque submit script like
>
> ssh $(hostname) "mpirun -machinefile bla bla bla"
>
> 2. zone_reclaim_node were set to 1 on all computational nodes (on the
> file servers this was done explicitly, NOT so on the computational  
> nodes).
>
> 3. We have seen particularly lousy performance on one of our  
> applications.
>
> 4. The command "tail -f file" doesn't get updated properly.
>
> Any help/hints would be greatly appreciated!
>
> Regards,
>
> /jon
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf