From wrzhu at etinternational.com Mon Mar 13 22:23:04 2006 From: wrzhu at etinternational.com (Weirong Zhu) Date: Tue Nov 9 01:14:28 2010 Subject: [scyld-users] how to let bbq migrate batch jobs to compute nodes? Message-ID: <44166148.8030701@etinternational.com> We have just got our new Penguin Computing cluster. Since one of our main purpose it to submit a lot of batch jobs to the cluster, I tried to learn how to use bbq provided by scyld. As a simple test, (1) I wrote a C program, which has a while(1) loop. Then I compile it to generate the binary a.out. (2) Write a simple job file with only one command "./a.out". And name this file as run. (3) submit the job by "batch now -f run" (4) do step (3) a lot of times. Then by using command "bbq" I saw a lot of jobs were listed. And I assume those jobs would be migrated to computing nodes. However, when I use "beostat -C", find all the computing nodes are actually idle, and all those instances are running on master node. Did I do something wrong to submit my simple batch jobs? How should I do? Moreover, I tried to use "atrm" to delete my jobs from the queue. After that, when I use "bbq" command, there is nothing in the queue. However, when I did a "top" or "ps -fu myname". Those jobs are still running on the master node. Did I do something wrong to delete a batch job from the queue? How should I do? I am really confused with the bbq batch system and it seems that there is no PBS avaliable on this cluster. Any help and suggestions are welcome! Regards, Weirong From wrzhu at etinternational.com Tue Mar 14 10:24:38 2006 From: wrzhu at etinternational.com (Weirong Zhu) Date: Tue Nov 9 01:14:28 2010 Subject: [scyld-users] how to let bbq migrate batch jobs to compute nodes? In-Reply-To: References: Message-ID: <44170A66.1020709@etinternational.com> Thanks for the information. This seems a feasible method. Thanks very much. Have you ever tried bbq? Since with a batch system, it will be easier to control jobs. For example, you can watch the status of all the jobs in the queue, and easily delete a job from a queue. Now if we do the batch ourselves, we need to use "ps" to find the corresponding pid, and kill the job. So I still want to know how make use of "bbq". -- Weirong Bishop, Ryan S SAJ Contractor wrote: >We use a batch script that sets a variable, something like "NODE='beomap >--nolocal'" and then runs a bpsh NODE [command]. That will issue the job to >the next free node. YMMV - make sure to check out the beomap and bpsh man >pages. > >--Schuyler > >-----Original Message----- >From: scyld-users-bounces@beowulf.org >[mailto:scyld-users-bounces@beowulf.org] On Behalf Of Weirong Zhu >Sent: Tuesday, March 14, 2006 1:23 AM >To: scyld-users@beowulf.org >Subject: [scyld-users] how to let bbq migrate batch jobs to compute nodes? > >We have just got our new Penguin Computing cluster. >Since one of our main purpose it to submit a lot of batch jobs to the >cluster, I tried to learn how to use bbq provided by scyld. > >As a simple test, > >(1) I wrote a C program, which has a while(1) loop. Then I compile it to >generate the binary a.out. >(2) Write a simple job file with only one command "./a.out". And name this >file as run. >(3) submit the job by "batch now -f run" >(4) do step (3) a lot of times. > >Then by using command "bbq" I saw a lot of jobs were listed. And I assume >those jobs would be migrated to computing nodes. > >However, when I use "beostat -C", find all the computing nodes are actually >idle, and all those instances are running on master node. > >Did I do something wrong to submit my simple batch jobs? >How should I do? > >Moreover, I tried to use "atrm" to delete my jobs from the queue. After >that, when I use "bbq" command, there is nothing in the queue. However, when >I did a "top" or "ps -fu myname". Those jobs are still running on the master >node. >Did I do something wrong to delete a batch job from the queue? >How should I do? > >I am really confused with the bbq batch system and it seems that there is no >PBS avaliable on this cluster. > >Any help and suggestions are welcome! > >Regards, >Weirong > > > >_______________________________________________ >Scyld-users mailing list, Scyld-users@beowulf.org To change your subscription >(digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/scyld-users > > > From wrzhu at etinternational.com Tue Mar 14 10:45:18 2006 From: wrzhu at etinternational.com (Weirong Zhu) Date: Tue Nov 9 01:14:28 2010 Subject: [scyld-users] how to let bbq migrate batch jobs to compute nodes? In-Reply-To: References: Message-ID: <44170F3E.5040606@etinternational.com> I just tried this method. It's working fine with "bbq" now. Thanks very much. One more question, after I using "atrm" to delete jobs from the queue, those jobs disappeared from the queue. However, when I do a "ps", they are still there. Any suggestions? Thanks for your help! -- Weirong Bishop, Ryan S SAJ Contractor wrote: >We use the methodology I described below with batch, which feeds bbq. Works >like a charm. > >-----Original Message----- >From: scyld-users-bounces@beowulf.org >[mailto:scyld-users-bounces@beowulf.org] On Behalf Of Weirong Zhu >Sent: Tuesday, March 14, 2006 1:25 PM >To: scyld-users@beowulf.org >Subject: Re: [scyld-users] how to let bbq migrate batch jobs to compute >nodes? > >Thanks for the information. > >This seems a feasible method. Thanks very much. > >Have you ever tried bbq? Since with a batch system, it will be easier to >control jobs. For example, you can watch the status of all the jobs in the >queue, and easily delete a job from a queue. Now if we do the batch >ourselves, we need to use "ps" to find the corresponding pid, and kill the >job. > >So I still want to know how make use of "bbq". > >-- Weirong > >Bishop, Ryan S SAJ Contractor wrote: > > > >>We use a batch script that sets a variable, something like >>"NODE='beomap --nolocal'" and then runs a bpsh NODE [command]. That >>will issue the job to the next free node. YMMV - make sure to check >>out the beomap and bpsh man pages. >> >>--Schuyler >> >>-----Original Message----- >>From: scyld-users-bounces@beowulf.org >>[mailto:scyld-users-bounces@beowulf.org] On Behalf Of Weirong Zhu >>Sent: Tuesday, March 14, 2006 1:23 AM >>To: scyld-users@beowulf.org >>Subject: [scyld-users] how to let bbq migrate batch jobs to compute nodes? >> >>We have just got our new Penguin Computing cluster. >>Since one of our main purpose it to submit a lot of batch jobs to the >>cluster, I tried to learn how to use bbq provided by scyld. >> >>As a simple test, >> >>(1) I wrote a C program, which has a while(1) loop. Then I compile it >>to generate the binary a.out. >>(2) Write a simple job file with only one command "./a.out". And name >>this file as run. >>(3) submit the job by "batch now -f run" >>(4) do step (3) a lot of times. >> >>Then by using command "bbq" I saw a lot of jobs were listed. And I >>assume those jobs would be migrated to computing nodes. >> >>However, when I use "beostat -C", find all the computing nodes are >>actually idle, and all those instances are running on master node. >> >>Did I do something wrong to submit my simple batch jobs? >>How should I do? >> >>Moreover, I tried to use "atrm" to delete my jobs from the queue. >>After that, when I use "bbq" command, there is nothing in the queue. >>However, when I did a "top" or "ps -fu myname". Those jobs are still >>running on the master node. >>Did I do something wrong to delete a batch job from the queue? >>How should I do? >> >>I am really confused with the bbq batch system and it seems that there >>is no PBS avaliable on this cluster. >> >>Any help and suggestions are welcome! >> >>Regards, >>Weirong >> >> >> >>_______________________________________________ >>Scyld-users mailing list, Scyld-users@beowulf.org To change your >>subscription (digest mode or unsubscribe) visit >>http://www.beowulf.org/mailman/listinfo/scyld-users >> >> >> >> >> > >_______________________________________________ >Scyld-users mailing list, Scyld-users@beowulf.org To change your subscription >(digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/scyld-users > > > From Ryan.S.Bishop at saj02.usace.army.mil Tue Mar 14 09:32:57 2006 From: Ryan.S.Bishop at saj02.usace.army.mil (Bishop, Ryan S SAJ Contractor) Date: Tue Nov 9 01:14:28 2010 Subject: [scyld-users] how to let bbq migrate batch jobs to compute nodes? Message-ID: We use a batch script that sets a variable, something like "NODE='beomap --nolocal'" and then runs a bpsh NODE [command]. That will issue the job to the next free node. YMMV - make sure to check out the beomap and bpsh man pages. --Schuyler -----Original Message----- From: scyld-users-bounces@beowulf.org [mailto:scyld-users-bounces@beowulf.org] On Behalf Of Weirong Zhu Sent: Tuesday, March 14, 2006 1:23 AM To: scyld-users@beowulf.org Subject: [scyld-users] how to let bbq migrate batch jobs to compute nodes? We have just got our new Penguin Computing cluster. Since one of our main purpose it to submit a lot of batch jobs to the cluster, I tried to learn how to use bbq provided by scyld. As a simple test, (1) I wrote a C program, which has a while(1) loop. Then I compile it to generate the binary a.out. (2) Write a simple job file with only one command "./a.out". And name this file as run. (3) submit the job by "batch now -f run" (4) do step (3) a lot of times. Then by using command "bbq" I saw a lot of jobs were listed. And I assume those jobs would be migrated to computing nodes. However, when I use "beostat -C", find all the computing nodes are actually idle, and all those instances are running on master node. Did I do something wrong to submit my simple batch jobs? How should I do? Moreover, I tried to use "atrm" to delete my jobs from the queue. After that, when I use "bbq" command, there is nothing in the queue. However, when I did a "top" or "ps -fu myname". Those jobs are still running on the master node. Did I do something wrong to delete a batch job from the queue? How should I do? I am really confused with the bbq batch system and it seems that there is no PBS avaliable on this cluster. Any help and suggestions are welcome! Regards, Weirong _______________________________________________ Scyld-users mailing list, Scyld-users@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/scyld-users From Ryan.S.Bishop at saj02.usace.army.mil Tue Mar 14 10:31:11 2006 From: Ryan.S.Bishop at saj02.usace.army.mil (Bishop, Ryan S SAJ Contractor) Date: Tue Nov 9 01:14:29 2010 Subject: [scyld-users] how to let bbq migrate batch jobs to compute nodes? Message-ID: We use the methodology I described below with batch, which feeds bbq. Works like a charm. -----Original Message----- From: scyld-users-bounces@beowulf.org [mailto:scyld-users-bounces@beowulf.org] On Behalf Of Weirong Zhu Sent: Tuesday, March 14, 2006 1:25 PM To: scyld-users@beowulf.org Subject: Re: [scyld-users] how to let bbq migrate batch jobs to compute nodes? Thanks for the information. This seems a feasible method. Thanks very much. Have you ever tried bbq? Since with a batch system, it will be easier to control jobs. For example, you can watch the status of all the jobs in the queue, and easily delete a job from a queue. Now if we do the batch ourselves, we need to use "ps" to find the corresponding pid, and kill the job. So I still want to know how make use of "bbq". -- Weirong Bishop, Ryan S SAJ Contractor wrote: >We use a batch script that sets a variable, something like >"NODE='beomap --nolocal'" and then runs a bpsh NODE [command]. That >will issue the job to the next free node. YMMV - make sure to check >out the beomap and bpsh man pages. > >--Schuyler > >-----Original Message----- >From: scyld-users-bounces@beowulf.org >[mailto:scyld-users-bounces@beowulf.org] On Behalf Of Weirong Zhu >Sent: Tuesday, March 14, 2006 1:23 AM >To: scyld-users@beowulf.org >Subject: [scyld-users] how to let bbq migrate batch jobs to compute nodes? > >We have just got our new Penguin Computing cluster. >Since one of our main purpose it to submit a lot of batch jobs to the >cluster, I tried to learn how to use bbq provided by scyld. > >As a simple test, > >(1) I wrote a C program, which has a while(1) loop. Then I compile it >to generate the binary a.out. >(2) Write a simple job file with only one command "./a.out". And name >this file as run. >(3) submit the job by "batch now -f run" >(4) do step (3) a lot of times. > >Then by using command "bbq" I saw a lot of jobs were listed. And I >assume those jobs would be migrated to computing nodes. > >However, when I use "beostat -C", find all the computing nodes are >actually idle, and all those instances are running on master node. > >Did I do something wrong to submit my simple batch jobs? >How should I do? > >Moreover, I tried to use "atrm" to delete my jobs from the queue. >After that, when I use "bbq" command, there is nothing in the queue. >However, when I did a "top" or "ps -fu myname". Those jobs are still >running on the master node. >Did I do something wrong to delete a batch job from the queue? >How should I do? > >I am really confused with the bbq batch system and it seems that there >is no PBS avaliable on this cluster. > >Any help and suggestions are welcome! > >Regards, >Weirong > > > >_______________________________________________ >Scyld-users mailing list, Scyld-users@beowulf.org To change your >subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/scyld-users > > > _______________________________________________ Scyld-users mailing list, Scyld-users@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/scyld-users From akerstens at utep.edu Tue Mar 14 16:29:40 2006 From: akerstens at utep.edu (Andre Kerstens) Date: Tue Nov 9 01:14:29 2010 Subject: [scyld-users] Problem running Charmm on Scyld cluster Message-ID: <44175FF4.1040105@utep.edu> Hello all, We have recently bought a Penguin cluster with Scyld release 29cz (29cz-5u0001 200506091805) on it and are trying to get a statically compiled version of Charmm to run on the nodes. The problem is that Charmm runs fine on the master node, but segfaults as soon as it is migrated to a compute node. From the strace below you can see that the segfault happens after the library /lib64/ld-linux-x86-64.so.2 cannot be found (it exists on the master node and is exported to the nodes in /etc/beowulf/config though). [akerstens@cluster 3ptb_1000110]$ bpsh 1 ./strace ./charmm64 execve("./charmm64", ["./charmm64"], [/* 22 vars */]) = 0 uname({sys="Linux", node=".1", ...}) = 0 brk(0) = 0x17a6dcc0 brk(0x17a8ecc0) = 0x17a8ecc0 brk(0x17a8f000) = 0x17a8f000 times({tms_utime=0, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 180579094 times({tms_utime=0, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 180579094 times({tms_utime=0, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 180579094 access("charmm.inp", F_OK) = 0 open("charmm.inp", O_RDWR) = 3 fstat(3, {st_mode=S_IFREG|0775, st_size=23616, ...}) = 0 access("charmm.out", F_OK) = -1 ENOENT (No such file or directory) open("charmm.out", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4 fstat(4, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0 open("/etc/localtime", O_RDONLY) = 5 fstat(5, {st_mode=S_IFREG|0644, st_size=877, ...}) = 0 fstat(5, {st_mode=S_IFREG|0644, st_size=877, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2a95556000 read(5, "TZif\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\4\0\0\0\4\0"..., 4096) = 877 close(5) = 0 munmap(0x2a95556000, 4096) = 0 readlink("/proc/self/fd/0", "socket:[5816]", 511) = 13 ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fbfffee90) = -1 EINVAL (Invalid argument) getuid() = 500 socket(PF_UNIX, SOCK_STREAM, 0) = 5 fcntl(5, F_GETFL) = 0x2 (flags O_RDWR|O_LARGEFILE) fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(5, {sa_family=AF_UNIX, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) close(5) = 0 socket(PF_UNIX, SOCK_STREAM, 0) = 5 fcntl(5, F_GETFL) = 0x2 (flags O_RDWR|O_LARGEFILE) fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(5, {sa_family=AF_UNIX, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) close(5) = 0 open("/etc/nsswitch.conf", O_RDONLY) = 5 fstat(5, {st_mode=S_IFREG|0644, st_size=175, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2a95556000 read(5, "# Generated by node_up for Scyld"..., 4096) = 175 read(5, "", 4096) = 0 close(5) = 0 munmap(0x2a95556000, 4096) = 0 open("/etc/ld.so.cache", O_RDONLY) = 5 fstat(5, {st_mode=S_IFREG|0644, st_size=144816, ...}) = 0 mmap(NULL, 144816, PROT_READ, MAP_PRIVATE, 5, 0) = 0x2a95556000 close(5) = 0 open("/lib64/libnss_beo.so.2", O_RDONLY) = 5 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20(\0\0"..., 640) = 640 fstat(5, {st_mode=S_IFREG|0755, st_size=40535, ...}) = 0 mmap(NULL, 1079320, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x2a9557a000 madvise(0x2a9557a000, 1079320, MADV_SEQUENTIAL|0x1) = 0 mprotect(0x2a95581000, 1050648, PROT_NONE) = 0 mmap(0x2a95681000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x7000) = 0x2a95681000 close(5) = 0 open("/lib64/libc.so.6", O_RDONLY) = 5 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\327"..., 640) = 640 fstat(5, {st_mode=S_IFREG|0755, st_size=1567579, ...}) = 0 mmap(NULL, 2377064, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x2a95682000 madvise(0x2a95682000, 2377064, MADV_SEQUENTIAL|0x1) = 0 mprotect(0x2a957bd000, 1086824, PROT_NONE) = 0 mmap(0x2a958bd000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x13b000) = 0x2a958bd000 mmap(0x2a958c2000, 17768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2a958c2000 close(5) = 0 open("/lib64/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) open("/lib64/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) stat("/lib64", {st_mode=S_IFDIR|0755, st_size=440, ...}) = 0 open("/usr/lib64/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) stat("/usr/lib64", {st_mode=S_IFDIR|0755, st_size=460, ...}) = 0 munmap(0x2a95556000, 144816) = 0 munmap(0x2a9557a000, 1079320) = 0 munmap(0x2a95682000, 2377064) = 0 open("/etc/ld.so.cache", O_RDONLY) = 5 fstat(5, {st_mode=S_IFREG|0644, st_size=144816, ...}) = 0 mmap(NULL, 144816, PROT_READ, MAP_PRIVATE, 5, 0) = 0x2a95556000 close(5) = 0 open("/lib64/libnss_bproc.so.2", O_RDONLY) = 5 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0%\0\0\0"..., 640) = 640 fstat(5, {st_mode=S_IFREG|0755, st_size=30705, ...}) = 0 mmap(NULL, 1070784, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x2a9557a000 madvise(0x2a9557a000, 1070784, MADV_SEQUENTIAL|0x1) = 0 mprotect(0x2a95580000, 1046208, PROT_NONE) = 0 mmap(0x2a9567f000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x5000) = 0x2a9567f000 close(5) = 0 open("/lib64/libc.so.6", O_RDONLY) = 5 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\327"..., 640) = 640 fstat(5, {st_mode=S_IFREG|0755, st_size=1567579, ...}) = 0 mmap(NULL, 2377064, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x2a95680000 madvise(0x2a95680000, 2377064, MADV_SEQUENTIAL|0x1) = 0 mprotect(0x2a957bb000, 1086824, PROT_NONE) = 0 mmap(0x2a958bb000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x13b000) = 0x2a958bb000 mmap(0x2a958c0000, 17768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2a958c0000 close(5) = 0 open("/lib64/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) open("/lib64/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/lib64/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) munmap(0x2a95556000, 144816) = 0 munmap(0x2a9557a000, 1070784) = 0 munmap(0x2a95680000, 2377064) = 0 uname({sys="Linux", node=".1", ...}) = 0 getpid() = 3646 open("/etc/resolv.conf", O_RDONLY) = -1 ENOENT (No such file or directory) uname({sys="Linux", node=".1", ...}) = 0 stat("/etc/resolv.conf", 0x7fbfffedc0) = -1 ENOENT (No such file or directory) socket(PF_UNIX, SOCK_STREAM, 0) = 5 fcntl(5, F_GETFL) = 0x2 (flags O_RDWR|O_LARGEFILE) fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(5, {sa_family=AF_UNIX, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) close(5) = 0 socket(PF_UNIX, SOCK_STREAM, 0) = 5 fcntl(5, F_GETFL) = 0x2 (flags O_RDWR|O_LARGEFILE) fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(5, {sa_family=AF_UNIX, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) close(5) = 0 open("/etc/ld.so.cache", O_RDONLY) = 5 fstat(5, {st_mode=S_IFREG|0644, st_size=144816, ...}) = 0 mmap(NULL, 144816, PROT_READ, MAP_PRIVATE, 5, 0) = 0x2a95556000 close(5) = 0 open("/lib64/libnss_files.so.2", O_RDONLY) = 5 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200%\0\0"..., 640) = 640 fstat(5, {st_mode=S_IFREG|0755, st_size=57649, ...}) = 0 mmap(NULL, 1096200, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x2a9557a000 madvise(0x2a9557a000, 1096200, MADV_SEQUENTIAL|0x1) = 0 mprotect(0x2a95586000, 1047048, PROT_NONE) = 0 mmap(0x2a95685000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0xb000) = 0x2a95685000 close(5) = 0 open("/lib64/libc.so.6", O_RDONLY) = 5 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\327"..., 640) = 640 fstat(5, {st_mode=S_IFREG|0755, st_size=1567579, ...}) = 0 mmap(NULL, 2377064, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x2a95686000 madvise(0x2a95686000, 2377064, MADV_SEQUENTIAL|0x1) = 0 mprotect(0x2a957c1000, 1086824, PROT_NONE) = 0 mmap(0x2a958c1000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x13b000) = 0x2a958c1000 mmap(0x2a958c6000, 17768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2a958c6000 close(5) = 0 open("/lib64/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) open("/lib64/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/lib64/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) munmap(0x2a95556000, 144816) = 0 munmap(0x2a9557a000, 1096200) = 0 munmap(0x2a95686000, 2377064) = 0 --- SIGSEGV (Segmentation fault) @ 0 (0) --- +++ killed by SIGSEGV +++ Master node: [akerstens@sacagawea 3ptb_1000110]$ ll /lib64/ld-linux-x86-64.so.2 lrwxrwxrwx 1 root root 11 Sep 29 21:18 /lib64/ld-linux-x86-64.so.2 -> ld-2.3.2.so [akerstens@sacagawea 3ptb_1000110]$ ll /lib64/ld-2.3.2.so -rwxr-xr-x 1 root root 100772 May 13 2005 /lib64/ld-2.3.2.so Since the Charmm binary is static, it seems that bpsh is looking for this library and cannot find it on the compute nodes somehow. Did anybody have this problem before and knows what is going on? Any help is appreciated as I didn't get any pointers from Penguin support for 2 weeks now. Thanks Andre Kerstens -- -------------------------------------------------------------- Andre Kerstens The University of Texas at El Paso College of Engineering The best way to predict the future is to invent it. --- Alan Kay --------------------------------------------------------------- From jbernstein at scyld.com Wed Mar 15 16:32:51 2006 From: jbernstein at scyld.com (Joshua Bernstein) Date: Tue Nov 9 01:14:29 2010 Subject: [Fwd: [scyld-users] Problem running Charmm on Scyld cluster] In-Reply-To: <4418AF3A.3000808@scyld.com> References: <4418AF3A.3000808@scyld.com> Message-ID: <4418B233.6060607@jellyfish.highlyscyld.com> Andre, I want to appoligize for support not getting to you any sooner. I am the Scyld Engineer that support passed your case on to. I've validated CHARMM on Scyld and would like to make sure you are good to go. To summerize, you have to copy the libraries needed out to the compute nodes because CHARMM is doing an open() rather then a dlopen(). Since libcache only traps proper dlopen() calls rather then open(), opening the library fails, and hence the segfault. -Joshua Bernstein Software Engineer Scyld Software > > > -------- Original Message -------- > Subject: [scyld-users] Problem running Charmm on Scyld cluster > Date: Tue, 14 Mar 2006 17:29:40 -0700 > From: Andre Kerstens > Organization: UTEP > To: scyld-users@beowulf.org > > > > Hello all, > > We have recently bought a Penguin cluster with Scyld release 29cz > (29cz-5u0001 200506091805) on it and are trying to get a statically > compiled version of Charmm to run on the nodes. The problem is that > Charmm runs fine on the master node, but segfaults as soon as it is > migrated to a compute node. From the strace below you can see that the > segfault happens after the library /lib64/ld-linux-x86-64.so.2 cannot be > found (it exists on the master node and is exported to the nodes in > /etc/beowulf/config though). > > [akerstens@cluster 3ptb_1000110]$ bpsh 1 ./strace ./charmm64 > execve("./charmm64", ["./charmm64"], [/* 22 vars */]) = 0 > uname({sys="Linux", node=".1", ...}) = 0 > brk(0) = 0x17a6dcc0 > brk(0x17a8ecc0) = 0x17a8ecc0 > brk(0x17a8f000) = 0x17a8f000 > times({tms_utime=0, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 180579094 > times({tms_utime=0, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 180579094 > times({tms_utime=0, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 180579094 > access("charmm.inp", F_OK) = 0 > open("charmm.inp", O_RDWR) = 3 > fstat(3, {st_mode=S_IFREG|0775, st_size=23616, ...}) = 0 > access("charmm.out", F_OK) = -1 ENOENT (No such file or > directory) > open("charmm.out", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4 > fstat(4, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0 > open("/etc/localtime", O_RDONLY) = 5 > fstat(5, {st_mode=S_IFREG|0644, st_size=877, ...}) = 0 > fstat(5, {st_mode=S_IFREG|0644, st_size=877, ...}) = 0 > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) > = 0x2a95556000 > read(5, "TZif\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\4\0\0\0\4\0"..., > 4096) = 877 > close(5) = 0 > munmap(0x2a95556000, 4096) = 0 > readlink("/proc/self/fd/0", "socket:[5816]", 511) = 13 > ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fbfffee90) = -1 EINVAL > (Invalid argument) > getuid() = 500 > socket(PF_UNIX, SOCK_STREAM, 0) = 5 > fcntl(5, F_GETFL) = 0x2 (flags O_RDWR|O_LARGEFILE) > fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > connect(5, {sa_family=AF_UNIX, path="/var/run/nscd/socket"}, 110) = -1 > ENOENT (No such file or directory) > close(5) = 0 > socket(PF_UNIX, SOCK_STREAM, 0) = 5 > fcntl(5, F_GETFL) = 0x2 (flags O_RDWR|O_LARGEFILE) > fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > connect(5, {sa_family=AF_UNIX, path="/var/run/nscd/socket"}, 110) = -1 > ENOENT (No such file or directory) > close(5) = 0 > open("/etc/nsswitch.conf", O_RDONLY) = 5 > fstat(5, {st_mode=S_IFREG|0644, st_size=175, ...}) = 0 > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) > = 0x2a95556000 > read(5, "# Generated by node_up for Scyld"..., 4096) = 175 > read(5, "", 4096) = 0 > close(5) = 0 > munmap(0x2a95556000, 4096) = 0 > open("/etc/ld.so.cache", O_RDONLY) = 5 > fstat(5, {st_mode=S_IFREG|0644, st_size=144816, ...}) = 0 > mmap(NULL, 144816, PROT_READ, MAP_PRIVATE, 5, 0) = 0x2a95556000 > close(5) = 0 > open("/lib64/libnss_beo.so.2", O_RDONLY) = 5 > read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20(\0\0"..., > 640) = 640 > fstat(5, {st_mode=S_IFREG|0755, st_size=40535, ...}) = 0 > mmap(NULL, 1079320, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, > 0) = 0x2a9557a000 > madvise(0x2a9557a000, 1079320, MADV_SEQUENTIAL|0x1) = 0 > mprotect(0x2a95581000, 1050648, PROT_NONE) = 0 > mmap(0x2a95681000, 4096, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x7000) = 0x2a95681000 > close(5) = 0 > open("/lib64/libc.so.6", O_RDONLY) = 5 > read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\327"..., > 640) = 640 > fstat(5, {st_mode=S_IFREG|0755, st_size=1567579, ...}) = 0 > mmap(NULL, 2377064, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, > 0) = 0x2a95682000 > madvise(0x2a95682000, 2377064, MADV_SEQUENTIAL|0x1) = 0 > mprotect(0x2a957bd000, 1086824, PROT_NONE) = 0 > mmap(0x2a958bd000, 20480, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x13b000) = 0x2a958bd000 > mmap(0x2a958c2000, 17768, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2a958c2000 > close(5) = 0 > open("/lib64/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such file > or directory) > open("/lib64/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such file > or directory) > stat("/lib64", {st_mode=S_IFDIR|0755, st_size=440, ...}) = 0 > open("/usr/lib64/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such > file or directory) > stat("/usr/lib64", {st_mode=S_IFDIR|0755, st_size=460, ...}) = 0 > munmap(0x2a95556000, 144816) = 0 > munmap(0x2a9557a000, 1079320) = 0 > munmap(0x2a95682000, 2377064) = 0 > open("/etc/ld.so.cache", O_RDONLY) = 5 > fstat(5, {st_mode=S_IFREG|0644, st_size=144816, ...}) = 0 > mmap(NULL, 144816, PROT_READ, MAP_PRIVATE, 5, 0) = 0x2a95556000 > close(5) = 0 > open("/lib64/libnss_bproc.so.2", O_RDONLY) = 5 > read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0%\0\0\0"..., > 640) = 640 > fstat(5, {st_mode=S_IFREG|0755, st_size=30705, ...}) = 0 > mmap(NULL, 1070784, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, > 0) = 0x2a9557a000 > madvise(0x2a9557a000, 1070784, MADV_SEQUENTIAL|0x1) = 0 > mprotect(0x2a95580000, 1046208, PROT_NONE) = 0 > mmap(0x2a9567f000, 4096, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x5000) = 0x2a9567f000 > close(5) = 0 > open("/lib64/libc.so.6", O_RDONLY) = 5 > read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\327"..., > 640) = 640 > fstat(5, {st_mode=S_IFREG|0755, st_size=1567579, ...}) = 0 > mmap(NULL, 2377064, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, > 0) = 0x2a95680000 > madvise(0x2a95680000, 2377064, MADV_SEQUENTIAL|0x1) = 0 > mprotect(0x2a957bb000, 1086824, PROT_NONE) = 0 > mmap(0x2a958bb000, 20480, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x13b000) = 0x2a958bb000 > mmap(0x2a958c0000, 17768, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2a958c0000 > close(5) = 0 > open("/lib64/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such file > or directory) > open("/lib64/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such file > or directory) > open("/usr/lib64/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such > file or directory) > munmap(0x2a95556000, 144816) = 0 > munmap(0x2a9557a000, 1070784) = 0 > munmap(0x2a95680000, 2377064) = 0 > uname({sys="Linux", node=".1", ...}) = 0 > getpid() = 3646 > open("/etc/resolv.conf", O_RDONLY) = -1 ENOENT (No such file or > directory) > uname({sys="Linux", node=".1", ...}) = 0 > stat("/etc/resolv.conf", 0x7fbfffedc0) = -1 ENOENT (No such file or > directory) > socket(PF_UNIX, SOCK_STREAM, 0) = 5 > fcntl(5, F_GETFL) = 0x2 (flags O_RDWR|O_LARGEFILE) > fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > connect(5, {sa_family=AF_UNIX, path="/var/run/nscd/socket"}, 110) = -1 > ENOENT (No such file or directory) > close(5) = 0 > socket(PF_UNIX, SOCK_STREAM, 0) = 5 > fcntl(5, F_GETFL) = 0x2 (flags O_RDWR|O_LARGEFILE) > fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > connect(5, {sa_family=AF_UNIX, path="/var/run/nscd/socket"}, 110) = -1 > ENOENT (No such file or directory) > close(5) = 0 > open("/etc/ld.so.cache", O_RDONLY) = 5 > fstat(5, {st_mode=S_IFREG|0644, st_size=144816, ...}) = 0 > mmap(NULL, 144816, PROT_READ, MAP_PRIVATE, 5, 0) = 0x2a95556000 > close(5) = 0 > open("/lib64/libnss_files.so.2", O_RDONLY) = 5 > read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200%\0\0"..., > 640) = 640 > fstat(5, {st_mode=S_IFREG|0755, st_size=57649, ...}) = 0 > mmap(NULL, 1096200, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, > 0) = 0x2a9557a000 > madvise(0x2a9557a000, 1096200, MADV_SEQUENTIAL|0x1) = 0 > mprotect(0x2a95586000, 1047048, PROT_NONE) = 0 > mmap(0x2a95685000, 4096, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0xb000) = 0x2a95685000 > close(5) = 0 > open("/lib64/libc.so.6", O_RDONLY) = 5 > read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\327"..., > 640) = 640 > fstat(5, {st_mode=S_IFREG|0755, st_size=1567579, ...}) = 0 > mmap(NULL, 2377064, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, > 0) = 0x2a95686000 > madvise(0x2a95686000, 2377064, MADV_SEQUENTIAL|0x1) = 0 > mprotect(0x2a957c1000, 1086824, PROT_NONE) = 0 > mmap(0x2a958c1000, 20480, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x13b000) = 0x2a958c1000 > mmap(0x2a958c6000, 17768, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2a958c6000 > close(5) = 0 > open("/lib64/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such file > or directory) > open("/lib64/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such file > or directory) > open("/usr/lib64/ld-linux-x86-64.so.2", O_RDONLY) = -1 ENOENT (No such > file or directory) > munmap(0x2a95556000, 144816) = 0 > munmap(0x2a9557a000, 1096200) = 0 > munmap(0x2a95686000, 2377064) = 0 > --- SIGSEGV (Segmentation fault) @ 0 (0) --- > +++ killed by SIGSEGV +++ > > Master node: > [akerstens@sacagawea 3ptb_1000110]$ ll /lib64/ld-linux-x86-64.so.2 > lrwxrwxrwx 1 root root 11 Sep 29 21:18 > /lib64/ld-linux-x86-64.so.2 -> ld-2.3.2.so > [akerstens@sacagawea 3ptb_1000110]$ ll /lib64/ld-2.3.2.so > -rwxr-xr-x 1 root root 100772 May 13 2005 /lib64/ld-2.3.2.so > > Since the Charmm binary is static, it seems that bpsh is looking for > this library and cannot find it on the compute nodes somehow. > > Did anybody have this problem before and knows what is going on? Any > help is appreciated as I didn't get any pointers from Penguin support > for 2 weeks now. > > Thanks > > Andre Kerstens > > > From jbernstein at scyld.com Mon Mar 27 12:02:52 2006 From: jbernstein at scyld.com (Joshua Bernstein) Date: Tue Nov 9 01:14:29 2010 Subject: [scyld-users] Problem running Charmm on Scyld cluster Message-ID: <442844EC.4070203@jellyfish.highlyscyld.com> Hello Andre, I've CC'd to the list a summary of the steps required to support Charmm on the compute nodes. I just wanted to make sure everyone on the list is able to benefit. Below is a script you need to copy into /etc/beowulf/init.d and call it charmm then: $ chmod +x /etc/beowulf/init.d/charmm then reboot the compute nodes (as root) # bpctl -S all -R NOTE: you may choose to reboot just one node to see if the script is working. Then reboot the rest of the cluster. To reboot, say node 4 do soemthing like: # bpctl -S 4 -R run your two tests to see if the scripts worked... If it does work I will CC this e-mail out to the mailing list: ----- #!/bin/bash # ############# # A script to pre-cache the required libraries # for both 32 and 64-bit versions of a static charmm binary # # Written for Scyld cz-5 # # By: Joshua Bernstein (jbernstein@scyld.com) # Scyld Software # # Use: Put in /etc/beowulf/init.d and reboot the compute nodes. Make sure the script # is set to be executable. (chmod +x charmm) ############# [ "$NODE" == "" ] && NODE=$1 if [ "$NODE" == "" ]; then echo "No node specified" exit 1 fi # Here for 64-bit static charmm binary support bpcp /lib64/libnss_files.so.2 $NODE:/lib64/libnss_files.so.2 bpcp /lib64/libnss_files-2.3.2.so $NODE:/lib64/libnss_files-2.3.2.so bpcp /lib64/libc.so.6 $NODE:/lib64/libc.so.6 bpcp /lib64/ld-linux-x86-64.so.2 $NODE:/lib64/ld-linux-x86-64.so.2 bpcp /lib64/ld-2.3.2.so $NODE:/lib64/ld-2.3.2.so ----SNIP---- In order to run the CHARMM 32-bit binary. Hopefully you have the 32-bit libraries installed into /opt/lib32. If you don't you'll have to use the CD and install them before you continue with everything below. After the libraries are installed you'll actually need the 32-bit libraries on the compute nodes. Your /etc/exports file should like like this: /home @cluster(rw) /opt @cluster(rw) Notice the addition of /opt. Now as root run: # exportfs -a # /etc/init.d/nfs restart Now you'll need to edit /etc/beowulf/fstab to force the compute nodes to mount /opt in the proper location. The end of my file looks like this: $MASTER:/home /home nfs nolock,nonfatal 0 0 $MASTER:/opt /opt nfs nolock,nonfatal 0 0 Again notice the added line for /opt. The next two commands will reboot the compute nodes, and restart the Beowulf service. The two commands must be typed in quick succesion in order to avoid having to manually power cycle the nodes in the machine room: # bpctl -S all -R # /etc/init.d/beowulf restart Now, the 32-bit CHARMM should work as expected... -Joshua Bernstein Software Engineer Scyld Software