DQS drops jobs on SuSE 6.3 cluster
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Kris Thielemans kris.thielemans at csc.mrc.ac.ukFri Nov 3 06:43:48 PST 2000
- Previous message: DQS drops jobs on SuSE 6.3 cluster
- Next message: DQS drops jobs on SuSE 6.3 cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Dear Michael, thanks a lot for your hint. In the mean time I have been experimenting a bit more, and I think now the problem was due to something else. i.e. I didn't apply the patch (yet). I observed that which queue actually executes the job depended on where I submitted it. So, I started to look at file systems. The example dqs.sh uses the -cwd flag to run the job in the current directory, and also put the output files over there. This will obviously only work when the current directory is mounted on all systems, and with identical names. The job runs only on those systems (i.e. queues) which happen to have a directory with the same name. To achieve this, I used symbolic links, but it looks like qsub resolves the symbolic link to its original name (which is NOT common to all systems). I have the following setup: - 4 Linux machines, called pp1,pp2,... - each has a /data partition, which they export (for NFS) - pp1 mounts the /data partitions of the other machines as /pp2-data etc. - On pp1, I linked /data as /pp1-data. Net result: on each machine, you can do 'cd /ppx-data/bla', and end up in the same physical location. However, if I am on pp1, 'cd /pp1-data/bla', 'qsub dqs.sh', it turns out that the job ONLY runs when it was assigned to a queue on pp1. Looking at the output of the job, I see that PWD was set to /data/bla, and not to /pp1-data/bla as expected. On the other hand, when I do exactly the same, but from (say) pp2, everything works fine. [ I tested this by creating a /data/bla on pp2 as well. Then indeed the job runs in a queue on pp2 as well, with output in pp2:/data/bla, and not in pp1:/data/kris ] So, at the moment, everything seems to work fine when I submit from a machine which mounts the cwd. An alternative solution would be to rename the partitions as pp1:/pp1-data, such that I wouldn't need the symbolic link. Personally, I find this behaviour of DQS with symbolic links unexpected, and worth putting in the documentation (or changing in the code...) Also, I would expect that the non-existence of the cwd on a system would be flagged in the DQS err_file. Doesn't seem to happen though. I'll wait to apply the patch till I discover other problems. Many thanks, Kris > > Dear Kris: > > I think I can help you with this. This behavior sounds like it is due to > a known bug in DQS 3.3.1 (and presumable earlier version), for which I > have a patch from the DQS authors at Florida State University. I attach > a portion of an email I received from DQS support a while back regarding > this issue, which contains a context 'diff' of the necessary patch. I > hope this helps. We are running DQS 3.3.1 on a Red Hat based cluster here > and it works very well. > > > > On Thu, 2 Nov 2000, Kris Thielemans wrote: > > > Hi, > > > > I'm trying to get DQS running on our cluster of 4 SuSE 6.3 > systems. I tried > > 3 different versions of DQS > > - the RPM package on the original CD > > - the RPM pakcage provide on the SuSE website to update it to fix a y2k > > problem (version 3.2.7) > > - the newest version (3.3.1) from ftp.scri.fsu.edu (compiled from > > sources) > > > > All 3 versions have the same problem: > > jobs are occasionally dropped from the queue, or even not started > > > > Symptoms: > > qsub somejob.sh -> works ok > > qstat -f -> lists job > > > > (a little bit later) > > qstat -f -> job gone > > > > This happens with the simple dqs.sh example script that they provide for > > testing. > > > > There is NO error message in the dqs err_file, or anything in > the log_file. > > > > This problem also occurs when I disable all queues except 1 (on the same > > node as the qmaster).
- Previous message: DQS drops jobs on SuSE 6.3 cluster
- Next message: DQS drops jobs on SuSE 6.3 cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
