[StarCluster] Configure Nodes to submit jobs
Jennifer Staab
jstaab at cs.unc.edu
Wed Oct 1 10:33:38 EDT 2014
For software and scripts you can login to each node and check that software is installed and you can view/find the scripts. Also you might check and make sure user you qsub'ed the jobs with has correct permissions to run scripts and software and write output.
Easier way is to use qstat -j <JOBID> where the <JOBID> is jobid of one of the jobs with EQW status. It and/or the .o/.e files you set when you submitted the qsub job will give you file location to read the error messages. If you didn't set an .o and .e file in your qsub call ( using -e and -o options) I believe it defaults to files with jobid or jobname with extension .o( for output) and .e ( for error). I believe Chris talked about this in his reply. This is how I discovered scripts weren't shared is because the .e file indicated the scripts couldn't be found.
Also as Chris said you can do qconf to change attributes of SGE setup. I have done this before on a Starcluster cluster - login to master node and as long as you have admin privileges you can use qconf command to change attributes of SGE setup.
Good Luck.
-Jennifer
Sent from my Verizon Wireless 4G LTE DROID
greg <margeemail at gmail.com> wrote:
>Thanks Jennifer! Being completely new to star cluster, how can I
>check that my scripts are available to all nodes?
>
>-Greg
>
>On Wed, Oct 1, 2014 at 8:26 AM, Jennifer Staab <jstaab at cs.unc.edu> wrote:
>> Greg -
>>
>> Maybe check that your software and scripts are available to all nodes. I
>> have had Starcluster throw a bunch of EQW's when I accidentally didn't have
>> all the software and script components loaded in a directory that was NFS'ed
>> for all the nodes of my cluster and/or individually loaded on all nodes of
>> the cluster.
>>
>> And as Chris just stated use:
>> qstat -j <JOBID> ==> gives complete info on that job
>> qstat -j <JOBID> | grep error (looks for errors in job)
>>
>> When you get the error debugged you can use:
>> qmod -cj <JOBID> (will clear error state and restart job - like Eqw )
>>
>> Good Luck.
>>
>> -Jennifer
>>
>>
>>
>> On 10/1/14 8:09 AM, Chris Dagdigian wrote:
>>>
>>> 'EQW' is a combination of multiple message states (e)(q)(w). The
>>> standard "qw" is familiar to everyone, the E indicates something bad at
>>> the job level.
>>>
>>> There are multiple levels of debugging, starting with easy and getting
>>> more cumbersome. Almost all require admin or sudo level access
>>>
>>> The 1st pass debug method is to run "qstat -j <jobID>" on the job that
>>> is in EQW state, that should provide a bit more information about what
>>> went wrong.
>>>
>>> After that you look at the .e and .o STDERR/STDOUT files from the script
>>> if any were created
>>>
>>> After that you can use sudo privs to go into
>>> $SGE_ROOT/$SGE_CELL/spool/qmaster/ and look at the messages file, there
>>> are also per-node messages files you can look at as well.
>>>
>>> The next level of debugging after that usually involves setting the
>>> sge_execd parameter KEEP_ACTIVE=true which triggers a behavior where SGE
>>> will stop deleting the temporary files associated with a job life cycle.
>>> Those files live down in the SGE spool at location
>>> <executionhost>/active.jobs/<jobID/ -- and they are invaluable in
>>> debugging nasty subtle job failures
>>>
>>> EQW should be easy to troubleshoot though - it indicates a fatal error
>>> right at the beginning of the job dispatch or execution process. No
>>> subtle things there
>>>
>>>
>>> And if your other question was about nodes being allowed to submit jobs
>>> -- yes you have to configure this. It can be done during SGE install
>>> time or any time afterwards by doing "qconf -as <nodename>" from any
>>> account with SGE admin privs. I have no idea if startcluster does this
>>> automatically or not but I'd expect that it probably does, If not it's
>>> an easy fix.
>>>
>>> -Chris
>>>
>>>
>>> greg wrote:
>>>>
>>>> Hi guys,
>>>>
>>>> I'm afraid I'm still stuck on this. Besides my original question
>>>> which I'm still not sure about. Does anyone have any general advice
>>>> on debugging an EQW state? The same software runs fine in our local
>>>> cluster.
>>>>
>>>> thanks again,
>>>>
>>>> Greg
>>>
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster at mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20141001/447e9a7e/attachment.htm
More information about the StarCluster
mailing list