[StarCluster] Configure Nodes to submit jobs
greg
margeemail at gmail.com
Wed Oct 1 10:20:14 EDT 2014
Thanks. So I'm a bit confused about what's mounted where but it
appears /usr/local isn't shared?
root at master:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 7.9G 2.8G 4.8G 37% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
udev 3.7G 8.0K 3.7G 1% /dev
tmpfs 752M 196K 752M 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 3.7G 0 3.7G 0% /run/shm
none 100M 0 100M 0% /run/user
/dev/xvdaa 414G 199M 393G 1% /mnt
root at master:~# touch /usr/local/testfile
root at master:~# ls /usr/local/
bin etc games include lib man sbin share src test file <-
here's my test file
root at master:~# ssh node001
root at node001:~# ls /usr/local
bin etc games include lib man sbin share src <- test file is missing!
On Wed, Oct 1, 2014 at 9:56 AM, Arman Eshaghi <arman.eshaghi at gmail.com> wrote:
> Please have a look at this http://linux.die.net/man/1/qconf or run
> "man qconf" command
>
> to check if the scripts are available to a given host you may run
> command "df -h". The output will show you which paths are mounted from
> an external host (your master node). If this is not the case maybe you
> can move script to the shared folders.
>
> All the best,
> Arman
>
>
> On Wed, Oct 1, 2014 at 5:10 PM, greg <margeemail at gmail.com> wrote:
>> Thanks Chris! I'll try those debugging techniques.
>>
>> So running "qconf -as <nodename>" turns that node into a job submitter?
>>
>> -Greg
>>
>> On Wed, Oct 1, 2014 at 8:09 AM, Chris Dagdigian <dag at bioteam.net> wrote:
>>>
>>> 'EQW' is a combination of multiple message states (e)(q)(w). The
>>> standard "qw" is familiar to everyone, the E indicates something bad at
>>> the job level.
>>>
>>> There are multiple levels of debugging, starting with easy and getting
>>> more cumbersome. Almost all require admin or sudo level access
>>>
>>> The 1st pass debug method is to run "qstat -j <jobID>" on the job that
>>> is in EQW state, that should provide a bit more information about what
>>> went wrong.
>>>
>>> After that you look at the .e and .o STDERR/STDOUT files from the script
>>> if any were created
>>>
>>> After that you can use sudo privs to go into
>>> $SGE_ROOT/$SGE_CELL/spool/qmaster/ and look at the messages file, there
>>> are also per-node messages files you can look at as well.
>>>
>>> The next level of debugging after that usually involves setting the
>>> sge_execd parameter KEEP_ACTIVE=true which triggers a behavior where SGE
>>> will stop deleting the temporary files associated with a job life cycle.
>>> Those files live down in the SGE spool at location
>>> <executionhost>/active.jobs/<jobID/ -- and they are invaluable in
>>> debugging nasty subtle job failures
>>>
>>> EQW should be easy to troubleshoot though - it indicates a fatal error
>>> right at the beginning of the job dispatch or execution process. No
>>> subtle things there
>>>
>>>
>>> And if your other question was about nodes being allowed to submit jobs
>>> -- yes you have to configure this. It can be done during SGE install
>>> time or any time afterwards by doing "qconf -as <nodename>" from any
>>> account with SGE admin privs. I have no idea if startcluster does this
>>> automatically or not but I'd expect that it probably does, If not it's
>>> an easy fix.
>>>
>>> -Chris
>>>
>>>
>>> greg wrote:
>>>> Hi guys,
>>>>
>>>> I'm afraid I'm still stuck on this. Besides my original question
>>>> which I'm still not sure about. Does anyone have any general advice
>>>> on debugging an EQW state? The same software runs fine in our local
>>>> cluster.
>>>>
>>>> thanks again,
>>>>
>>>> Greg
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster at mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
More information about the StarCluster
mailing list