[StarCluster] Configure Nodes to submit jobs

Jennifer Staab jstaab at cs.unc.edu
Wed Oct 1 16:53:27 EDT 2014


Ed and Hugh have it right.  As long as it is a python module you can use 
the python package installer plugin as Hugh suggested.  If it is a 
library/scripts you created for your own for your personal use, you will 
need to propagate that code yourself.  And as Ed suggested, perhaps 
setup an AMI with all the software you need and/or take an AMI of the 
Master node (since it is set like you need it to be) and use that for 
your worker nodes.

Also from past (bad) experience if you are going to take an AMI of a 
running EC2, stop it first - then take the AMI.  If your EC2 isn't EBS 
backed - don't stop it or try to take the AMI -- my fear is that is in 
taking the AMI it will some how reboot (or need to be reboot to function 
after AMI is taken) and you could lose everything.  Recall if you are 
using instance/ephemeral storage in stopping the EC2 all 
instance/ephemeral storage disappears, so even if it is an EBS backed 
EC2 be sure to save in stuff instance/ephemeral storage before stopping 
the EC2 if said "stuff" is important.  And final note, if you stop a 
Starcluster EC2 they tend to dump all your mounts (like mounted EBS 
volumes) so you will likely need to remount all mounted EBS volumes upon 
restarting it.  I believe Hugh attributed this to the 
"self.detach_volumes()" line in "stop_cluster" definition of the 
"cluster.py" of the Starcluster source code (Thanks again Hugh -- 
remounting EBS volumes - total pain).  To figure out what you have 
mounted before stopping the master node -- run a 'df -h' or 'lsblk' on 
the master node.

Good Luck.

-Jennifer

On 10/1/14 1:48 PM, greg wrote:
> Hi everyone,
>
> I found the bug.  Apparently I library I installed in Python is only
> available on the master node.  What's a good way to install Python
> libraries so it's available on all nodes?  I guess virtualenv, but I'm
> hoping for something simpler :-)
>
> -Greg
>
> On Wed, Oct 1, 2014 at 10:33 AM, Jennifer Staab <jstaab at cs.unc.edu> wrote:
>> For software and scripts you can login to each node and check that software
>> is installed and you can view/find the scripts. Also you might check and
>> make sure user you qsub'ed the jobs with has correct permissions to run
>> scripts and software and write output.
>>
>> Easier way is to use qstat -j <JOBID> where the <JOBID> is jobid of one of
>> the jobs with  EQW status. It and/or the .o/.e files you set when you
>> submitted the qsub job will give you file location to read the error
>> messages.  If you didn't set an .o and .e file in your qsub call ( using -e
>> and -o options) I believe it defaults to files with jobid or jobname with
>> extension .o( for output) and .e ( for error).  I believe Chris talked about
>> this in his reply.  This is how I discovered scripts weren't shared is
>> because the .e file indicated the scripts couldn't be found.
>>
>> Also as Chris said you can do qconf to change attributes of SGE setup. I
>> have done this before on a Starcluster cluster - login to master node and as
>> long as you have admin privileges you can use qconf command to change
>> attributes of SGE setup.
>>
>> Good Luck.
>>
>> -Jennifer
>>
>> Sent from my Verizon Wireless 4G LTE DROID
>>
>>
>> greg <margeemail at gmail.com> wrote:
>>
>> Thanks Jennifer!  Being completely new to star cluster, how can I
>> check that my scripts are available to all nodes?
>>
>> -Greg
>>
>> On Wed, Oct 1, 2014 at 8:26 AM, Jennifer Staab <jstaab at cs.unc.edu> wrote:
>>> Greg -
>>>
>>>     Maybe check that your software and scripts are available to all nodes.
>>> I
>>> have had Starcluster throw a bunch of EQW's when I accidentally didn't
>>> have
>>> all the software and script components loaded in a directory that was
>>> NFS'ed
>>> for all the nodes of my cluster and/or individually loaded on all nodes of
>>> the cluster.
>>>
>>> And as Chris just stated use:
>>> qstat -j <JOBID> ==> gives complete info on that job
>>> qstat -j <JOBID> | grep error  (looks for errors in job)
>>>
>>> When you get the error debugged you can use:
>>> qmod -cj <JOBID>  (will clear error state and restart job - like Eqw )
>>>
>>> Good Luck.
>>>
>>> -Jennifer
>>>
>>>
>>>
>>> On 10/1/14 8:09 AM, Chris Dagdigian wrote:
>>>> 'EQW' is a combination of multiple message states (e)(q)(w).  The
>>>> standard "qw" is familiar to everyone, the E indicates something bad at
>>>> the job level.
>>>>
>>>> There are multiple levels of debugging, starting with easy and getting
>>>> more cumbersome. Almost all require admin or sudo level access
>>>>
>>>> The 1st pass debug method is to run "qstat -j <jobID>" on the job that
>>>> is in EQW state, that should provide a bit more information about what
>>>> went wrong.
>>>>
>>>> After that you look at the .e and .o STDERR/STDOUT files from the script
>>>> if any were created
>>>>
>>>> After that you can use sudo privs to go into
>>>> $SGE_ROOT/$SGE_CELL/spool/qmaster/ and look at the messages file, there
>>>> are also per-node messages files you can look at as well.
>>>>
>>>> The next level of debugging after that usually involves setting the
>>>> sge_execd parameter KEEP_ACTIVE=true which triggers a behavior where SGE
>>>> will stop deleting the temporary files associated with a job life cycle.
>>>> Those files live down in the SGE spool at location
>>>> <executionhost>/active.jobs/<jobID/  -- and they are invaluable in
>>>> debugging nasty subtle job failures
>>>>
>>>> EQW should be easy to troubleshoot though - it indicates a fatal error
>>>> right at the beginning of the job dispatch or execution process. No
>>>> subtle things there
>>>>
>>>>
>>>> And if your other question was about nodes being allowed to submit jobs
>>>> -- yes you have to configure this. It can be done during SGE install
>>>> time or any time afterwards by doing "qconf -as <nodename>" from any
>>>> account with SGE admin privs. I have no idea if startcluster does this
>>>> automatically or not but I'd expect that it probably does, If not it's
>>>> an easy fix.
>>>>
>>>> -Chris
>>>>
>>>>
>>>> greg wrote:
>>>>> Hi guys,
>>>>>
>>>>> I'm afraid I'm still stuck on this.  Besides my original question
>>>>> which I'm still not sure about.  Does anyone have any general advice
>>>>> on debugging an EQW state?  The same software runs fine in our local
>>>>> cluster.
>>>>>
>>>>> thanks again,
>>>>>
>>>>> Greg
>>>> _______________________________________________
>>>> StarCluster mailing list
>>>> StarCluster at mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>



More information about the StarCluster mailing list