[StarCluster] Configure Nodes to submit jobs

Jacob barhak jacob.barhak at gmail.com
Wed Oct 1 16:38:23 EDT 2014


Hi Greg,

There is a starcluster plugin for pip install. Also, you may wish to check an anaconda AMI that comes with many libraries and conda on it. 

Here is a link to instructions on setting up an anaconda AMI:
http://continuum.io/blog/starcluster-anaconda

These two paths should give you enough options to continue. 

        Jacob

Sent from my iPhone

On Oct 1, 2014, at 12:48 PM, greg <margeemail at gmail.com> wrote:

> Hi everyone,
> 
> I found the bug.  Apparently I library I installed in Python is only
> available on the master node.  What's a good way to install Python
> libraries so it's available on all nodes?  I guess virtualenv, but I'm
> hoping for something simpler :-)
> 
> -Greg
> 
> On Wed, Oct 1, 2014 at 10:33 AM, Jennifer Staab <jstaab at cs.unc.edu> wrote:
>> For software and scripts you can login to each node and check that software
>> is installed and you can view/find the scripts. Also you might check and
>> make sure user you qsub'ed the jobs with has correct permissions to run
>> scripts and software and write output.
>> 
>> Easier way is to use qstat -j <JOBID> where the <JOBID> is jobid of one of
>> the jobs with  EQW status. It and/or the .o/.e files you set when you
>> submitted the qsub job will give you file location to read the error
>> messages.  If you didn't set an .o and .e file in your qsub call ( using -e
>> and -o options) I believe it defaults to files with jobid or jobname with
>> extension .o( for output) and .e ( for error).  I believe Chris talked about
>> this in his reply.  This is how I discovered scripts weren't shared is
>> because the .e file indicated the scripts couldn't be found.
>> 
>> Also as Chris said you can do qconf to change attributes of SGE setup. I
>> have done this before on a Starcluster cluster - login to master node and as
>> long as you have admin privileges you can use qconf command to change
>> attributes of SGE setup.
>> 
>> Good Luck.
>> 
>> -Jennifer
>> 
>> Sent from my Verizon Wireless 4G LTE DROID
>> 
>> 
>> greg <margeemail at gmail.com> wrote:
>> 
>> Thanks Jennifer!  Being completely new to star cluster, how can I
>> check that my scripts are available to all nodes?
>> 
>> -Greg
>> 
>> On Wed, Oct 1, 2014 at 8:26 AM, Jennifer Staab <jstaab at cs.unc.edu> wrote:
>>> Greg -
>>> 
>>>   Maybe check that your software and scripts are available to all nodes.
>>> I
>>> have had Starcluster throw a bunch of EQW's when I accidentally didn't
>>> have
>>> all the software and script components loaded in a directory that was
>>> NFS'ed
>>> for all the nodes of my cluster and/or individually loaded on all nodes of
>>> the cluster.
>>> 
>>> And as Chris just stated use:
>>> qstat -j <JOBID> ==> gives complete info on that job
>>> qstat -j <JOBID> | grep error  (looks for errors in job)
>>> 
>>> When you get the error debugged you can use:
>>> qmod -cj <JOBID>  (will clear error state and restart job - like Eqw )
>>> 
>>> Good Luck.
>>> 
>>> -Jennifer
>>> 
>>> 
>>> 
>>> On 10/1/14 8:09 AM, Chris Dagdigian wrote:
>>>> 
>>>> 'EQW' is a combination of multiple message states (e)(q)(w).  The
>>>> standard "qw" is familiar to everyone, the E indicates something bad at
>>>> the job level.
>>>> 
>>>> There are multiple levels of debugging, starting with easy and getting
>>>> more cumbersome. Almost all require admin or sudo level access
>>>> 
>>>> The 1st pass debug method is to run "qstat -j <jobID>" on the job that
>>>> is in EQW state, that should provide a bit more information about what
>>>> went wrong.
>>>> 
>>>> After that you look at the .e and .o STDERR/STDOUT files from the script
>>>> if any were created
>>>> 
>>>> After that you can use sudo privs to go into
>>>> $SGE_ROOT/$SGE_CELL/spool/qmaster/ and look at the messages file, there
>>>> are also per-node messages files you can look at as well.
>>>> 
>>>> The next level of debugging after that usually involves setting the
>>>> sge_execd parameter KEEP_ACTIVE=true which triggers a behavior where SGE
>>>> will stop deleting the temporary files associated with a job life cycle.
>>>> Those files live down in the SGE spool at location
>>>> <executionhost>/active.jobs/<jobID/  -- and they are invaluable in
>>>> debugging nasty subtle job failures
>>>> 
>>>> EQW should be easy to troubleshoot though - it indicates a fatal error
>>>> right at the beginning of the job dispatch or execution process. No
>>>> subtle things there
>>>> 
>>>> 
>>>> And if your other question was about nodes being allowed to submit jobs
>>>> -- yes you have to configure this. It can be done during SGE install
>>>> time or any time afterwards by doing "qconf -as <nodename>" from any
>>>> account with SGE admin privs. I have no idea if startcluster does this
>>>> automatically or not but I'd expect that it probably does, If not it's
>>>> an easy fix.
>>>> 
>>>> -Chris
>>>> 
>>>> 
>>>> greg wrote:
>>>>> 
>>>>> Hi guys,
>>>>> 
>>>>> I'm afraid I'm still stuck on this.  Besides my original question
>>>>> which I'm still not sure about.  Does anyone have any general advice
>>>>> on debugging an EQW state?  The same software runs fine in our local
>>>>> cluster.
>>>>> 
>>>>> thanks again,
>>>>> 
>>>>> Greg
>>>> 
>>>> _______________________________________________
>>>> StarCluster mailing list
>>>> StarCluster at mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20141001/6661feb5/attachment.htm


More information about the StarCluster mailing list