[StarCluster] Configure Nodes to submit jobs

Ed Gray gray_ed at hotmail.com
Wed Oct 1 14:53:55 EDT 2014


Hi Greg,

1) What does the PATH look like when you ssh to the node and in a job
qsubbed on the node?
2) Are your qsubbed jobs bash scripts?
3) Do you use exactly (or essentially) the same AMI to start the nodes as
the AMI used to start the master?
4) I missed earlier where you described your are of application, is it
bioinformatics?

There were some items I had to address in my startup scripts (/etc/profile
and the bashrc's) to ensure the path was correct.  I don't have the code in
front of me right now, but I believe there are some considerations for
situations where you run bash non-interactively.  My concern wasn't
specifically any given job or script, but an overall architecture that
worked smoothly to develop/debug pipelines without qsubbing and then run
these jobs over a large set of genes (>40,000) in a cluster.

We had code and libraries not in the home directory that were core to all
coding (e.g. worked interactively and also in qsubbed jobs) that was not in
the home directory.  This sounds kind of like what you are doing, with a
core specialized environment that is then used by specific user code that
may vary on a cluster by cluster basis.  

Otherwise, the advice from Hugh and Jennifer seems pretty complete on the
specific workings of the tools at hand to solve the larger problem.

Best,
Ed


-----Original Message-----
From: starcluster-bounces at mit.edu [mailto:starcluster-bounces at mit.edu] On
Behalf Of MacMullan, Hugh
Sent: Wednesday, October 01, 2014 2:19 PM
To: greg
Cc: starcluster at mit.edu
Subject: Re: [StarCluster] Configure Nodes to submit jobs

Hi Greg:

Use the Python Package Installer Plugin!

http://star.mit.edu/cluster/docs/latest/plugins/pypkginstaller.html

Note the "If you already have a cluster running that didn't originally
include the PyPkgInstaller plugin in its config you can manually run the
plugin on the cluster ..." at the end.

Let us know how it goes.

-Hugh

-----Original Message-----
From: starcluster-bounces at mit.edu [mailto:starcluster-bounces at mit.edu] On
Behalf Of greg
Sent: Wednesday, October 01, 2014 1:49 PM
To: Jennifer Staab
Cc: starcluster at mit.edu
Subject: Re: [StarCluster] Configure Nodes to submit jobs

Hi everyone,

I found the bug.  Apparently I library I installed in Python is only
available on the master node.  What's a good way to install Python libraries
so it's available on all nodes?  I guess virtualenv, but I'm hoping for
something simpler :-)

-Greg

On Wed, Oct 1, 2014 at 10:33 AM, Jennifer Staab <jstaab at cs.unc.edu> wrote:
> For software and scripts you can login to each node and check that 
> software is installed and you can view/find the scripts. Also you 
> might check and make sure user you qsub'ed the jobs with has correct 
> permissions to run scripts and software and write output.
>
> Easier way is to use qstat -j <JOBID> where the <JOBID> is jobid of 
> one of the jobs with  EQW status. It and/or the .o/.e files you set 
> when you submitted the qsub job will give you file location to read 
> the error messages.  If you didn't set an .o and .e file in your qsub 
> call ( using -e and -o options) I believe it defaults to files with 
> jobid or jobname with extension .o( for output) and .e ( for error).  
> I believe Chris talked about this in his reply.  This is how I 
> discovered scripts weren't shared is because the .e file indicated the
scripts couldn't be found.
>
> Also as Chris said you can do qconf to change attributes of SGE setup. 
> I have done this before on a Starcluster cluster - login to master 
> node and as long as you have admin privileges you can use qconf 
> command to change attributes of SGE setup.
>
> Good Luck.
>
> -Jennifer
>
> Sent from my Verizon Wireless 4G LTE DROID
>
>
> greg <margeemail at gmail.com> wrote:
>
> Thanks Jennifer!  Being completely new to star cluster, how can I 
> check that my scripts are available to all nodes?
>
> -Greg
>
> On Wed, Oct 1, 2014 at 8:26 AM, Jennifer Staab <jstaab at cs.unc.edu> wrote:
>> Greg -
>>
>>    Maybe check that your software and scripts are available to all nodes.
>> I
>> have had Starcluster throw a bunch of EQW's when I accidentally 
>> didn't have all the software and script components loaded in a 
>> directory that was NFS'ed for all the nodes of my cluster and/or 
>> individually loaded on all nodes of the cluster.
>>
>> And as Chris just stated use:
>> qstat -j <JOBID> ==> gives complete info on that job qstat -j <JOBID> 
>> | grep error  (looks for errors in job)
>>
>> When you get the error debugged you can use:
>> qmod -cj <JOBID>  (will clear error state and restart job - like Eqw 
>> )
>>
>> Good Luck.
>>
>> -Jennifer
>>
>>
>>
>> On 10/1/14 8:09 AM, Chris Dagdigian wrote:
>>>
>>> 'EQW' is a combination of multiple message states (e)(q)(w).  The 
>>> standard "qw" is familiar to everyone, the E indicates something bad 
>>> at the job level.
>>>
>>> There are multiple levels of debugging, starting with easy and 
>>> getting more cumbersome. Almost all require admin or sudo level 
>>> access
>>>
>>> The 1st pass debug method is to run "qstat -j <jobID>" on the job 
>>> that is in EQW state, that should provide a bit more information 
>>> about what went wrong.
>>>
>>> After that you look at the .e and .o STDERR/STDOUT files from the 
>>> script if any were created
>>>
>>> After that you can use sudo privs to go into 
>>> $SGE_ROOT/$SGE_CELL/spool/qmaster/ and look at the messages file, 
>>> there are also per-node messages files you can look at as well.
>>>
>>> The next level of debugging after that usually involves setting the 
>>> sge_execd parameter KEEP_ACTIVE=true which triggers a behavior where 
>>> SGE will stop deleting the temporary files associated with a job life
cycle.
>>> Those files live down in the SGE spool at location 
>>> <executionhost>/active.jobs/<jobID/  -- and they are invaluable in 
>>> debugging nasty subtle job failures
>>>
>>> EQW should be easy to troubleshoot though - it indicates a fatal 
>>> error right at the beginning of the job dispatch or execution 
>>> process. No subtle things there
>>>
>>>
>>> And if your other question was about nodes being allowed to submit 
>>> jobs
>>> -- yes you have to configure this. It can be done during SGE install 
>>> time or any time afterwards by doing "qconf -as <nodename>" from any 
>>> account with SGE admin privs. I have no idea if startcluster does 
>>> this automatically or not but I'd expect that it probably does, If 
>>> not it's an easy fix.
>>>
>>> -Chris
>>>
>>>
>>> greg wrote:
>>>>
>>>> Hi guys,
>>>>
>>>> I'm afraid I'm still stuck on this.  Besides my original question 
>>>> which I'm still not sure about.  Does anyone have any general 
>>>> advice on debugging an EQW state?  The same software runs fine in 
>>>> our local cluster.
>>>>
>>>> thanks again,
>>>>
>>>> Greg
>>>
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster at mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
_______________________________________________
StarCluster mailing list
StarCluster at mit.edu
http://mailman.mit.edu/mailman/listinfo/starcluster

_______________________________________________
StarCluster mailing list
StarCluster at mit.edu
http://mailman.mit.edu/mailman/listinfo/starcluster



More information about the StarCluster mailing list