[StarCluster] Eqw errors in SGE with default starcluster configuration

Justin Riley jtriley at MIT.EDU
Tue Feb 7 11:44:10 EST 2012


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Yes, the issue is most likely because /root is not, and will never be,
NFS-shared on the cluster. I'd recommend, as Dustin suggested, logging
in as the CLUSTER_USER (defined in your config) and retrying your
script(s) from $HOME which *is* NFS-shared.

For example, assuming CLUSTER_USER=sgeadmin:

$ starcluster sshmaster yourcluster -u sgeadmin
sgeadmin at master $ qsub ....

HTH,

~Justin

P.S. - Please join the list if you can, otherwise I have to manually
login and approve each message you post.

On 02/07/2012 11:37 AM, Dustin Machi wrote:
> It looks to me like either a) the script you are submitting is 
> specifically looking in /root/lme (or setting that as the working 
> directory) or b) you are submitting jobs as the root user and your 
> script is looking in ~/lme.  I don't think /root is shared via nfs 
> across the cluster nodes, but /home for standard users is shared. 
> If you run this as a normal user and ensure you are looking in 
> /home/username/lme instead of /root/ I think it would work.  I'm 
> guessing you don't run as root on your dept's cluster.
> 
> Dustin
> 
> 
> 
> On Feb 7, 2012, at 11:27 AM, Josh Moore wrote:
> 
>> Ah ok, the problem is obvious now:
>> 
>> 02/07/2012 16:20:31|worker|master|W|job 1.1 failed on host 
>> node001 general changing into working directory because: 
>> 02/07/2012 16:20:30 [0:1568]: error: can't chdir to /root/lme:
>> No such fi le or directory 02/07/2012 
>> 16:20:31|worker|master|W|rescheduling job 1.1 02/07/2012 
>> 16:20:31|worker|master|W|job 3.1 failed on host node001 general 
>> changing into working directory because: 02/07/2012 16:20:30 
>> [0:1569]: error: can't chdir to /root/lme: No such fi le or 
>> directory 02/07/2012 16:20:31|worker|master|W|rescheduling job 
>> 3.1
>> 
>> The directory structure isn't there on the other nodes. Like I 
>> said, this script runs fine on my department's cluster, which 
>> runs SGE, without any extra action to set up directories on the 
>> nodes. What do I need to configure so that this will replicate 
>> the directory structure etc.?
>> 
>> Best, Josh
>> 
>> On Tue, Feb 7, 2012 at 11:02 AM, Dustin Machi
>> <dmachi at vbi.vt.edu> wrote: I'd check out
>> /opt/sge6/default/spool/qmaster/messages to see if there is
>> anything useful about what is happening there. It will generally
>> tell you why its not queuing an additional job. Are the parallel
>> environments setup the same between your two clusters?
>> 
>> Dustin
>> 
>> On Feb 7, 2012, at 2:30 AM, Josh Moore wrote:
>> 
>>> I tried submitting a bunch of jobs using qsub with a script 
>>> that works fine on another (non-Amazon) cluster's
>>> configuration of SGE. But on a cluster configured with
>>> StarCluster, only the first 8 (on a cluster of c1.xlarge nodes,
>>> so 8 cores each) enter the queue without error (all of those
>>> are immediately executed on the master node). Even if I delete
>>> one of the jobs on the master node, another one never takes its
>>> place. I have a cluster of 8 c1.xlarge nodes. Here is the
>>> output of qconf -ssconf:
>>> 
>>> algorithm                         default schedule_interval 
>>> 0:0:15 maxujobs                          0 queue_sort_method 
>>> load job_load_adjustments              np_load_avg=0.50 
>>> load_adjustment_decay_time        0:7:30 load_formula 
>>> np_load_avg schedd_job_info                   false 
>>> flush_submit_sec                  0 flush_finish_sec 0 params
>>> none reprioritize_interval 0:0:0 halftime
>>> 168 usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000
>>> compensation_factor 5.000000 weight_user
>>> 0.250000 weight_project                    0.250000
>>> weight_department 0.250000 weight_job
>>> 0.250000 weight_tickets_functional         0
>>> weight_tickets_share 0 share_override_tickets            TRUE 
>>> share_functional_shares           TRUE 
>>> max_functional_jobs_to_schedule   200 report_pjob_tickets TRUE
>>> max_pending_tasks_per_job         50 halflife_decay_list none
>>> policy_hierarchy                  OFS weight_ticket 0.010000
>>> weight_waiting_time               0.000000 weight_deadline
>>> 3600000.000000 weight_urgency 0.100000 weight_priority
>>> 1.000000 max_reservation                   0 default_duration 
>>> INFINITY
>>> 
>>> I can't figure out how to change schedd_job_info to true to 
>>> find out more about the error message... 
>>> _______________________________________________ StarCluster 
>>> mailing list StarCluster at mit.edu 
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>> 
>> 
> 
> 
> _______________________________________________ StarCluster
> mailing list StarCluster at mit.edu 
> http://mailman.mit.edu/mailman/listinfo/starcluster

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk8xVNkACgkQ4llAkMfDcrlD2wCdGrtcD9xwIJhScKMDr8tGKswt
YTQAn13eZ6u9CUYBnKlpqnF1Eaej6XAa
=7Kcc
-----END PGP SIGNATURE-----


More information about the StarCluster mailing list