[StarCluster] Eqw errors in SGE with default starcluster configuration

Tue Feb 7 11:37:32 EST 2012

It looks to me like either a) the script you are submitting is specifically looking in /root/lme (or setting that as the working directory) or b) you are submitting jobs as the root user and your script is looking in ~/lme.  I don't think /root is shared via nfs across the cluster nodes, but /home for standard users is shared.  If you run this as a normal user and ensure you are looking in /home/username/lme instead of /root/ I think it would work.  I'm guessing you don't run as root on your dept's cluster.

Dustin

On Feb 7, 2012, at 11:27 AM, Josh Moore wrote:

> Ah ok, the problem is obvious now:
> 
> 02/07/2012 16:20:31|worker|master|W|job 1.1 failed on host node001 general changing into working directory because: 02/07/2012 16:20:30 [0:1568]: error: can't chdir to /root/lme: No such fi
> le or directory
> 02/07/2012 16:20:31|worker|master|W|rescheduling job 1.1
> 02/07/2012 16:20:31|worker|master|W|job 3.1 failed on host node001 general changing into working directory because: 02/07/2012 16:20:30 [0:1569]: error: can't chdir to /root/lme: No such fi
> le or directory
> 02/07/2012 16:20:31|worker|master|W|rescheduling job 3.1
> 
> The directory structure isn't there on the other nodes. Like I said, this script runs fine on my department's cluster, which runs SGE, without any extra action to set up directories on the nodes. What do I need to configure so that this will replicate the directory structure etc.?
> 
> Best,
> Josh
> 
> On Tue, Feb 7, 2012 at 11:02 AM, Dustin Machi <dmachi at vbi.vt.edu> wrote:
> I'd check out /opt/sge6/default/spool/qmaster/messages to see if there is anything useful about what is happening there.  It will generally tell you why its not queuing an additional job.  Are the parallel environments setup the same between your two clusters?
> 
> Dustin
> 
> On Feb 7, 2012, at 2:30 AM, Josh Moore wrote:
> 
> > I tried submitting a bunch of jobs using qsub with a script that works fine on another (non-Amazon) cluster's configuration of SGE. But on a cluster configured with StarCluster, only the first 8 (on a cluster of c1.xlarge nodes, so 8 cores each) enter the queue without error (all of those are immediately executed on the master node). Even if I delete one of the jobs on the master node, another one never takes its place. I have a cluster of 8 c1.xlarge nodes. Here is the output of qconf -ssconf:
> >
> > algorithm                         default
> > schedule_interval                 0:0:15
> > maxujobs                          0
> > queue_sort_method                 load
> > job_load_adjustments              np_load_avg=0.50
> > load_adjustment_decay_time        0:7:30
> > load_formula                      np_load_avg
> > schedd_job_info                   false
> > flush_submit_sec                  0
> > flush_finish_sec                  0
> > params                            none
> > reprioritize_interval             0:0:0
> > halftime                          168
> > usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
> > compensation_factor               5.000000
> > weight_user                       0.250000
> > weight_project                    0.250000
> > weight_department                 0.250000
> > weight_job                        0.250000
> > weight_tickets_functional         0
> > weight_tickets_share              0
> > share_override_tickets            TRUE
> > share_functional_shares           TRUE
> > max_functional_jobs_to_schedule   200
> > report_pjob_tickets               TRUE
> > max_pending_tasks_per_job         50
> > halflife_decay_list               none
> > policy_hierarchy                  OFS
> > weight_ticket                     0.010000
> > weight_waiting_time               0.000000
> > weight_deadline                   3600000.000000
> > weight_urgency                    0.100000
> > weight_priority                   1.000000
> > max_reservation                   0
> > default_duration                  INFINITY
> >
> > I can't figure out how to change schedd_job_info to true to find out more about the error message...
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster at mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
> 
>