<div dir="ltr">As a follow up:<div><br></div><div><div>>>> Loading full job history</div><div>Execution hosts: 41</div><div>Queued jobs: 165</div><div>Oldest queued job: 2013-10-30 01:15:34</div><div>Avg job duration: 1541 secs</div>
<div>Avg job wait time: 992 secs</div><div>Last cluster modification time: 2013-10-30 01:17:05</div><div>>>> Not adding nodes: already at or above maximum (1)</div><div>>>> Sleeping...(looping again in 60 secs)</div>
<div><br></div><div>Execution hosts: 41</div><div>Queued jobs: 161</div><div>Oldest queued job: 2013-10-30 01:15:34</div><div>Avg job duration: 0 secs</div><div>Avg job wait time: 0 secs</div><div>Last cluster modification time: 2013-10-30 01:17:05</div>
<div>>>> Not adding nodes: already at or above maximum (1)</div><div>>>> Sleeping...(looping again in 60 secs)</div></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">
On Tue, Oct 29, 2013 at 10:04 PM, Ryan Golhar <span dir="ltr"><<a href="mailto:ngsbioinformatics@gmail.com" target="_blank">ngsbioinformatics@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">Hi Rajat,<div><br></div><div>Its happening again. My jobs are, on average, 1hr long. I'm attaching the qacct output as an attachment:<br></div><div><br></div><div>qacct -j -b "10291300" > qacct.out<br>
</div><div><br></div><div><br></div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Sep 20, 2013 at 11:28 AM, Rajat Banerjee <span dir="ltr"><<a href="mailto:rajatb@post.harvard.edu" target="_blank">rajatb@post.harvard.edu</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div><div dir="ltr"><div><div><div>Hi Ryan,<br>Sorry wrong qacct command. I think i may know
what's happening. Are your jobs really long running? I think the
'lookback window' for checking the job history may be too short for you?
You could try it with being at least twice the duration of one of your
qsub'd tasks. See how every other line says ">>> Loading full
job history" That comes up because jobstats are empty, 'qacct -j -b
<some time' is coming back empty.<br>
<br></div><div>Trying to reproduce the behavior from:<br><a href="https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py#L504" target="_blank">https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py#L504</a><br>
<br></div>Could you send the output from this:<br></div>make a date of when you started your cluster, approximately, in this format: <br><pre>MMDDhhmm Months, Days, hours, minutes</pre>qacct -j -b <put that date format><br>
<br></div><div>And please paste that qacct output here. That should
always have a history of all jobs. Then try the same with the date
format being only 3 hours ago. You can try toying with the lookback
windows. The default is 3 hours and you can feed a new one in on the
command line:<br>
<br><strong>Lookback window</strong> (-l LOOKBACK_WIN, –lookback_window=LOOKBACK_WIN) - How
long, in hours, to look back for past job history<br><br></div><div>Justin Riley, can you please update the doc on this site?<br><a href="http://star.mit.edu/cluster/docs/0.93.3/manual/load_balancer.html" target="_blank">http://star.mit.edu/cluster/docs/0.93.3/manual/load_balancer.html</a><br>
<br></div>It says the window is in minutes but it's in fact in hours.<br><br>Thanks,<br>Raj</div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>