[Starcluster] Load Balancer Problems

Tue Aug 3 11:05:10 EDT 2010

Thanks. Fixed and checked in.

Raj

On Tue, Aug 3, 2010 at 10:15 AM, Justin Riley <jtriley at mit.edu> wrote:
> Raj,
>
>> I know what that error is - when no job has completed, qacct returns 1
>> instead of 0. I am looking for a way to get rid of that message. It
>> has no effect on the balancer.
>
> master.ssh.execute('qacct .....', ignore_exit_status=True)
>
> ~Justin
>
> ________________________________________
> From: rqbanerjee at gmail.com [rqbanerjee at gmail.com] On Behalf Of Rajat Banerjee [rbanerj at fas.harvard.edu]
> Sent: Monday, August 02, 2010 4:53 PM
> To: Amaro Taylor
> Cc: Justin Riley; starcluster at mit.edu
> Subject: Re: [Starcluster] Load Balancer Problems
>
> Hey Amaro,
> Great. Glad it is working for you.
>
> I know what that error is - when no job has completed, qacct returns 1
> instead of 0. I am looking for a way to get rid of that message. It
> has no effect on the balancer.
>
> Rajat
>
>
> On Mon, Aug 2, 2010 at 4:51 PM, Amaro Taylor
> <amaro.taylor at resgroupinc.com> wrote:
>> Hey Rajat,
>>
>> Just to update you on the testing progress. Im currently running a job and
>> it seems to be working as expected. We also got one error that didnt seem to
>> change anything :  ssh.py:248 - ERROR - command source /etc/profile && qacct
>> -j -b 201008021652 failed with status 1. The balancer looks to be working
>> great.
>> Best,
>> Amaro Taylor
>> RES Group, Inc.
>> 1 Broadway • Cambridge, MA 02142 • U.S.A.
>> Tel: 310 880-1906 (Direct) • Fax: 617-812-8042 • Email:
>> amaro.taylor at resgroupinc.com
>>
>> Disclaimer: The information contained in this email message may be
>> confidential. Please be careful if you forward, copy or print this message.
>> If you have received this email in error, please immediately notify the
>> sender and delete the message.
>>
>>
>> On Mon, Aug 2, 2010 at 12:59 PM, Amaro Taylor <amaro.taylor at resgroupinc.com>
>> wrote:
>>>
>>> Hey Guys,
>>>
>>> As far as the node idle time I think we just misinterpreted what was
>>> happening. The modulus statement was what we wanted.
>>>
>>> Thanks
>>> Amaro Taylor
>>> RES Group, Inc.
>>> 1 Broadway • Cambridge, MA 02142 • U.S.A.
>>> Tel: 310 880-1906 (Direct) • Fax: 617-812-8042 • Email:
>>> amaro.taylor at resgroupinc.com
>>>
>>> Disclaimer: The information contained in this email message may be
>>> confidential. Please be careful if you forward, copy or print this message.
>>> If you have received this email in error, please immediately notify the
>>> sender and delete the message.
>>>
>>>
>>> On Mon, Aug 2, 2010 at 12:30 PM, Justin Riley <jtriley at mit.edu> wrote:
>>>>
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>>
>>>> Raj,
>>>>
>>>> > 2. What is your preference for how long a job should stay idle before
>>>> > being killed?
>>>>
>>>> I think you meant *node* not job...
>>>>
>>>> > I usually don't check how long it has been idle. If it
>>>> > is idle now and the queue is empty then kill it. I could add code to
>>>> > check how long it has been idle, if it seems useful. Is there a use
>>>> > case?
>>>>
>>>> Also, the node must be up for the "majority of the hour" before it can
>>>> be considered for removal. This provides flexibility for the queue to
>>>> stabilize and also saves money given that you pay for the entire
>>>> instance hour anyway.
>>>>
>>>> As far as the "code to check how long a node has been idle" goes I'm not
>>>> sure I understand the use case/context either. Mind bringing the list up
>>>> to date on this discussion?
>>>>
>>>> ~Justin
>>>>
>>>> On 08/02/2010 02:38 PM, Rajat Banerjee wrote:
>>>> > Hey Amaro,
>>>> > Cool thanks. I called Brian and got info regarding the array of jobs.
>>>> > I checked in some test code that works fine on my (simple) cluster
>>>> > with qsub -t 1-20:1. I'd appreciate it if you'd test and let me know
>>>> > how it goes. Just committed to github:
>>>> >
>>>> > http://github.com/rqbanerjee/StarCluster/commit/17998a68feab3d1440aa5d9edc2e74697e43ef54
>>>> >
>>>> > Making requests during a business day has its rewards :)
>>>> >
>>>> > Regarding the host that had been inactive for a short time:
>>>> > 1. If the "tasks" field was properly recognized , as it is now, the
>>>> > queue should be recognized as full, and that node probably wouldn't
>>>> > have been killed.
>>>> > 2. What is your preference for how long a job should stay idle before
>>>> > being killed? I usually don't check how long it has been idle. If it
>>>> > is idle now and the queue is empty then kill it. I could add code to
>>>> > check how long it has been idle, if it seems useful. Is there a use
>>>> > case?
>>>> >
>>>> > Thanks,
>>>> > Rajat
>>>> > _______________________________________________
>>>> > Starcluster mailing list
>>>> > Starcluster at mit.edu
>>>> > http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: GnuPG v2.0.15 (GNU/Linux)
>>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>>>>
>>>> iEYEARECAAYFAkxXHOcACgkQ4llAkMfDcrnbvACghwwDpZn2uMUcr88lqH/bFdAr
>>>> MAIAn39LoXOe4j1iJ0x0crm4IsSI5kZC
>>>> =TQh9
>>>> -----END PGP SIGNATURE-----
>>>
>>
>>
>
>