[StarCluster] Twilight Zone: sge_gethostbyname failed

Rayson Ho raysonlogin at gmail.com
Fri Dec 27 19:35:54 EST 2013


(Updating the list...)

The hostname on the master gets reset to centos-ami, which is not
resolvable. Thus Grid Engine complains about the hostname issue.

Lyn: what is the value of the HOSTNAME key in "/etc/sysconfig/network"
on your master instance??

Justin & other devs: set_hostname() in node.py works on Ubuntu because
Ubuntu uses /etc/hostname, but RHEL (and RHEL-based distros like
CentOS, Oracle Linux, Scientific Linux) uses /etc/sysconfig/network,
and yet SuSE uses /etc/HOSTNAME!

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html


On Fri, Dec 27, 2013 at 6:39 PM, Lyn Gerner <schedulerqueen at gmail.com> wrote:
> I used the Scientific Linux AMI (been a long time, but I found it from the
> SC site), and 0.94.3 is my SC version.
>
>
> On Fri, Dec 27, 2013 at 1:36 PM, Rayson Ho <raysonlogin at gmail.com> wrote:
>>
>> Hmm, which AMI did you use, and what's the version of SC?
>>
>> Rayson
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>
>>
>> On Fri, Dec 27, 2013 at 6:33 PM, Lyn Gerner <schedulerqueen at gmail.com>
>> wrote:
>> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> > # /opt/sge6/utilbin/linux-x64/gethostname -name
>> > error resolving local host: can't resolve host name (h_errno =
>> > HOST_NOT_FOUND)
>> >
>> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> > # hostname
>> > centos-ami
>> >
>> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> > # hostname -f
>> > hostname: Unknown host
>> >
>> > What's weird is that I have never mucked with any of this under
>> > StarCluster,
>> > and have only recently started having problems.  Can't pinpoint any
>> > specific
>> > event or thing that changed--except that I started leaving the config up
>> > for
>> > days instead of hours at a stretch.
>> >
>> > Thanks,
>> > Lyn
>> >
>> >
>> > On Fri, Dec 27, 2013 at 1:30 PM, Rayson Ho <raysonlogin at gmail.com>
>> > wrote:
>> >>
>> >> No problem, and I think that's why it is failing. Can you also send me
>> >> the output of:
>> >>
>> >> 1) gethostname -name
>> >>
>> >> 2) hostname
>> >>
>> >> 3) hostname -f
>> >>
>> >> Rayson
>> >>
>> >> ==================================================
>> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> http://gridscheduler.sourceforge.net/
>> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>> >>
>> >>
>> >> On Fri, Dec 27, 2013 at 6:27 PM, Lyn Gerner <schedulerqueen at gmail.com>
>> >> wrote:
>> >> > My bad:
>> >> >
>> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> >> > # /opt/sge6/utilbin/linux-x64/gethostname -all
>> >> > error resolving local host: can't resolve host name (h_errno =
>> >> > HOST_NOT_FOUND)
>> >> >
>> >> > Thanks for any insights,
>> >> > Lyn
>> >> >
>> >> >
>> >> > On Fri, Dec 27, 2013 at 1:25 PM, Rayson Ho <raysonlogin at gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> But I need the output of "gethostname", not "gethostbyname"... :-P
>> >> >>
>> >> >> Rayson
>> >> >>
>> >> >> ==================================================
>> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> http://gridscheduler.sourceforge.net/
>> >> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>> >> >>
>> >> >>
>> >> >> On Fri, Dec 27, 2013 at 6:11 PM, Lyn Gerner
>> >> >> <schedulerqueen at gmail.com>
>> >> >> wrote:
>> >> >> > Thanks for the quick response, Rayson.  Output from gethostbyname
>> >> >> > is
>> >> >> > in
>> >> >> > between the ****s below:
>> >> >> >
>> >> >> > On Fri, Dec 27, 2013 at 1:04 PM, Rayson Ho <raysonlogin at gmail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> What is the output of "gethostname"? (gethostname is shipped with
>> >> >> >> SGE
>> >> >> >> in the util dir.)
>> >> >> >>
>> >> >> >> Rayson
>> >> >> >>
>> >> >> >> ==================================================
>> >> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> >> http://gridscheduler.sourceforge.net/
>> >> >> >>
>> >> >> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>> >> >> >>
>> >> >> >>
>> >> >> >> On Fri, Dec 27, 2013 at 5:34 PM, Lyn Gerner
>> >> >> >> <schedulerqueen at gmail.com>
>> >> >> >> wrote:
>> >> >> >> > Hi All,
>> >> >> >> >
>> >> >> >> > Okay, I'm in the Twilight Zone now.  After starting a small
>> >> >> >> > cluster
>> >> >> >> > on
>> >> >> >> > the
>> >> >> >> > 23rd, and doing minimal reconfig (qmod -d) to disable the
>> >> >> >> > sge_execd
>> >> >> >> > on
>> >> >> >> > the
>> >> >> >> > master and qconf -mq all.q to change some slot counts -- all of
>> >> >> >> > which
>> >> >> >> > worked
>> >> >> >> > fine -- I come back these days later to find an unusable SGE
>> >> >> >> > config:
>> >> >> >> >
>> >> >> >> > root at AWS-VTMXmaster-w2b ~
>> >> >> >> > # qstat -f
>> >> >> >> > error: sge_gethostbyname failed
>> >> >> >> >
>> >> >> >> > /etc/hosts is correct for all its (internal) host addrs:
>> >> >> >> >
>> >> >> >> > root at AWS-VTMXmaster-w2b ~
>> >> >> >> > # cat /etc/hosts
>> >> >> >> > 127.0.0.1   localhost localhost.localdomain localhost4
>> >> >> >> > localhost4.localdomain4
>> >> >> >> > ::1         localhost localhost.localdomain localhost6
>> >> >> >> > localhost6.localdomain6
>> >> >> >> > 10.250.65.204 master
>> >> >> >> > 10.251.30.12 node001
>> >> >> >> >
>> >> >> >> *****
>> >> >> >>
>> >> >> >> > The gethostbyname utility works correctly (so does
>> >> >> >> > gethostbyaddr):
>> >> >> >> >
>> >> >> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> >> >> >> > # /opt/sge6/utilbin/linux-x64/gethostbyname master
>> >> >> >> > Hostname: master
>> >> >> >> > Aliases:
>> >> >> >> > Host Address(es): 10.250.65.204
>> >> >> >> >
>> >> >> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> >> >> >> > # /opt/sge6/utilbin/linux-x64/gethostbyname node001
>> >> >> >> > Hostname: node001
>> >> >> >> > Aliases:
>> >> >> >> > Host Address(es): 10.251.30.12
>> >> >> >
>> >> >> >
>> >> >> > ******
>> >> >> >>
>> >> >> >> >
>> >> >> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> >> >> >> > # qstat -f
>> >> >> >> > error: sge_gethostbyname failed
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > I went so far as to edit the hostname in /etc/sysconfig/network
>> >> >> >> > to
>> >> >> >> > contain
>> >> >> >> > "master" and "node001" on the two nodes.  Same error.
>> >> >> >> >
>> >> >> >> > I have been all over the 'net looking for solutions, but have
>> >> >> >> > found
>> >> >> >> > nothing
>> >> >> >> > with a clear resolution.  gridengine.sunsource.net is gone.
>> >> >> >> > The
>> >> >> >> > follow-on
>> >> >> >> > at http://gridengine.org/pipermail/users/ doesn't seem to be
>> >> >> >> > searchable,
>> >> >> >> > except on an onerous, month-by-month click-thru basis (which
>> >> >> >> > hasn't
>> >> >> >> > yielded
>> >> >> >> > anything useful as I slog thru it).
>> >> >> >> >
>> >> >> >> > Short of starcluster restart'ing, I'll appreciate anyone's
>> >> >> >> > inputs
>> >> >> >> > on
>> >> >> >> > what to
>> >> >> >> > try next.
>> >> >> >> >
>> >> >> >> > Thanks much,
>> >> >> >> > Lyn
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > _______________________________________________
>> >> >> >> > StarCluster mailing list
>> >> >> >> > StarCluster at mit.edu
>> >> >> >> > http://mailman.mit.edu/mailman/listinfo/starcluster
>> >> >> >> >
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>


More information about the StarCluster mailing list