[StarCluster] Twilight Zone: sge_gethostbyname failed

Rayson Ho raysonlogin at gmail.com
Fri Dec 27 19:57:50 EST 2013


We need to change the SC code for RHEL-based distros. Each distro does
things slightly differently, and that's why you get that behavior.

In the mean time, you might want to go to each node and set the
hostname by editing /etc/sysconfig/network and running hostname <name>
as root, and then restart OGS/GE.

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html


On Fri, Dec 27, 2013 at 7:47 PM, Lyn Gerner <schedulerqueen at gmail.com> wrote:
> Yep, it works again with those changes.
>
> So, how should I stop the regression in a non-kludgy way?
>
> Thanks again,
> Lyn
>
>
> On Fri, Dec 27, 2013 at 2:43 PM, Rayson Ho <raysonlogin at gmail.com> wrote:
>>
>> /etc/sysconfig/network is read during reboot, and may be after DHCP...
>>
>> To see if it is the issue, set HOSTNAME back to master, and also run
>> "hostname master" as root.
>>
>> Rayson
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>
>>
>> On Fri, Dec 27, 2013 at 7:40 PM, Lyn Gerner <schedulerqueen at gmail.com>
>> wrote:
>> > Thanks for digging, Rayson.
>> >
>> > So, /etc/sysconfig/network had HOSTNAME=centos-ami when the problem
>> > first
>> > occurred.  I tried resetting it to "master" and then retried the SGE
>> > commands (qstat, qsub, etc.).  They still failed with the same error at
>> > that
>> > point, so I switched them back, not knowing for sure if they'd been set
>> > to
>> > master and node001 to begin with.
>> >
>> > Thanks,
>> > Lyn
>> >
>> >
>> > On Fri, Dec 27, 2013 at 2:35 PM, Rayson Ho <raysonlogin at gmail.com>
>> > wrote:
>> >>
>> >> (Updating the list...)
>> >>
>> >> The hostname on the master gets reset to centos-ami, which is not
>> >> resolvable. Thus Grid Engine complains about the hostname issue.
>> >>
>> >> Lyn: what is the value of the HOSTNAME key in "/etc/sysconfig/network"
>> >> on your master instance??
>> >>
>> >> Justin & other devs: set_hostname() in node.py works on Ubuntu because
>> >> Ubuntu uses /etc/hostname, but RHEL (and RHEL-based distros like
>> >> CentOS, Oracle Linux, Scientific Linux) uses /etc/sysconfig/network,
>> >> and yet SuSE uses /etc/HOSTNAME!
>> >>
>> >> Rayson
>> >>
>> >> ==================================================
>> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> http://gridscheduler.sourceforge.net/
>> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>> >>
>> >>
>> >> On Fri, Dec 27, 2013 at 6:39 PM, Lyn Gerner <schedulerqueen at gmail.com>
>> >> wrote:
>> >> > I used the Scientific Linux AMI (been a long time, but I found it
>> >> > from
>> >> > the
>> >> > SC site), and 0.94.3 is my SC version.
>> >> >
>> >> >
>> >> > On Fri, Dec 27, 2013 at 1:36 PM, Rayson Ho <raysonlogin at gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Hmm, which AMI did you use, and what's the version of SC?
>> >> >>
>> >> >> Rayson
>> >> >>
>> >> >> ==================================================
>> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> http://gridscheduler.sourceforge.net/
>> >> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>> >> >>
>> >> >>
>> >> >> On Fri, Dec 27, 2013 at 6:33 PM, Lyn Gerner
>> >> >> <schedulerqueen at gmail.com>
>> >> >> wrote:
>> >> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> >> >> > # /opt/sge6/utilbin/linux-x64/gethostname -name
>> >> >> > error resolving local host: can't resolve host name (h_errno =
>> >> >> > HOST_NOT_FOUND)
>> >> >> >
>> >> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> >> >> > # hostname
>> >> >> > centos-ami
>> >> >> >
>> >> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> >> >> > # hostname -f
>> >> >> > hostname: Unknown host
>> >> >> >
>> >> >> > What's weird is that I have never mucked with any of this under
>> >> >> > StarCluster,
>> >> >> > and have only recently started having problems.  Can't pinpoint
>> >> >> > any
>> >> >> > specific
>> >> >> > event or thing that changed--except that I started leaving the
>> >> >> > config
>> >> >> > up
>> >> >> > for
>> >> >> > days instead of hours at a stretch.
>> >> >> >
>> >> >> > Thanks,
>> >> >> > Lyn
>> >> >> >
>> >> >> >
>> >> >> > On Fri, Dec 27, 2013 at 1:30 PM, Rayson Ho <raysonlogin at gmail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> No problem, and I think that's why it is failing. Can you also
>> >> >> >> send
>> >> >> >> me
>> >> >> >> the output of:
>> >> >> >>
>> >> >> >> 1) gethostname -name
>> >> >> >>
>> >> >> >> 2) hostname
>> >> >> >>
>> >> >> >> 3) hostname -f
>> >> >> >>
>> >> >> >> Rayson
>> >> >> >>
>> >> >> >> ==================================================
>> >> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> >> http://gridscheduler.sourceforge.net/
>> >> >> >>
>> >> >> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>> >> >> >>
>> >> >> >>
>> >> >> >> On Fri, Dec 27, 2013 at 6:27 PM, Lyn Gerner
>> >> >> >> <schedulerqueen at gmail.com>
>> >> >> >> wrote:
>> >> >> >> > My bad:
>> >> >> >> >
>> >> >> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> >> >> >> > # /opt/sge6/utilbin/linux-x64/gethostname -all
>> >> >> >> > error resolving local host: can't resolve host name (h_errno =
>> >> >> >> > HOST_NOT_FOUND)
>> >> >> >> >
>> >> >> >> > Thanks for any insights,
>> >> >> >> > Lyn
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > On Fri, Dec 27, 2013 at 1:25 PM, Rayson Ho
>> >> >> >> > <raysonlogin at gmail.com>
>> >> >> >> > wrote:
>> >> >> >> >>
>> >> >> >> >> But I need the output of "gethostname", not "gethostbyname"...
>> >> >> >> >> :-P
>> >> >> >> >>
>> >> >> >> >> Rayson
>> >> >> >> >>
>> >> >> >> >> ==================================================
>> >> >> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> >> >> http://gridscheduler.sourceforge.net/
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On Fri, Dec 27, 2013 at 6:11 PM, Lyn Gerner
>> >> >> >> >> <schedulerqueen at gmail.com>
>> >> >> >> >> wrote:
>> >> >> >> >> > Thanks for the quick response, Rayson.  Output from
>> >> >> >> >> > gethostbyname
>> >> >> >> >> > is
>> >> >> >> >> > in
>> >> >> >> >> > between the ****s below:
>> >> >> >> >> >
>> >> >> >> >> > On Fri, Dec 27, 2013 at 1:04 PM, Rayson Ho
>> >> >> >> >> > <raysonlogin at gmail.com>
>> >> >> >> >> > wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> What is the output of "gethostname"? (gethostname is
>> >> >> >> >> >> shipped
>> >> >> >> >> >> with
>> >> >> >> >> >> SGE
>> >> >> >> >> >> in the util dir.)
>> >> >> >> >> >>
>> >> >> >> >> >> Rayson
>> >> >> >> >> >>
>> >> >> >> >> >> ==================================================
>> >> >> >> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> >> >> >> http://gridscheduler.sourceforge.net/
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> On Fri, Dec 27, 2013 at 5:34 PM, Lyn Gerner
>> >> >> >> >> >> <schedulerqueen at gmail.com>
>> >> >> >> >> >> wrote:
>> >> >> >> >> >> > Hi All,
>> >> >> >> >> >> >
>> >> >> >> >> >> > Okay, I'm in the Twilight Zone now.  After starting a
>> >> >> >> >> >> > small
>> >> >> >> >> >> > cluster
>> >> >> >> >> >> > on
>> >> >> >> >> >> > the
>> >> >> >> >> >> > 23rd, and doing minimal reconfig (qmod -d) to disable the
>> >> >> >> >> >> > sge_execd
>> >> >> >> >> >> > on
>> >> >> >> >> >> > the
>> >> >> >> >> >> > master and qconf -mq all.q to change some slot counts --
>> >> >> >> >> >> > all
>> >> >> >> >> >> > of
>> >> >> >> >> >> > which
>> >> >> >> >> >> > worked
>> >> >> >> >> >> > fine -- I come back these days later to find an unusable
>> >> >> >> >> >> > SGE
>> >> >> >> >> >> > config:
>> >> >> >> >> >> >
>> >> >> >> >> >> > root at AWS-VTMXmaster-w2b ~
>> >> >> >> >> >> > # qstat -f
>> >> >> >> >> >> > error: sge_gethostbyname failed
>> >> >> >> >> >> >
>> >> >> >> >> >> > /etc/hosts is correct for all its (internal) host addrs:
>> >> >> >> >> >> >
>> >> >> >> >> >> > root at AWS-VTMXmaster-w2b ~
>> >> >> >> >> >> > # cat /etc/hosts
>> >> >> >> >> >> > 127.0.0.1   localhost localhost.localdomain localhost4
>> >> >> >> >> >> > localhost4.localdomain4
>> >> >> >> >> >> > ::1         localhost localhost.localdomain localhost6
>> >> >> >> >> >> > localhost6.localdomain6
>> >> >> >> >> >> > 10.250.65.204 master
>> >> >> >> >> >> > 10.251.30.12 node001
>> >> >> >> >> >> >
>> >> >> >> >> >> *****
>> >> >> >> >> >>
>> >> >> >> >> >> > The gethostbyname utility works correctly (so does
>> >> >> >> >> >> > gethostbyaddr):
>> >> >> >> >> >> >
>> >> >> >> >> >> > root at AWS-VTMXmaster-w2b
>> >> >> >> >> >> > /opt/sge6/default/common/install_logs
>> >> >> >> >> >> > # /opt/sge6/utilbin/linux-x64/gethostbyname master
>> >> >> >> >> >> > Hostname: master
>> >> >> >> >> >> > Aliases:
>> >> >> >> >> >> > Host Address(es): 10.250.65.204
>> >> >> >> >> >> >
>> >> >> >> >> >> > root at AWS-VTMXmaster-w2b
>> >> >> >> >> >> > /opt/sge6/default/common/install_logs
>> >> >> >> >> >> > # /opt/sge6/utilbin/linux-x64/gethostbyname node001
>> >> >> >> >> >> > Hostname: node001
>> >> >> >> >> >> > Aliases:
>> >> >> >> >> >> > Host Address(es): 10.251.30.12
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > ******
>> >> >> >> >> >>
>> >> >> >> >> >> >
>> >> >> >> >> >> > root at AWS-VTMXmaster-w2b
>> >> >> >> >> >> > /opt/sge6/default/common/install_logs
>> >> >> >> >> >> > # qstat -f
>> >> >> >> >> >> > error: sge_gethostbyname failed
>> >> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >> > I went so far as to edit the hostname in
>> >> >> >> >> >> > /etc/sysconfig/network
>> >> >> >> >> >> > to
>> >> >> >> >> >> > contain
>> >> >> >> >> >> > "master" and "node001" on the two nodes.  Same error.
>> >> >> >> >> >> >
>> >> >> >> >> >> > I have been all over the 'net looking for solutions, but
>> >> >> >> >> >> > have
>> >> >> >> >> >> > found
>> >> >> >> >> >> > nothing
>> >> >> >> >> >> > with a clear resolution.  gridengine.sunsource.net is
>> >> >> >> >> >> > gone.
>> >> >> >> >> >> > The
>> >> >> >> >> >> > follow-on
>> >> >> >> >> >> > at http://gridengine.org/pipermail/users/ doesn't seem to
>> >> >> >> >> >> > be
>> >> >> >> >> >> > searchable,
>> >> >> >> >> >> > except on an onerous, month-by-month click-thru basis
>> >> >> >> >> >> > (which
>> >> >> >> >> >> > hasn't
>> >> >> >> >> >> > yielded
>> >> >> >> >> >> > anything useful as I slog thru it).
>> >> >> >> >> >> >
>> >> >> >> >> >> > Short of starcluster restart'ing, I'll appreciate
>> >> >> >> >> >> > anyone's
>> >> >> >> >> >> > inputs
>> >> >> >> >> >> > on
>> >> >> >> >> >> > what to
>> >> >> >> >> >> > try next.
>> >> >> >> >> >> >
>> >> >> >> >> >> > Thanks much,
>> >> >> >> >> >> > Lyn
>> >> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >> > _______________________________________________
>> >> >> >> >> >> > StarCluster mailing list
>> >> >> >> >> >> > StarCluster at mit.edu
>> >> >> >> >> >> > http://mailman.mit.edu/mailman/listinfo/starcluster
>> >> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>


More information about the StarCluster mailing list