[StarCluster] Twilight Zone: sge_gethostbyname failed

Lyn Gerner schedulerqueen at gmail.com
Fri Dec 27 19:40:18 EST 2013


Thanks for digging, Rayson.

So, /etc/sysconfig/network had HOSTNAME=centos-ami when the problem first
occurred.  I tried resetting it to "master" and then retried the SGE
commands (qstat, qsub, etc.).  They still failed with the same error at
that point, so I switched them back, not knowing for sure if they'd been
set to master and node001 to begin with.

Thanks,
Lyn


On Fri, Dec 27, 2013 at 2:35 PM, Rayson Ho <raysonlogin at gmail.com> wrote:

> (Updating the list...)
>
> The hostname on the master gets reset to centos-ami, which is not
> resolvable. Thus Grid Engine complains about the hostname issue.
>
> Lyn: what is the value of the HOSTNAME key in "/etc/sysconfig/network"
> on your master instance??
>
> Justin & other devs: set_hostname() in node.py works on Ubuntu because
> Ubuntu uses /etc/hostname, but RHEL (and RHEL-based distros like
> CentOS, Oracle Linux, Scientific Linux) uses /etc/sysconfig/network,
> and yet SuSE uses /etc/HOSTNAME!
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>
>
> On Fri, Dec 27, 2013 at 6:39 PM, Lyn Gerner <schedulerqueen at gmail.com>
> wrote:
> > I used the Scientific Linux AMI (been a long time, but I found it from
> the
> > SC site), and 0.94.3 is my SC version.
> >
> >
> > On Fri, Dec 27, 2013 at 1:36 PM, Rayson Ho <raysonlogin at gmail.com>
> wrote:
> >>
> >> Hmm, which AMI did you use, and what's the version of SC?
> >>
> >> Rayson
> >>
> >> ==================================================
> >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> http://gridscheduler.sourceforge.net/
> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
> >>
> >>
> >> On Fri, Dec 27, 2013 at 6:33 PM, Lyn Gerner <schedulerqueen at gmail.com>
> >> wrote:
> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
> >> > # /opt/sge6/utilbin/linux-x64/gethostname -name
> >> > error resolving local host: can't resolve host name (h_errno =
> >> > HOST_NOT_FOUND)
> >> >
> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
> >> > # hostname
> >> > centos-ami
> >> >
> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
> >> > # hostname -f
> >> > hostname: Unknown host
> >> >
> >> > What's weird is that I have never mucked with any of this under
> >> > StarCluster,
> >> > and have only recently started having problems.  Can't pinpoint any
> >> > specific
> >> > event or thing that changed--except that I started leaving the config
> up
> >> > for
> >> > days instead of hours at a stretch.
> >> >
> >> > Thanks,
> >> > Lyn
> >> >
> >> >
> >> > On Fri, Dec 27, 2013 at 1:30 PM, Rayson Ho <raysonlogin at gmail.com>
> >> > wrote:
> >> >>
> >> >> No problem, and I think that's why it is failing. Can you also send
> me
> >> >> the output of:
> >> >>
> >> >> 1) gethostname -name
> >> >>
> >> >> 2) hostname
> >> >>
> >> >> 3) hostname -f
> >> >>
> >> >> Rayson
> >> >>
> >> >> ==================================================
> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> >> http://gridscheduler.sourceforge.net/
> >> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
> >> >>
> >> >>
> >> >> On Fri, Dec 27, 2013 at 6:27 PM, Lyn Gerner <
> schedulerqueen at gmail.com>
> >> >> wrote:
> >> >> > My bad:
> >> >> >
> >> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
> >> >> > # /opt/sge6/utilbin/linux-x64/gethostname -all
> >> >> > error resolving local host: can't resolve host name (h_errno =
> >> >> > HOST_NOT_FOUND)
> >> >> >
> >> >> > Thanks for any insights,
> >> >> > Lyn
> >> >> >
> >> >> >
> >> >> > On Fri, Dec 27, 2013 at 1:25 PM, Rayson Ho <raysonlogin at gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> But I need the output of "gethostname", not "gethostbyname"... :-P
> >> >> >>
> >> >> >> Rayson
> >> >> >>
> >> >> >> ==================================================
> >> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> >> >> http://gridscheduler.sourceforge.net/
> >> >> >>
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
> >> >> >>
> >> >> >>
> >> >> >> On Fri, Dec 27, 2013 at 6:11 PM, Lyn Gerner
> >> >> >> <schedulerqueen at gmail.com>
> >> >> >> wrote:
> >> >> >> > Thanks for the quick response, Rayson.  Output from
> gethostbyname
> >> >> >> > is
> >> >> >> > in
> >> >> >> > between the ****s below:
> >> >> >> >
> >> >> >> > On Fri, Dec 27, 2013 at 1:04 PM, Rayson Ho <
> raysonlogin at gmail.com>
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> What is the output of "gethostname"? (gethostname is shipped
> with
> >> >> >> >> SGE
> >> >> >> >> in the util dir.)
> >> >> >> >>
> >> >> >> >> Rayson
> >> >> >> >>
> >> >> >> >> ==================================================
> >> >> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> >> >> >> http://gridscheduler.sourceforge.net/
> >> >> >> >>
> >> >> >> >>
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Fri, Dec 27, 2013 at 5:34 PM, Lyn Gerner
> >> >> >> >> <schedulerqueen at gmail.com>
> >> >> >> >> wrote:
> >> >> >> >> > Hi All,
> >> >> >> >> >
> >> >> >> >> > Okay, I'm in the Twilight Zone now.  After starting a small
> >> >> >> >> > cluster
> >> >> >> >> > on
> >> >> >> >> > the
> >> >> >> >> > 23rd, and doing minimal reconfig (qmod -d) to disable the
> >> >> >> >> > sge_execd
> >> >> >> >> > on
> >> >> >> >> > the
> >> >> >> >> > master and qconf -mq all.q to change some slot counts -- all
> of
> >> >> >> >> > which
> >> >> >> >> > worked
> >> >> >> >> > fine -- I come back these days later to find an unusable SGE
> >> >> >> >> > config:
> >> >> >> >> >
> >> >> >> >> > root at AWS-VTMXmaster-w2b ~
> >> >> >> >> > # qstat -f
> >> >> >> >> > error: sge_gethostbyname failed
> >> >> >> >> >
> >> >> >> >> > /etc/hosts is correct for all its (internal) host addrs:
> >> >> >> >> >
> >> >> >> >> > root at AWS-VTMXmaster-w2b ~
> >> >> >> >> > # cat /etc/hosts
> >> >> >> >> > 127.0.0.1   localhost localhost.localdomain localhost4
> >> >> >> >> > localhost4.localdomain4
> >> >> >> >> > ::1         localhost localhost.localdomain localhost6
> >> >> >> >> > localhost6.localdomain6
> >> >> >> >> > 10.250.65.204 master
> >> >> >> >> > 10.251.30.12 node001
> >> >> >> >> >
> >> >> >> >> *****
> >> >> >> >>
> >> >> >> >> > The gethostbyname utility works correctly (so does
> >> >> >> >> > gethostbyaddr):
> >> >> >> >> >
> >> >> >> >> > root at AWS-VTMXmaster-w2b/opt/sge6/default/common/install_logs
> >> >> >> >> > # /opt/sge6/utilbin/linux-x64/gethostbyname master
> >> >> >> >> > Hostname: master
> >> >> >> >> > Aliases:
> >> >> >> >> > Host Address(es): 10.250.65.204
> >> >> >> >> >
> >> >> >> >> > root at AWS-VTMXmaster-w2b/opt/sge6/default/common/install_logs
> >> >> >> >> > # /opt/sge6/utilbin/linux-x64/gethostbyname node001
> >> >> >> >> > Hostname: node001
> >> >> >> >> > Aliases:
> >> >> >> >> > Host Address(es): 10.251.30.12
> >> >> >> >
> >> >> >> >
> >> >> >> > ******
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> > root at AWS-VTMXmaster-w2b/opt/sge6/default/common/install_logs
> >> >> >> >> > # qstat -f
> >> >> >> >> > error: sge_gethostbyname failed
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > I went so far as to edit the hostname in
> /etc/sysconfig/network
> >> >> >> >> > to
> >> >> >> >> > contain
> >> >> >> >> > "master" and "node001" on the two nodes.  Same error.
> >> >> >> >> >
> >> >> >> >> > I have been all over the 'net looking for solutions, but have
> >> >> >> >> > found
> >> >> >> >> > nothing
> >> >> >> >> > with a clear resolution.  gridengine.sunsource.net is gone.
> >> >> >> >> > The
> >> >> >> >> > follow-on
> >> >> >> >> > at http://gridengine.org/pipermail/users/ doesn't seem to be
> >> >> >> >> > searchable,
> >> >> >> >> > except on an onerous, month-by-month click-thru basis (which
> >> >> >> >> > hasn't
> >> >> >> >> > yielded
> >> >> >> >> > anything useful as I slog thru it).
> >> >> >> >> >
> >> >> >> >> > Short of starcluster restart'ing, I'll appreciate anyone's
> >> >> >> >> > inputs
> >> >> >> >> > on
> >> >> >> >> > what to
> >> >> >> >> > try next.
> >> >> >> >> >
> >> >> >> >> > Thanks much,
> >> >> >> >> > Lyn
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > _______________________________________________
> >> >> >> >> > StarCluster mailing list
> >> >> >> >> > StarCluster at mit.edu
> >> >> >> >> > http://mailman.mit.edu/mailman/listinfo/starcluster
> >> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20131227/c3b5c911/attachment-0001.htm


More information about the StarCluster mailing list