Topic
  • 2 replies
  • Latest Post - ‏2012-11-14T16:48:30Z by wiedela
wiedela
wiedela
19 Posts

Pinned topic Nodes Shutting Down Unexpectedly

‏2012-11-13T23:13:29Z |
We are getting some weird behavior in a new environment that we just moved a grid instance into. In our development environment, we have had no problems with running our catalog and container servers. When moved to one of our test environments, our catalog server works fine (it seems), but our containers keep shutting down unexpectedly then are attempted to be restarted.

CWOBJ1224I JVM process is ending because a replacement JVM has started
CWOBJ2523I Stopping this catalog or container server due to an external signal from the OS

This continues over and over again, never being able to fully start the container. Has anyone ever seen this behavior? How do we go about troubleshooting the problem? We are guessing that it might have something to do with the environment/OS, but we don't exactly know where to start with determining what the differences are between our dev and test environments.

Thanks
Updated on 2012-11-14T16:48:30Z at 2012-11-14T16:48:30Z by wiedela
  • jhanders
    jhanders
    260 Posts

    Re: Nodes Shutting Down Unexpectedly

    ‏2012-11-13T23:44:57Z  
    What you are seeing is the container restart function in eXtreme Scale. When a container gets isolated from the rest of the servers due to heartbeating, it is removed from the grid and its shards are balanced over the remaining containers. When that container comes back into view to the other servers, the container is restarted because it is no longer part of the grid. When it finishes the restart, it joins back up with the other servers in the grid and if it loses view again, the cycle repeats.

    There are several reasons that a server can go out of view. It can get CPU starvation where other servers cannot contact it for a while. A network split can cause it to no longer be visible with other container servers.

    A message you should see is CWOBJ1123 indicating that it has become disconnected. The other servers in the grid should also be reporting it as not being around any longer. With the server logs we could give more information about what the problem is.

    Things to validate is that your heap size is set appropriately. If it is not set on this container for instance and entries are placed on the container, it could garbage collect a lot and potentially cause CPU starvations.

    I hope this helps give some direction,

    Jared Anderson
  • wiedela
    wiedela
    19 Posts

    Re: Nodes Shutting Down Unexpectedly

    ‏2012-11-14T16:48:30Z  
    • jhanders
    • ‏2012-11-13T23:44:57Z
    What you are seeing is the container restart function in eXtreme Scale. When a container gets isolated from the rest of the servers due to heartbeating, it is removed from the grid and its shards are balanced over the remaining containers. When that container comes back into view to the other servers, the container is restarted because it is no longer part of the grid. When it finishes the restart, it joins back up with the other servers in the grid and if it loses view again, the cycle repeats.

    There are several reasons that a server can go out of view. It can get CPU starvation where other servers cannot contact it for a while. A network split can cause it to no longer be visible with other container servers.

    A message you should see is CWOBJ1123 indicating that it has become disconnected. The other servers in the grid should also be reporting it as not being around any longer. With the server logs we could give more information about what the problem is.

    Things to validate is that your heap size is set appropriately. If it is not set on this container for instance and entries are placed on the container, it could garbage collect a lot and potentially cause CPU starvations.

    I hope this helps give some direction,

    Jared Anderson
    Thank you Jared for the quick answer. We have double-checked the CPU usage and relaxed the heartbeat, and we are still getting the same behavior. The catalog server logs tell us that the heartbeat is expiring, so somehow, we are not getting the containers able to bootstrap. Other weird behavior is when we try to teardown using the xscmd scripts, the containers don't teardown with the host:listener-port for the catalog.

    Our current theory is that somehow our network is not allowing the nodes to communicate to the catalog server, not allowing the heartbeat message to get through. What steps should we take in troubleshooting our network, to make sure the containers send heartbeats through?

    Thanks