We are getting some weird behavior in a new environment that we just moved a grid instance into. In our development environment, we have had no problems with running our catalog and container servers. When moved to one of our test environments, our catalog server works fine (it seems), but our containers keep shutting down unexpectedly then are attempted to be restarted.
CWOBJ1224I JVM process is ending because a replacement JVM has started
CWOBJ2523I Stopping this catalog or container server due to an external signal from the OS
This continues over and over again, never being able to fully start the container. Has anyone ever seen this behavior? How do we go about troubleshooting the problem? We are guessing that it might have something to do with the environment/OS, but we don't exactly know where to start with determining what the differences are between our dev and test environments.
jhanders 1200009V3S262 Posts
Re: Nodes Shutting Down Unexpectedly2012-11-13T23:44:57ZThis is the accepted answer. This is the accepted answer.What you are seeing is the container restart function in eXtreme Scale. When a container gets isolated from the rest of the servers due to heartbeating, it is removed from the grid and its shards are balanced over the remaining containers. When that container comes back into view to the other servers, the container is restarted because it is no longer part of the grid. When it finishes the restart, it joins back up with the other servers in the grid and if it loses view again, the cycle repeats.
There are several reasons that a server can go out of view. It can get CPU starvation where other servers cannot contact it for a while. A network split can cause it to no longer be visible with other container servers.
A message you should see is CWOBJ1123 indicating that it has become disconnected. The other servers in the grid should also be reporting it as not being around any longer. With the server logs we could give more information about what the problem is.
Things to validate is that your heap size is set appropriately. If it is not set on this container for instance and entries are placed on the container, it could garbage collect a lot and potentially cause CPU starvations.
I hope this helps give some direction,
wiedela 110000P9SD19 Posts
Re: Nodes Shutting Down Unexpectedly2012-11-14T16:48:30ZThis is the accepted answer. This is the accepted answer.
- jhanders 1200009V3S
Our current theory is that somehow our network is not allowing the nodes to communicate to the catalog server, not allowing the heartbeat message to get through. What steps should we take in troubleshooting our network, to make sure the containers send heartbeats through?