IBM Support

Diagnosis of errors when trying to run CPLEX's distributed MIP solver

Troubleshooting


Problem

CPLEX's distributed MIP solvers rely more heavily on system calls than other CPLEX features, resulting in error messages that are less informative regarding the specific cause of a problem. This technote tries to address this challenge by providing additional details about how the distributed MIP solver is initialized; this can help users identify and correct any errors that are preventing the solver from running.

Symptom

Error messages with limited information when running the distributed MIP optimizer

Cause

Errors in the configuration/setup of the distributed MIP solver

Environment

All operating systems on which CPLEX supports the distributed MIP solver

Resolving The Problem


The CPLEX distributed MIP solver offers 3 different transport protocols to identify and connect the master and worker processes. However, regardless of the specific choice of protocol, the following 3 tasks must be performed.

  1. A (xml formatted) .vmc file must be configured for the master process that identifies the workers that will be used.
  2. The distributed MIP service must be started on each worker machine. The IP address or machine name in the command that starts the service corresponds to the worker, not the master.
  3. The location of the shared libraries used by the master and worker processes must be accessible to all machines. Typically this is done by installing CPLEX in a directory accessible to all machines, but it can also be accomplished by separate installations on each machine to be used. Depending on the operating system, the LD_LIBRARY_PATH (Linux/Unix), PATH (Windows), DYLD_LIBRARY_PATH (MacOS) or LIBPATH (AIX) environment variable must be set on the master machine to point to the bin directory of your CPLEX installation that contains the required shared library. For the worker machines, use the -libpath option in the command that starts up the distributed MIP service.

The specifics of how these tasks are performed (especially the second one) varies depending on the transport protocol selected, the details of which appear below. However, all must be done regardless of the selected protocol, and if something is missing or inconsistently configured, an error will occur when you try to start the distributed MIP optimization.


TCP/IP transport protocol

In TCP/IP transport protocol, these 3 tasks are performed separately and explicitly. Task 2) is performed on the worker machines independently of the master, and before even the master is started up. On each worker a distributed MIP service is started that will listen for connections, and when the master connects, will start the worker process. This distributed MIP service is a tcp/ip server itself started by invoking cplex with command line options that specify the location of the CPLEX shared libraries, the IP address of the worker machine through which it will communicate with the master machine, and the free port on the worker machine on which the distributed MIP service will listen:

<path_to_CPLEX>/cplex -worker=tcpip -libpath=<path_to_CPLEX> -address=<ip_address_of_the_interface_to_use_on_the_worker>:<port_to_use>

In this command replace <...> with the appropriate values. If the worker machine has only a single IP address (e.g. a public one for internet access), use that value for the address specification. If the worker machine has multiple connections (e.g. a private interface to a private LAN in addition to the aforementioned public one), CPLEX needs to know on which interface it will communicate with the master. To reiterate: the ip address in the -address argument is the ip address of the network interface of the worker machine through which it will communicate with the master machine.

The port to use is a free port that cplex will listen on. This port cannot be used by any other process and it cannot be a "privileged" port, that is a port number below 1024.

Optionally add the argument -debug to get extra information out of the worker process.

Once the workers are all started the master can be started. Set the LD_LIBRARY_PATH or equivalent environment variable using the appropriate shell command (e.g. the export command if using the bash shell). The .vmc file lists the name or IP address of each worker machine and the port through which the TCP/IP connection between the master and host is done; an example appears in the user manual section on TCP/IP transport protocol. The port specified to the master machine in the .vmc file for each worker must match the port passed on the command line when starting the distributed MIP service on that worker.


Process transport protocol

When using process transport protocol, the .vmc file content describes how to handle task 2). The .vmc file not only identifies each worker machine to be used, but it defines the connection so that the distributed MIP service is started when the connection is made. For example, let's have a look at the vmc file in the documentation used for process transport protocol. Note that this assumes that a passwordless login has been set up (but we discuss below the changes needed if passwords must be entered).


<vmc>
<machine name="host2">
<transport type="process">
<cmdline>
<item value="ssh"/>
<item value="host2"/>
<item value="/nfs/CPLEX/cplex"/>
<item value="-worker=process"/>
<item value="-stdio"/>
<item value="-libpath=/nfs/CPLEX"/>
</cmdline>
</transport>
</machine>
</vmc>

The <cmdline> section basically builds a command to send from the master process that reads in the .vmc file to connect to each worker machine in the vmc file. So this section of the vmc files says to connect to worker machine host2 via the statement consisting of appending the 6 listed items:

ssh host2 /nfs/CPLEX/cplex -worker=process -stdio -libpath=/nfs/CPLEX"

Note that the basic syntax for ssh is "ssh hostname command", which is what we are doing here. If we want to add the -debug option (discussed below) to get more information about what is going wrong, we would add a line

<item value="-debug"/>

to the list of items in the vmc file snippet above. Similarly, if you wanted to specify the password on a command line login for ssh, you would need to include an <item> line for the password specification as well. Instead of ssh you can potentially use any other program that transparently passes through stdin/stdout/stderr of the process. However, only ssh is officially tested and supported. So, we recommend you have a really good reason for using an alternative to ssh before trying it.

The description above assumes that key-based passwordless communication is already set up from the master machine to the worker machines. If that is not the case, tools like sshpass can be used instead of ssh in the .vmc file to supply the password needed, but we strongly recommend to use passwordless communication instead.


OpenMPI and MPICH transport protocol.

In this protocol MPI is used for communication between the master and the worker processes. Of the many MPI implementations OpenMPI and MPICH are supported. A cluster of machines is created that will contain the master and worker machines. MPI is the most complicated transport protocol to use; we recommend it only for users who have used it previously. Note also that MPI is not supported on Windows and is therefore less portable. Using this protocols involves running a script on the master machine that first configures the cluster of machines, and then invokes the CPLEX distributed MIP service, making it available to all the machines on the cluster. The example below illustrates how to use OpenMPI to start up the environment: the first 4 lines of the script invoke mpirun to configure the cluster, in this case consisting of 3 machines (host1, host2 and host3). Make sure the MPIDIR environment variable correctly identifies the directory containing the MPI installation, and the LD_LIBRARY_PATH environment variable identifies the cplex directory containing the CPLEX shared libraries use for the distributed MIP. The next 5 lines specify the command to run on each machine. The command is cplex, and its arguments specify that MPI will be used for communication, within that the OpenMPI implementation will be used, and where the MPI libraries are:

$MPIDIR/bin/mpirun \
-x "LD_LIBRARY_PATH=/nfs/CPLEX:$MPIDIR/lib" \
-tag-output \
-host host1 -host host2 -host host3 \
/nfs/CPLEX/cplex \
-mpi \
-libpath="$MPIDIR/lib" \
-mpilib="$MPIDIR/lib/libmpi.so" \
-mpiapi="openmpi"

The vmc file must then be configured to correctly identify the worker machines
and indicate that MPI transport protocol is being used, e.g.

<vmc>
<machine name="host2">
<transport type="MPI">
<rank value="1"/>
</transport>
</machine>

<machine name="host3">
<transport type="MPI">
<rank value="2"/>
</transport>
</machine>
</vmc>

If the MPICH implementation of MPI is used then the mpirun command is slightly different.



Proper Networking of the master and worker machines

In order to execute distributed optimization with CPLEX the master and worker machines must be properly networked so that they can communicate. If you experience problems, you need to confirm that the basic network functionality between the master and worker machines is working properly. While ultimately it is the user's responsibility to ensure that their network is functioning properly, here is a list of steps to validate basic network functionality. If these do not help you resolve the problem, you will need to contact your network administrator for additional help.

First, make sure the master and workers can see each other on the network in the most basic sense by using the ping command followed by the name or IP address of the other machines that should be accessible on the network.

Second, make sure that the master and workers can communicate properly using the tools that the CPLEX distributed MIP will rely on.
  • For process communication make sure that you can connect to the worker machines and execute a simple command there. For example, issue the command "ssh <worker_ip_address> ls". That command should lists the content of your home directory on the worker machine.
  • For tcp/ip communication ensure correct port usage. Port numbers < 1024 require root or admin privileges on Unix and Windows. Do not use a port number < 1024 in the .vmc file or command that starts up the distributed MIP service on the worker machines.
    • On Windows you can use the "netstat -a" command on the worker machine to find out which ports are already occupied. Any port in a listening state cannot be used for dist. MIP. Even if netstat indicates that a port is usable, there might still be a firewall that blocks communication between the master and worker machines. The final test to make sure that communication is possible is to set up a simple tcp/ip server on the worker and connect to it from the master. For this purpose one can use "Network Stuff" (http://jacquelin.potier.free.fr/networkstuff/) a freely available tool (IBM does not endorse this software, we just found it useful). The first item in the "User Manual" section of the page describes how to set up a tcp/ip server (this should be done on the worker) and how to connect to it (this should be done on the master).
    • On linux the netcat (alias nc or ncat on some machines) command can be used to verify that a port is free. Issue
      "netcat -l <ip_address_of_the_interface_to_use_on_the_worker> <desired_port>"

      on the worker machine. The "netcat: Address already in use" message means that, well, the
    port is already in use, it cannot be used by cplex. If the command just waits, that means the
    port is free (and you can stop the netcat command with ctrl-c). Just like on windows, it is
    possible that a firewall would still block communication. netcat can also be used to set up a
    simple tcp/ip server and then connect to it. On the worker issue the command
    "netcat -l <ip_address_of_the_interface_to_use_on_the_worker> <desired_port>"
    and afterwards on the master issue
    "netcat -v -z <ip_address_of_the_interface_to_use_on_the_worker> <desired_port>"
    The test is successful if on the worker the netcat command stops and you get back a prompt
    and on the master process you get a "Connection ... succeeded!" message. For more Machines accessed through a Cloud Service provider may have additional security restrictions that must be relaxed for the CPLEX distributed MIP to work. That is, the firewall that is blocking communication between the master and the workers may exist neither on the master or on the worker, but on the cloud system itself. For example, Microsoft(TM)'s Azure system appears to have this issue: https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-create-vnet-arm-pportal?toc=%2fazure%2fvirtual-machines%2fwindows%2ftoc.json.
  • For MPI transport, when using OpenMPI or MPICH, we recommend first getting it to run properly with a simple program that just prints a message to the screen (e.g. "Hello World") or something similar.
  • To recap this section: make sure you can establish a basic master worker connection first. Multiple IP addresses on the master or worker machines can cause problems.

Diagnostic tools.

CPLEX offers two tools to obtain more verbose output.
  • Set the environment variable ILOG_CPLEX_TRACE_REMOTE_OBJECT=99 to get more verbose information.
  • Use the -debug command line option when starting the dist MIP service on the worker machines, e.g.
    nfs/CPLEX/cplex -debug -worker=tcpip -libpath=/nfs/cplex -address=<yourmachine.yourcompany.com>:12345
    The machine specified in the -address field refers to the worker machine on which the command is run, not the master machine.

In addition, here are some error messages and explanations. This list is not exhaustive and may change in future releases.


Error code: CPLEX Error 1813: Unspecified operating system error..
Explanation: Possible causes: a mismatch between port numbers in the vmc file; distributed MIP service not started at all, or not started properly on one or more worker machines; name of worker machine in vmc file incorrectly specified (either pointing to a nonexistent machine or a machine on which the service has not been started).

Error Code: CPLEX Error 1814: Failed to load dynamic library..
Explanation: LD_LIBRARY_PATH or equivalent not properly set when invoking CPLEX on the master machine.

Error: Could not load worker tcpip: -13
Explanation: The appropriate directory was not specified in the -libpath command line flag on a worker. In addition to failing to specify the path to the CPLEX shared libraries, this error message can occur in the presence of a version mismatch between the shared libraries identified and the version of CPLEX invoked to start the worker service.

Error: getaddrinfo: -2/Name or service not known
Explanation: Most likely an error in the IP address or name of worker machine in the command to start the distributed MIP service on a worker machine. In other words, in the specification -address=<yourmachine.yourcompany.com>:12345, yourmachine.yourcompany.com does not correctly identify a machine available in your network. It is best to use the ip address of the network interface of the worker that should be used for communication.

Issue: No error message appears, but CPLEX quietly runs regular MIP rather than distributed MIP.
Explanation: This typically means the .vmc file was not read in or contains errors. You can confirm that the distributed MIP solver is running by checking for the following output after the presolve step at the start of the optimization:

Running distributed MIP on 2 solvers.
Setting up 2 distributed solvers.
Setup time = 0.02 sec. (0.00 ticks)
Starting ramp-up.

Summarizing, there are 3 core operations involved in the setup and configuration of the distributed MIP solver: proper set up of environment variables to locate the directories containing the required shared libraries for the master machine; proper configuration of the .vmc file, and proper command line arguments when starting up the distributed MIP service for the worker machines. Most problems with the distributed MIP solver involve errors in one of more of these operations, or inconsistencies between two or more of them. But, if the errors persist, use the simple test in the section above on proper network to determine if the problem lies in your network configuration rather than with your CPLEX distributed MIP configuration.


[{"Product":{"code":"SSSA5P","label":"IBM ILOG CPLEX Optimization Studio"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"Parallel CPLEX","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"12.8.0;12.7.1;12.7.0;12.6.3;12.6.2;12.6.1;12.6.0.1;12.6","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
16 June 2018

UID

swg22012097