Getting started with DDL

WML CE Distributed Deep Learning (or DDL) is a MPI-based communication library, which is specifically optimized for deep learning training. An application integrated with DDL becomes an MPI-application, which will allow the use of the ddlrun command to invoke the job in parallel across a cluster of systems. DDL understands multi-tier network environment and uses different libraries (for example NCCL) and algorithms to get the best performance in multi-node, multi-GPU environments.

Initial set up

Some configuration steps are common to all use of DDL:

  • WML CE frameworks must be installed at the same version on all nodes in the DDL cluster.
  • The DDL master node must be able to log in to all the nodes in the cluster by using ssh keys. Keys can be created and added by:
    • Generate ssh private/public key pair on the master node by using:
    • Copy the generated public key in ~/.ssh/ to all the nodes’ ~./ssh/authorized_keys file:
      ssh-copy-id -i ~/.ssh/ $USER@$HOST
  • Linux system firewalls might need to be adjusted to pass MPI traffic. This adjustment might be done broadly as shown.
    Note: Opening only required ports would be more secure. Required ports vary with configuration.
    sudo iptables -A INPUT -p tcp --dport 1024:65535 -j ACCEPT