System requirements for Data Science Experience Local

Ensure that your servers meet the hardware and software requirements for DSX Local.

You need to provide each server's IP address and the storage partition name for the installation.

Operating system requirements

For detailed operating system requirements, search for "Data Science Experience Local" at the Operating systems for a specific product page and Software Product Compatibility page.

Requirement: To install packages on Red Hat Enterprise Linux, you must set up repositories for installing DSX Local and install the required RPM packages.

Docker requirements

DSX Local for RHEL requires a docker to be installed. Because the DSX Local installer does not include the docker distribution from RHEL, you must install docker yourself on all of the nodes before installing DSX Local. Complete the following steps on each node in the cluster:
  1. Enable a Red Hat repo for package installation.
  2. Enable the extras repo so docker can be installed, for example, --subscription-manager repos --enable rhel-7-server-extras-rpms.
  3. Allocate a raw disk with at least 200 GB on each node for docker storage.
  4. Run the docker_redhat_install.sh script on each node (extractable from the DSX Local installation package by running it with the --extract-pre-install-scripts parameter) to automatically install the docker from the RHEL repo, and to set up the devicemapper as the storage driver in "direct-lvm" mode with a 25 GB docker base container size.

Hardware and software requirements for a seven-node configuration

This Version 1.2 configuration requires a minimum of six servers (either physical or virtual machines) and either one or two optional servers for deployment management.

Recommendation: Install the operating system with a minimal operating system installation package selection.

You need a sudo username and password to each node (this credential needs sudo root access and is used by the installer to lay down files and configure). The password cannot contain a single quotation ('), double quotation ("), pound sign (#), or white space ( ). After installation, DSX Local will run as root.

Alternatively, you can use a SSH key installation using root's private SSH key that has been copied to each node (for example, using the ssh-copy-id command).

If you are using a power broker to manage access, run pbrun to become root on the node you will install from, and copy this root private ssh key to all other nodes (and use this for installation using the wdp.conf configuration file).

Ensure that each node has an extra disk partition for the installer files. Each storage node requires another additional disk partition. All of these disk partitions must be mounted to paths (the installer will ask for these paths) and formatted with XFS with ftype functionality enabled. Example command to format each partition: mkfs.xfs -f -n ftype=1 -i size=512 -n size=8192 /dev/sdb1

Recommendation:

To improve performance, add the noatime flag to the mount options in /etc/fstab for both the installer and data storage partitions. Example:


/dev/sdb1       /installer              xfs    
defaults,noatime    1 2

As a result, inode access times will not be updated on the filesystems.

Minimum server specifications for a seven-node configuration on Red Hat Enterprise Linux (x86, POWER and z)

Node Type Number of Servers (BM/VM) CPU RAM Disk partitions IP addresses
Control Plane/Storage 3 8 cores 48 GB Minimum 300 GB with XFS format for installer files partition + minimum 500 GB with XFS format for data storage partition + minimum 200 GB of extra raw disk space for docker.  
Compute 3 16 cores 64 GB Minimum 300 GB with XFS format for installer file partition + minimum 200 GB of extra raw disk space for docker. If you add additional cores, a total of 48-50 cores distributed across multiple nodes is recommended.  
Deployment 1 16 cores 64 GB Minimum 300 GB with XFS format for installer file partition + minimum 200 GB of extra raw disk space for docker. If you add additional cores, a total of 48-50 cores distributed across multiple nodes is recommended.

Other requirements:

  • The installation requires at least 10 GB on the root partition.
  • If you plan to place /var on its own partition, reserve at least 10 GB for the partition.
  • SPSS Modeler add-on requirement: If you plan to install the SPSS Modeler add-on, add .5 CPU and 8 GB of memory for each stream you plan to create.
  • All servers must be synchronized in time (ideally through NTP or Chrony). Ensure that the system time of all the nodes in cluster are synchronized within one second. On each node, if NTP or Chrony is installed, but the node is not synchronized with one second, the installer will not allow you to proceed. If NTP and Crony are not installed, the installer will warn you. If an NTP or Chrony service is running but not used to sync time, then stop and disable the NTP or Chrony service in all nodes before running install.
  • SSH between nodes should be enabled.
  • YUM should not be already running.
  • Pre-requisites for Installing DSX Local with NVIDIA GPU support
Control plane/Storage
Requires a minimum of three servers: one master node to manage the entire cluster and at least two additional nodes for high availability. The Kubernetes cluster requires either a load balancer or one unused IP address as the HA proxy IP address. The IP address must be static, portable, and in the same subnet as the cluster. The data storage path is used by GlusterFS storage management.
Compute
Requires a minimum of three servers: one primary node and at least two extra nodes for high availability and scaling compute resources. During installation, you can add additional nodes for scaling Compute resources, for example, if you expect to run resource-intensive computations or have many processes that run simultaneously.
Deployment
Requires a minimum of one server: one primary node and one optional extra node for high availability. The Deployment nodes are the production versions of the Compute nodes, and thus have identical requirements.

Hardware and software requirements for a four-node configuration

This Version 1.2 configuration requires a minimum of three servers (either physical or virtual machines) and either one or two optional servers for deployment management.

You need a sudo username and password that matches the login password of that user to each node (this credential needs sudo root access and is used by the installer to lay down files and configure). The password cannot contain a single quotation ('), double quotation ("), pound sign (#), or white space ( ). After installation, DSX Local will run as root.

Alternatively, you can use an SSH key installation by using root's private SSH key that are copied to each node (for example, by using the ssh-copy-id command).

If you are using a power broker to manage access, run pbrun to become root on the node that you install from, and copy this root private ssh key to all other nodes (and use this for installation by way of the wdp.conf configuration file).

Ensure that each node has an extra disk partition for the installer files. Each storage node requires another additional disk partition. All of these disk partitions must be mounted to paths (the installer asks for these paths) and formatted with XFS with ftype functions enabled. Example command to format each partition: mkfs.xfs -f -n ftype=1 -i size=512 -n size=8192 /dev/sdb1

Recommendation:

To improve performance, add the noatime flag to the mount options in /etc/fstab for both the installer and data storage partitions. Example:


/dev/sdb1       /installer              xfs    
defaults,noatime    1 2

As a result, inode access times will not be updated on the filesystems.

Minimum server specifications for a four-node configuration on Red Hat Enterprise Linux (x86, POWER and z)

Node Type Number of Servers (BM/VM) CPU RAM Disk partition IP addresses
Control Plane/Storage/Compute 3 24 cores 64 GB Minimum 300 GB with XFS format for installer files partition + minimum 500 GB with XFS format for data storage partition + minimum 200 GB of extra raw disk space for docker.  
Deployment 2 16 cores 64 GB Minimum 300 GB with XFS format for installer file partition + minimum 200 GB of extra raw disk space for docker. If you add additional cores, a total of 48-50 cores distributed across multiple nodes is recommended.

Other requirements:

  • The installation requires at least 10 GB on the root partition.
  • If you plan to place /var on its own partition, reserve at least 10 GB for the partition.
  • SPSS Modeler add-on requirement: If you plan to install the SPSS Modeler add-on, add .5 CPU and 8 GB of memory for each stream you plan to create.
  • All servers must be synchronized in time (ideally through NTP or Chrony). Ensure that the system time of all the nodes in cluster are synchronized within one second. On each node, if NTP or Chrony is installed, but the node is not synchronized with one second, the installer will not allow you to proceed. If NTP and Crony are not installed, the installer will warn you. If an NTP or Chrony service is running but not used to sync time, then stop and disable the NTP or Chrony service in all nodes before running install.
  • SSH between nodes should be enabled.
  • YUM should not be already running.
  • Pre-requisites for Installing DSX Local with NVIDIA GPU support

Control Plane, Storage, and Compute are all installed on a single node with at least two extra nodes for high availability. Deployment is installed on a single node with an optional extra node for high availability. The Deployment nodes are the production versions of the Compute nodes, and have identical requirements. The Kubernetes cluster requires either a load balancer or one unused IP address as HA proxy IP address. The IP address must be static, portable, and in the same subnet as the cluster. The data storage path is used for file storage and GlusterFS storage management. You can add extra nodes for scaling Compute resources, for example, if you expect to run resource-intensive computations or have many processes that run simultaneously.

Disk requirements

Ensure that the storage has good disk I/O performance.

Disk latency test: dd if=/dev/zero of=/<path-to-install-path-directory>/testfile bs=512 count=1000 oflag=dsync The value must be better or comparable to: 512000 bytes (512 kB) copied, 1.7917 s, 286 kB/s

Disk throughput test: dd if=/dev/zero of=/<path-to-install-directory/testfile bs=1G count=1 oflag=dsync The value must be better or comparable to: 1073741824 bytes (1.1 GB) copied, 5.14444 s, 209 MB/s

To ensure that your data that is stored within IBM DSX is stored securely, you can encrypt your storage partition. If you use Linux Unified Key Setup-on-disk-format (LUKS) for this purpose, then you must enable LUKS and format the partition with XFS before you install DSX Local.

Network requirements

  • Each node needs to have a working DNS and a gateway that is specified within the network configuration regardless of whether this gateway allows outbound network access.
  • The cluster requires a network that it can use for the overlay network within Kubernetes. The network cannot conflict with other networks that might establish a connection to the cluster. DSX Local configures 9.242.0.0/16 as the default network. Use this default only if it does not conflict with other networks that this cluster is connected to.
  • In the /etc/sysctl.conf file, you must set net.ipv4.ip_forward = 1, and load the variable using the command sysctl -p.
  • From the first master node where the installer will run from, verify that you can actually SSH to every other node either by user ID or SSH key.
  • Verify the DNS you have set up on every node and ensure the DNS that you configured actually accepts DNS lookup requests. Enter a dig or nslookup command against a name on your network, and ensure your DNS correctly responds with an IP address.
  • Ensure the IP addresses being used for the installation match the host name for each node (hostnames and IP addresses need to be unique across the nodes).
  • Verify the machine-id is unique on each node by entering the command: cat /etc/machine-id. If they are not unique, you can generate new IDs with the following command: uuidgen > /etc/machine-id.
  • Ensure that ping is enabled between the nodes.
  • Ensure ICMP is enabled between the nodes, and that you are able to ping each of the nodes.

Proxy IP or load balancer configuration

To provide High Availability for DSX Local, you must use either a proxy IP address or a load balancer.

Option 1: Proxy IP address
Requirements:
  • All of the master nodes must be on the same subnet (minimum 1 GB network between nodes). The compute and deploy can be on any accessible subnet.
  • A static unused IP address on the network is required that is on the same VLAN and subnet of the master nodes. For high availability purposes, this will be used as a failover source IP where DSX will be accessed from. The master nodes will use this IP so that if one of the master nodes fails, the other node will take over this IP and provide fault tolerance. The network administrator must provide the reservation of the IP to be used before you can install DSX Local.
Option 2: Load balancer
For high availabilty purposes, you can use an external load balancer that is configured on your network. The load balancer does not require the nodes to be on the same subnet and VLAN. The load balancer can only be specified for a DSX Local installation using a wdp.conf file.

You can use one or two load balancers for this configuration:

External Traffic Routing
This load balancer must be configured to forward traffic for port 6443 and 443 to all three control nodes (or master nodes) with persistent IP round robin for the cluster to function properly. After DSX Local is installed, you can access it by connecting to the load balancer on this port via SSL or HTTPS.
Internal Traffic Routing
This load balancer must be configured before installing DSX Local to forward internal traffic for port 6443 to all three control nodes (or master nodes). All nodes must have access to the Kubernetes API server for the cluster to communicate to itself.

Firewall restrictions

  • Kubernetes uses IP tables for cluster communication. Because Kubernetes cannot run a server firewall on each node in combination with the IP tables it is using, firewall (for example, firewalld and iptables) must be disabled. If an extra firewall is needed, it is recommended you set up the firewall around the cluster (for example, vyatta firewall), and open up port 443.
  • SELinux must be in either Enforcing or Permissive mode. Use the getenforce command to get the current SELinux mode. If the command shows "Disabled", then edit /etc/selinux/config and change the SELINUX= line to either SELINUX=permissive or SELINUX=enforcing. Then, restart the node for the change to take effect.
  • DSX Local expects to be displayed externally through one port: 443 (https), for which access must be permitted.
  • The DSX Local runtime environment components connect to data sources (for example, relational databases, HDFS, and enterprise LDAP server/port) to support authentication for which access should be permitted.
  • Ensure that no daemon, script, process, or cron job makes any modification to /etc/hosts, IP tables, routing rules, or firewall settings (like enabling or refreshing firewalld or iptables) during or after install.
  • Ensure every node has at least one localhost entry in the /etc/hosts file corresponding to IP 127.0.0.1.
  • If your cluster uses multiple network interfaces (one with public IP addresses and one with private IP addresses), use only the private IP address in the /etc/hosts file with the short hostnames.
  • Ansible requirement: ensure the libselinux-python package is available.
  • Restriction: DSX Local does not support dnsmasq. Check with your network administrator to make sure that dnsmasq is not enabled.

Certificates

DSX Local generates SSL certificates during installation. The certificates are used for inter-cluster communication and must be trusted during first time access by users.

IBM Cloud offering requirements

See IBM Cloud documentation for details on ordering resources and performing installation tasks.

  • Set up a minimum of three virtual machines or bare metal servers, choosing specifications needed for DSX Local. Choose SSD drivers when ordering.
  • Ensure the DNS you have configured on each node is working, and can resolve names or IP addresses on the network you are on.
  • Set up a local load balancer and configure it to redirect the TCP port 6443 to the three master node instances. Choose persistent IP and round robin configuration. For health checks, use whether the port is open or closed.
  • Install DSX Local using the wdp.conf file with virtual_ip_address= commented out and the new line added: load_balancer_ip_address=<IP of the network load balancer>. Use the private IPs for each of the nodes to ensure DSX Local is installed using the private network.
  • After the installation completes, create an external load balancer for HTTPS (443) and point this to the three master nodes. Do not use SSL off loading. Use this external load balancer to connect to DSX Local through HTTPS and TCP port 443.

Microsoft Azure requirements

See Microsoft Azure Documentation for details on ordering resources and performing installation tasks.

  • Order an Availability set within Azure for your dsx-local cluster. When creating the VMs, all three master nodes will need to be added to the availablity set.
  • Order virtual machines with RHEL 7.4 and 7.3:
    • For a four-node cluster, order three VMs with a min of 24 CPU cores and 64 GB RAM (SSD hard drives). Order one or two VMs for the deployment nodes with a minimum of 16 CPU cores and 64 GB of RAM (SSD hard drives). Ensure each VM is on the same VLAN and ensure you add each of the three master nodes to the availability set.
    • For a seven-node cluster, order three VMs for the three master/storage nodes with a minimum of 24 CPU cores and 64 GB of RAM (SSD hard drives). Order three VMs for the compute nodes and one or two VMs for the deployment nodes with a minimum of 16 CPU cores and 64 GB of RAM (SSD hard drives). Ensure each VM is on the same VLAN and ensure you add each of the three master nodes to the availability set.
  • Choose to have a public IP on at least the first node for you to sign in and access the VM remotely.
  • Choose to have a static IP on the private VLAN for each VM (default is DHCP). This can be done after provisioning through the portal, if needed.
  • On each VM, add an additional SSD drive of at least 500 GB to install dsx-local on (default is one hard drive at 30GB). After you add the additional SSD hard drive, create at least one partition on it and format the partition using XFS. Example:
    
    mkfs.xfs -f -n ftype=1 -i size=512 -n size=8192
    /dev/sdb1
    
    
    Then mount this partition to a directory of your choosing. If the cluster is intended for production usage, then you should create this partition as an LVM to allow for expansion of disk space, if needed, using lvextend.
  • Update your inbound rules in the networking section to remove the default SSH rule that allows access to everyone. Add an SSH rule to allow only from a specific IP address or IP range. Add an inbound rule for HTTPs, and ensure you include your source from where this is allowed from a specific IP or IP range similar to the SSH rule.
  • Stop and disable firewalld on the Azure RHEL VMs.
  • Before installing DSX Local, create an internal load balancer with a static IP address. Configure this load balancer to redirect the TCP port 6443 to the three master node instances. Then install DSX Local using the wdp.conf file with virtual_ip_address= commented out and the new line added: load_balancer_ip_address=<static IP of the load balancer>. After the installation completes, create an external load balancer for HTTPS (443) and point this to the three master nodes.

More Amazon Web Services requirements

Before installing DSX Local, complete the following steps:

  1. Create an HTTPS "Application" Elastic Load Balancer that forwards traffic to port 443 to the three master nodes. This load balancer will be the front-facing URL to the users, so you can choose whatever port to listen on, and the certificate to secure the connection on via AWS's certificate manager.
  2. Create a TCP "Network" Elastic Load Balancer that listens on port 6443, and forwards to port 6443 on the three master nodes. This load balancer will be used by the cluster to communicate with the kubernetes API server.

For version 1.2.1.0 or later: Install DSX Local using the wdp.conf file file with virtual_ip_address= commented out and the new line added: load_balancer_fqdn=<FQDN of the TCP load balancer>.

For versions earlier than 1.2.1.0: Install DSX Local using the wdp.conf file with virtual_ip_address= commented out and the new line added: load_balancer_ip_address=<static IP of the TCP load balancer>.

Hadoop requirements

See Hortonworks Data Platform (HDP) or Cloudera Distribution for Hadoop (CDH).

Supported web browsers

  • Google Chrome (recommended)
  • Mozilla Firefox