Developing IBM PureData System for Hadoop applications with the Eclipse IDE

Developing applications using the Eclipse IDE for IBM PureData™ System for Hadoop requires access to the Hadoop cluster. This article describes how to set up and configure OpenVPN on the connected client, a process that provides secure access to the cluster.

Jordan Denny (dennyj@us.ibm.com), Software Engineer, IBM

Jordan DennyJordan Denny is a software engineer on the IBM PureData System for Hadoop product development team.



Mirav Kapadia (mirav@us.ibm.com), Advisory software engineer, IBM Corporation

Mirav KapadiaMirav Kapadia leads the software development team in IBM Lenexa for the IBM PureData System for Hadoop offering.



17 December 2013

Also available in Russian

Introduction

This article describes how to install and configure OpenVPN Community Edition for IBM PureData™ System for Hadoop. OpenVPN is an open source implementation of both VPN server and VPN client software published under the GNU General Public License. It is available for all major operating systems on the market today, including Windows® and most distributions of Linux®.

A VPN is used to extend a private network across a public network. A VPN server is installed on a machine with access to the public network and the private network. A VPN client is installed on a machine on the public network and is used to connect to the VPN server. Once connected, the machine running the VPN client can send and receive data across the public network as if it were a part of the private network. OpenVPN provides many layers of security to ensure that data sent across the public network is encrypted and that users connected to the private network cannot compromise its integrity. The idea of a virtual private network has important ramifications when discussing the network specifications of IBM PureData System for Hadoop.


Use a connected client to enable external clients to access data nodes

InfoSphere BigInsights Quick Start Edition

InfoSphere BigInsights Quick Start Edition is a complimentary, downloadable version of InfoSphere BigInsights, IBM's Hadoop-based offering. Using Quick Start Edition, you can try out the features that IBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets. Guided learning is available to make your experience as smooth as possible including step-by-step, self-paced tutorials and videos to help you start putting Hadoop to work for you. With no time or data limit, you can experiment on your own time with large amounts of data. Watch the videos, follow the tutorials (PDF), and download BigInsights Quick Start Edition now.

IBM PureData System for Hadoop consists of two master nodes — one active and one standby — and 18 data nodes. The appliance is configured with an internal network that prevents the exposure of the data nodes to the outside world. The appliance also maintains a firewall that blocks access to most of its ports. An external client is able to communicate with the appliance only by connecting to the active master node on an open port. The external client can then run a Hadoop service or package that requires only master node connectivity, such as the REST APIs.

This network model provides a great level of security, and users can be comfortable knowing that data residing on each of the data nodes is protected from external access. However, because external clients cannot directly access the data nodes, many development tools and Hadoop packages cannot be used with the appliance. One noteworthy application that cannot be used is the InfoSphere® BigInsights™ plug-in for Eclipse. Rather than restricting users to using only those services and packages that rely solely on master node connectivity, IBM developed the notion of a connected client in IBM PureData System for Hadoop.

A connected client is an edge node that contains two network interface cards. One interface is connected directly into the fabric switch of the appliance, and the other is connected into the user's internal network. By connecting directly into the fabric switch, the connected client is granted access to the internal network of the appliance and has direct access to the data nodes. Any development tools and Hadoop packages that require data node access can be installed and run from the connected client.


Extend the connected client with VPNs

The capabilities of the connected client model can be extended with the use of a VPN. A connected client is connected directly to the fabric switch of the appliance and has access to its internal network. This access includes the ability to connect directly to the data nodes. This means that a connected client can write data into the Hadoop Distributed File System (HDFS) by getting location information from the name node and writing that data on the specified data node. A connected client can also read data from HDFS by communicating with the name node to get the location of that data and going to the specified data node to retrieve it.

OpenVPN Community Software provides a full-featured open source SSL VPN solution. The OpenVPN server software can be installed on the connected client, and the OpenVPN client software can be installed on the user's development machine. When users want to develop applications or use services from IBM PureData System for Hadoop, they connect to the VPN server on the connected client and are instantly given direct access to the system's internal network, as shown in Figure 1. When the user's work is finished, the user can easily disconnect from the VPN server and resume sending data across the original network.

In the following figure, the bold red line between the VPN clients and the VPN server indicates the VPN connection.

Figure 1. High-level architecture of the VPN client, VPN server, and connected client
Data nodes accessed by VPN and connected client

Click to see larger image

Figure 1. High-level architecture of the VPN client, VPN server, and connected client

Data nodes accessed by VPN and connected client

Set up OpenVPN on a connected client

To set up OpenVPN on a connected client, you need to:

  • Install OpenVPN
  • Configure OpenVPN
  • Create a certification authority
  • Create server keys
  • Create client keys
  • Build the OpenVPN server configuration file
  • Build the OpenVPN client configuration files
  • Do some additional configuration
  • Define rules for IP tables

Install OpenVPN

The first step for deploying OpenVPN involves downloading the software, along with any dependencies. This step will vary depending on what operating system is running on the connected client. In our examples, we used Red Hat 6.2 Enterprise Edition running on the connected client. Please refer to OpenVPN to install any prerequisites and to install the correct OpenVPN package. OpenVPN 2.3 or later also requires easy-rsa.

By default, OpenVPN installs itself in the /usr/share/openvpn directory. If you installed OpenVPN 2.2 or later, run the following commands as super-user:

mv easy-rsa /usr/share/openvpn
cp -pr /usr/share/openvpn /etc/openvpn

For the remainder of this article, we will use the /etc/openvpn directory for our configuration. We will leave the /usr/share/openvpn directory as a backup.

Configure OpenVPN

After making a copy of the default OpenVPN installation directory, open the vars file located at /etc/openvpn/easy-rsa/2.0/vars and find the following line:

export KEY_CONFIG=`$EASY_RSA/whichopensslcnf $EASY_RSA`

Change it to the appropriate value as shown in Figure 2 below:

export KEY_CONFIG=/etc/openvpn/easy-rsa/2.0/openssl-1.0.0.cnf
Figure 2. Screen capture showing the result of the commands to configure OpenVPN
Screen printout shows rules for the export command

Click to see larger image

Figure 2. Screen capture showing the result of the commands to configure OpenVPN

Screen printout shows rules for the export command

Create a certificate authority

The next step is to create the certificate authority (CA) certificate and key. The CA signs each of the public keys exchanged during a VPN connection, verifying that they are legitimate. The CA certificate must reside on every machine that will be a part of the VPN (the server and all clients). The CA key only needs to reside on one machine (typically the server). Whenever a client attempts to connect to the server, the server will ensure that the client's CA certificate is legitimate by comparing it against the CA key.

It is critically important to never copy the CA key across an unsecured network. In most scenarios, the CA key is generated on the machine on which it will permanently reside and is never copied. The CA key is the most critical aspect of any VPN because it is responsible for authenticating every client connection. If the CA key is compromised, a whole new set of certificates and keys will need to be generated.

To generate both the CA certificate and key, run the following commands:

cd /etc/openvpn/easy-rsa/2.0
chmod 755 *
source ./vars
./clean-all
./build-ca
Figure 3. Screen capture showing the result of the commands to create a CA certificate
Screen printout shows results from the create CA commands

When building the CA certificate, you must specify a set of parameters, including country name, state, city, etc. These parameters can be filled in with the appropriate information for your installation environment, or the default values can be used by simply pressing Enter on each line.

The only important parameter to take note of is the common name. Use a unique string for this parameter or other users could build their own CA certificates using the default parameters and compromise your VPN. After all of the parameters have been specified, you will return to the command line.

Create server keys

After building the CA certificate and key, the next step is to build the server certificate and key. This can be done by running the following command:

./build-key-server server
Figure 4. Screen capture showing the result of the commands to create server keys
Screen printout shows results from the build command

The default values for each parameter can be used by pressing Enter. These values do not have to match up with those listed in the CA. As before, the only important parameter is the common name. Make sure you use the name server for the common name, so the VPN knows that the certificate and key residing on this machine belongs to the server. If you passed the argument server to the script as shown above, then server will be set as the default value, which you can use by pressing Enter.

Create client keys

We use certificate authentication for the VPN. Certificate authentication requires that each client have its own unique certificate and key. Only one user can access the VPN with a specific certificate at any time. Generate as many certificates and keys as there are machines that want to access the VPN. It is usually a good idea to generate a few extra certificates and keys in case more machines need to enter the network in the future.

OpenVPN uses two-way authentication, meaning that the server must verify the client's certificate and the client must verify the server's certificate. This is why both sides require both a certificate (the public key) and a key (the private key). Both sides also require the same CA certificate to verify their authenticity.

As with the CA and server keys, the client keys should only be copied across a secure network. Typically a CA and client certificate are used together to generate a client key directly on the client machine. However, for this scenario, it is sufficient to generate the client key on the server and securely copy it to each client machine.

To begin generating client certificate and keys, run the following commands:

./build-key client1
./build-key client2
./build-key client3
...
Figure 5. Screen capture showing result of the commands to build client keys
Screen printout shows results from the build-key commands

As with the CA and server, specify whatever you want for the parameters. The only important parameter is the common name. Ensure that you use distinct common names for each certificate (if you ran the above command, then the default value for the common name parameter will be client1, which you can use by pressing Enter).

For an additional layer of security, you can password-protect your client certificates. Every time you attempt to connect to the VPN, you can require the client to enter a password. If you want to use this option, you will have to remember what password you used during certificate creation.

To create password-protected client certificates, run the following commands instead:

./build-key-pass client1
./build-key-pass client2
./build-key-pass client3
...
Figure 6. Screen capture showing result of command to create password-protected client keys
Screen printout shows results from the build-key-pass commands

Now build the Diffie-Hellman key exchange by running the following command.

./build-dh
Figure 7. Screen capture showing the result of the command to build Diffie-Hellman parameters
Screen printout shows results from the build-dh command

Build the OpenVPN server configuration file

In addition to the certificates and keys, each server and client machine needs a configuration file. This file specifies how the VPN connection is laid out. Copy the default configuration files into your OpenVPN directory. This can be done with the following commands:

cp /usr/share/doc/openvpn-2.3.2/sample-config-files/server.conf /etc/openvpn/
cp /usr/share/doc/openvpn-2.3.2/sample-config-files/client.conf /etc/openvpn/

Using your favorite editor, open the server.conf file and find the lines:

ca            ca.crt
cert          server.crt
key           server.key

dh            dh1024.pem

Replace these lines with the absolute paths:

ca            /etc/openvpn/easy-rsa/2.0/keys/ca.crt
cert          /etc/openvpn/easy-rsa/2.0/keys/server.crt
key           /etc/openvpn/easy-rsa/2.0/keys/server.key

dh            /etc/openvpn/easy-rsa/2.0/keys/dh1024.pem

By default, the internal network of IBM PureData System for Hadoop resides on the 10.168 subnet. When a client connects to the VPN server, push traffic to this subnet by updating the push route. Find the following lines in the server.conf file:

;push "route 192.168.10.0 255.255.255.0"
;push "route 192.168.20.0 255.255.255.0"

Replace these lines with:

push "route 10.168.0.0 255.255.0.0"

Build the OpenVPN client configuration files

Each client needs a separate configuration file for specifying how it will connect to IBM PureData System for Hadoop. For each client certificate and key generated earlier, copy a separate client configuration file by running the following commands:

cp /etc/openvpn/client.conf /etc/openvpn/client1.conf
cp /etc/openvpn/client.conf /etc/openvpn/client2.conf
cp /etc/openvpn/client.conf /etc/openvpn/client3.conf
...

For each configuration file, modify the same lines with the appropriate values. To determine the external IP address of master node 1 of the appliance, open its /etc/hosts file and look near the top. Below the 127.0.0.1 lines, you will see the external IP address and alias for the master node. Inside each clientX.conf file, find the line:

remote my-server-1 1194

Replace that line with:

remote IP-FROM-HOSTS-FILE 1194

Next, find the lines:

cert          client.crt
key           client.key

Replace those lines with the file names with the same name as the configuration file. For example, the client1.conf file will have:

cert          client1.crt
key           client1.key

The client2.conf file will have:

cert          client2.crt
key           client2.key

Continue to make this replacement for each configuration file.

Now restart the OpenVPN server with the command:

service openvpn restart

Enable IP packet forwarding

For the VPN server running on the connected client to redirect traffic into the internal network of IBM PureData Sytem for Hadoop, we have to enable packet forwarding. Using your favorite editor, open the /etc/sysctl.conf file and modify the net.ipv4.ip_forward = 0 line to net.ipv4.ip_forward = 1, as shown below.

Figure 8. Screen capture showing results of the command to enable IP forwarding
Image shows enabling IP forwarding

Run this command to make this setting persistent:

sysctl -p

Define IP tables rules

The next step is to create rules for IP tables to allow traffic in to port 1194 on the connected client and to redirect that traffic coming from the 10.8 subnet into the internal network of IBM PureData System for Hadoop. Run the following commands to set up these rules for the IP tables:

iptables -A INPUT -p udp -m udp --dport 1194 -j ACCEPT
iptables -t nat -A POSTROUTING -s 10.8.0.0/24 -o bond0 -j MASQUERADE
iptables -t nat -A POSTROUTING -s 10.8.0.0/24 -j SNAT --to-source 10.168.5.145

The address in bold above will vary depending upon the network configuration of your connected client. To obtain the correct address, run ifconfig on your connected client and find the inet addr of your bonded interface. It's a good idea to set this as a static IP address so you don't have to update your IP tables rules every time the connected client receives a new IP from the IBM PureData System for Hadoop rack switch.

Figure 9. Screen capture showing ifconfig address
Screen printout shows results from the ifconfig command
Figure 10. Screen capture showing IP tables rules
Screen printout shows IP tables rules

Save the newly created IP tables rules with the following command:

service iptables save

Connect to the OpenVPN server

Now that the OpenVPN server is installed, configured, and running on the connected client, OpenVPN needs to be set up on every external machine that needs direct access to the IBM PureData System for Hadoop data nodes.

Follow the same steps as before to download and install OpenVPN on each of your development machines. After OpenVPN is installed, copy a CA certificate, client certificate, and client key from the server onto each development machine. All of these files reside in the /etc/openvpn/easy-rsa/2.0/keys directory on the server.

Each development machine has a ca.crt file and a matching clientX.crt and clientX.key file. Never give the same clientX.crt and clientX.key file to more than one machine.

The last steps will vary depending upon the operating system running on your development machine.

For Linux, copy all three files into the /etc/openvpn/ directory on your development machine and open up a terminal window with root access. To make the VPN connection, run the following command:

openvpn client.conf

If you created password-protected client certificates, you will need to give the password at this time. After giving the correct password, the VPN connection will be made. The connection can be closed at any time by pressing CTRL+C at this terminal window.

On Windows®, copy all three files into the C:\Program Files\OpenVPN\config directory. Rename the clientX.conf file to clientX.ovpn. Go to the C:\Program Files\OpenVPN\bin directory and run the OpenVPN GUI as an administrator. (If you do not run it as administrator, the VPN connection will fail.)

Double-click the OpenVPN icon in your system tray. If you created password-protected client certificates, you will need to give the password at this time. After giving the correct password, the VPN connection will be made. The connection can be closed at any time by right-clicking the icon and clicking Disconnect.

Regardless of your operating system, after you're connected to the VPN, you will have access to the IBM PureData System for Hadoop internal network through the connected client. You can test your connection by looking at the /etc/hosts file on IBM PureData System for Hadoop and finding the IP addresses of the data nodes (do a search for rack1-data. Try pinging the IP address of one of those data nodes and you should get a response back.

A final step is to modify the hosts file on your external machine to include an alias for rack1-master. With the /etc/hosts file still open, find the line that contains rack1-master. Create an alias for rack1-master into the hosts file on your external machine and try pinging rack1-master on your external machine. You should get a response back.

Now that your external machine has access to the IBM PureData System for Hadoop internal network, you can develop applications using the InfoSphere BigInsights plug-ins for Eclipse Integrated Development Environment, submit Hadoop jobs using the External Job Submission API and much more.

Simply close the OpenVPN client connection when you are finished working with IBM PureData System for Hadoop.


Conclusion

In conclusion, this article provides a secure method for accessing HDFS data from IBM PureData System for Hadoop through IBM InfoSphere BigInsights tools for Eclipse.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=957199
ArticleTitle=Developing IBM PureData System for Hadoop applications with the Eclipse IDE
publish-date=12172013