Application files deployment

With the MapReduce framework in IBM® Spectrum Symphony, you can deploy application files to hosts in the cluster that run tasks requiring these application files. You can use different deployment mechanisms to transfer application user packages containing user class files, JAR files, or other files required for a MapReduce job, or application files such as patches, libraries, or data.

Types of deployment

The MapReduce framework in IBM Spectrum Symphony supports the following mechanisms to deploy application files to hosts requiring those files in the cluster:

SOAM common data (default)

SOAM common data is the default deployment mechanism for MapReduce applications and is useful for passing data from a client application to a service. The service loads the data when the job (session) is created.

Common data transfers the application JAR and other application files containing the logic for the MapReduce job to the compute host only once per job, not on every task.

Repository server

The repository server in IBM Spectrum Symphony is a system service, called rs, that stores application packages associated with a resource consumer or a group of consumers. There is at least one instance of the repository service per cluster, and it runs on the primary or other management hosts. You can specify more resource requirements for choosing hosts to run repository server instances.

The repository service is useful for deploying an application user package and for transferring that package to compute hosts in the cluster only at the first instance of a job requiring the latest package. Once the package has been transferred to the compute host, all subsequent jobs that require the same package use the package already downloaded on the compute host.

By using the repository service, multiple jobs that use the same package on one host download the package only once. If you want to use an application user package for many jobs that are reusing the same libraries or other files with significant data sizes, we recommend that you use the repository service for better performance. It also separates application deployment operations and job submission, allowing you to set permissions differently for different users (consumer administrator versus application user).

Deploying application packages via a repository service involves creating an application user package and adding the package to the repository using the soamdeploy command. Then, the MapReduce application can directly use the files from the package because they are automatically downloaded on the service side, unpacked and included into the service class path. To do so, however, you must specify the pmr.userjar.in.package (optional) property during job submission to tell the system to look for the main application JAR in the package on the service side, rather than transfer the JAR from the client side.

Note: The repository service provides advanced functionality that does not apply to the MapReduce workload.

Distributed file system

The distributed file system mechanism provides a service for copying files specified as file cache from a shared file system or from distributed file system such as HDFS to the compute hosts in time for the tasks to use them when they run. To save network bandwidth, files are normally copied to a compute node only once per job.

Depending on the values you set for the optional pmr.ship.local.files.mode parameter during job submission, the MapReduce framework uses the distributed or shared file system to copy the files to local file systems on the service host.

Deploy application files via default mechanism

Deploying application files via SOAM common data is the default deployment mechanism. Submitting a MapReduce job without specifying any additional parameters deploys the application JAR file and other application files to the compute host through common data.

Submit a MapReduce job without any additional parameters. For example:
```
mrsh jar $PMR_HOME/version/os_type/samples/hadoop-0.20.2-examples.jar wordcount hdfs://:9000/input hdfs://host_name:9000/output
```
where host_name is the name of the host on which the HDFS NameNode process started.

Deploy application packages via repository service

The package deployment process has two phases: First, packages are copied to the central repository on the repository server, the host on which the rs service is running. Then, when workload comes in, the package is downloaded to a temporary location on the compute host, uncompressed, and then copied to the $SOAM_HOME/deploy/ directory.

The package removal process also has two phases: When a request to remove a package is made, packages are removed from the central repository. Then, when a new application is deployed and existing packages on the compute hosts are no longer needed, the packages are removed from compute hosts. For existing applications, when an existing package is updated, the older version of the package is removed; then when workload comes in, the latest version of the package is downloaded.

Package deployment process

You deploy the service package using the soamdeploy add command.
The package is copied to the repository server host.
As workload comes in, IBM Spectrum Symphony checks whether the latest package is already on the compute host.
If the package is not already on the compute host, the SIM requests to download the package from the repository server to the compute host, and decompresses it, ready to be used.
Note: If any user class file or JAR files have the same files/directories, the service instance uses the files and directories transferred by common data, rather than those downloaded from the repository. Whenever you add a package with an identical name, the repository server always stores only the latest package.

Package removal process

You request to remove the package using the soamdeploy remove command, or you update an existing package using the soamdeploy add command.
Note: You cannot update a package when jobs running in the MapReduce cluster are using the package. If you update a package when running jobs are using that package:
- Slots on hosts already allocated to the job continue to use the old package to run tasks.
- Slots on hosts allocated to the job after the package is updated will use the new package to run tasks.
  To overcome these issues, do not update a package when jobs are running. Instead, add a new package.
The package is removed from the repository server host.
Whenever a new package is deployed on to the host, the older version of the package is deleted.

Deploy an application package to the repository service

Before you deploy an application user package for use with MapReduce jobs, you must create the package. The following package formats are supported:

.zip
.tar
.tar.gz

Create a package for use with MapReduce jobs:
1. Navigate to any directory and gather all the required files in that directory.
2. Create an application user package by compressing all the files in that directory into, for example, a tar file:
```
tar -cvf SamplePackage.tar SamplePackage
gzip SamplePackage.tar
```
  Note: Ensure that all JAR files are at the beginning of the package. To add the content of any repository package to the classpath of a MapReduce logic code, pack those JARS, folders, and files into the beginning of the repository service package.
You have now created an application user package.
Deploy the package with the soamdeploy add command:
```
soamdeploy add package_name -p package_file -c consumer_name [-u username -x user_password]
```
where:
package_name

Specifies the name of the application user package.

-p package_file

Specifies the file name of the application user package.

-c consumer_name
Specifies the consumer name of a registered MapReduce application. Only applications that are registered to this consumer and consumers within the consumer tree can use the package.
Both short consumer names (consumer name without any path information) and full consumer names (consumer name with complete path information) are supported. Note the following guidelines:

If the package is deployed with a short name, the consumer ID in the application profile can be specified either by short name or full name.

If the package is deployed with a full name, the consumer ID in the application profile can be specified by full name only.

If the package is deployed to a non-leaf consumer, the consumer ID in the application profile can be specified by full name only.

If the package is deployed to the root consumer, the consumer ID in the application profile can be specified either by full name or short name.
-u username

Optionally specifies the user name of that consumer.

-x user_password

Optionally specifies the password of that consumer's user.
For example:
```
soamdeploy add SamplePackage -p SamplePackage.tar.gz -c /MapReduceConsumer/MapReduceversion -u Admin -x Admin
```

View information about the deployed package with the soamdeploy view command. For example:

soamdeploy view -c /MapReduceConsumer/MapReduceversion

You should see output similar to the following:

PACKAGE         APPLICATION    CREATED TIME
SamplePackage   -              Mon May 9 10:20:50 2012

Submit one or more jobs using that package using the mrsh utility. For example:
```
mrsh jar hadoop-examples-1.1.1.jar wordcount -Dpmr.userjar.in.package=true -Dpmr.job.package=SamplePackage input output
```
You can also specify more files or archives to be copied on the compute host using the -file and -archives options:

-archives archive1,archive2

Copies the specified archives to the shared file system used by the job tracker (usually HDFS), no longer archives them, and makes them available to MapReduce programs in the task's working directory.

-files file1,file2

Copies the specified files to the shared file system used by the job tracker (usually HDFS) and makes them available to MapReduce programs in the task’s working directory.
For example, if a file that you want use in the cluster is larger than 50 MB, you can compress that file and deploy it as a package to the repository service; then, submit a job specifying that package along with more files. In this scenario, follow these steps:
1. Compress a file larger than 50 MB that is to be used for a MapReduce job. For example:
```
tar -cvf file.tar file1
gzip file.tar
```
2. Deploy the package with the soamdeploy add command. For example:
```
soamdeploy add SamplePackage -p file.tar.gz -c /MapReduceConsumer/MapReduceversion
```
3. Submit a job using that package with additional files or archives using the mrsh utility. For example:
```
mrsh jar hadoop-examples-1.1.1.jar wordcount -files file2,file3 -Dpmr.job.package=SamplePackage input output
```
Once the package is downloaded on the compute host, it is not downloaded again for later jobs that specify the same package name.

Remove an application package deployed to the repository service

When you remove a deployed package, the specified package is deleted from the repository under the specified consumer.

Note: You cannot remove a package if there are registered applications using the package. Unregister the applications with the soamunreg command before attempting to remove the package.

Remove a deployed package with the soamdeploy remove command:

soamdeploy remove package_name -p package_file -c consumer_name [-u username -x user_password]

where:

package_name: Specifies the name of the application user package.
-p package_file: Specifies the file name of the application user package.
-c consumer_name: Specifies the consumer name of a registered MapReduce application. Both short consumer names (consumer name without any path information) and full consumer names (consumer name with complete path information) are supported. If the package is deployed with a short name, the soamdeploy command can access the package by short name only. Similarly, if the package is deployed with a full name, the soamdeploy command can access the package by full name only.
-u username: Optionally specifies the user name of that consumer.
-x user_password: Optionally specifies the password of that consumer's user.

For example:

soamdeploy remove SamplePackage -p SamplePackage.tar.gz -c /MapReduceConsumer/MapReduceversion -u Admin -x Admin

Deploy application files via distributed file system

Deploy distributed cache files required for a MapReduce job via HDFS or via shared file system by specifying these parameters during job submission:

pmr.ship.local.files.mode

Only takes effect when the distributed cache files are on the local file system.

Use the following syntax during job submission to transfer distributed cache files that are located in a shared file system (such as NFS), so that each compute host can access the files directly:
```
mrsh jar jarfile -Dpmr.ship.local.files.mode=sharedfs -files=file:///path_to_file hdfs://host_name:9000/input hdfs://host_name:9000/output
```
For example:
```
mrsh jar hadoop-examples-1.1.1.jar wordcount -Dpmr.ship.local.files.mode=sharedfs -files=file:///shared/test.txt hdfs://namenode1:9000/input hdfs://namenode1:9000/output
```
In this case, the shared (NFS-mounted) /shared/test.txt directory is not copied, but is available for access directly by the compute hosts.

Use the following syntax during job submission to copy distributed cache files from the local file system to the temp folder on the HDFS. The temp folder is the working directory on HDFS, used to store intermediate files, and is cleared when the job is finished:

mrsh jar jarfile -Dpmr.ship.local.files.mode=hdfs://namenodeAddress:port -files=file:///path_to_file hdfs://namenodeAddress:port/input hdfs://namenodeAddress:port/output

For example:

mrsh jar hadoop-examples-1.1.1.jar wordcount -Dpmr.ship.local.files.mode=hdfs://namenode1:9000 -files=file:///home/user/test.txt hdfs://namenode1:9000/input hdfs://namenode1:9000/output

pmr.ship.hdfs.files.by.hdfs

Only takes effect when the distributed cache files are on the HDFS. This parameter is by default set to true, so that when the distributed cache files are from HDFS, the system uses the files on the service side directly from the original HDFS location.