Application files deployment
With the MapReduce framework in IBM® Spectrum Symphony, you can deploy application files to hosts in the cluster that run tasks requiring these application files. You can use different deployment mechanisms to transfer application user packages containing user class files, JAR files, or other files required for a MapReduce job, or application files such as patches, libraries, or data.
Types of deployment
The MapReduce framework in IBM Spectrum Symphony supports the following mechanisms to deploy application files to hosts requiring those files in the cluster:
SOAM common data (default)
SOAM common data is the default deployment mechanism for MapReduce applications and is useful for passing data from a client application to a service. The service loads the data when the job (session) is created.
Common data transfers the application JAR and other application files containing the logic for the MapReduce job to the compute host only once per job, not on every task.
Repository server
The repository server in IBM Spectrum Symphony is a system service, called rs, that stores application packages associated with a resource consumer or a group of consumers. There is at least one instance of the repository service per cluster, and it runs on the primary or other management hosts. You can specify more resource requirements for choosing hosts to run repository server instances.
The repository service is useful for deploying an application user package and for transferring that package to compute hosts in the cluster only at the first instance of a job requiring the latest package. Once the package has been transferred to the compute host, all subsequent jobs that require the same package use the package already downloaded on the compute host.
By using the repository service, multiple jobs that use the same package on one host download the package only once. If you want to use an application user package for many jobs that are reusing the same libraries or other files with significant data sizes, we recommend that you use the repository service for better performance. It also separates application deployment operations and job submission, allowing you to set permissions differently for different users (consumer administrator versus application user).
Deploying application packages via a repository service involves creating an application user package and adding the package to the repository using the soamdeploy command. Then, the MapReduce application can directly use the files from the package because they are automatically downloaded on the service side, unpacked and included into the service class path. To do so, however, you must specify the pmr.userjar.in.package (optional) property during job submission to tell the system to look for the main application JAR in the package on the service side, rather than transfer the JAR from the client side.
Distributed file system
The distributed file system mechanism provides a service for copying files specified as file cache from a shared file system or from distributed file system such as HDFS to the compute hosts in time for the tasks to use them when they run. To save network bandwidth, files are normally copied to a compute node only once per job.
Depending on the values you set for the optional pmr.ship.local.files.mode parameter during job submission, the MapReduce framework uses the distributed or shared file system to copy the files to local file systems on the service host.
Deploy application files via default mechanism
- Submit a MapReduce job without any additional parameters. For
example:
mrsh jar $PMR_HOME/version/os_type/samples/hadoop-0.20.2-examples.jar wordcount hdfs://:9000/input hdfs://host_name:9000/output
where host_name is the name of the host on which the HDFS NameNode process started.
Deploy application packages via repository service
The package removal process also has two phases: When a request to remove a package is made, packages are removed from the central repository. Then, when a new application is deployed and existing packages on the compute hosts are no longer needed, the packages are removed from compute hosts. For existing applications, when an existing package is updated, the older version of the package is removed; then when workload comes in, the latest version of the package is downloaded.
Package deployment process
- You deploy the service package using the soamdeploy add command. The package is copied to the repository server host.
- As workload comes in, IBM Spectrum Symphony checks whether the latest package is already on the compute host.If the package is not already on the compute host, the SIM requests to download the package from the repository server to the compute host, and decompresses it, ready to be used.Note: If any user class file or JAR files have the same files/directories, the service instance uses the files and directories transferred by common data, rather than those downloaded from the repository. Whenever you add a package with an identical name, the repository server always stores only the latest package.
Package removal process
- You request to remove the package using the soamdeploy remove command, or you
update an existing package using the soamdeploy add command.Note: You cannot update a package when jobs running in the MapReduce cluster are using the package. If you update a package when running jobs are using that package:
- Slots on hosts already allocated to the job continue to use the old package to run tasks.
- Slots on hosts allocated to the job after the package is updated will use the new package to run
tasks.
To overcome these issues, do not update a package when jobs are running. Instead, add a new package.
The package is removed from the repository server host. - Whenever a new package is deployed on to the host, the older version of the package is deleted.
Deploy an application package to the repository service
- .zip
- .tar
- .tar.gz
- Create a package for use with MapReduce jobs:
- Navigate to any directory and gather all the required files in that directory.
- Create an application user package by compressing all the files in that directory into, for
example, a tar file:
tar -cvf SamplePackage.tar SamplePackage gzip SamplePackage.tar
Note: Ensure that all JAR files are at the beginning of the package. To add the content of any repository package to the classpath of a MapReduce logic code, pack those JARS, folders, and files into the beginning of the repository service package.
You have now created an application user package.
- Deploy the package with the soamdeploy add command:
soamdeploy add package_name -p package_file -c consumer_name [-u username -x user_password]
where:- package_name
- Specifies the name of the application user package.
- -p package_file
- Specifies the file name of the application user package.
- -c consumer_name
- Specifies the consumer name of a registered MapReduce application. Only applications that are
registered to this consumer and consumers within the consumer tree can use the package. Both short consumer names (consumer name without any path information) and full consumer names (consumer name with complete path information) are supported. Note the following guidelines:
- If the package is deployed with a short name, the consumer ID in the application profile can be specified either by short name or full name.
- If the package is deployed with a full name, the consumer ID in the application profile can be specified by full name only.
- If the package is deployed to a non-leaf consumer, the consumer ID in the application profile can be specified by full name only.
- If the package is deployed to the root consumer, the consumer ID in the application profile can be specified either by full name or short name.
- -u username
- Optionally specifies the user name of that consumer.
- -x user_password
- Optionally specifies the password of that consumer's user.
For example:soamdeploy add SamplePackage -p SamplePackage.tar.gz -c /MapReduceConsumer/MapReduceversion -u Admin -x Admin
- View information about the deployed package with the soamdeploy view command.
For
example:
soamdeploy view -c /MapReduceConsumer/MapReduceversion
You should see output similar to the following:PACKAGE APPLICATION CREATED TIME SamplePackage - Mon May 9 10:20:50 2012
- Submit one or more jobs using that package using the mrsh utility. For
example:
mrsh jar hadoop-examples-1.1.1.jar wordcount -Dpmr.userjar.in.package=true -Dpmr.job.package=SamplePackage input output
You can also specify more files or archives to be copied on the compute host using the -file and -archives options:- -archives archive1,archive2
- Copies the specified archives to the shared file system used by the job tracker (usually HDFS), no longer archives them, and makes them available to MapReduce programs in the task's working directory.
- -files file1,file2
- Copies the specified files to the shared file system used by the job tracker (usually HDFS) and makes them available to MapReduce programs in the task’s working directory.
For example, if a file that you want use in the cluster is larger than 50 MB, you can compress that file and deploy it as a package to the repository service; then, submit a job specifying that package along with more files. In this scenario, follow these steps:- Compress a file larger than 50 MB that is to be used for a MapReduce job. For
example:
tar -cvf file.tar file1 gzip file.tar
- Deploy the package with the soamdeploy add command. For
example:
soamdeploy add SamplePackage -p file.tar.gz -c /MapReduceConsumer/MapReduceversion
- Submit a job using that package with additional files or archives using the
mrsh utility. For
example:
mrsh jar hadoop-examples-1.1.1.jar wordcount -files file2,file3 -Dpmr.job.package=SamplePackage input output
Once the package is downloaded on the compute host, it is not downloaded again for later jobs that specify the same package name.
Remove an application package deployed to the repository service
soamdeploy remove package_name -p package_file -c consumer_name [-u username -x user_password]
- package_name
- Specifies the name of the application user package.
- -p package_file
- Specifies the file name of the application user package.
- -c consumer_name
- Specifies the consumer name of a registered MapReduce application. Both short consumer names (consumer name without any path information) and full consumer names (consumer name with complete path information) are supported. If the package is deployed with a short name, the soamdeploy command can access the package by short name only. Similarly, if the package is deployed with a full name, the soamdeploy command can access the package by full name only.
- -u username
- Optionally specifies the user name of that consumer.
- -x user_password
- Optionally specifies the password of that consumer's user.
soamdeploy remove SamplePackage -p SamplePackage.tar.gz -c /MapReduceConsumer/MapReduceversion -u Admin -x Admin
Deploy application files via distributed file system
- pmr.ship.local.files.mode
- Only takes effect when the distributed cache files are on the local file system.
- Use the following syntax during job submission to transfer distributed cache files that are
located in a shared file system (such as NFS), so that each compute host can access the files
directly:
mrsh jar jarfile -Dpmr.ship.local.files.mode=sharedfs -files=file:///path_to_file hdfs://host_name:9000/input hdfs://host_name:9000/output
For example:mrsh jar hadoop-examples-1.1.1.jar wordcount -Dpmr.ship.local.files.mode=sharedfs -files=file:///shared/test.txt hdfs://namenode1:9000/input hdfs://namenode1:9000/output
In this case, the shared (NFS-mounted) /shared/test.txt directory is not copied, but is available for access directly by the compute hosts.
- Use the following syntax during job submission to copy distributed cache files from the local
file system to the temp folder on the HDFS. The temp folder is the working directory on HDFS, used
to store intermediate files, and is cleared when the job is
finished:
mrsh jar jarfile -Dpmr.ship.local.files.mode=hdfs://namenodeAddress:port -files=file:///path_to_file hdfs://namenodeAddress:port/input hdfs://namenodeAddress:port/output
For example:mrsh jar hadoop-examples-1.1.1.jar wordcount -Dpmr.ship.local.files.mode=hdfs://namenode1:9000 -files=file:///home/user/test.txt hdfs://namenode1:9000/input hdfs://namenode1:9000/output
- Use the following syntax during job submission to transfer distributed cache files that are
located in a shared file system (such as NFS), so that each compute host can access the files
directly:
- pmr.ship.hdfs.files.by.hdfs
- Only takes effect when the distributed cache files are on the HDFS. This parameter is by default set to true, so that when the distributed cache files are from HDFS, the system uses the files on the service side directly from the original HDFS location.