Hadoop distcp support

The hadoop distcp command is used for data migration from HDFS to the IBM Spectrum Scale™ file system and between two IBM Spectrum Scale file systems.

distcp-based data migration between HDFS and IBM Spectrum Scale

There are no additional configuration changes. The hadoop distcp command is supported in HDFS transparency 2.7.0-2 (gpfs.hdfs-protocol-2.7.0-2) and later.

hadoop distcp hdfs://nn1_host:8020/source/dir hdfs://nn2_host.:8020/target/dir

Known Issues and Workaround

Issue 1: Permission is denied when the hadoop distcp command is run with the root credentials.

The super user root in Linux is not the super user for Hadoop. If you do not add the super user account to gpfs.supergroup, the system displays the following error message:
org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE,
inode="/user/root/.staging":hdfs:hdfs:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319).
Workaround

Add the super user account to gpfs.supergroup in gpfs-site.xml to configure the root as the super user or run the related hadoop distcp command with the super user credentials.

Issue 2: Access time exception while copying files from IBM Spectrum Scale to HDFS with the -p option

[hdfs@c8f2n03 conf]$ hadoop distcp -overwrite -p hdfs://c16f1n03.gpfs.net:8020/testc16f1n03/
hdfs://c8f2n03.gpfs.net:8020/testc8f2n03

Error: org.apache.hadoop.ipc.RemoteException(java.io.IOException): Access time for HDFS is not configured. Set the dfs.namenode.accesstime.precision configuration parameter at org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.setTimes(FSDirAttrOp.java:101)

Workaround

Change the dfs.namenode.accesstime.precision value from 0 to a value such as 3600000 (1 hour) in hdfs-site.xml for the HDFS cluster.

Issue 3: The distcp command fails when the src director is root.

[hdfs@c16f1n03 root]$ hadoop distcp hdfs://c16f1n03.gpfs.net:8020/ hdfs://c8f2n03.gpfs.net:8020/test5
16/03/03 22:27:34 ERROR tools.DistCp: Exception encountered
java.lang.NullPointerException

at org.apache.hadoop.tools.util.DistCpUtils.getRelativePath(DistCpUtils.java:144)

at org.apache.hadoop.tools.SimpleCopyListing.writeToFileListing(SimpleCopyListing.java:353)

Workaround

Specify at least one directory or file at the source directory.

Issue 4: The distcp command throws NullPointerException when the target directory is root in the federation configuration but the job is completed

The hadoop distcp command throws NullPointerException when the target directory in the federation configuration is root, and the job is completed. For more details, see https://issues.apache.org/jira/browse/HADOOP-11724.