Service fails to start

HDFS does not start after adding IBM Spectrum® Scale service or after running an Integrate_Transparency or an Unintegrate_Transparency UI actions in HA mode.
Solution:
After Integrate_Transparency or Unintegrate_Transparency in HA mode, if the HDFS service or its components (for example, NameNodes) do not come up during start, then do the following:
- Check if the zkfc process is up by running the following command on each NameNode host:
```
# ps -eaf | grep zkfc
```
  If the zkfc process is up, kill the zkfc process from the NameNode host by running the kill -9 command on the pid.
- Once the zkfc process is not running in any NameNode host, go into the HDFS service dashboard and do a Start the HDFS service.

In non-root Ambari environment, IBM Storage Scale™ fail to start due to NFS mount point not being accessible by root.

Solution: For example, the /usrhome/am_agent is a NFS mount point with permission set to 700. The following error is seen:

2017-04-04 15:42:49,901 - ========== Check for changes to the configuration. ===========
2017-04-04 15:42:49,901 - Updating remote.copy needs service reboot.
2017-04-04 15:42:49,901 - Values don't match for gpfs.remote.copy. running_config[gpfs.remote.copy]:
sudo wrapper in use; gpfs_config[gpfs.remote.copy]: /usr/bin/scp
2017-04-04 15:42:49,902 - Updating remote.shell needs service reboot.
2017-04-04 15:42:49,902 - Values don't match for gpfs.remote.shell. running_config[gpfs.remote.shell]:
/usr/lpp/mmfs/bin/sshwrap; gpfs_config[gpfs.remote.shell]: /usr/bin/ssh
2017-04-04 15:42:49,902 - Shutdown all gpfs clients.
2017-04-04 15:42:49,902 - Run command: sudo /usr/lpp/mmfs/bin/mmshutdown -N k-001,k-002,k-003,k-004
2017-04-04 15:44:03,608 - Status: 0, Output:
Tue  4 Apr 15:42:50 CEST 2017: mmshutdown: Starting force unmount of GPFS file systems
k-003.gpfs.net:  mmremote: Invalid current working directory detected: /usrhome/am_agent

To resolve this issue: Change the permissions of the home directory of the GPFS™ non-root user to at least 711.

Example: For the /usrhome/am_agent directory, set the directory with at least a 711 permission set or rwx--x—x.

Where, 7= rwx for the user itself, 1= x for the group, 1= x for others; x will allow users to cd into the home directory.

This is because the IBM Storage Scale command does a cd into the home directory of the user. Therefore, the permission should be set to at least 711.

Accumulo Tserver failed to start.
Solution: Accumulo Tserver might go down. Ensure that the block size is set to the IBM Storage Scale file system value.
- In Accumulo > Configs > Custom accumulo-site, set the tserver.wal.blocksize to <GPFS File system block size of the data pool>.
  For example, tserver.wal.blocksize = 2097152.
```
[root@c902f05x04 ~]# mmlsfs /dev/bigpfs -B
flag value description
------------------- ------------------------ ------------------------------------
B 262144 Block size (system pool)
2097152 Block size (other pools)
[root@c902f05x04 ~]#
```
- Restart Accumulo service.
- Run Accumulo service check.
  From Ambari GUI > Accumulo > Service Actions > Run Service Check.
For additional failures, see What to do when the Accumulo service start or service check fails?.
Hive fails to start, with the default HDP and Ambari recommendations
Solution:
There is bug in HDP3.0 - https://hortonworks.jira.com/projects/SPEC/issues/SPEC-18 which causes Accumulo to use the same port number as HiveServer2 leading to port binding conflict. As a workaround, use the following configuration:
1. Put accumulo and hiveserver2 on different hosts or
2. Use non-default port for either of them.
Kafka service fails to start if Kafka is added after IBM Spectrum Scale.
Solution:
Check the Kafka configuration log directory to see if the Kafka log directory contains the IBM Storage® Scale shared mount point (/<gpfs mount point>/kafka-logs). Remove the share mount point from the directory list and restart the Kafka service. For more information, see Adding Services.
HBase service fails to start.
Solution:
If IBM Spectrum Scale is integrated, Hbase Master fails to start or goes down. This could be because of stale znodes in Zookeeper created for Hbase.
To clean znodes of Hbase, perform the following:
1. Log in to zookeeper by executing the following command:
```
/usr/hdp/current/zookeeper-server/bin/zkCli.sh -server <any zookeeper hostname>
```
2. rmr /hbase-unsecure or rmr /hbase-secure (depending on kerberos enabled/disabled).
Start All services fails because zkfc fails to start. Therefore, putting the NameNodes into standby mode.
The sequence of steps when this error occurs is to enable NameNode HA in native HDFS then integrate IBM Spectrum Scale service to use HDFS Transparency and later enable Kerberos. During the period when Kerberos is enabling, zkfc fails to start. This leads to both NameNodes being in the standby mode. Therefore, HDFS cannot be used. As a result, the Start All services fails.
Solution:
Restart HDFS or restart all services from Ambari.
NameNode and DataNodes failed to start with mapreduce.tar.gz error.
For Mpack version 2.7.0.0, the NameNode and DataNode might fail to start with the following error message when the data directory (gpfs.data.dir) is specified through the Ambari IBM Spectrum Scale UI: Failed to replace mapreduce.tar.gz with Transparency jars.
Solution:
Follow these steps to set the mapreduce.tar.gz into a proper directory for the NameNode/DataNode to start:
Check if the /<gpfs.mnt.dir>/<gpfs.data.dir>/hdp/apps/<hdp-version>/mapreduce/ directory exists. If not, create the directory by running the following command:
```
mkdir -p /<gpfs.mnt.dir>/<gpfs.data.dir>/hdp/apps/<hdp-version>/mapreduce/
```
Copy the mapreduce.tar.gz from HDP to the Scale directory by running the following command:
```
cp /usr/hdp/<hdp-version>/hadoop/mapreduce.tar.gz
/<gpfs.mnt.dir>/<gpfs.data.dir>/hdp/apps/<hdp-version>/mapreduce/mapreduce.tar.gz
```
where, <gpfs.mnt.dir> is the IBM Storage Scale mount point <gpfs.data.dir> is the IBM Spectrum Scale data directory <hdp-version> is the HDP version and can be obtained by running hdp-select versions. For example,
```
mkdir -p /bigpfs/datadir1/hdp/apps/2.6.5.0-292/mapreduce/ 
cp /usr/hdp/2.6.5.0-292/hadoop/mapreduce.tar.gz 
/bigpfs/datadir1/hdp/apps/2.6.5.0-292/mapreduce/mapreduce.tar.gz
```
Restart the failed HDFS components.
The zookeeper failover controller (ZKFC) fails during the Start All operation after integrating IBM Spectrum Scale service with NameNode High Availability for the first time.
There is a timing issue during the formatting of the zookeeper directory which is shared by both ZKFC in HA mode on which ZKFC should be started first.
Solution:
Rerun the Start All operation to get the services back up.
The zkfc fails to start when Kerberos is enabled.
The zkfc might fail to start with Can't set priority for process error if IBM Spectrum Scale is first added to an HA enabled HDP cluster before adding Kerberos. The hdfs_jaas.conf file might not be generated during the kerberos enablement action.
Solution:
1. Create the hdfs_jaas.conf file in the /etc/hadoop/conf/secure directory, on both the NameNodes.
  For example,
```
# cat /etc/hadoop/conf/secure/hdfs_jaas.conf

Client {
      com.sun.security.auth.module.Krb5LoginModule required
      useKeyTab=true
      storeKey=true
      useTicketCache=false
      keyTab="/etc/security/keytabs/nn.service.keytab"
      principal="nn/c902f09x13.gpfs.net@IBM.COM";
};
```
  Note: Ensure that you change the keyTab and principal values based on your environment.
2. If /etc/hosts is used for hostname resolution instead of DNS in your environment, use the FQDN hostname in /etc/hosts.
  Ensure that the output from the command hostname matches the following:
  1. Hostname specified in the Ambari wizard.
  2. IP/hostname used for DNS.
  Check the same for all the hosts in the cluster and restart HDFS.
When the Scale service is unintegrated, the Active NameNode starts whereas the standby NameNode fails to start with Failed to start namenode.java.io.FileNotFoundException: No valid image files found error message in the /var/log/hadoop/hdfs/hadoop-hdfs-namenode-<standby_namenode>.log file:
```
ERROR namenode.NameNode (NameNode.java:main(1774)) - 
Failed to start namenode.java.io.FileNotFoundException: 
No valid image files found at 
org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector.
getLatestImages(FSImageTransactionalStorageInspector.java:165)
```
This is because the dfs.namenode.name.dir (default path: /hadoop/hdfs/namenode) directory is empty.
Solution:
Because the Active NameNode is up and running, run the following steps to start the Standby NameNode:
1. Run the following commands only on the Standby NameNode:
```
# su - hdfs
# hdfs namenode -bootstrapStandby
```
  Note: Do not run this command on the Active NameNode.
  This command tries to recover all the metadata on the Standby NameNode.
2. Restart both the ZKFailover Controllers from Ambari.
3. Restart the Standby NameNode from Ambari.
On SLES environment, the NameNode might fail to start due to Out of Memory error with the following error message: Exiting with status 1: java.lang.OutOfMemoryError: unable to create new native thread.
Solution:
Increase the NameNode Heap Size to at least 2GB in Ambari HDFS configuration and restart the NameNodes.
In SLES environment, the Zeppelin Notebook service stop action can be stuck for a long period of time.
Solution:
Stop and start the Zeppelin Notebook service to get out of the hang situation.

ZKFC fails to start due to hdfs_jaas.conf file missing when Kerberos is enabled when IBM Spectrum Scale is integrated.

Error message:

2019-05-08 13:34:44,595 WARN zookeeper.ClientCnxn (ClientCnxn.java:startConnect(1014)) - 
SASL configuration failed: javax.security.auth.login.LoginException: Zookeeper client cannot 
authenticate using the Client section of the supplied JAAS configuration: 
'/usr/hdp/3.1.0.0-78/hadoop/conf/secure/hdfs_jaas.conf' because of a RuntimeException: 
java.lang.SecurityException: java.io.IOException: /usr/hdp/3.1.0.0-78/hadoop/conf/secure/hdfs_jaas.conf 
(No such file or directory) Will continue connection to Zookeeper server without SASL authentication, 
if Zookeeper server allows it.

Solution:

Copy the /etc/hadoop/conf/secure/hdfs_jaas.conf into /usr/hdp/3.1.0.0-78/hadoop/conf/secure/hdfs_jaas.conf on all the NameNodes.
Restart ZKFC.

When Kerberos is enabled on RH 7.5, the ZKFController fails with the following errors:

2019-05-06 06:10:09,974 ERROR client.ZooKeeperSaslClient (ZooKeeperSaslClient.java:createSaslToken(388)) - 
An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed 
[Caused by GSSException: No valid credentials provided (Mechanism level: Ticket expired (32) - PROCESS_TGS)]) 
occurred when evaluating Zookeeper Quorum Member's  received SASL token. 
Zookeeper Client will go to AUTH_FAILED state.

2019-05-06 06:10:09,974 ERROR zookeeper.ClientCnxn (ClientCnxn.java:run(1059)) - SASL authentication with Zookeeper 
Quorum member failed: javax.security.sasl.SaslException: An error: (java.security.PrivilegedActionException: 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided 
(Mechanism level: Ticket expired (32) - PROCESS_TGS)]) occurred when evaluating Zookeeper Quorum Member's  
received SASL token. Zookeeper Client will go to AUTH_FAILED state.

2019-05-06 06:10:10,081 ERROR ha.ActiveStandbyElector (ActiveStandbyElector.java:fatalError(719)) - 
Unexpected Zookeeper watch event state: AuthFailed

2019-05-06 06:10:10,081 ERROR ha.ZKFailoverController (ZKFailoverController.java:fatalError(379)) - 
Fatal error occurred:Unexpected Zookeeper watch event state: AuthFailed

2019-05-06 06:10:10,081 FATAL tools.DFSZKFailoverController (DFSZKFailoverController.java:main(197)) - 
DFSZKFailOverController exiting due to earlier exception java.io.IOException: 
Couldn't determine existence of znode '/hadoop-ha/nn'

2019-05-06 06:10:10,083 INFO  util.ExitUtil (ExitUtil.java:terminate(210)) - Exiting with status 1: 
java.io.IOException: Couldn't determine existence of znode '/hadoop-ha/nn'

2019-05-06 06:10:10,085 INFO  tools.DFSZKFailoverController (LogAdapter.java:info(49)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DFSZKFailoverController at dn01-dat.gpfs.net/30.1.1.15
************************************************************/

Solution:

The default KDC version on RHEL7.5 has a known bug. You need to upgrade the krb-server packages to 1.1.15.1-19 + version.

Steps:

Check the installed version of krb on all the hosts.
```
# yum list installed | grep krb
```

Stop Kerberos.

# systemctl stop krb5kdc
# systemctl stop kadmin

Upgrade krb-server, libs and workstation to 1.15.1-19 on the ambari-server and all the ambari-agent nodes.

For example:

# rpm -Uvh krb5-workstation-1.15.1-19.el7.ppc64le.rpm krb5-libs-1.15.1-19.el7.ppc64le.rpm 
libkadm5-1.15.1-19.el7.ppc64le.rpm

Start Kerberos.

# systemctl start krb5kdc
# systemctl start kadmin

Restart Ambari server.
```
# Ambari-server restart
```
Restart ZKFController in Ambari.

For additional information, see 2nd generation HDFS Protocol troubleshooting DataNode reports exceptions after Kerberos is enabled on RHEL7.5.

Yarn Timeline Service 2.0 fails to start.
In HDP 3.0: The Timeline Service 2.0 in Yarn fails to start.
Solution:
There is a new implementation of Timeline service in HDP 3.0 named Timeline Service 2.0. It can run in 2 modes (Embedded mode or System service mode) depending on the cluster capacity. To check which mode is being set, filter the search for is_hbase_system_service_launch under YARN configuration. If this value is checked, it is running in the system service mode. If it is running in the system service mode, follow the set of best practices from the Enable System Service Mode.
The following important steps should be performed after Integrating/UnIntegrating the IBM Spectrum Scale service and Enabling/Disabling Kerberos: Remove ats-hbase before switching between clusters.
If you get the ERROR client.ApiServiceClient: Failed to destroy service ats-hbase, because it is still running error above, perform the following steps:
1. Check the status of the ats-hbase service by executing the following command:
```
yarn app -status ats-hbase
```
2. If the state is STOPPED, then perform the following steps:
  Get the application_ID from the ResourceManager UI in the Ambari GUI and run:
```
yarn -kill -appId <application_ID> 
yarn app -destroy ats-hbase
```
  Might need to remove the /<gpfs.mnt.dir value>/<gpfs.data.dir value>/user/yarn-ats/{stack-version} directory.
  For example,
```
rm -rf /gpfs/datadir_1/user/yarn-ats/{stack-version}
```
3. Run all the service checks to ensure that all the services are successful.