General

Note: For known HDFS Transparency issues, see HDFS Transparency protocol troubleshooting.

Data capturing for problem determination
Solution:
Capture the following data for problem determination:
- Failed services and HDFS service logs from the Ambari log outputs from the Ambari UI. The log outputs are seen in the operations logs in the Ambari UI from the output*.txt and error*.txt outputs.
- Ambari server and agent log /var/log/ambari-server/ambari-server.log and /var/log/ambari-agent/ambari-agent.log.
- Transparency NameNode and DataNode logs.
- ZKFC log from NameNode host - /var/log/hadoop/root/hadoop-root-zkfc*.log.
- The following software versions:
  - Management pack installed on the Ambari server node, go to the /var/lib/ambari-server/resources/mpacks directory and get the directory package installed.
  - HDFS Transparency version: rpm -qa | grep gpfs.hdfs-protocol.
  - IBM Spectrum® Scale version.
- SpectrumScaleMPackInstaller.py/SpectrumScaleMPackUninstaller.py/SpectrumScale_UpgradeIntegrationPackage scripts failure: Capture the SpectrumScale* log from the directory where the script is located. Any produced *.json files will also reside in this directory.
Find Ambari Mpack version
From Mpack 2.7.0.0, get the Mpack version through Ambari GUI service action.
Otherwise, get the Mpack version through the Ambari directory under /var/lib/ambari-server.
For example:
```
/var/lib/ambari-server/resources/extensions/SpectrumScaleExtension/2.4.2.0/services/GPFS
```
This example is using Mpack 2.4.2.0.
What IBM Storage Scale™ edition is required for the Ambari deployment?
Solution: If you want to perform a new installation, including cluster creation and file system creation, use the Standard or Advanced edition because the IBM Spectrum Scale file system policy is used by default. If you only have the Express® Edition, select Deploy HDP over existing IBM Storage Scale file system.
Why do I fail in registering the Ambari agent?
Solution: Run ps -elf | grep ambari on the failing agent node to see what it is running. Usually, while registering in the agent node, there must be nothing under /etc/yum.repos.d/. If there is an additional repository that does not work because of an incorrect path or yum server address, the Ambari agent register operation will fail.
Which yum repository must be under /etc/yum.repos.d?
Solution: Before registering, on the Ambari server node, under /etc/yum.repos.d, there is only one Ambari repository file that you create in Installing the Ambari server rpm. On the Ambari agent, there must be no repository files related with Ambari. After the Ambari agent has been registered successfully, the Ambari server copies the Ambari repository to all Ambari agents. After that, the Ambari server creates the HDP and HDP-UTILS repository over the Ambari server and agents, according to your specification in the Ambari GUI in Select Stack section.
If you interrupt the Ambari deployment, clean the files before starting up Ambari the next time, especially when you specify a different IBM Storage Scale, HDP, or HDP-UTILS yum URL.
Must all nodes have the same root password?
Solution: No, this is unnecessary. You only need to specify the ssh key file for root on the Ambari server.
How to check the superuser and the supergroup?
Solution:
For HortonWorks HDP 3.0, HDFS Transparency 3.0 has removed the configuration gpfs.supergroup defined in /var/mmfs/hadoop/etc/hadoop/gpfs-site.xml.
By default, the groups from the configuration dfs.permissions.superusergroup in /var/mmfs/hadoop/etc/hadoop/hdfs-site.xml and the group root are super groups.
Why am I unable to connect to the Ambari Server through the web browser?
Solution: If you cannot connect to the Ambari Server through the web browser, check to see if the following message is displayed in the Ambari Server log which is in /var/log/ambari-server:
```
WARN [main] AbstractConnector:335 - insufficient threads configured for SelectChannelConnector@0.0.0.0:8080
```
The size of the thread pool can be increased to match the number of CPUs on the node where the Ambari Server is running.
For example, if you have 160 CPUs, add the following properties to /etc/ambari-server/conf/ambari.properties:
```
server.execution.scheduler.maxThreads=160
agent.threadpool.size.max=160
client.threadpool.size.max=160
```
HDFS Download Client Configs does not contain HDFS Transparency configuration.
Solution: In the HDFS dashboard, go to Service Actions > Download Client Configs, the tar configuration downloaded does not contain the HDFS Transparency information.

The workaround is to tar up the HDFS Transparency directory.
Run the following command on a HDFS Transparency host to tar up the HDFS Transparency directory into /tmp:
```
# cd /var/mmfs/hadoop/etc/
# tar -cvf /tmp/hdfs.transparency.hadoop.etc.tar hadoop
```
HDFS checkpoint confirmation warning message from Actions > Stop All¹ when integrated with IBM Storage Scale.
Solution: When IBM Storage Scale is integrated, the NameNode is stateless. The HDFS Transparency does not support the HDFS dfsadmin command.

Therefore, when doing Ambari dashboard > Actions > Stop All¹, Ambari will generate a confirmation box to ask user to do an HDFS checkpoint using the hdfs dfsadmin -safemode commands. This is not needed when HDFS Transparency is integrated, and this step can be skipped. Click on next to skip this step.
What happens if the Ambari admin password is modified after installation?
Solution: When the Ambari admin password was modified, the new password is required to be set in the IBM Storage Scale service.
To change the Ambari admin password in IBM Storage Scale, follow these steps:
- Log in to the Ambari GUI.
- Click Spectrum Scale > Configs tab > Advanced tab > Advanced gpfs-ambari-server-env > AMBARI_USER_PASSWORD to update the Ambari admin password.
If the Ambari admin password is not modified in the IBM Storage Scale Advanced configuration panel, starting Ambari services might fail. For example, Hive starting fails with exception errors.
Kerberos authentication error during Unintegrate Transparency action
```
ERROR: Kerberos Authentication Not done Successfully. Exiting Unintegration.
Enter Correct Credentials of Kerberos KDC Server in Spectrum Scale Configuration.
```
Solution:
If error occurs in a Kerberos environment, check to ensure that the KDC_PRINCIPAL and KDC_PRINCIPAL_PASSWORD values in Spectrum Scale services > Configs > Advanced tab have the correct values. Save the configuration changes.
NameNodes and DataNodes failed with the error Fail to replace Transparency jars with hadoop client jars when short-circuit is enabled.

Solution: Install the Java™ OpenJDK development tool-kit package, java-<version>-openjdk-devel, on all the Transparency nodes. Ensure that the version is compatible with your existing JDK version. See HDFS Transparency package.
ssh rejects additional ssh connections which causes the HDFS Transparency syncconf connection to be rejected.
Solution: If the ssh maxstartup value is too low, then the ssh connections can be rejected.
Review the ssh configuration values, and increase the maxstartup value.
For example:
Review ssh configuration:
```
# sshd -T | grep -i max
maxauthtries 6
maxsessions 10
clientalivecountmax 3
maxstartups 10:30:100
```
Modify the ssh configuration: Modify the /etc/ssh/sshd_config file to set the maxstartup value.
```
maxstartups 1024:30:1024
```
Restart the ssh daemon:
```
# service sshd restart
```
Not able to view Solr audits in Ranger.
Solution: To resolve this issue:
1. Remove the solr ranger audit write lock file if it exists as root or as the owner of the file.
```
$ ls /bigpfs/apps/solr/data/ranger_audits/core_node1/data/index/write.lock 
$ rm /bigpfs/apps/solr/data/ranger_audits/core_node1/data/index/write.lock
```
2. Restart HDFS and Solr.
  Click Ambari GUI > HDFS > Actions > Restart All
  Click Ambari GUI > Solr > Actions > Restart All
On restarting the service that failed due to network port being in use, the NameNode is still up after doing a STOP ALL¹ from Ambari GUI or HDFS service > STOP.
Solution: As a root user, ssh to the NameNode to check if the NameNode is up:
```
# ps -ef | grep namenode
```
If it exists, then kill the NameNode pid
```
# kill -9 namenode_pid
```
Restart the service.
UID/GID failed with illegal value Illegal value: USER = xxxxx > MAX = 8388607
Solution: If you have installed Ranger, and you need to leverage Ranger capabilities, then you need to make the UID/GID less than 8388607.
If you do not need Ranger, follow these steps to disable Ranger from HDFS Transparency:
1. On the Ambari GUI, click IBM Storage Scale > Configs and set the Add gpfs.ranger.enabled to false.
2. Save the configuration.
3. Restart IBM Spectrum Scale.
4. Restart HDFS.
What to do when I see performance degradation when using HDFS Transparency version 2.7.3-0 and earlier?
Solution:
For HDFS Transparency version 2.7.3-0 and below, if you see performance degradation and you are not using Ranger, set the gpfs.ranger.enabled to false.
1. On the Ambari GUI, click Spectrum Scale > Configs > Advanced > Custom gpfs-site and set the Add gpfs.ranger.enabled to false.
2. Save the configuration.
3. Restart IBM Spectrum Scale.
4. Restart HDFS.
Why did the IBM Storage Scale service not stop or restart properly?
This can be a result of a failure to unmount the IBM Storage Scale file system which may be busy. See the IBM Spectrum Scale operation task output in Ambari to verify the actual error messages.
Solution:
Stop all services. Ensure the IBM Storage Scale file system is not being accessed either via HDFS or POSIX by running the lsof or fuser command. Stop or restart the IBM Storage Scale service again.
For FPO cluster, do not run STOP ALL from the Ambari GUI. Refer to the Limitations > General section on how to properly stop IBM Spectrum Scale.
IBM Storage Scale service cannot be deployed in a non-root environment.
Solution:
If the deployment of IBM Storage Scale service in a non-root environment fails with the Error message: Error occurred during stack advisor command invocation: Cannot create /var/run/ambari-server/stack-recommendations, go to I cant add new services into ambari.
User permission denied when Ranger is disabled.
If Kerberos is enabled and Ranger is disabled, the user gets the permission denied errors when accessing the file system for HDFS Transparency 3.0.0 and earlier.
Solution:
Check the Kerberos principal mapping hadoop.security.auth_to_local field in the /var/mmfs/hadoop/etc/hadoop/core-site.xml or in Ambari under HDFS Config to ensure that the NameNode and DataNode are mapped to root instead of HDFS. For example, change
```
FROM:
RULE:[2:$1@$0](dn@COMPANY.DIV.COM)s/.*/hdfs/ 
RULE:[2:$1@$0](nn@COMPANY.DIV.COM)s/.*/hdfs/ 

TO:
RULE:[2:$1@$0](dn@COMPANY.DIV.COM)s/.*/root/ 
RULE:[2:$1@$0](nn@COMPANY.DIV.COM)s/.*/root/
```
Restart the HDFS service in Ambari or HDFS Transparency by using the following command:
```
/usr/lpp/mmfs/bin/mmhadoopctl connector stop; /usr/lpp/mmfs/bin/mmhadoopctl connector start
```
Updating ulimit settings for HDFS Transparency.
After updating the ulimit values on your nodes, perform the following procedure for HDFS Transparency to pick up the ulimit values properly.
Solution:
1. Restart each node’s Ambari agent by issuing the following command:
```
ambari-agent restart
```
2. Restart HDFS service from Ambari.
In Kerberized environment, getting Ambari error due to user fail to authenticate.
If Kerberos is enabled and the uid got changed, the Kerberos ticket cache will be invalid for that user.
Solution:
If the user fails to authenticate, run the kinit list command to find the path to the ticket cache and remove the krb5* files.
For example:
As a user, run the kinit list.
Check the Ticket cache value (For example, Ticket cache: FILE: /tmp/krb5cc_0).
Remove the /tmp/krb5cc_0 file from all nodes.
Note: Kerberos regenerates the file on the node.
Quicklinks NameNode GUI are not accessible from HDFS service in multihomed network environment.
In multihomed networks, the cluster nodes are connected to more than one network interface.
The Quicklinks from HDFS service are not accessible with the following errors:
```
This site can't be reached.
<Host> refused to connect.
ERR_CONNECTION_REFUSED
```
Solution:
For fixing the NameNode binding so that HDFS service NameNode UI can be accessed properly, see the following Hortonworks documentation:
- Fixing Hadoop issues In Multihomed Environments
- Ensuring HDFS Daemons Bind All Interfaces.
Ensure that you do a HDFS service restart after changing the values in HDFS configuration in Ambari.
Enable Kerberos action fails.
Solution:
If the IBM Spectrum Scale service is integrated, Enable Kerberos action might fail due to an issue with GPFS Service Check underneath. In such cases, retry the operation.
Enable the autostart of services when IBM Storage® Scale is integrated.
Solution:
1. In Ambari GUI, go to Admin > Service Auto Start Configuration and enable autostart.
2. Enable autoload and automount on the IBM Spectrum Scale cluster (on the HDP cluster side).
3. If ESS is being used, enable autoload on the ESS cluster.
For more information, see IBM Spectrum Scale mmchfs fsname -A yes for automount and mmchconfig autoload=yes commands.
GPFS Master fails with the error message: The UID and GID of the user "anonymous" is not uniform across all the IBM Storage Scale hosts.
Solution:
1. Ensure that the userid/groupid for the user anonymous are uniform across all the GPFS hosts in the cluster. Correct the inconsistent values on any GPFS host.
2. If there is no anonymous userid/groupid existing on a GPFS host, ensure that you create the same anonymous userid/groupid value as all the other GPFS hosts' anonymous userid/groupid value in the same IBM Spectrum Scale cluster.
  Example on how to create the anonymous user as a regular OS user across all the GPFS hosts. If you are using LDAP or other network authentication service, refer to their respective documentation.
  Create the GID first by running the following command:
```
mmdsh -N all groupadd -g <common group ID> anonymous
```
  where, <common group ID> can be set to a value like 11888.
  Create the UID by running the following command:
```
mmdsh -N all useradd -u <common group ID> anonymous -g anonymous
```
  where, <common group ID> can be set to a value like 11889.
IBM Spectrum Scale installation fails during deployment in Ambari due to scripts not found error.
stdout: /var/lib/ambari-agent/data/output-402.txt Caught an exception while executing custom service command:
```
<class 'ambari_agent.AgentException.AgentException'>: 
'Script /var/lib/ambari-agent/cache/extensions/SpectrumScaleExtension 
/2.7.0.1/services/GPFS/package/scripts/slave.py does not exist';
```
Solution:
See Ambari Release Notes® SPEC-57 for resolution.
IBM Spectrum Scale service installation in Ambari fails in the stack-advisor because the default login shell used does not propagate the error return code of zero for the shell command properly.
The log file: /var/run/ambari-server/stack-recommendations/<number>/stackadvisor.out shows errors:
mmlsfs: No file systems were found.
mmlsfs: Command failed. Examine previous error messages to determine cause.
Error occurred in the stack advisor.
Error details: local variable 'mount' referenced before assignment.
Solution:
Check to see if the default login shell returns an error return code of zero '0' for the failed command. If successful, the command should return a value > 0.
Run the following command on the Ambari server:
```
ssh -q -o BatchMode=yes -o StrictHostKeyChecking=no <USER>@<AMBARI-HOST-NAME> "sudo cat /notpresentfile"
```
The <USER> is either the root or non-root Ambari user depending on how Ambari was configured. If the command returns a zero '0' return code, then you need to update the default login shell to use the 'bash' shell for the Ambari user.
Unable to stop IBM Spectrum Scale service in Mpack 2.7.0.4.
In Mpack 2.7.0.4, if gpfs.storage.type is set to shared, stopping the IBM Spectrum Scale service from Ambari reports a failure in the UI even if the operation had succeeded internally.
Solution:
To workaround this issue:
1. Before you stop IBM Spectrum Scale or do a STOP All, set the IBM Spectrum Scale service to maintenance mode.
2. On the command line, stop IBM Spectrum Scale using the mmshutdown command.
```
# /usr/lpp/mmfs/bin/mmshutdown -a
```
3. Put the IBM Spectrum Scale service out of maintenance mode.
4. Start the IBM Spectrum Scale service or do a Start All using Ambari.
If SSL is enabled in Ambari, running the SpectrumScaleMPackUninstaller.py script to uninstall the IBM Spectrum Scale Mpack with an IP address might fail with a certificate error during the validation of the Ambari server's credentials.
Solution:
Depending on the SSL certificate that the Ambari server is registered with (hostname or IP address), using the IP address of the Ambari server while running the SpectrumScaleMPackUninstaller.py script can give a certificate error because the certificate is registered with the hostname. Therefore, provide the Ambari server's hostname instead of the IP address when the Mpack Uninstaller scripts prompts for the Ambari server IP address.

Ambari 2.7.X adding additional directories during deployment.

Solution:

For HDP 3.X using Ambari 2.7.X, Ambari will add directories in addition to the default /hadoop/hdfs directory path. Ensure that you review the HDFS NameNode and DataNode directories and Yarn local directories and other directories listed in the Customize services directories to ensure that only the required directories are listed.

For example, when integrating/unintegrating Scale service:

DFS NameNode : /hadoop/hdfs/namenode,/.snapshots/hadoop/hdfs/namenode,
/opt/hadoop/hdfs/namenode,/srv/hadoop/hdfs/namenode,/usr/local/hadoop/hdfs/namenode,
/var/cache/hadoop/hdfs/namenode,/var/crash/hadoop/hdfs/namenode,
/var/lib/libvirt/images/hadoop/hdfs/namenode,/var/lib/machines/hadoop/hdfs/namenode,
/var/lib/mailman/hadoop/hdfs/namenode,/var/lib/mariadb/hadoop/hdfs/namenode,
/var/lib/mysql/hadoop/hdfs/namenode,/var/lib/named/hadoop/hdfs/namenode,
/var/lib/pgsql/hadoop/hdfs/namenode,/var/log/hadoop/hdfs/namenode,
/var/opt/hadoop/hdfs/namenode,/var/spool/hadoop/hdfs/namenode,/var/tmp/hadoop/hdfs/namenode

DFS DataNode /hadoop/hdfs/data,/.snapshots/hadoop/hdfs/data,/opt/hadoop/hdfs/data,
/srv/hadoop/hdfs/data,/usr/local/hadoop/hdfs/data,/var/cache/hadoop/hdfs/data,
/var/crash/hadoop/hdfs/data,/var/lib/libvirt/images/hadoop/hdfs/data,
/var/lib/machines/hadoop/hdfs/data,/var/lib/mailman/hadoop/hdfs/data,
/var/lib/mariadb/hadoop/hdfs/data,/var/lib/mysql/hadoop/hdfs/data,
/var/lib/named/hadoop/hdfs/data,/var/lib/pgsql/hadoop/hdfs/data,/var/log/hadoop/hdfs/data,
/var/opt/hadoop/hdfs/data,/var/spool/hadoop/hdfs/data,/var/tmp/hadoop/hdfs/data

Even though HDFS Transparency does not use the NameNode and DataNode listed above, the native HDFS will need to use them.

The default directory path is /hadoop/hdfs/namenode and /hadoop/hdfs/data. All other directories are not needed.

Ambari 2.7.x - Cannot find a valid baseurl for repo.
For Ambari 2.7.x, Ambari writes empty baseurl values to the repo files when using a local repository causing stack installation failures.
Solution:
See AMBARI-25069/SPEC-58/BUG-116328 workaround:
For Ambari 2.7.0.0: Ambari 2.7.0 Known Issues.
For Ambari 2.7.1.0: Ambari 2.7.1 Known Issues.
For Ambari 2.7.3.0: Ambari 2.7.3 Known Issues.

The IBM Spectrum Scale Mpack installer fails with No JSON object could be decoded error.

If the Ambari certificate is expired or self-signed or is invalid, the Mpack installation fails while executing the REST API calls.

Error seen:


    INFO: ***Starting the Spectrum Scale Mpack Installer v2.7.0.7***  
    Enter the Ambari server host name or IP address. If SSL is configured, enter host name, to verify the SSL certificate. Default=192.0.2.22  :   c902f09x05.gpfs.net
    Enter Ambari server port number. If it is not entered, the installer will take default port 8080  :   9443
    Enter the Ambari server username, default=admin  :   admin
    Enter the Ambari server password  :  
    INFO: Verifying Ambari server address, username and password.
    Traceback (most recent call last):
    File "./SpectrumScaleMPackInstaller.py", line 312, in <module>
        InstallMpack(**darg)
    File "./SpectrumScaleMPackInstaller.py", line 162, in InstallMpack
        cluster_details = verify(ambari_hostname.strip(), ambari_username.strip(), ambari_password, ambari_port)
    File "/root/mpack2707/mpack_utils.py", line 417, in verify
        clusters_json=json.loads(result)
    File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
        return _default_decoder.decode(s)
    File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
        obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode
        raise ValueError("No JSON object could be decoded")
    ValueError: No JSON object could be decoded
    SpectrumScaleMPackInstaller failed.

Solution:

The following are the two possible solutions:

Enable urllib2 to work with the self-signed certificate by setting verify to disable in the /etc/python/cert-verification.cfg file. For more information, see Certificate verification in Python standard library HTTP clients.
Configure Ambari with the correct SSL certificate.

Mpack Installation / Uninstallation fails while restarting Ambari due to Server not yet listening on http port timeout error.

Error seen:


    ERROR: Failed to run Ambari server restart command, with error: Using python  /usr/bin/python
    Restarting ambari-server
    Waiting for server stop...
    Ambari Server stopped
    Ambari Server running with administrator privileges.
    Organizing resource files at /var/lib/ambari-server/resources...
    Ambari database consistency check started...
    Server PID at: /var/run/ambari-server/ambari-server.pid
    Server out at: /var/log/ambari-server/ambari-server.out
    Server log at: /var/log/ambari-server/ambari-server.log
    Waiting for server start....................................................................................................
    DB configs consistency check found warnings. See /var/log/ambari-server/ambari-server-check-database.log for more details.
    ERROR: Exiting with exit code 1.
    REASON: Server not yet listening on http port 8080 after 90 seconds. Exiting..

Solution:

Increase the timeout by adding or updating the server.startup.web.timeout property on the Ambari server to 180 seconds in the /etc/ambari-server/conf/ambari.properties file. For more information, see change the port for ambari server.
Retry the Mpack install / uninstall procedure.

¹For FPO cluster, do not run STOP ALL from the Ambari GUI. Refer to the Limitations > General section on how to properly stop IBM Spectrum Scale.