IBM Support

hi_core_utils command fails to work in Jupyter Notebook

Troubleshooting


Problem

While creating HDFS directory using the spark session by calling the Python function in the hi_core_utils library hi_core_utils.run_command, it might fail with an error. 
%%spark -s $session_name

# Declare imports needed for all of the cells that will run remotely.
import getpass, time, os, shutil

# Load IBM Hadoop Integration utilities to facilitate remote functionality.
# This line assumes that HI version >= X.Y has been installed on the registered
# Hadoop Integration system.
hi_utils_lib = os.getenv("HI_UTILS_PATH", "")
sc.addPyFile("hdfs://{}".format(hi_utils_lib))

import hi_core_utils
# Declare a target HDFS directory path that will be used for our data.
hdfs_dataset_dir = "/user/{}/datasets".format(getpass.getuser())
input_ds = "{}/{}".format(hdfs_dataset_dir, "cars.csv")

# Create target hdfs directory, if it does not already exist.
hi_core_utils.run_command("hdfs dfs -mkdir -p {}".format(hdfs_dataset_dir))

Symptom

The error observed on Jupyter Notebook is 
log4j:ERROR Could not read configuration file from URL [file:/run/cloudera-scm-agent/process/7147-yarn-NODEMANAGER/log4j.properties].
java.io.FileNotFoundException: /run/cloudera-scm-agent/process/7147-yarn-NODEMANAGER/log4j.properties (Permission denied)

Cause

This is in fact a non-fatal error that comes from CDH configuration.

Environment

Hadoop Execution Engine (2.1.0) and Watson Studio Local (1.2.3.1 Patch 10)
CDH 6.2
Notebook: Jupyter with Python 3.5 and Spark 2.2.1

Resolving The Problem

It has to do with CDH configuration and it is non-fatal error. In this case, check if the directory is created.

While admittedly annoying to have in the logs, if it is in fact a non-fatal error that comes from CDH config, it can be avoided by setting the HADOOP_CONF_DIR environment var in one of two ways:

  1. As part of the command being run, prepend the command with export (note the semi-colon).
    HADOOP_CONF_DIR=/etc/hadoop/conf;
  2. Within your Notebook, run the following once
os.environ['HADOOP_CONF_DIR']=os.environ['HADOOP_CLIENT_CONF_DIR']

Document Location

Worldwide

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSHGWL","label":"IBM Watson Studio Local"},"ARM Category":[{"code":"a8m0z000000bmNlAAI","label":"Modeling"}],"ARM Case Number":"TS003661724","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
27 May 2020

UID

ibm16217320