Refining data stored in tables in Impala in Data Refinery
Refine data stored in tables in Impala on the Hadoop cluster.
Prerequisite
Create the connection to the Hadoop cluster. See Impala via Execution Engine for Hadoop connection.
Restrictions
-
The Data Refinery flow source and target and the Hadoop environment must reference the same Hadoop system.
-
You must use a Hadoop environment to run Data Refinery jobs on a Hadoop cluster.
-
On Impala, Data Refinery only supports jobs that write to tables with files in the Parquet format.
-
If you will be overwriting or re-creating the target data set, you must have
write
permission (specifically thedelete
permission) for the Impala table’s HDFS data directory.For example, if the HDFS data directory of the Impala table is
/user/hive/warehouse/
table_name
and you do not have thedelete
permission for the data files in that directory, then run this command to change the owner:hdfs dfs -chown -R
new_owner
:hive /user/hive/warehouse/
table_name
-
If you want to use the Replace Table action with an external table for the target, the external table must be empty.
Procedure
-
Create a connected data asset for the source (data that you want to refine):
- Go to the project page.
- Click Assets > Import asset > Connected data.
- Click Select source.
- Select the Impala via Execution Engine for Hadoop connection. Navigate to the data you want and click Select.
- Type a name and description.
- Click Create. The asset appears on the project Assets page.
-
Repeat step 1 to create a connected data asset for the target file for the output of the Data Refinery flow.
-
Create a Data Refinery flow:
- Click the connected data asset for the source that you created in step 1.
- Click Prepare data to open Data Refinery.
- Apply operations to refine the data.
-
Change the target location for the output file:
- Click the Flow settings icon
on the toolbar. Go to the Target data set tab, click Select target.
- Click Data asset, and then select the connected data asset for the target output file and click Next.
- In the Select target and format properties window, select a Write mode and a Table action.
- Click Save and then Apply.
- Click the Flow settings icon
-
Create a job that runs the Data Refinery flow in the Hadoop environment:
- From the Data Refinery toolbar, click the Jobs icon
, and then select Save and create a job.
- Enter a name and description. Select the Hadoop environment.
- Optional: Add a one-time or repeating schedule.
- Create the job and run it immediately, or create the job and run it later.
- From the Data Refinery toolbar, click the Jobs icon
-
After the job completes, resync the Impala metadata. On the Hadoop cluster, connect to the impala-shell on the database and run this command:
REFRESH
table_name
Known issues
Learn more
Parent topic: Refining data on the Hadoop cluster