Refining data stored in tables in a Hive warehouse using Data Refinery

Refine data stored in tables in a Hive warehouse on the Hadoop cluster.

Prerequisite

Create the connection to the Hadoop cluster. See Hive via Execution Engine for Hadoop connection.

Restriction

The Data Refinery flow source and target and the Hadoop runtime environment must reference the same Hadoop system.

Procedure

  1. Create a connected data asset for the source (data that you want to refine):

    1. Go to the project page.
    2. Click Assets > Import asset > Connected data.
    3. Click Select source.
    4. Select the Hive via Execution Engine for Hadoop connection. Navigate to the data you want and click Select.
    5. Type a name and description.
    6. Click Create. The asset appears on the project Assets page.
  2. Repeat step 1 to create a connected data asset for the target file for the output of the Data Refinery flow.

  3. Create a Data Refinery flow:

    1. Click the connected data asset for the source that you created in step 1.
    2. Click Prepare data to open Data Refinery.
    3. Apply operations to refine the data.
  4. Change the target location for the output file:

    1. Open Flow settings settings icon from the toolbar. Go to the Target data set tab, click Select target.
    2. Click Data asset, and then select the connected data asset for the target output file and click Next.
    3. In the Select target and format properties window, select a Write mode and a Table action.
    4. Click Save and then Apply.
  5. Create a job that runs the Data Refinery flow in the Hadoop runtime environment:

    1. From the Data Refinery toolbar, click the Jobs icon the run or schedule a job icon, and then select Save and create a job.
    2. Enter a name and description. Select the Hadoop runtime environment.
    3. Optional: Add a one-time or repeating schedule.
    4. Create the job and run it immediately, or create the job and run it later.

Known issues

Troubleshooting Hadoop environments

Learn more

Parent topic: Refining data on the Hadoop cluster