Sample steps to ingest CSV files (Open Data for Industries)

To ingest a CSV data file into the Open Data for Industries storage layer, you use the ingestion workflow.

You can also follow this process whenever you want to verify that the ingestion flow is functioning properly.
Restriction: The steps in this process require the 2.0.0 or higher release of Open Data for Industries.

One of the simplest data formats that are supported by the Open Data for Industries ingestion process is the "comma separated values" format, which is called a CSV format.

The CSV format is processed through a csv-parser-dag definition. For more information, see Workflow DAGs installation and configuration (Open Data for Industries).

The end-to-end ingestion process for CSV files consists of the following elements:
Sample schema structure
The schema structure used to validate the sanity of the data structure on the data file.
Sample legal tag
A sample entity to force governance of the ingested data by tagging the data records at the storage level.
Signed upload URL
A pre-signed URL that directly interacts with the object storage of the Open Data for Industries service to upload the data file.
Sample data file
A sample data CSV file that adheres to the sample schema structure and it is uploaded through the signed URL.
Sample metadata structure
A sample metadata structure to bind the schema information and data file together to send them through the ingestion process.
CSV DAG on Apache Airflow
A predefined and configured DAG for CSV data file. The CSV DAG is parsing the ingestion on the Open Data for Industries installation. The Apache Airflow and the CSV DAG are configured on the Open Data for Industries service platform when the required utilities are installed. For more information, see Installing software utilities.

For more information, see Ingesting and governing oil and gas data with Open Data for Industries.

Prerequisites

Note: You can import your own CSV file or the sample CSV Postman collection file that is provided.
Be sure that you have the following on your Open Data for Industries cluster:
  • Acquire access to the Apache Airflow test environment.
  • Acquire access to the Postman desktop application or web utility.
  • Configure the management environment and create a sample environment template. The template is needed to run the sample ingestion against it. For more information, see Validating the Open Data for Industries environment.
    Tip: Make sure that the Identity Provider (IdP) which you configure has token expiry time of 30 minutes at least. This way you ensure that you have enough time to run the test and to collect the statistics

Procedure

Note:

To perform this procedure, you need a third-party API testing tool. The given steps assume that you are using the Postman Collaboration Platform for API Development Tool.

  1. Import the sample ingestion collection to the Postman tool in the testing environment. You can get the sample ingestion collection file from the Cloud Pak for Data public repository.
  2. In the Postman tool, select binary representation of the body data for the Upload File Using Signed URL request.
  3. For the Upload File Using Signed URL request, select the file that you want to ingest from the local storage drive. You can select your own CSV file or use the sample data file from the Cloud Pak for Data public repository. The sample data file adheres to the sample ingestion collection's schema.
  4. Run the entire collection through Postman collection runner.

Verifying the test collection execution

  1. Log in to the Apache Airflow web console.
  2. Click the DAG csv-parser-dag to open the process status page.
  3. Locate the most recent Runs on the Tree View tab and click it.
  4. Click the Log tab in the resulting menu.
  5. Search for the phrase ‘record-ids’, which are generated in the logs section and copy all of them.
  6. Use the search API (in Postman Collection) to query the record-ids, which you copied.

If you can get the records in the search request, then the ingestion is successful.