Implement a data lake by using the Hadoop Distributed File System

A data lake is a large-scale data storage repository and processing engine. You can upload large amounts of raw data into a data lake, without any transformation, and use the data in IBM® Predictive Maintenance and Quality for further analysis.

There are two ways to upload raw data into a data lake that is implemented with the Hadoop Distributed File System (HDFS), retrieve the data from the data lake, do profiling work for analytics usage, and save the transformed data to analytics store.

The following table describes the methods:

Table 1. Methods for implementing a data lake in HDFS
Method Advantages
Use IBM SPSS® Modeler and IBM SPSS Analytic Server to upload and get data to and from HDFS.
  • Easier for development.
  • Requires fewer coding skills.
Use the HDFS command line to load data into HDFS, and then use Spark to load data and do profiling.
  • Is the method that is native to IBM Open Platform.
  • Offers better performance.