Implement a data lake by using the Hadoop Distributed File System
A data lake is a large-scale data storage repository and processing engine. You can upload large amounts of raw data into a data lake, without any transformation, and use the data in IBM® Predictive Maintenance and Quality for further analysis.
There are two ways to upload raw data into a data lake that is implemented with the Hadoop Distributed File System (HDFS), retrieve the data from the data lake, do profiling work for analytics usage, and save the transformed data to analytics store.
The following table describes the methods:
| Method | Advantages |
|---|---|
| Use IBM SPSS® Modeler and IBM SPSS Analytic Server to upload and get data to and from HDFS. |
|
| Use the HDFS command line to load data into HDFS, and then use Spark to load data and do profiling. |
|