Data sources (Analytic Server console)

A data source is a collection of records, plus a data model, that define a data set for analysis. The source of records can be a file (delimited text, fixed width text, Excel) on HDFS, a relational database, HCatalog, or geospatial. The data model defines all the metadata (field names, storage, measurement level, and so on) necessary for analyzing the data. Data source owners can grant or restrict access to data sources.

Data source listing

The main Data sources page provides a list of data sources of which the current user is a member.

Click a data source's name to display its details and edit its properties.
Type in the search area to filter the listing to display only data sources with the search string in their name.
Click New to create a new data source with the name and content type you specify in the Add new data source dialog.
- See Naming rules (Analytic Server console) for restrictions on the names you can give to data sources.
- The available content types are File, Database, HCatalog, and Geospatial.
  Notes:
  - The HCatalog option is only available if Analytic Server has been configured to work with those data sources.
  - The HCatalog type is not available for HDP 3.0 or later and CDH 6.0 or later.
  - The content type cannot be edited once selected.
  - You can import/export multiple data sources in a single action.
Click Delete to remove the data source. This action leaves all files associated with the data source intact.
Click Refresh to update the listing.
The Actions dropdown list performs the selected action.
1. Select Export to create an archive from the selected data sources and save the archive to the local file system. The archive includes any files that were added to the selected data sources in Projects mode or Data source mode.
  Note: When only one data source is selected, the archive file name shares the selected data source name. If more than one data source is selected, the archive file name defaults to the name datasources.zip.
2. Select Import to import archives that were created by the Export action.
  Note: Archive files that contain information from multiple data sources cannot be imported. In these cases, the individual data source archives must first be extracted from the datasources.zip archive.
3. Select Duplicate to create a copy of the data source.

Individual data source details

The content area is divided into several sections, which can depend on the content type of the data source.

Details

These settings are common to all content types.

Name: An editable text field that shows the name of the data source.
Display name: An editable text field that shows the name of the data source as displayed in other applications. If this is blank, the Name is used as the display name.
Description: An editable text field to provide explanatory text about the data source.
Is public: A check box that indicates whether anyone can see the data source (checked) or if users and groups must be explicitly added as members (cleared).
Is global share: A check box that controls whether the Spark RDD is cached in the global cache. When selected, the Spark RDD is always cached in the global cache. When deselected, the Spark RDD is removed from the global cache when no Spark job is using it.
Custom attributes: Applications can attach properties to data sources, such as whether the data source is temporary, through the use of custom attributes. These attributes are exposed in the Analytic Server console to provide further insight into how applications use the data source.

Click Save to keep the current state of the settings.

Sharing

These settings common to all content types.

You can share ownership of a data source by adding users and groups as Authors or Readers.

Typing in the text box filters on users and groups with the search string in their name. Select Author or Reader from the drop-down list to assign their role within the data source. Click Add member to add them to the list of members.
To remove a participant, select a user or group in the member list and click Remove member.

Note: Users with the Administrator role have read and write access to every data source, regardless of whether they are specifically listed as a member.

File Input

Settings that are specific to defining data sources with file content type.

File Viewer

Shows available files for inclusion in the data source. Select Projects mode to view files within the Analytic Server project structure, Data source to view files stored within a data source, or File system to view the file system (typically HDFS). You can browse either folder structure, but HDFS is not editable at all, and in Projects mode, you cannot add files, create folders, or delete items at the root level, but only within defined projects. To create, edit, or delete a project, use Projects.

Click Upload to upload a file to the current data source or project/subfolder. You can browse for and select multiple files in a single directory.
Note: Files are uploaded to the distributed file system. You can find the uploaded files in the /analytic-root directory structure, under the appropriate tenant, data source or project (depending upon the mode chosen), and subfolder. For example, if you:
1. Log on to the tenant ibm
2. Create a data source called fraudDetection
3. Select Data source mode
4. Create a subfolder called historicalData
5. Upload a file charges2015.csv
Then the file can be found on the distributed file system in /analytic-root/ibm/.datasource/fraudDetection/historicalData/charges2015.csv. If, on the other hand, you:
1. Log on to the tenant ibm
2. Create a data source called fraudDetection
3. Select Project mode
4. Select an existing project called creditProcessing
5. Create a subfolder called historicalData
6. Upload a file charges2015.csv
Then the file can be found on the distributed file system in /analytic-root/ibm/creditProcessing/historicalData/charges2015.csv.
Click New folder to create a new folder under the current folder, with the name you specify in the New Folder Name dialog.
Click Download to download the selected files to the local file system.
Click Delete to remove the selected files/folders.

Files included in data source definition

Use the move button to add selected files and folders to, or remove them from, the data source. For each selected file or folder in the data source, click Settings to define the specifications for reading the file.

When multiple files are included in a data source, they must share a common metadata; that is, each file must have the same number of fields, the fields must be parsed in the same order in each file, and each field must have the same storage across all files. Mismatches between files can cause the console to fail to create the Preview and Metadata, or otherwise valid values to be parsed as invalid (null) when Analytic Server reads the file.

Database Selections

Specify the connection parameters for the database that contains the record content.

Database

Select the type of database to connect to. Choose from: Db2, Greenplum, Apache Impala, Amazon Redshift, MySQL, Netezza, Oracle, SQL Server, TeraData, Hive, DashDB or BigSQL. If the type you are seeking is not listed, ask your server administrator to configure Analytic Server with the appropriate JDBC driver.

Note: Analytic Server supports MySQL databases that are located on remote systems.

Hive Connect Type

This option is available only when Hive is selected as the Database type. Select the connection type, Single Server or High Availability. Single Server is used when a single Hive server is employed; High Availability is used when a highly available Hive server cluster is employed. The following options are available when High Availability is selected:

Zookeeper Quorum: Enter a comma-separated list for all Zookeeper server hosts:ports (for example: zkhost1:2181,zkhost2:2181).

Name Space: Enter the Hive root name space on Zookeeper. For example, hiveserver2 or hiveserver2-hive2 (when hiveserver2 interactive is enabled and used on HDP 2.6).

Notes:

The Zookeeper Quorum and Name Space values are located in the hive-site.xml file.
By default, Hive High Availability is disabled in Cloudera and must be enabled manually.
When using a Hive datasource in a non-Kerberos environment, you must ensure that the Username you entered in the Database Selections section is the same as login AS user.

Server address

Enter the URL of the server that hosts the database.

Server port

The port number that the database listens on.

Database name

The name of the database you want to connect to.

Username

If the database is password-protected, enter your user name.

Password

If the database is password-protected, enter your password.

Table name

Enter the name of a table from the database that you want to use.

Maximum concurrent reads

Enter the limit on the number of parallel queries that can be sent from Analytic Server to the database to read from the table specified in the data source.

HCatalog Selections

Specify the parameters for accessing data that are managed under Apache HCatalog.

Database: The name of the HCatalog database.
Table name: Enter the name of a table from the database that you want to use.
Filter: The partition filter for the table, if the table was created as partitioned table. HCatalog filtering is supported only on Hive partition keys of type string.
Note: The !=, <>, and LIKE operators do not appear to work in certain Hadoop distributions. This is a compatibility issue between HCatalog and those distributions.
HCatalog Field Mappings: Displays the mapping of an element in HCatalog to a field in the data source. Click Edit to modify the field mappings.; Note: After creating an HCatalog based data source that exposes data from a Hive table, you may find that when the Hive table is formed from a large number of data files, there is a substantial delay incurred each time Analytic Server starts to read data from the data source. If you notice such delays, rebuild the Hive table using a smaller number of larger data files, and reduce the number of files to 400 or fewer.

Geospatial Selections

Specify the parameters for accessing geographic data.

Geospatial type: The geographic data can come from an online map service, or a shape file.; If you are using a map service, specify the URL of the service and select the map layer you want to use.; If you are using a shape file, select or upload the shape file. Note that a shape file is actually a set of files with a common filename, stored in the same directory. Select the file with the SHP suffix. Analytic Server will look for and use the other files. Two additional files with the SHX and DBF suffixes must always be present; depending on the shape file, a number of additional files may also be present.

Preview and Metadata

After you specify the settings for the data source, click Preview and Metadata to check and confirm the data source specifications.

Output

Data sources with file, database, or HCatalog content type can be appended or overwritten by output from streams that are run on Analytic Server. Select Make writeable to enable appending or overwriting and:

For data sources with database content type, choose an output database table where the output data are written.
For data sources with files content type:
1. Choose an output folder where the new files are written.
  Tip: Use a separate folder for each data source so it's easier to keep track of the associations between files and data sources.
2. Select a file format; either CSV (comma separated value) or Splittable binary format.
3. Optionally select Make sequence file. This is useful if you want to create splittable compressed files that are usable in downstream MapReduce jobs.
4. Select Newlines can be escaped if your output is CSV and you have string fields that contain embedded newline or carriage return characters. This will cause each newline to be written as a backslash followed by the letter “n”, carriage return as a backslash followed by the letter “r”, and backslash as two consecutive backslashes. Such data must be read with the same setting. We strongly suggest using the Splittable binary format when handling string data that contains newline or carriage return characters.
5. Select a compression format. The list includes all formats that have been configured for use with your installation of Analytic Server.
  Note: Some combinations of compression format and file format result in output that cannot be split, and is therefore unsuitable for further MapReduce processing. Analytic Server produces a warning in the Output section when you make such a selection.
For data sources with HCatalog content type, choose an output hive table where the output data are written.
HCatalog data source notes and restrictions:
- The HCatalog data source table must exist before working with Analytic Server (Analytic Server does not create the required table).
- The table's metadata/data model must be consistent with the data model of the results to be exported.
- The HCatalog data source supports append mode only; override mode is not supported.