Data sources (Analytic Server console)
Data source listing
The main Data sources page provides a list of data sources of which the current user is a member.
- Click a data source's name to display its details and edit its properties.
- Type in the search area to filter the listing to display only data sources with the search string in their name.
- Click New to create a new data source with the name and content type you
specify in the Add new data source dialog.
- See Naming rules (Analytic Server console) for restrictions on the names you can give to data sources.
- The available content types are File, Database, HCatalog, and Geospatial. Notes:
- The HCatalog option is only available if Analytic Server has been configured to work with those data sources.
- The HCatalog type is not available for HDP 3.0 or later and CDH 6.0 or later.
- The content type cannot be edited once selected.
- You can import/export multiple data sources in a single action.
- Click Delete to remove the data source. This action leaves all files associated with the data source intact.
- Click Refresh to update the listing.
- The Actions dropdown list performs the selected action.
- Select Export to create an archive from the selected data sources and
save the archive to the local file system. The archive includes any files that were added to the
selected data sources in Projects mode or Data source
mode.Note: When only one data source is selected, the archive file name shares the selected data source name. If more than one data source is selected, the archive file name defaults to the name datasources.zip.
- Select Import to import archives that were created by the Export action.
Note: Archive files that contain information from multiple data sources cannot be imported. In these cases, the individual data source archives must first be extracted from the datasources.zip archive.
- Select Duplicate to create a copy of the data source.
- Select Export to create an archive from the selected data sources and
save the archive to the local file system. The archive includes any files that were added to the
selected data sources in Projects mode or Data source
mode.
Individual data source details
The content area is divided into several sections, which can depend on the content type of the data source.
- Details
- These settings are common to all content types.
- Name
- An editable text field that shows the name of the data source.
- Display name
- An editable text field that shows the name of the data source as displayed in other applications. If this is blank, the Name is used as the display name.
- Description
- An editable text field to provide explanatory text about the data source.
- Is public
- A check box that indicates whether anyone can see the data source (checked) or if users and groups must be explicitly added as members (cleared).
- Is global share
- A check box that controls whether the Spark RDD is cached in the global cache. When selected, the Spark RDD is always cached in the global cache. When deselected, the Spark RDD is removed from the global cache when no Spark job is using it.
- Custom attributes
- Applications can attach properties to data sources, such as whether the data source is temporary, through the use of custom attributes. These attributes are exposed in the Analytic Server console to provide further insight into how applications use the data source.
Click Save to keep the current state of the settings.
- Sharing
-
These settings common to all content types.
You can share ownership of a data source by adding users and groups as Authors or Readers.
- Typing in the text box filters on users and groups with the search string in their name. Select Author or Reader from the drop-down list to assign their role within the data source. Click Add member to add them to the list of members.
- To remove a participant, select a user or group in the member list and click Remove member.
Note: Users with the Administrator role have read and write access to every data source, regardless of whether they are specifically listed as a member. - File Input
- Settings that are specific to defining data sources with file
content type.
- File Viewer
Shows available files for inclusion in the data source. Select Projects mode to view files within the Analytic Server project structure, Data source to view files stored within a data source, or File system to view the file system (typically HDFS). You can browse either folder structure, but HDFS is not editable at all, and in Projects mode, you cannot add files, create folders, or delete items at the root level, but only within defined projects. To create, edit, or delete a project, use Projects.
- Click Upload to upload a file to the current data source or
project/subfolder. You can browse for and select multiple files in a single directory.Note: Files are uploaded to the distributed file system. You can find the uploaded files in the /analytic-root directory structure, under the appropriate tenant, data source or project (depending upon the mode chosen), and subfolder. For example, if you:
- Log on to the tenant ibm
- Create a data source called fraudDetection
- Select Data source mode
- Create a subfolder called historicalData
- Upload a file charges2015.csv
- Log on to the tenant ibm
- Create a data source called fraudDetection
- Select Project mode
- Select an existing project called creditProcessing
- Create a subfolder called historicalData
- Upload a file charges2015.csv
- Click New folder to create a new folder under the current folder, with the name you specify in the New Folder Name dialog.
- Click Download to download the selected files to the local file system.
- Click Delete to remove the selected files/folders.
- Click Upload to upload a file to the current data source or
project/subfolder. You can browse for and select multiple files in a single directory.
- Files included in data source definition
- Use the move button to add selected files and folders to, or remove them from, the data source. For each selected file or folder in the data source, click Settings to define the specifications for reading the file.
- Database Selections
Specify the connection parameters for the database that contains the record content.
- Database
- Select the type of database to connect to. Choose from: Db2, Greenplum, Apache Impala, Amazon
Redshift, MySQL, Netezza, Oracle, SQL Server, TeraData, Hive, DashDB or BigSQL. If the type you are
seeking is not listed, ask your server administrator to configure Analytic Server with the
appropriate JDBC driver.Note: Analytic Server supports MySQL databases that are located on remote systems.
- Hive Connect Type
- This option is available only when Hive is selected as the
Database type. Select the connection type, Single
Server or High Availability. Single Server
is used when a single Hive server is employed; High Availability is used when
a highly available Hive server cluster is employed. The following options are available when
High Availability is selected:
- Zookeeper Quorum
- Enter a comma-separated list for all Zookeeper server hosts:ports (for example: zkhost1:2181,zkhost2:2181).
- Name Space
- Enter the Hive root name space on Zookeeper. For example, hiveserver2 or hiveserver2-hive2 (when hiveserver2 interactive is enabled and used on HDP 2.6).
Notes:- The Zookeeper Quorum and Name Space values are located in the hive-site.xml file.
- By default, Hive High Availability is disabled in Cloudera and must be enabled manually.
- When using a Hive datasource in a non-Kerberos environment, you must ensure that the Username you entered in the Database Selections section is the same as login AS user.
- Server address
- Enter the URL of the server that hosts the database.
- Server port
- The port number that the database listens on.
- Database name
- The name of the database you want to connect to.
- Username
- If the database is password-protected, enter your user name.
- Password
- If the database is password-protected, enter your password.
- Table name
- Enter the name of a table from the database that you want to use.
- Maximum concurrent reads
- Enter the limit on the number of parallel queries that can be sent from Analytic Server to the database to read from the table specified in the data source.
- HCatalog Selections
- Specify the parameters for accessing data that are managed under
Apache HCatalog.
- Database
- The name of the HCatalog database.
- Table name
- Enter the name of a table from the database that you want to use.
- Filter
- The partition filter for the table, if the table was created as partitioned table. HCatalog
filtering is supported only on Hive partition keys of type string.Note: The !=, <>, and LIKE operators do not appear to work in certain Hadoop distributions. This is a compatibility issue between HCatalog and those distributions.
- HCatalog Field Mappings
- Displays the mapping of an element in HCatalog to a field in the data source. Click Edit to modify the field mappings.
- Geospatial Selections
- Specify the parameters for accessing geographic data.
- Geospatial type
- The geographic data can come from an online map service, or a shape file.
- Preview and Metadata
- After you specify the settings for the data source, click Preview and Metadata to check and confirm the data source specifications.
- Output
- Data sources with file, database, or HCatalog content type can be appended or overwritten by
output from streams that are run on Analytic Server. Select
Make writeable to enable appending or overwriting and:
- For data sources with database content type, choose an output database table where the output data are written.
- For data sources with files content type:
- Choose an output folder where the new files are written.Tip: Use a separate folder for each data source so it's easier to keep track of the associations between files and data sources.
- Select a file format; either CSV (comma separated value) or Splittable binary format.
- Optionally select Make sequence file. This is useful if you want to create splittable compressed files that are usable in downstream MapReduce jobs.
- Select Newlines can be escaped if your output is CSV and you have string fields that contain embedded newline or carriage return characters. This will cause each newline to be written as a backslash followed by the letter “n”, carriage return as a backslash followed by the letter “r”, and backslash as two consecutive backslashes. Such data must be read with the same setting. We strongly suggest using the Splittable binary format when handling string data that contains newline or carriage return characters.
- Select a compression format. The list includes all formats that have been configured for use
with your installation of Analytic Server.
Note: Some combinations of compression format and file format result in output that cannot be split, and is therefore unsuitable for further MapReduce processing. Analytic Server produces a warning in the Output section when you make such a selection.
- Choose an output folder where the new files are written.
- For data sources with HCatalog content type, choose an output hive table where the output data
are written.HCatalog data source notes and restrictions:
- The HCatalog data source table must exist before working with Analytic Server (Analytic Server does not create the required table).
- The table's metadata/data model must be consistent with the data model of the results to be exported.
- The HCatalog data source supports append mode only; override mode is not supported.