Benefits of new features, version 11.7

This topic describes the new features and functionality added to IBM® InfoSphere® Information Server.

For a detailed list of all new features added for InfoSphere Information Server, version 11.7.1, see New features and changes in InfoSphere Information Server, Version 11.7.1.

For a detailed list of all new features added with rollup patches or fix packs since the release of InfoSphere Information Server, Version 11.7, see New features and changes in InfoSphere Information Server, Version 11.7.

IBM Information Server

IBM Information Server delivers the unified governance by enabling you to explore, cleanse, and analyze your data. By using the following capabilities of unified governance, you can easily govern the data in your enterprise, making sure that it is of the highest quality, and always up-to-date.
New and enhanced user interfaces
  • IBM Information Server Enterprise Search, a new component where you can find assets in your enterprise, and explore their relationships in the relationship graph. InfoSphere Information Server is now bringing social collaboration to the domain of Information Governance. Assets can be rated by all users, and everyone can collaborate, sharing comments about critical assets such as reports, source files, and more.
  • Information Governance Catalog New, a new version of the Information Governance Catalog that you know from previous versions. With a completely new look and feel, you can use it to discover, analyze, and govern your data. You can explore the assets and their relationships, apply workflow process to keep the catalog up-to-date, and more.
  • Information Server Governance Monitor, a new component where you can check the status and health of your data in Quality Dashboard and Curation Dashboard. This user interface helps you measure your progress towards the enterprise governance objectives. Is all of your data being governed? What levels of Data Quality have you achieved? How many assets are you already governing? These and many other questions can be answered thanks to Governance Monitor.
Integration with IBM Watson Knowledge Catalog
Watson Knowledge Catalog is an extension to Information Governance Catalog that provides self-service access to data assets for knowledge workers who need to use those data assets to gain insights. After you create glossary assets and profile, classify, and curate information assets with Information Governance Catalog, you use Watson Knowledge Catalog to protect and display data assets in a self-service catalog where users can find and prepare data assets.
For details, see IBM Watson Knowledge Catalog.
Searching for assets
Searching for assets has never been so easy. You don't need to know anything about the data in your enterprise to explore it. Let's assume that you want to find information about bank accounts, simply type 'bank account' in the search field in enterprise search, and that's it. The search engine looks for the information in all asset types. It takes into account factors like text match, related assets, ratings and comments, modification date, quality score, and usage.
For details, see the topic in Information Server Enterprise Search.
Are you already familiar with your organization and looking for something more specific? Just open the catalog with your data, and select asset types that you want to browse. To narrow down search results, apply advanced filters like creation and modification dates, stewards, labels, or custom attributes.
For details, see the topic in Information Governance Catalog.
Exploring relationships
Data in large organizations can be very complex, and assets can be related to one another in multiple ways. To understand these complex relations better, explore them in a graphical form by using relationship graph. This view by default displays all relationships of one asset that you select. But this is just the starting point, as you can further expand relationships of this asset's relationships in the same view. Having all this information in one place in a graphical format makes it a lot easier to dig into the structure of your data. You'll be surprised when you discover how assets are connected!
For details, see Exploring relationships.
Using workflow
When you update data catalog, you can enable workflow to ensure the highest quality of your data. With the workflow enabled, the changes in the catalog are verified by experts before they are available to all users. The workflow team must review and approve the changes before they are made public. Therefore, you can be sure that only high-quality updates are made to your data.
For details, see Workflow process.
Rating assets and adding comments
When you browse your data, sometimes you would want to know what other experts think of it. Now it is possible, as you can rate an asset on a scale of one to five stars, and you can leave a comment with a couple of words. This enables all members of your organization to collaborate and share their expertise right where it's needed. Also remember that the more popular the asset is, the higher is its position on the search results list.
For details, see Rating assets and adding comments.
Queries and collections
When you discover your data, queries and collections might come in handy. When you regularly search for the same type of data, you can run queries with the specified search criteria and have up-to-date results that you can export to a CSV file. And to have assets related to one specific subject grouped together, you can add them to a collection.
For details, see Queries and Collections.
Unstructured data sources
The data in your enterprise consists of databases, tables, columns, and other sources of structured data. But what about the data that doesn't have such clear structure? What about email messages, word-processing documents, audio or video files, collaboration software, or instant messages? They are also a very valuable source of information. And now you can classify such by integrating Information Governance Catalog with IBM StoredIQ®. As a result, new asset types are available in your enterprise: instances, infosets, volumes, and filters.
For details, see Unstructured data sources.
Using producers to populate data
Producers are applications that collect relevant data on systems like Db2®, Hive, Hadoop Distributed File System (HDFS), Oracle, or Teradata. These applications monitor activity on the systems and generate information that helps improve the quality of search results. For example, thanks to these producers, apart from assets, you can also see users in the search results. You can then easily find which users contribute to which assets by displaying the relationships in the graph explorer.
For details, see Assets in Information Governance Catalog New.
Automatically discovering data
Importing and analyzing new data sets used to require many actions which took a decent amount of your precious time. But now there is a faster and more convenient way to complete all these tasks - all you need to do is to click one button labeled Discover in Information Governance Catalog New. With this one click, you can register data sets and add metadata to a new or existing workspace, run column analysis and quality analysis, publish analysis results, and automatically assign term to assets.
If you do not want to access the user interface, you can also discover data by using the command line discover action.
For details, see Discovering assets and Reviewing and working with the discovery results.
Running a quick scan of your data
If you don't have the time to run a full analysis on your data, especially on large data sets, you can run a quick scan. Quick scan provides a general overview of the quality of your data, and is much faster than the traditional discovery. The metadata is not imported, and only the first 1000 rows of each data set are analyzed.
For details, see Running a quick scan.
Analyzing data
If you want to run analysis on a smaller set of data, you can easily do it in the workspace view in Information Governance Catalog New. In this view, you can also review detailed analysis results, and publish them to the catalog to share them with other users.
For details, see Analyzing data.
Using data and quality rules
While you discover and analyze your data, you can use data rules and quality rules to validate specific conditions associated with your data sources. They provide an easy way to check the quality of your data. After the discovery and analysis is run, you can check the results, and review the current state of your data. And all these actions can be performed in a single application, which is Information Governance Catalog New.
For details, see Using rules.
Automatically assigning terms to information assets
One of the tasks that you can configure for the automatic discovery of data is automatically assigning terms to information assets. When you add new data to your enterprise, you want to have it properly classified. Therefore, you assign terms to new information assets manually. What if this could be done automatically? Sounds good, but is it safe? How will it know which terms should be assigned to which assets? This mechanism takes into account the similarity between the term and the name of an asset, matching data classifications that are identified during column analysis, and machine learning service which constantly learns from the user actions. Additionally, you can specify a confidence threshold that controls when terms are automatically assigned or when they are identified as a candidate for the assignment.
For details, see Automatic term assignment.
Running automation rules
Another process that you can automate in Information Governance Catalog New to save a lot of time and energy is evaluating and validating specific criteria associated with your data sources. Starting with this release, you can create automation rules, which use a graphical or command line based if … then logic to automatically generate quality rules based on terms when you automatically discover data or automatically or manually run a column analysis. With automation rules, you can be sure that any violations in your data are quickly discovered and reported so that you can correct them at once.
For details, see Automation rules.
Monitoring your data
To quickly check on the status and health of your data, you can open the Monitoring tab in Information Governance Catalog New. Two dashboards give you a very deep understanding of what's going on with your information. On the Curation Dashboard, you can see whether the data in your enterprise is cataloged, classified and governed. On the Quality Dashboard, you can review the overall quality of the data in your enterprise, including scoring and quality dimensions.
For details, see Monitoring the data.
Set checkpoints for InfoSphere DataStage® jobs
When InfoSphere DataStage jobs failed in the past, you would have to manually restart the whole job from scratch. Now, there is a great new capability to create checkpoints! You can save persistent checkpoints on disk across job runs and then use these checkpoints on subsequent runs. This means that jobs are automatically restarted after they fail. With a checkpoint, jobs are automatically restarted, after failure during writing, inserting or updating operations.
For details, see Configuring checkpoints for jobs.
Run Information Analyzer jobs on Spark
Take advantage of the power of Spark by running Information Analyzer jobs on this powerful framework! You now have the ability to run column analysis, data quality analysis, primary key analysis, and data rules on Apache Spark from the InfoSphere Information Analyzer thin client and workbench.
For details, see Configuring analysis and data rule jobs to run on Spark.

IBM DataStage Flow Designer

IBM DataStage Flow Designer is a web-based user interface for DataStage. You can use it to create, edit, load, and run DataStage jobs. The greatest benefits are:
Specify table definitions in jobs
Use table definitions when designing your jobs in order to specify the specific data that you want to use at each stage of a job. Table definitions are shared by all the jobs in a project.
Schedule your jobs
Use the job scheduling feature to automatically run jobs at specified times every day, week, or month. This helps you to manage how your jobs are run and ensures you are able to run the jobs at specific times that are important to you.
Suggested stages
This feature makes working with the canvas easier and faster. It suggests stages when you click a stage on the canvas that has no outputs. The suggested stages are highlighted in the palette and displayed on the canvas with dotted lines. You can select one of the suggestions by either dragging the suggested stage from the palette or clicking the suggested stage on the canvas. When you add a stage by either method, the added stage is automatically linked to the stage that you clicked that had no outputs.
No need to migrate jobs
You do not need to migrate jobs to a new location in order to use the new web-based IBM DataStage Flow Designer user interface. Any existing DataStage jobs can be rendered in IBM DataStage Flow Designer, avoiding complex, error-prone migrations that could lead to costly outages. Furthermore, any new jobs created in IBM DataStage Flow Designer can be opened in the Windows-based DataStage Designer thick client, maintaining backward compatibility.
No need to upgrade servers and purchase virtualization technology licenses
Getting rid of a thick client means getting rid of keeping up with the latest version of software, upgrading servers, and purchasing Citrix licenses. IBM DataStage Flow Designer saves time AND money!
Easily work with your favorite jobs
You can mark your favorite jobs in the Jobs Dashboard, and have them automatically show up on the welcome page. This gives you a fast, one-click access to jobs that are typically used for reference, saving you navigation time.
Easily continue working where you left off
Your recent activity automatically shows up on the welcome page. This gives you a fast, one-click access to jobs that you were working on before, so you can easily start where you left off in the last session. No need to go through several levels of folder navigation to find your job.
Efficiently search any job
Many organizations have thousands of DataStage jobs. You can very easily find your job with the built-in type ahead Search feature on the Jobs Dashboard. No need to go through several levels of folder navigation. For example, you can search for job name, description or timestamp to find what you are looking for very quickly. Built-in Virtual Scrolling allows the result page to scale for thousands of jobs. You can also organize jobs by using sorting and group-by. The Category option allows you to drill-down to a specific job starting with a folder.
Cloning a job
Instead of always starting Job Design from scratch, you can clone an existing job on the Jobs Dashboard and use that to jump-start your new Job Design. Open the cloned job and start editing connectors and stages.
Quick Tour and videos
As a new user, you don't know how to navigate on the new user interface. Take the built-in Quick Tour to familiarize yourself with the product, or watch the Create your first job video on the welcome page.
Flow Designer Features
IBM DataStage Flow Designer has many features to enhance your job building experience. You can use the palette to drag and drop connectors and operators on to the designer canvas. You can link nodes by selecting the previous node and dropping the next node, or drawing the link between the two nodes. You can edit stage properties on the side-bar, and make changes to your schema in Column Properties tab. You can zoom in and zoom out using your mouse, and leverage the mini-map on the lower-right of the window to focus on a particular part of the DataStage job. This is very useful when you have a very large job with tens or hundreds of stages.
Automatic metadata propagation
IBM DataStage Flow Designer comes with a powerful feature to automatically propagate metadata. Once you add a source connector to your job and link it to an operator, the operator automatically inherits the metadata. You do not have to specify the metadata in each stage of the job. For example, if your source is Db2 and you link it to the sort operator, it will automatically show you the columns that are eligible for sorting. Changing metadata in a DataStage job can be very time consuming process because you must go to each subsequent stage and redo the change. IBM DataStage Flow Designer automatically propagates the changed metadata to subsequent stages in that flow, increasing productivity.
Storing your preferences
You can easily customize your viewing preferences and have the IBM DataStage Flow Designer automatically save them across sessions. Preferences, such as showing lists or tiles in the Jobs Dashboard, or Flow Designer settings to show and hide node types, links or annotations, are designed to make the user interface work for you!
Saving a job
IBM DataStage Flow Designer allows you to save a job in any folder. The job is saved as a DataStage job in the repository, alongside other jobs that might have been created using the DataStage Designer thick client.
Highlighting of all compilation errors
The DataStage thick client identifies compilation errors one at a time. Large jobs with many stages can take longer to troubleshoot in this situation. IBM DataStage Flow Designer highlights all errors and gives you a way to see the problem with a quick hover over each stage, so you can fix multiple problems at the same time before recompiling. When the compilation is successful, you can view the generated OSH script for your job.
Running a job
IBM DataStage Flow Designer allows you to run a job. You can refresh the status of your job on the new user interface. You can also view the Job Log, or launch the Ops Console to see more details of job execution.

For details, see Data Integration.


Google Cloud Storage connector
You can use Google Cloud Storage connector to connect to Google Cloud Storage service and perform the following operations:
  • Write data to files residing in Google Cloud Storage using various file formats.
  • Read data from files residing in Google Cloud Storage using various file formats.
  • Read metadata information of different file formats for files residing in Google Cloud Storage.
  • Create Google Cloud Storage buckets for storing files.
  • Delete files residing in Google Cloud Storage.
Cassandra connector
You can use Cassandra connector to connect to tables stored in Cassandra database and perform the following operations:
  • Read data from Cassandra database.
  • Write data to Cassandra database.
For details, see Cassandra connector.
HBase connector
You can use HBase connector to connect to tables stored in the HBase database and perform the following operations:
  • Read data from or write data to HBase database.
  • Read data in parallel mode.
  • Use HBase table as a lookup table in sparse or normal mode.
For details, see HBase connector.
Hive connector
Hive connector supports modulus partition mode and minimum maximum partition mode during the read operation.
For details, see Hive connector.
Kafka connector
Kafka connector has been enhanced with the following new capabilities:
  • Continuous mode, where incoming topic messages are consumed without stopping the connector.
  • Transactions, where a number of Kafka messages is fetched within a single transaction. After record count is reached, an end of wave marker is sent to the output link.
  • TLS connection to Kafka.
  • Kerberos keytab locality is supported.
For details, see Kafka connector.
Amazon S3 connector
Amazon S3 connector now supports connecting by using a HTTP proxy server.
For details, see Amazon S3 connector.
File connector
File connector has been enhanced with the following new capabilities:
  • Native HDFS FileSystem mode is supported.
  • You can import metadata from the ORC files.
  • New data types are supported for reading and writing the Parquet formatted files: Date / Time and Timestamp.
For details, see File connector.
New Hadoop distributions
The following Hadoop distributions are supported:
  • MapR 5.2.2
  • Cloudera CDH 5.13
  • Hortonworks HDP 2.6.2