Cleaning data

Data is often messy and inconsistent. You might want to clean your data so that it's clearer and easier to read.

About this task

The Clean options that are available for a column depend on the column data type. Some options can be specified for multiple columns with the same data type, and some for singular columns only.

The following options are available to clean your data:

Whitespace
Trim leading and trailing whitespace

Select this check box to remove leading and trailing whitespace from strings.

Convert case to
UPPERCASE, lowercase, Do not change

Use this option to change the case of all characters in the string to either uppercase or lowercase, or to ensure that the case of each individual character is unchanged.

Return a substring of characters
Return a string that includes only part of the original string in each value. For example, an employee code can be stored as CA096670, but you need only the number 096670 so you use this option to remove the CA part. You can specify this option for singular columns only.

For the Start value, type a number that represents the position of a character in the string that will start the substring. Number 1 represents the first character in the string. For the Length value, specify the number of characters that will be included in the substring.

NULL values
Specify NULL-handling options for columns with text, numeric, date, and time data types that allow NULL values. When Cognos Analytics detects that a column does not allow NULL values, these options are not available for that column.

The default value for each option depends on the column data type. For text data, the default is an empty string. For numbers, the default is 0. For dates, the default is 2000-01-01. For time, the default is 12:00:00. For date and time (timestamp), the default is 2000-01-01T12:00:00.

The entry field for each option also depends on the column data type. For text, the entry field accepts alphanumeric characters, for numbers, the entry field accepts only numeric input. For dates, a date picker is provided to select the date, and for time, a time picker is provided to select the time.

The following NULL-handling options are available:

Replace this value with NULL

Replaces the text, numbers, date, and time values, as you specify in the entry field, with Null.

If you replace all column values with Null, the column data type remains unchanged. You need to change the data type manually to avoid errors when the values are used in expressions. For more information, see step 4.

Replace NULL values with

Replaces NULL values with text, numbers, date, and time values, as you specify in the entry field.

For example, for the Middle Name column, you can specify the following values to be used for cells where middle name does not exist: n/a, none, or the default empty string. For the Discount Amount column, you can specify 0.00 for cells where the amount is unknown.

Procedure

  1. In the data module tree, click the context menu icon Context menu for a column, and click Clean.

    To clean data in multiple columns at once, control-select the columns that you want to clean. The Clean option is available only if the data type of each selected column is the same.

  2. Specify the options that are applicable for the selected column or columns.
  3. Click Clean.

    The "cleaned" values are displayed in the data module. This is the final step for most Clean options.

    The expression editor automatically creates an expression for the modified columns. To view the expression, open the column properties panel, and for the Expression property, click View or edit.

  4. If you used the Replace this value with NULL option to replace all column values, perform the following steps to change the column data type:
    1. Open the column properties panel, and for the Expression property, click View or edit.
    2. View the expression for the cleaned column. The expression is similar to this one:
      nullif (
      	i2000_YR2000
      	;
      	''
      )

      The column Data type property in this example is Text.

    3. Use the cast function on the expression to convert the values in the column to the required data type, as shown in the following example:
      cast(nullif (
      	i2000_YR2000
      	;
      	''
      ), integer)
      Tip: To view the documentation about the cast function, click the function name in the expression, and then click the information icon Help icon in the toolbar. The function description is displayed in the Information panel.
    4. Click OK to save the expression.

      The column Data type property is now changed to the required type. In the example provided in this step, it's the Integer data type.

      The values in each column now show Null.

Example

If you want to use an empty string instead of NULL in a given column, but your uploaded file sometimes uses the string n/a to indicate that the value is unknown, you can replace n/a with NULL, and then replace NULL with the empty string.