GUI operations in Data Refinery (Data Refinery)

Data Refinery supports the following categories of GUI operations.

Select a GUI operation from the + Operation button.

A subset of the operations is available from each column’s Actions menu (three vertical dots in the column header actions icon three vertical dots). You can rename a column by clicking the Edit icon in the column header.

FREQUENTLY USED

Calculate
Perform a calculation with another column or with a specified value. The operators are:


Convert column type
When you open a file in Data Refinery, the Convert column type operation is automatically applied as the first step if it detects any non-string data types in the data. Data types are automatically converted to inferred data types. Click Edit to change the automatic conversion for selected columns. As with any other operation, you can undo the step. The Convert column type operation is reapplied every time that you open the file in Data Refinery. Automatic conversion is applied as needed for file-based data sources only. (It does not apply to a data source from a database connection.)

If the data is converted to an Integer or to a Decimal data type, you can specify the decimal symbol and the thousands grouping symbol for all applicable columns. Strings that are converted to the Decimal data type use a dot for the decimal symbol and a comma for the thousands grouping symbol. Alternatively, you can select comma for the decimal symbol and dot or a custom symbol for the thousands grouping symbol. The decimal symbol and the thousands grouping symbol cannot be the same.

The source data is read from left to right until a terminator or an unrecognized character is encountered. For example, if you are converting string data 12,834 to Decimal and you do not specify what to do with the comma (,), the data will be truncated to 12. Similarly, if the source data has multiple dots (.), and you select dot for the decimal symbol, the first dot is used as the decimal separator and the digits following the second dot are truncated. A source string of 1.834.230,000 is converted to a value of 1.834.

The Convert column type operation automatically converts these date and timestamp formats:

You can manually apply the Convert column type operation to change the data type of a column at any point in the Data Refinery flow. You can create a new column to hold the result of this operation or you can overwrite the existing column.

Tip: A column’s data type determines the operations that you can perform. Changing the data type can affect which operations are relevant for that column.


Filter
Filter rows by the selected columns. Keep rows with the selected column values; filter out all other rows.

The operators for numeric, string, and Boolean (logical), and date and timestamp columns are:

Operator Numeric String Boolean Date and timestamp
Contains      
Does not contain      
Does not end with      
Does not start with      
Ends with      
Is between two numbers      
Is empty  
Is equal to  
Is false      
Is greater than    
Is greater than or equal to    
Is in    
Is less than    
Is less than or equal to    
Is not empty  
Is not equal to  
Is not in    
Is not null      
Is null    
Is true      
Starts with      


Math

You can apply math operations only to numeric columns. You can create a new column to hold the result of an operation or you can overwrite the existing column.

Math > Absolute value
Get the absolute value of a number.
Example: The absolute value of both 4 and -4 is 4.

Math > Arc cosine
Get the arc cosine of an angle.

Math > Ceiling
Get the nearest integer of greater value, also known as the ceiling of the number.
Examples: The ceiling of 2.31 is 3. The ceiling of -2.31 is -2.

Math > Exponent
Get a number raised to the power of the column value.

Math > Floor
Get the nearest integer of lesser value, also known as the floor of the number.
Example: The floor of 2.31 is 2. The floor of -2.31 is -3.

Math > Round
Get the whole number nearest to the column value. If the column value is a whole number, return it.

Math > Square root
Get the square root of the column value.


Remove
Remove the selected column.


Rename
Rename the selected column.


Sort ascending
Sort all the rows in the table by the selected column in ascending order.

Sort descending
Sort all the rows in the table by the selected column in descending order.


Substitute
Obscure sensitive information from view by substituting a random string of characters for the actual data in the selected column.


Text

You can apply text operations only to string columns. You can create a new column to hold the result of an operation or you can overwrite the existing column.

Text > Collapse spaces
Collapse multiple, consecutive spaces in the text to a single space.

Text > Concatenate string
Link together any string to the text. You can prepend the string to the text, append the string to the text, or both.

Text > Lower case
Convert the text to lower case.

Text > Number of characters
Return the number of characters in the text.

Text > Pad characters
Pad the text with the specified string. Specify whether to pad the text on the left, right, or both the left and right.

Text > Substring
Create substrings from the text that start at the specified position and have the specified length.

Text > Title case
Convert the text to title case.

Text > Trim quotes
Remove single or double quotation marks from the text.

Text > Trim spaces
Remove leading, trailing, and extra spaces from the text.

Text > Upper case
Convert the text to upper case.


CLEANSE

Convert column value to missing
Convert values in the selected column to missing values if they match values in the specified column or they match a specified value.


Extract date or time value
Extract a selected portion of a date or time value from a column with a date or timestamp data type.


Remove duplicates
Remove rows with duplicate column values.


Remove empty rows
Remove rows that have a blank or missing value for the selected column.


Replace missing values
Replace missing values in the column with a specified value or with the value from a specified column in the same row.


Replace substring
Replace the specified substring with the specified text.


ORGANIZE

Aggregate
Apply summary calculations to the values of one or more columns. Each aggregation creates a new column. Optionally, select Group by columns to group the new column by another column that defines a characteristic of the group, for example, a department or an ID. You can group by multiple columns. You can combine multiple aggregations in a single operation.

The available aggregate operations depend on the data type.

Numeric data:

String data:


Concatenate
Concatenate the values of two or more columns.


Conditional replace
Replace the values in a column based on conditions.


Join
Combine data from two data sets based on a comparison of the values in specified key columns. Specify the type of join to perform, select the columns (join keys) in both data sets that you want to compare, and select the columns that you want in the resulting data set.

The join key columns in both data sets need to be compatible data types. If the Join operation is the first step that you add, check whether the Convert column type operation automatically converted the data type of the join key columns in the first data set when you opened the file in Data Refinery. Also, depending where the Join operation is in the Data Refinery flow, you can use the Convert column type operation to ensure that the join key columns’ data types match. Click Steps to see the snapshot view of the steps.

The join types include:

Join type Description
Left join Returns all rows in the original data set and return only matching rows in the joining data set. Returns one row in the original data set for each matching row in the joining data set.
Right join Returns all rows in the joining data set and return only matching rows in the original data set. Returns one row in the joining data set for each matching row in the original data set.
Inner join Returns only the rows in each data set that match rows in the other data set. Returns one row in the original data set for each matching row in the joining data set.
Full join Returns all rows in both data sets. Blends rows in the original data set with matching rows in the joining data set.
Semi join Returns only the rows in the original data set that match rows in the joining data set. Returns one row in the original data set for all matching rows in the joining data set.
Anti join Returns only the rows in the original data set that do not match rows in the joining data set.


Sample
Generate a subset of your data by using one of the following methods. Sampling steps from UI operations apply only when the flow is run.


Split column
Split the column by non-alphanumeric characters, position, pattern, or text.

Restriction: The Split column operation is not supported when you run a Data Refinery flow on a Hadoop cluster.


Union
Combine the rows from two data sets that share the same schema and filter out the duplicates. If you select Allow a different number of columns and allow duplicate values, the operation is a UNION ALL command.


Tip: If you receive an error about incompatible schemas, check if the automatic Convert column type operation changed the data types of the first data set. Delete the Convert column type step and try again.

NATURAL LANGUAGE

Remove stop words
Remove common words of the English language, such as “the” or “and.” Stop words usually have little semantic value for text analytics algorithms and models. Remove the stop words to reduce the data volume and to improve the quality of the data that you use to train machine learning models.

Optional: To confirm which words were removed, apply the Tokenize operation (by words) on the selected column, and then view the statistics for the words in the Profile tab. You can undo the Tokenize step later in the Data Refinery flow.


Tokenize
Break up English text into words, sentences, paragraphs, lines, characters, or by regular expression.

Restriction: The Tokenize operation is not supported when you run a Data Refinery flow on a Hadoop cluster.