Get started with Informix Warehouse Feature, Part 2: Extract, load, and transform your data in Design Studio

The new IBM® Informix® Warehouse Feature provides an integrated and simplified software platform to design and deploy a warehouse repository on your existing IBM Informix Dynamic Server (IDS) infrastructure. This tutorial, the second part of a series, gives you a hands-on and example-driven view of the Informix Warehouse Client component, the Design Studio. Follow the steps for designing and testing the data movements and transformations (Extract-Load-Transform, or ELT jobs) in the form of data flows and control flows that will do the task of populating the new data warehouse repository you created in Part 1 of this series.

Share:

Pamela Siebert (psiebert@us.ibm.com), IDS Staff Software Test Engineer, IBM

Author photo: Pamela SiebertPamela Siebert is a quality assurance software tester and has been working with various Informix and DB2 database server products over five years. Currently, she works with Java tools and connectivity, with a focus on drivers for IBM data servers.



Veronica Gomes (vgomes@us.ibm.com), IT Specialist, IDS Competitive Technologies and Enablement, IBM

Veronica GomesVeronica Gomes is an IT specialist in the IDS Competitive Technologies and Enablement team. She used to work as a specialist of IBM Information Management products in IBM ChannelWorks and was a member of the Informix and DB2 Technical Support team for several years.



Manjula Panthagani (manjulap@us.ibm.com), Informix Advanced Technical Support Engineer, IBM

Manjula Panthagani photoManjula Panthagani is an Informix advanced technical support engineer. She has been working as a down systems engineer for over eight years. She is involved in the development of the Cheetah certification exam and tutorials. She has worked with Informix Warehouse Feature for a year.



12 November 2009

Also available in Portuguese

Before you start

About this series

This tutorial series, Get started with Informix Warehouse Feature, introduces you to the highlights and capabilities of the new client and server software tools available for you in the Informix Warehouse option. These tools enable you to create and deploy a project for modeling the databases in your Informix data warehousing plaform, and for automating the data transformation and movement processes to easily acquire and integrate data from different sources into your Informix warehouse.

The tutorials in this series are organized following this line of deployment: First, you will use the client component of the Informix Warehouse software—the Design Studio—to create the data warehouse (DW) project around Informix and build the data models for source and target databases. Then, you will continue using Design Studio to create and test the jobs for extracting, loading, and transforming (ELT) data from various sources (typically, OLTP databases and external files) into the target warehouse repository on Informix. Finally, the series illustrates the use of the server components in the Informix Warehouse—the Administration Console and SQW services—to manage and monitor resources associated to your warehouse, including the ELT jobs you programmed before in Design Studio.

About this tutorial

This tutorial is the second part of the Get started with Informix Warehouse Feature series. Get an introduction to the data and control flows, as well as the ETL operators and design data flows. Get familiar with the components of data warehouse project, and use and test data flow operators, ports, and connectors. Define a control flow, get familiar with the components of a control flow, and use and test control flows. Continue using Design Studio to create and test the jobs for extracting, loading, and transforming (ELT) data from various sources (typically, OLTP databases and external files) into the target warehouse repository on Informix.

Objectives

When you have completed this tutorial, you will:

  • Understand how Design Studio provides a graphical development platform for creating SQL-based Extract-Load-Transform (ELT) jobs
  • Understand the difference between data flows and control flows, and how they form the ELT activities of a warehouse application
  • Understand the capabilities for SQL-based ELT jobs provided by the Informix Warehouse Feature
  • Design applications for populating warehouse and mart tables using SQL-based data flows and control flows
  • Test data flows and control flows in your warehouse application using the execution and debugging capabilities in Design Studio

Prerequisites

Downloading the required products

Some of the products mentioned in this tutorial are available in trial versions (see the Resources section). Informix Dynamic Server Developer Edition is available at no cost. You can now also get the Informix Warehouse Feature Trial edition (see Resources). For any software that you do not find available in trial or demo resources, please consult with your IBM representative.

This tutorial is written based on an Informix Dynamic Server and Informix Warehouse installation on a Microsoft® Windows® platform. However, most of the information in this tutorial applies in the context of a solution on UNIX® or Linux® too. The tutorial is written for Informix database users who are familiar with and have working knowledge of Informix tools such as dbaccess, SQL language and scripts, Windows environments, and Eclipse-based IDE/GUI tools. You should also have skills and experience in relational database design, ER models, data warehousing and ETL tools.

This tutorial assumes that you have:

  • Informix Warehouse Feature Client installed. For more information about how to install the product, please refer to the Resources section.
  • Informix Dynamic Server (IDS) 11.50.xC3 or later installed and a basic working knowledge of IDS. For download information, please refer to the Resources section.
  • Completed the following basic installation and configuration steps from Part 1 of this series:
    • Followed the prerequisites for Part 1, including creating both stores_demo and sales_demo databases.
    • Read the introductory information about Informix Warehouse Feature.
    • Read the business scenario for the tutorial.
    • Created a data warehousing project in Design Studio with connections to both source and target databases.
    • Created the physical data model for source OLTP database stores_demo from reverse engineering.
    • Created the physical data model for target data mart database sales_demo from an empty template, with new tables: time, customer, product, geography, and sales.

System requirements

To run the examples in this tutorial, you need a system with at least the Informix Warehouse Feature client installed and an Informix Dynamic Server installed, with a database server instance up and running (local or remote).

You should also have already followed the steps in Part 1 of this series.

Additional steps required on the Informix database instance for the source database stores_demo before starting this tutorial

  1. Make sure you have the environment variables set at the Informix instance level, including the default setting DBDATE=MDY4/, which is needed for loading additional rows into the orders table.
  2. Download and uncompress into a temporary directory the following files (provided in the Downloads section):
    • add_orders.zip
    • add_items.zip
  3. Log on to the Windows operational system or connect to the database stores_demo using the same userid admin used to create the database stores_demo in Part 1, or with another user with resource privileges on the database:
    dbaccess stores_demo -
    > load from 'add_orders.unl' insert into orders;
    > load from 'add_items.unl' insert into items;

Additional steps required on the Informix database instance for the target database sales_demo before starting this tutorial:

  1. Connect to the database sales_demo using the same userid admin you used when creating the sales_demo database.
  2. Make sure you have the environment variables set at the Informix instance level to use dbaccess against the target database sales_demo.
  3. Download and uncompress into a temporary directory the following file (provided in the Downloads section):
    • sales_demo_alter.zip
  4. Add primary key and foreign key constraints by running the SQL script provided (sales_demo_alter.sql) against the target database sales_demo:
    dbaccess sales_demo sales_demo_alter.sql

Additional steps required on the Informix Warehouse client (Design Studio) machine before starting this tutorial:

  1. Log in on the operating system where the Informix Warehouse client (Design Studio) is located.
  2. Download the following files (provided in the Downloads section):
    • time.zip
    • cost.zip
  3. Uncompress the two zip files under the same directory on the Informix Warehouse Client machine: C:\temp\IW_tutorial_files. The uncompressed files are UNL data files time.unl and cost.unl used in this Part 2 to populate the dimension table sales_demo:time and the staging table sales_demo:cost using Design Studio.
    • Notice the file time.unl contains date values using the format of DBDATE=MDY4/, which is not the default date format in Design Studio. In this tutorial, you will learn how to configure the JDBC connection to the target datamart database and the input file format to make them use the same date format used by a plain text file containing data to load a table in the target data warehousing system.

Introduction to data flows and control flows for ELT activities

Data flows

Data flows model the SQL-based data movement and transformation activities that execute in the IDS database. A data flow consists of activities that extract data from flat files or relational tables, transform the data, and load it into a relational table in a data warehouse, data mart, or staging area. Data can also be exported to flat files.

The Design Studio provides a graphical editor with an intuitive way to visualize and design data flows. Graphical operators model the various steps of a data flow activity. By arranging these source, transform, and target operators in a canvas work area, connecting them, and defining their properties, models can be created that meet business requirements. After creating data flows, you generate their SQL code, which creates the specific SQL operations that are performed by the execution database when you run a data warehouse application.

A data flow consists of operators, ports, and connectors.

Data flow operators

Working offline in Design Studio

Creating the data models in Design Studio (as you did in Part 1 of this series) associated to JDBC connections to the source and target systems of the warehousing project that identify the source and target tables used in the data movement activities will allow you to work offline while designing the data flows and control flows.

These data models will be available for the ELT designer to select the different tables that are being part of the different ELT jobs in data flows and control flows, removing the need of being connected to the different source and target databases until it is time to test or debug the data flows and control flows.

Data flow operators represent the source, transform, and target steps in a data flow. The operators—graphical objects that you can select from a palette and drop in the canvas work area—form the nodes in a data flow diagram. Each type of operator has a specific set of input and output data ports, one or more lists of data columns, and a number of properties that define exactly how each operator moves or transforms data. Figure 1 illustrates a simple data flow showing different types of operators:

Figure 1. Simple data flow showing types of operators
Simple data flow showing types of operators

There are three categories of data flow operators:

  • Source operators that cause data to be brought into the data flow from a persistent source, typically a relational table or a flat file.
  • Target operators that cause data processed in the data flow to be persistent, typically in a relational table or into a flat file.
  • Transform operators that perform some type of action on the data as it moves from source to target.
Source operators
Source operators represent some type of data that is input to your data flow. There are two types of source operators:

Table source operator
A table source operator represents a JDBC connection to a relational table in a database management system and corresponds closely to the SELECT clause of an SQL statement. Table source operators can be local or remote, relative to the execution database of a data flow.

A table source operator provides column information, including column names and data types, to other operators in the data flow and is typically populated from the physical data model that is referenced by a Data Warehouse project.
File import operator
A file import operator makes data in a file available to a data flow by presenting it to a downstream operator, such as a virtual table, using the output port. It may be connected directly to input ports of transform or target operators. This means that, for instance, you could have file import operators along with table source operators as input elements to a join operator, using a file import node as if it was a table.
Target operators
Target operators represent some type of persistent store into which the data flow will update, or update data, and identifies the techniques with which they will be updated. There are two categories of target operators:

File export operator
The file export operator exports data to delimited flat files. In addition to the obvious use of a file export operator to export data from a relational table, you might also use it to receive discarded rows from other operators, such as the distinct or key lookup operator.
Table target operator
The table target operator emulates standard INSERT, UPDATE, and DELETE operations against a JDBC relational table (typically on IDS where the warehouse repository will reside). Table target operators can be local or remote, relative to the execution database of a data flow.

Target table operators provide three ways to update the data in a target table:
  • INSERT - Adds rows from the input data set to the target table without any changes to existing rows.
  • UPDATE - Matches rows from the input data set with those in the target table. If the match is found, the row is updated; if no match exists, no action is taken.
  • DELETE - Checks the target table for rows that meet a specified condition and deletes those rows.
Transform operators
Transform operators represent some type of transformation action taken against the data flowing into the operator.

Custom SQL operator
The custom SQL operator is used to add one or more embedded SQL statements that modify the database in some way. The operator can have one or more input ports, but does not contain an output port. The input ports make available the input schemas available to the expression builder, but any table in the execution database may be referenced in the supplied SQL.
Data station operator
The data station operator is used to define an explicit staging point in a data flow. Staging is done to store intermediate processed data for the purpose of tracking, debugging, or ease of data recovery. Staging may occur implicitly within a data flow, as determined by the data flow code generator, but does not persist after the data flow execution. However, you may want to have data put into a persistent store at various points during processing, perhaps as logical recovery points. The data station operator allows you to define when to put the data into a persistent store.

There are a number of reasons to persist intermediate data during processing. During development, you may want to view the data after a certain operation to ensure that the processing was done correctly. So, you can add a data station in the middle of a data flow for debugging purposes. You may also need to keep track of the data at certain points in the processing for audit purposes or perhaps to provide a recovery point. A data station can be added at the appropriate points in the data flow.

There are four storage types for data stations. A persistent table will store the data in a permanent relational table existing in the data model. You can optionally specify to delete all data from the table after the data flow has executed. A temporary table will store the data in a temporary table object (regular table) during the data flow, but the data will not persist after the execution of the data flow. A view is useful to influence code generation to not implicitly persist data and is useful in data mining scenarios. A flat file persistently stores the data in a flat file, which may be useful in scenarios where you want to use a bulk loader to load data from a flat file. One flat file data station operator can be used instead of a file export target and a file import source.
Distinct operator
The DISTINCT operator simply removes duplicate rows from the input table, passing unique rows to the result port and duplicates to the discard port. When there are duplicates, the first row found is passed to the result port, the rest to the discard port, and there is no guarantee in the order of processing.
Group by operator
The GROUP BY operator groups rows in a data set and summarizes values in designated columns using COUNT, SUM, MIN, MAX, AVG, and other functions, thereby emulating the SQL GROUP BY function. A GROUP BY operator is similar to a SELECT LIST operator, except that it provides the additional functionality of a GROUP BY property.
Order by operator
The ORDER BY operator will sort the input data according to the values in one or more designated columns, passing all of the columns of the entire row to the result port.
Select list operator
The select list operator emulates the SELECT clause of an SQL statement. It can be used to add, drop, split, modify, or combine columns of a data set. Columns can be added and modified using scalar functions, arithmetic, and constant values. Certain scalar functions, such as CONCATENATE and DATE, and column expressions containing arithmetic functions enable you to effectively combine values from multiple columns into a single column. Other functions enable you to change the data type of a column. From the column list property of the select list operator, you can access the expression builder, which makes it easier to create complex column expressions by providing interactive lists of available input columns, scalar functions, and boolean and arithmetic operators.
Table join operator
The TABLE JOIN operator does exactly what the name indicates; it performs joins between two or more relational tables. It supports inner joins, left outer joins, right outer joins, and full outer joins.

By default, the TABLE JOIN operator has two input ports and four output ports. You may add any number of input ports by clicking on the add a new port icon, found underneath the last input port. The four output ports represent the join types: inner, left outer, right outer, and full outer; and the output ports that have connections define the type of joins to be done. If you have only two input ports, then you may use any combination of the four output ports. However, if there are more than two input ports, then outer joins are not allowed.
Union operator
The UNION operator performs the union operation of two sets. There are two inputs that are processed according to the selected set operation and passed to the result output port. All columns of each row are passed through. The set operator type is selected in the set details property tab.

The UNION operator merges unconditionally two sets of input rows into a single output data set, removing any duplicate rows.
Where condition operator
The WHERE CONDITION operator is used to implement filtering of the input data based on a condition with matching rows flowing to the result port. There is no discard port on the WHERE CONDITION. The SPLITTER operator can be used when there is a need for multiple conditions and outputs.
Fact key replace operator
The FACT KEY REPLACE operator looks up surrogate keys from dimension tables or key mapping tables and uses them to replace corresponding natural keys in a fact table. The input consists of a data table, one or more lookup tables, and an output table. Natural and surrogate keys are identified for each lookup table, and the natural keys of the lookup tables mapped to the corresponding column in the input table.
Key lookup operator
The KEY LOOKUP operator is used to compare keys from a single input table with keys in one or more lookup tables and discard input table rows that do not have matching rows in the lookup tables. Rows that successfully match are sent to the output port; otherwise, they are sent to the discard port. Using the select list properties can select a subset of columns, add columns, and use expressions from the input table or lookup tables.
Pivot operator
The PIVOT operator groups data from several related columns into a single column by creating additional rows in the result. In other words, it turns column-oriented data into row-oriented data.
Splitter operator
The SPLITTER operator takes one input and multiple outputs based on specified criteria. Each output may be different in terms of columns and rows, and does not need to contain an exclusive set of rows. The property of the SPLITTER is primarily the filter condition associated with each output port. The mapping of columns is handled using the port connections.
Unpivot operator
The unpivot operator does just the opposite of the pivot operator by using repeating values (months or quarters, for example) in a single column, called the data group, as headings for a new set of columns. Values from a specified set of input columns, called unpivot columns, are arranged under the new column heading according to the pivot group value found in each row. Other columns, called key columns, define the new set of rows.
Subflows
A subflow is a data flow that can be embedded inside another data flow. Subflows can also be embedded in another subflow, allowing the nesting of flows. The primary advantage of the subflow is that it is reusable across multiple data flows.

Subflows can be helpful when there is a series of operations that are the same across a number of data flows. One subflow can be created and connected inline into each data flow. Subflows can also be useful to simplify a complex data flow.

Control flows

A control flow model sequences of one or more data flows and integrates other kinds of data processing activities. Typically, you will develop a number of data flows for a particular warehouse application, which may have some type of dependencies in the order of execution such that certain flows will need to execute before other flows, using some type of processing rules. Design Studio provides a graphical editor into which you can define the order of execution of these related data flows. These are called control flows.

In addition to sequencing data flows, a control flow can execute other data processing functions, such as operating system scripts, IDS SQL scripts, batch executables, ftp, sending e-mail notifications, and others. A control flow contains dataflow and non-dataflow activities, such as command execution and file wait, for example. You cannot deploy data flows directly; they have to be included in a control flow.

The Design Studio provides a graphical editor with an intuitive capability to visualize and design control flows. Graphical operators model various SQW and data processing activities. By arranging these operators in a canvas work area, connecting them, and defining their properties, you can create work flow models that define the sequence of execution of the activities. Figure 2 shows a simple control flow, showing the operators:

Figure 2. Simple control flow, showing the operators
Diagram of a simple control flow

Control flow operators

Control flow operators represent some type of data processing activity to be executed in the sequence of the control flow. Operators are graphical objects that are dragged from the palette and dropped onto the editor canvas, and form the nodes of the control flow sequence. Each operator represents a specific type of activity and has properties to define that activity. There are two categories of control operators:

Common operators
SQW flow operators
The SQW flow operators represent the data flows that are developed within SQW. These flows, developed using the Design Studio editor, must be available in the same Data Warehouse project folder as the control flow.

Data flow operator
A data flow operator represents a data flow in the control flow sequence. The data flow must exist in the same Data Warehouse project. Drag the data flow operator onto the canvas, connect an output port from the previous operator in the sequence, and define the properties. The properties consist of a pointer to the data flow and logging/tracing information.
Command operators
The command operators are used to execute batch code that can be invoked using some type of command line interface. It is not recommended to execute commands that prompt for user input. Commands can be operating system scripts, IDS scripts, executable programs, or the FTP command, and all are supported by the same command operator. The type of command is a property of the command operator.

Command
With the command operator, you can run executables like shell scripts or the FTP command. Choose the type of command line operation to be used. The appropriate properties are displayed.

The command that is invoked with the command operator must terminate with an exit value. This exit value is checked to determine whether the command completed successfully or not. A return code of 0 is considered to be a successful completion.
Secure command
Invoke a command on a remote system using the SSH protocol. The secure command operator provides the same functionality as the command operator, but uses a secure connection between the Design Studio client and a remote server to run a script or a command on the remote computer.
Stored procedure
The stored procedure operator executes an existing stored procedure through a control flow. A stored procedure is a user-written application program that is stored on the database server and runs inside the database. An existing stored procedure can be invoked inside a control flow. Stored procedures are typically used to implement custom code that executes on the database. Some common uses include row-by-row processing of data and invoking external or third-party functionality. The stored procedure uses parameters to complete its logic when it runs.
Secure FTP
Securely copy one or more files from or to a remote host, using the SFTP protocol. The secure FTP operator transfers one or more files to and from a remote computer using the SSH File Transfer Protocol (SFTP).
Control operators
Start
Every control flow must have one and only one start operator. When you create a control flow, a start operator is placed on the canvas automatically. Because of this, you cannot select a start operator from the palette. Like other operators, you can select it in order to view and define its properties.
Continue
Proceed unconditionally with the next iteration in a processing loop performed by an iterator operator.
Break
Break out of a processing loop performed by an iterator operator. The break operator provides a way to prematurely terminate an iteration.
Fail
Explicitly cause the control flow to fail and proceed to the next On Failure path, if any.
End
End a series of activities or a control flow.
File wait
Check for the existence or non-existence of a file. The file wait operator can pause the control flow execution for a period of time while the system checks for the existence or non-existence of a specified file. You can specify the amount of time to allocate to this operator.
File write
Write the specified text to the default execution log file or a specified file. The file write operator is used to diagnose the status of the control flow at several stages and to track the values of variables at each stage.
Iterator
The iterator operator is used for looping over a set of control flow operators, causing the operators to be repeated until certain conditions are met. There are three types of iteration supported:
  • Repeat loop a fixed number of times: This number of times to execute the loop is based on integer values for the starting value, the step increment, and the end value. The actual value will be available in the defined iteration variable.
  • Repeat loop for each delimited data item in a file: This will read a delimited data file and will loop once for each value provided between the defined delimiter, including whitespace. A comma delimited file with the values 1, 6, 18, 22 will loop four times and in order, passing the current value using the defined iteration variable.
  • Repeat loop for each file in a directory: This loop technique will read the names of all the files in a directory and loop once for each, making the filename available in the defined iteration variable.
Parallel container
Group a set of activities that can run in parallel.

You can design control flows that support parallel processing and scheduling. The parallel container operator groups together independent activities that can run concurrently.

The parallel container operator defines activities that have no dependency on each other, but fall at the same point in the overall process flow. Therefore, they qualify for parallel execution.
Subprocess
Run a subprocess.
Variable assignment
Assign a variable to a fixed value or to another variable.
Variable comparison
Compare a variable with a value, and apply conditional processing logic based on the result.

The variable comparison operator allows for branching logic in a control flow. A specified variable is compared to either a constant value or to another variable. The result is either a true or a false condition. (There can also be an error condition that would result from a type mismatch.) The variable compare operation can be numeric (=, <, >, ...), string (equals, substring, ...), and boolean, or it can test for null.
Datastage operators
DataStage parallel job operator
A DataStage job operator represents a DataStage parallel job. Parallel jobs can significantly improve performance because the different stages of a job are run concurrently rather than sequentially. There are two basic types of parallel processing: pipeline and partitioning.
DataStage job sequence operator
A DataStage job sequence operator represents a DataStage job sequence. A job sequence is a reference to a single job in the DataStage server whose type is job sequence. A job sequence specifies the processing sequence for a set of DataStage jobs.
Notification operators
Notification operators are used as an alert for some type of important event that has taken place in the control flow. This is typically used for notifying an administrator that some error has occurred, but could be an event such as the successful completion of the control flow. The control flow uses an e-mail operator to accomplish notification.

E-mail notification
The e-mail operator will simply send an e-mail to a specified e-mail address with the provided message. There are properties for the sender e-mail address, the recipient e-mail address, the subject text, and the message text. The smtp server is not specified in the operator, but is defined in the runtime environment as a system resource. E-mail will not be sent when executing a control flow test within the Design Studio.
Informix operators
The Informix control flow operators are Informix-specific operators for handling fragmentation and SQL scripts.

Attach partition
To attach a partition from another table.
Detach partition
This detaches a partition from an existing partitioned table to a separate table.
IDS custom SQL
This operator is used to execute SQL commands as part of the control flow. You can write any number of statements, both DDL and DML. The statements are executed against the database you apply on the properties general tab.
IDS SQL script
IDS SQL script operators run SQL script files on the IDS engine. The IDS SQL script operator works only with files that are on the database server. When you run a flow that contains the IDS SQL script operator, a stored procedure called INFORMIX.DS_EXECSQLSCRIPT_FILE will be created and executed on the IDS server, if one does not exist already.
Update statistics
This performs an UPDATE STATISTICS operation to gather statistical information about database tables and data distribution information, and records the information in system catalog tables. The UPDATE STATISTICS code can either be generated by choosing various properties options or it can be manually coded.

Business scenario

This tutorial uses Informix Warehouse's Design Studio to design the SQL-based Extract-Load-Transform (ELT) activities that will simplify and do a similar job as the one done by the SQL Script code in Listing 1.

The code in Listing 1 will be used as a guidance for creating a similar resulting extract-load-transform processing that is going to be graphically designed and tested using Informix Warehouse Design Studio. You will start by populating the dimension tables (Time, Customer, Geography, and Product), and then the fact table Sales. There is a table of temporary nature, Cost, only used for the purpose of loading Sales fact table.

Listing 1. Sample SQL load data script code intended for sales_demo database
{ ************************************************************************* }
{                                                                           }
{   Licensed Materials - Property of IBM                                    }
{                                                                           }
{   "Restricted Materials of IBM"                                           }
{                                                                           }
{   IBM Informix Extended Parallel Server                                   }
{   (c) Copyright IBM Corporation 1991, 2002 All rights reserved.           }
{                                                                           }
{ ************************************************************************* }

{ The following statement connects to the stores_demo database }
connect to "stores_demo";

{ The following statements add rows to orders and items table of stores_demo } 
load from 'C:\temp\IW_tutorial_files\add_orders.unl' 
insert into stores_demo:orders;

load from 'C:\temp\IW_tutorial_files\add_items.unl' 
insert into stores_demo:items;

{ The following statement connects to the sales_demo database }
connect to "sales_demo";

{ The following statement loads data from the costs.unl file }
{ into the costs table, which is used to load the sales fact table. }
load from 'C:\temp\IW_tutorial_files\costs.unl'
insert into cost;

{ The following statement loads data into the time dimension table. }
load from 'C:\temp\IW_tutorial_files\time.unl'
insert into time;

{ The following statement loads data into the geography dimension table. }
insert into geography (district_name, state_code, state_name)
select distinct c.city, s.code, s.sname
from stores_demo:customer c, stores_demo:state s
where c.state = s.code;

update geography   	 { converts state_code values to region values }
set region = 1
where state_code = "CA";

update geography
set region = 2
where state_code <> "CA";

{ The following statement loads data into the customer dimension table. }
insert into customer (customer_code, customer_name, company_name)
select c.customer_num, trim(c.fname) || " " || c.lname, c.company
from stores_demo:customer c;

{ The following statement loads data into the product dimension table. }
insert into product (product_code, product_name, vendor_code,
 vendor_name, product_line_code, product_line_name)
 select a.catalog_num, 
 trim(m.manu_name) || " " || s.description,
 m.manu_code, m.manu_name, s.stock_num, s.description
 from stores_demo:catalog a, stores_demo:manufact m,
 stores_demo:stock s
 where a.stock_num = s.stock_num and a.manu_code = s.manu_code and
 s.manu_code = m.manu_code;

{ The following statement loads data into the sales fact table. }
insert into sales (customer_code, district_code, time_code, product_code,
units_sold, revenue, cost, net_profit)
select c.customer_num, g.district_code, t.time_code, p.product_code,
SUM(i.quantity), SUM(i.total_price), SUM(i.quantity * x.cost),
SUM(i.total_price) - SUM(i.quantity * x.cost)
from stores_demo:customer c, geography g, time t, product p,
stores_demo:items i, stores_demo:orders o, cost x
where c.customer_num = o.customer_num and o.order_num = i.order_num and
p.product_line_code = i.stock_num and  p.vendor_code = i.manu_code and
 t.order_date = o.order_date and  p.product_code = x.product_code and
 c.city = g.district_name
 GROUP BY 1,2,3,4;

Designing the data flow for loading dimension table Time

Data flows will be used to define the SQL-based data transformation activities that extract data from flat files (time.unl and cost.unl) and relational tables (in the source OLTP database stores_demo), transform the data, and load it into the tables within the data mart sales_demo.

The Design Studio provides an intuitive way to visualize, design, test, and debug data flows. In this section, you are going to build one data flow for each target table in the sales_demo database that needs to be fully loaded for the first time.

First-time load, versus periodic updates

In this tutorial, you are designing the data flows and control flows to populate the target data mart sales_demo for the first time. Another set of control flows will be needed for the periodic maintenance of the data mart tables based on the changes that occur in the source systems after the data mart has been populated the first time.

You will create one data flow per table in sales_demo:

  • time (dimension table: Time)
  • cost (staging table: Cost)
  • customer (dimension table: Customer)
  • geography (dimension table: Geography)
  • product (dimension table: Product)
  • sales (fact table: Sales)

Let's start with the dimension table Time.

Data flow for loading dimension table sales_demo:time

The Time dimension table in sales_demo database is populated from the data contained in plain file time.unl that is provided in the Downloads section of this tutorial.

Listing 2 shows the type of ETL activity, in the form of SQL script, to load dimension table Time for the first time that you will recreate in Design Studio in the form of a graphical SQL-based ELT data flow:

Listing 2. Populating the table sales_demo:time using SQL
{ The following statement connects to the sales_demo database }
connect to "sales_demo";

{ The following statement loads data into the time dimension table. }
load from 'C:\temp\IW_tutorial_files\time.unl'
insert into time;

The following steps will help implement this SQL-based ETL activity in the form of a data flow in Informix Warehouse Design Studio:

  1. Right-click on the Sales Demo DW-ELT project's Data Flow folder, and select New > Data Flow, as illustrated in Figure 3:
    Figure 3. Creating a data flow
    Creating a data flow
  2. The New Data Flow wizard appears. Change the Data flow name to populate_time, and select the radio button for Work against data models (Offline). This reduces the performance impact when working on data flows.
    Figure 4. New Data Flow wizard for Time dimension table
    New Data Flow wizard for Time dimension table
  3. Click on Finish.
  4. An empty data flow canvas for the new data flow populate_time appears on the right side of the workspace.

File import operator to get data from source file time.unl

  1. Drag a file import operator from the General Operators palette on the right, and drop it onto the canvas for the populate_time data flow, as Figure 5 shows:
    Figure 5. Adding a new file import operator into the populate_time data flow
    Adding a file import operator for populate_time data flow

DBDATE

The default DATE data type format in Design Studio is YYYY-MM-DD, whereas time.unl uses the Informix's default DBDATE format MDY4/, which is MM/DD/YYYY. In order to successfully import time.unl in a data flow, you need to configure DBDATE to the right value in the JDBC connection for sales_demo. Also, you may want to create a new file format that contains more properties than the basic ones, being one of these non-basic file format properties the setup for the DATE format to other different than the default.

  1. The File Import operator wizard appears, displaying the General properties page. Change the Label to a meaningful name, such as file_import_time_unl.
  2. Click on Browse to locate the data input file time.unl, which should be under C:\temp\IW_tutorial_files.
  3. Click on Open once you locate the file to import, as illustrated in Figure 6:







    Figure 6. Filling in information for File Import source to load Time dimension
    File Import operator for loading Time dimension
  4. You are taken back to the General properties page. Click on Next to move on to the next property page of this file import operator.
  5. The File Format properties page appears. This is where you provide all the information about the type of file you are importing data from, the fields contained within the file, and their data types.
    1. In the Delimiter field, enter the pipe character (|) to overwrite the default delimiter, since time.unl is a pipe-delimited plain text file.
    2. Click on Generate File Format so Design Studio reads the time.unl file and produces an initial discovery of its format. Figure 7 illustrates these steps:
      Figure 7. Generating a file format based on reading the pipe-delimited plain text file
      Generating file format of a pipe-delimited data file
  6. The Generate File Format page appears for you to specify the Format Type and File Selection properties.
    1. Select Generic Delimited for Format type.
    2. Leave the default CP1252 character set for the Encoding field.
    3. Verify that Field delimiter contains a pipe (|).
    4. This screen allows you to preview the ASCII data file to make sure the properties are right. Once all the file type and selection properties are verified, similar to as shown in Figure 8, click on Next:
      Figure 8. Generate File Format page to enter file type information
      File type information
  7. The Column Specification property page appears. This is where the fields and their datatypes within the file time.unl are described based on the columns and data types to be loaded in table sales_demo:time.
    1. Consider the sample data from the file and the values as they should look in the table, and the schema for table time in the business scenario in the first tutorial.

      The last field in time.unl, COL_8, is empty and unnecessary. The table time contains seven (7) columns, so this field COL_8 should be removed from the list of columns in the file format.

      To remove this last field/column, select it, and then click on the Delete icon on the toolbar, as illustrated in Figure 9:
      Figure 9. Column specification page and elimination of last unnecessary field COL_8
      Column specification page and elimination of last field COL_8
    2. The next step is to make corrections on the data types of fields in file time.unl discovered by Design Studio to reflect the data types that best match the ones in columns of table time. The following Listing 3 shows time's schema.
      Listing 3. sales_demo:time schema
      create table "admin".time 
      (
      	time_code integer,
      	order_date date,
      	month_code smallint,
      	month_name char(10),
      	quarter_code smallint,
      	quarter_name char(10),
      	year integer,
      	primary key (time_code) 
      );


      Based on this schema above and the discovered file format by the tool, slight corrections need to be made, which can be done in-place on the cells, under the Field list section of the Column specification page.

      Figure 10 illustrates the final status of the Field list grid after the changes are made that are compatible with both the time's schema above and the fields' data types discovered by Design Studio:
      Figure 10. Column specification's Field list after changes
      Column specification's Field list after corrections
    3. If you wish, replace the names COL_n given to the fields in the file with their corresponding column names in table time to make it easier to map the field in the file with the column in the table:
      Figure 11. Column specification's Field list after changes
      Column specification's Field list after corrections
    4. Click on Finish.
  8. You are now back to the File Format property page of the file import operator. It now shows the corrected format and the modified field names to facilitate matching the fields in the file with the columns in the table. Figure 12 shows this screen.

    In order to document the actual date format (DBDATE) used in the file, you need to configure additional non-basic properties of the file format that will require a new file format object in the data warehousing project.
    1. Click on Save to File Format to create the new file format.
      Figure 12. File Format page with corrected file format and saving file format for further use
      File Format page with corrected file format and saving file format
    2. A Save As browse window appears. Find the folder file-formats under your data warehousing project, and enter fileFormattime.fileformat as the name for the file format:
      Figure 13. Save to File Format window
      Save to File Format window
    3. Click on OK.
  9. You are now back in the File Format page again. Click on Next, as you will edit the properties of the new file format later on.
  10. The Column select property page appears. Here, you verify that all the selected columns that appear on the right are actually the ones you want to pick up/select from the source file time.unl.

    You need all the columns in the file to be used to populate the time table, so no changes are needed. However, if you want to use only some of the fields in the import data file, then this is the screen where you get to move the columns you do not need from the right side, Selected Columns, to the left side, Available Columns.

    It is important to be selective about the information that is really needed from the import file in order to avoid unnecessary data movements and loads that may impact performance of the data flow. Figure 14 illustrates this last screen, indicating that all the available columns of the file will be used in the data flow:
    Figure 14. Available columns, versus selected columns of a file import
    Available columns, versus selected columns of a file import
  11. You are now back to the canvas for data flow populate_time with the newly configured file import operator file_import_time_unl. Save (Ctrl+S) all the changes.
  12. Click on Finish.

File format for time.unl used by the file import operator

Project objects: File definitions

Design Studio allows you to save more complex file format definitions as .fileformat files so you can reuse them across several file import and file export operators, in case your data warehousing project manipulates files that use the same format.

In this section, configure the properties of the file format and fields in the plain text file time.unl.

As indicated in previous steps, the file import operator file_import_time_unl uses the file format fileFormattime.fileformat for definition of the file type, fields ,and format in the input file time.unl.

Only with the use of a separate file definition object, rather than the default file format settings in the file import operator, can you set up additional properties as the ones seen in time.unl, such as the date format, or DBDATE.

  1. Look for the File Definitions folder under the data warehousing project, and expand it.
  2. You will see the newly created format, fileFormattime.fileformat. Right-click on this file format, and select Open With > File Format Editor, as illustrated in Figure 15:
    Figure 15. Open newly created file format in File Format Editor
    Open newly created file format in File Format Editor
  3. A special file format editor opens the file definition created for time.unl on the right side of the workspace. Edit the properties as follows:
    • Verify that the previous settings you used earlier for the file time.unl, such as the File type set to Delimited format, the File delimiter set to the pipe (|), the Code page set to CP1252, and the data types of each field/column in the file, are there and with the correct values.
    • In Date format put the non-default setting that corresponds to a DBDATE=MDY4/ by finding the value MM/DD/YYYY in the drop-down list. Figure 16 illustrates these desired settings for the file format used by time.unl:
      Figure 16. Desired settings for file format used to import time from file time.unl
      Desired settings for file format used to import time from file
  4. Click on the cross icon (X) to close the file format fileFormattime.fileformat opened with the editor and leave only the data flow populate_time, as illustrated in Figure 17:
    Figure 17. Closing the file format properties editor
    Closing the file format properties editor
  5. To make sure your file import operator file_import_time_unl uses this new file format definition in fileFormattime.fileformat file, follow these last steps:
    1. Right-click on the file import operator file_import_time_unl, and select Show Wizard.
    2. Once there, click on Next to go to the File Format page.
    3. On this page, click on the Load from File Format button on the top, and select the file format fileFormattime.fileformat you just created.
    4. Click on Open.
    5. Back on the File Format page of the File Import wizard, click on Finish to close the operator.
    6. Back to the canvas, save (Ctrl+S) all the changes.

Setting the right DBDATE in JDBC connection to sales_demo

In addition to setting the right DBDATE=MDY4/ for the date fields in the file format used to define the import data file time.unl, you need to set DBDATE=MDY4/ value as an additional JDBC property for the connection to the target database that will use it (in this case, sales_demo).

In order to do this, follow the steps in this section. When the settings are done, Design Studio will request you to automatically reconnect with JDBC to the database.

  1. Under the Data Source Explorer panel on the bottom left side of the workspace, right-click on the connection to database sales_demo and select Properties.
  2. The properties for sales_demo database will appear at the General tab. Click on the Optional tab.
  3. Once on the Optional tab, enter the text DBDATE=MDY4/ inside the Additional properties field, and then click on Add to incorporate it into the additional JDBC parameters of the connection, as shown in Figure 18:
    Figure 18. Entering right DBDATE settings as addtiional JDBC connection property
    Entering right DBDATE settings as addtional JDBC connection property
  4. Click on the Test Connection button to make sure the connection works fine with the new DBDATE parameter.
  5. You should receive a message saying the connection succeeded. Click on OK to close this message.
  6. Click on OK on the Properties > Optional window to close the JDBC properties settings.
  7. A message will ask you to reconnect to the database. Click on Yes to reconnect.

Table target operator to load data coming from the file import into table time

  1. Drag the Table Target operator that is in the General Operators palette on the right, and drop it onto the canvas so it appears to the right of the file import operator.
  2. The table target operator's wizard will appear, on the General properties page. Change the general properties as follows:
    • Change the Label of the operator to a meaningful name, such as insert_time.
    • Enter the Description: Table Target operator to insert/load data into sales_demo:time dimension table.
    • Keep Commit interval to 0. This field can be changed as needed. For the SQL insert operation, specify the interval in rows at which the operation is committed. When the interval is zero, the insert operation is completed before it is committed.
    • In SQL operation, leave Insert, as you are populating the table sales_demo:time for the first time. Depending on the intention of the data flow, you may need to use the target table operator to either insert, delete, or update rows.
    • In JDBC batch size, keep 0. You may want to increase this value for performance purposes.
    • In Location, you can either choose: SQL execution database if the target database is the SQL execution database defined in the general properties table for the data flow (at this point, the execution database for this data flow has not been set, and so the target table time is local to this database); or you can choose Remote database if the target table is in another database you want to specify at the operator's level. All JDBC-compliant databases are supported, not only Informix.

      For performance purposes, it is recommended that the execution database is the same as the target database, so no additonal connections are required. Therefore, in this case, set Location to SQL execution database.

      For more information about the SQL execution database, please refer to the IBM Informix Warehouse, Version 11.50 Information Center (see Resources).
    • In Target database table, click on the ellipsis (...) button to get the Table Selection Dialog window (see Figure 19).

      Inside the window, expand the physical data model for the target data mart database, sales_demo that is named sales_demo DB Model.dbm.
    • Expand the admin schema under which you created the tables (see Figure 19).
    • Select the table time (see Figure 19).
    • Click on OK, as illustrated in Figure 19:
      Figure 19. Table selection dialog inside target table operator
      Table selection dialog inside target table operator
    • Back in the General page, select the check box to have the table use the Not-Logged Initially option.
    • Check that all the values are correct, as in Figure 20, and then click on Finish and save (Ctrl+S) all changes:
      Figure 20. Final General page in target table operator
      Final General page in target table operator

Setting the execution database and the temporary prefix for the data flow populate_time

Operators with cross or exclamation signs

At times, an operator appears in the canvas with a red cross (X) on the upper left corner. This means that its status is either inconsistent or incomplete. Normally, it will go away once you go through the whole settings of the operator. For example, with a new operator, you may still need to connect it with an operator, and then go back to its wizard to make sure all the interaction points and properties are correctly set up, before the invalid status goes away.

Whenever an operator appears in the canvas with a yellow exclamation sign on the upper left corner, there are some warning-level issues validated by Design Studio that might or might not affect the execution of the data flow, such as matching a smallint data type at an input port with an integer data type at an output port.

Sometimes these alerts coming from the validation functionality of the tool go away once you save all the recent changes made. So, reviewing the settings that apply to all subsequent operators in the data flow after changing an operator (like Propagate changes across the data flow) and saving often are best practices to make sure these alerts are not false and disappear.

To see what error or warning it is, click on the red "X" cross or yellow exclamation sign to get details about the validation issue, or go to the Problems tab on the bottom section of the workspace.

  1. Click on some blank area of the data flow populate_time's canvas, and find the Properties page on the bottom side, opened at the General tab, to set up the general properties of the data flow:
    1. In the SQL execution schema, enter admin as the default schema for the tables used in data movement.
    2. Set the SQL execution database to sales_demo using the drop-down list.
    3. In Default table space for internal staging table, leave the field blank so the IDS instance will take care of the default behavior for temp tables.
    4. In the Default database for internal staging table, leave the field blank, so the internal staging tables will be created within the SQL execution database.
    5. In Temporary Staging Table Name Prefix, enter the prefix time_ so you will be able to identify the temporary staging tables generated by different data flows.
    6. Keep the check boxes marked for optimization: Enable Optimization for Code Generation and Enable Optimization for Code Generation on Distributed Queries
      Figure 21. Properties page settings for data flow populate_time
      Properties page settings for data flow

Connecting the output of file import operator with the input of table target operator

Clicking on the operator, versus clicking on one of its ports

Before seeing the Properties page or the Wizard of an operator, make sure that you have clicked on some area of the grayed header portion of the operator instead of any of the input or output ports, unless you want to see the properties that apply specifically to that port. If you want to see properties/wizard pages of an operator, make sure none of its ports is highlighted. If so, click on that port again to deselect it, and try selecting the operator (on the gray area) again.

  1. Grab the output port at the file import operator file_import_time_unl to make a connection to the input port at the table target operator insert_time.
  2. Right-click on the table target operator insert_time, and select Show Wizard:





    Figure 22. Connecting output and input ports of operators in data flow populate_time
    Connecting output and input ports of operators in data flow
  3. The table target operator wizard will appear on the General page. Click on Next.
  4. The Map page of the insert_time operator appears.

    Make sure that the mapping of the fields in the file time.unl (that corresponds to input INPUT_02_0) with the columns in table time is correct. The fields of the input file are listed on the left panel for Available Columns and also in the Map from column on the right panel for Result Columns; and the columns of table time appear in column Map to on the Result Columns.
    Figure 23. Mapping input fields from file time.unl with columns in table time
    Mapping input fields from File time.unl with columns in Table time
  5. Click on Finish, and save (Ctrl+S) all changes.

Testing the execution of the data flow populate_time

It is time to test in your development environment that the newly created data flow populate_time works fine at extracting data from the input data file time.unl and loading the data (with no transformation) into the target dimension table sales_demo:time.

Execute, versus debug

In cases where you need to identify execution issues in the data flow, the alternative of running a debug execution of the data flow is given in the menu and toolbars of the Design Studio tool. The debug mode allows to set up breakpoints, step-into and step-out modes of moving forward, visualize the status and amount of data rows in the input and output ports of the operators, and visualize the current values of the variables being used during the execution. To run the data flow in debug mode, select the Debug icon on the upper toolbar or the menu options Informix Data Flow > Debug Data Flow.

In order to do this execution test, follow the steps below:

  1. Right-click on the data flow populate_time located under the Data Flows folder on your project explorer, and click on Validate.

    You want to make sure that any data type mismatches and other issues are identified and resolved. The ideal situation is to get no warnings nor errors from the validate function of Design Studio.
    Figure 24. Validating the data flow populate_time
    Validating the data flow populate_time
  2. Click on the Execute button on the upper toolbar of Design Studio or use the menu options Informix Data Flow > Execute to start the execution of the data flow.
    Figure 25. Executing/Testing the data flow populate_time
    Executing/Testing the data flow populate_time
  3. The Flow Execution wizard appears.

    You can configure several options for the execution, like changing the default userid/schema that will execute the flow (which should not be confused with the userid in the JDBC connection), or choose another execution database to run the process, on the General tab.

    On the Diagnostics tab, you can also configure the tracing options and statistics information for the data flow execution and performance, as well as other diagnotics settings.

    On the Resources tab, you can verify the database connections the data flow is supposed to open against source and target systems.

    On the Variables tab, you can set up variables to customize the data flow's execution.

    Leave all the default values alone as shown in Figure 26, and click on the Execute button.
    Figure 26. Flow Execution wizard for data flow populate_time
    Flow Execution wizard for data flow populate_time
  4. A dialog indicating that the data flow operation is in progress appears.

    You can choose to run the data flow in background, using the button Run in Background, but for the purposes of this exercise you will not run the data flow in background, because you want the Execution Report to appear on the foreground as soon as the execution finishes.
  5. Once the execution of the data flow completes, an Execution Result report/log with detailed information appears in a separate window.

    As you can see in Figure 27, the execution of data flow populate_time was successful, and you can also see the details of the execution process in this Execution Result window, and have the option to save this execution log. Click on OK to close this window.
    Figure 27. Flow execution results window with details of succeeded (in this case) or failed data flow execution
    Flow execution results window with details of succeeded (in this case) or failed data flow execution
  6. If you close this window but want to see the result again, you can open the data flow populate_time and locate the Execution Status tab on the bottom of the workspace. There, you can see the history of executions of this data flow.

    A very similar report to the one seen in the Execution Results report/log that you just saw after running the data flow can be obtained agan by clicking on the tab Tail Log on the right side. In the Variables tab you find the reserved variables and its values at runtime. Figure 28 shows the Execution Status information for the data flow populate_time.
    Figure 28. Execution status information for data flow populate_time
    Execution status information for data flow populate_time
  7. Verify directly on the Informix instance that contains the database sales_demo that its table time has been populated and now contains data.

    For instance, execute the command in Listing 4 against the Informix instance to get the count of the rows in sales_demo:time table:
    Listing 4. Get count of rows in sales_demo:time table
    dbaccess sales_demo
    Database selected.
    
    > select count(*) from time;
    
    
          (count(*))
    
                  92
    
    1 row(s) retrieved.

Now you are ready to continue with the remaining data flows:

  • cost (staging table: Cost)
  • customer (dimension table: Customer)
  • geography (dimension table: Geography)
  • product (dimension table: Product)
  • sales (fact table: Sales)

Designing the data flow for loading staging table Cost

Data flow for loading staging table sales_demo:cost

The Cost staging table is a table that is not going to be consulted by the end users of this Sports Stores's sales data mart. It is only shortly used for during the ETL process, for the sole purpose of storing temporary Cost-related metrics to later populate part of the data into the fact table sales_demo:sales. Therefore, the table sales_demo:cost was not illustrated in the star schema of the " Business Scenario" section in Part 1 of this series, and the table was not created within the sales_demo database as a part of the database model created in Design Studio.

Listing 5 shows the type of ETL activity, in the form of SQL script, that you will implement in Design Studio in the form of an SQL-based Data Flow.

Listing 5. How the table sales_demo:cost is created and populated in SQL
{ The following statement connects to the sales_demo database }
connect to "sales_demo";

{ The following statement creates the staging table cost }
create table cost 
  (
    product_code integer,
    manu_code char(3),
    cost money(8,2)
  );

{ The following statement loads data into the staging table cost. }
load from 'C:\temp\IW_tutorial_files\cost.unl'
insert into cost;

Using a data flow to create a temporary table Cost does not make any sense because it will not be independent on its own due to Cost being as a temporary table, which means it is dropped as soon as the data flow is completed. So, instead, you will create a "sub" data flow to create and load the temporary table Cost, but then the operators in it will be copied into the data flow that loads the Sales fact table.

Let's create another Data Flow to perform the above steps for the cost table.

  1. Right-click on the Sales Demo DW-ELT Project's Data Flows folder, and select New > Data Flow.
  2. Enter information for new data flow called populate_cost and select the "work against data models: (Offline)" option. Select Informix as the SQL execution database type.
  3. Click on Finish.

File import operator to get data from file cost.unl

Performance implications of creating a cost table

Informix Warehouse Feature is mostly an SQL-based Extract-Load-Transform (ELT) tool, which means that before doing any transformation on the data, the data needs to be loaded somewhere (in the default SQL execution database or in some other database set in the data flow). With this implementation approach, implicit temporary tables are created as needed to deal with data extractions and transformations.

The purpose of the table Cost is basically to get data from a file cost.unl and with no transformations keep it so it can later be used as an input table in a 6-table join that will populate fact table Sales.

For tutorial purposes, and in order to illustrate the use of a Data Station operator, a staging table Cost is created to hold the data imported from source file cost.unl. Then, both the File Import and the Data Station operators will be copied as a subflow inside the Sales data flow, so that the temp table Cost will serve as input for the 6-table Join operator prior to loading the fact table Sales.

A more efficient approach would be to directly use the file import operator for cost.unl in the Sales data flow, as input for the 6-table Join used to load fact table Sales, because the File Import will have an underlying temporary table associated to it, has an output port, and can hold transformations prior to the table join.

  1. Drag a file import operator from the General Operators palette on the right, and drop it onto the canvas of the populate_cost data flow.
    Figure 29. Adding a new file import operator into the populate_cost data flow
    Adding a file import operator for populate_cost data flow
  2. The file import operator wizard appears displaying the General properties page.

    Change the Label to file_import_cost_unl. Then, click on Browse to locate the data input file cost.unl, which should be under C:\temp\IW_tutorial_files. Click on Open once you locate the file.
    Figure 30. Filling in information for file import source to load Cost staging table
    File Import operator for loading Cost staging table
  3. Click on Next and the File Format properties page appears. Type in a pipe character "|" in the Field delimiter field and click on Generate File Format button.
    Figure 31. Generating a file format based on reading the pipe-delimited plain text file
    Generating file format of a pipe-delimited data file
  4. The File Type and File Selection property page appears.

    Here, you see a preview of the file cost.unl and configure its basic format properties. Make sure that Format type is Generic Delimited, Encoding is Cp1252, and Field delimiter is the pipe character (|). The desired settings are shown in Figure 32. Click on Next.
    Figure 32. File type and file selection page in file import for cost.unl
    File type and file selection page in File Import for cost.unl
  5. The Column Specification property page appears. This is where you match the fields in the file cost.unl with the data types.

    The last field identified by Design Studio in cost.unl, COL_4, is unnecessary. Since the table cost only contains three columns, remove this fourth field COL_4 by selecting it. Then, click on the Delete icon on the toolbar. Figure 33 illustrates this step.
    Figure 33. Column specification page and elimination of unnecessary field COL_4 in file cost.unl
    Column specification page and elimination of last field COL_4
    The next step is to make corrections on the data types of the fields in the file cost.unl discovered by Design Studio by setting the data types that best match the column types of the table cost.

    Based on the intended schema for cost in Listing 5, and the file format of cost.unl discovered by the tool, a correction is needed: COL_3 should be of data type MONEY(8,2).

    Figure 34 illustrates the final status of the Field list grid after those little changes that are compatible with both the cost's intended schema above and the fields' data types discovered by Design Studio.
    Figure 34. Column Specification's Field list after changes
    Column Specification's Field list after corrections
    Replace the names COL_n with the real names of the columns in table cost to make it easier to track which field in the file is for which column in the table. Make the changes to the name of the fields by clicking on the cell COL_n and editing the name in the Field list grid. Figure 35 shows the file format with the changed field names on the Field list grid. Then, click on Finish.
    Figure 35. Column Specification's Field list after changes
    Column Specification's Field list after corrections
  6. You are back to the File Format property page of the file import operator, but now showing the corrected format and the modified field names. After reviewing the settings, click on Next.
    Figure 36. Final file format window
    Final file format window
  7. The Column select property page appears.

    Here, verify that all the selected columns that appear on the right side are actually the columns you want to be selected from the file cost.unl.

    You want all the columns in the file to be used to load the cost table, so no changes are needed.
    Figure 37. Available columns, versus selected columns of a file import
    Available columns, versus selected columns of a file import
  8. Click on Finish to go back to the canvas for data flow populate_cost. See the newly created and configured file import operator File_import_cost_unl.
  9. Save (Ctrl+S) all the changes.

Data station operator for temporary table Cost

As explained earlier, Cost table is not part of the data mart model because it is rather a staging table that stores temporary Cost-related data prior to populating the fact table Sales. At this point in time in the tutorial, the table sales_demo:cost with a schema like the one shown in Listing 5 above, does not exist yet.

This next section of the tutorial to briefly illustrate how to use a data station operator to create, truncate, and/or populate staging tables in the form of either permanent tables, temporary tables, views or files in disk.

In this tutorial, cost is a temporary table which will be created and populated at execution time, and then will be automatically dropped as soon as the data flow execution is finished.

Follow the steps below to create and populate sales_demo:cost as a data station:

  1. Drag a data station operator from the General Operators palette on the right, and drop it onto the canvas for the populate_cost data flow.
  2. The Data Station properties view appear at the bottom side of the workspace. You can either edit the settings information in this Properties view/page, which first displays General tab, like the one shown in Figure 38, or right-click on the data station operator and select Show Wizard to edit the same properties in wizard environment. For this tutorial, the wizard option is used.
    Figure 38. Default Properties view for a data station operator
    Default Properties view for a data station operator
  3. The Data Station operator wizard will appear displaying the General properties page. Use these settings:
    • Change the Label to data_station_cost.
    • Enter the Description as you wish.
    • In Station type, select TEMPORARY_TABLE.
    • Enter cost in Table name.
    • Leave the other default values.
      Figure 39. Filling in information for data station operator for Cost staging/temp table
      Filling in information for data station operator for Cost staging/temp table
  4. Click on Finish, and save (Ctrl+S) all the changes.

Testing the execution for the data flow populate_cost

Although the data flow populate_cost will be placed in the data flow populate_sales, you want to make sure this data flow executes correctly.

  1. Click on some blank area of the data flow populate_cost's canvas, and find the Properties page in the bottom side, opened at the General tab, to set up the general properties of the data flow. Set the following:
    1. In the SQL execution schema, enter admin.
    2. Set the SQL execution database to sales_demo value from the drop-down list of available connections.
    3. In Temporary Staging Table Name Prefix, enter the prefix cost_.
    4. Keep the check boxes marked for optimization: Enable Optimization for Code Generation and Enable Optimization for Code Generation on Distributed Queries.
    5. Leave the other fields as they are.
  2. Grab the output port at the file import operator file_import_cost_unl to make a connection to the input port at the data station operator data_station_cost. Then, save (Ctrl+S) the changes.
  3. Click on the data station operator data_station_cost and review the information on the bottom Properties view/page. Make corrections, if any, as needed.
  4. Validate the data flow and ensure there is no issues.
  5. Execute the data flow as you did for the data flow populate_time, but with tracing and statistics on (under Diagnostics tab).
  6. Look at the Tail Log tab under Execution Status view. See "Number of rows modified" for the Insert operation into cost table. It should be 74 rows.

Now, you are ready to continue with the other data flows that remain:

  • customer (dimension table: Customer)
  • geography (dimension table: Geography)
  • product (dimension table: Product)
  • sales (fact table: Sales)

Designing the data flow for loading dimension table Customer

Data flow for loading dimension table sales_demo:customer

The Customer dimension table in sales_demo database is populated from the data contained in the the source OLTP table stores_demo:customer, with some basic transformations: Getting only a few columns from this source table, and also concatenating the columns of first name and last name to form a single column holding the name of a customer.

The following listing shows the type of ETL activity that is desired for a data flow in Design Studio, using SQL.

Listing 6. How the table sales_demo:customer is populated using SQL
{ The following statement connects to the sales_demo database }
connect to "sales_demo";

{ The following statement loads data into the customer dimension table. }
insert into customer (customer_code, customer_name, company_name)
select c.customer_num, trim(c.fname) || " " || c.lname, c.company
from stores_demo:customer c;

The following listing shows the type of ETL activity, in the form of SQL script, to load dimension table customer for the first time, that you will recreate in Design Studio in the form of a graphical SQL-based ELT Data Flow.

  1. Right-click on the Sales Demo DW-ELT project's Data Flows folder and select New > Data Flow.
  2. The New Data Flow wizard appears. Change the Data flow name to populate_customer, and also indicate to Work against data models (Offline).
  3. Click on Finish.

Table source operator to get data from OLTP system table stores_demo:customer

  1. Drag a table source operator from the General Operators palette on the right, and drop it onto the canvas for the populate_customer data flow.
  2. The Table Source operator wizard appears displaying the General properties page. Make these changes:
    1. Change the Label to a meaningful name, such as stores_demo_customer.
    2. Enter the Description that you prefer for this operator.
    3. Leave the other fields as they are, such as: JDBC batch size in 1000, and Debug sample type in Whole Data.
    4. In Location, select Remote database given that you are fetching data from a connection different than the one to the SQL execution database sales_demo (you will set up the execution database for this data flow later in this section).
    5. In the Source database table, click on the ellipsis (...) button to make the Table Selection dialog appear. There, locate the table customer under the stores_demo DB Model.dbm. Click on OK once you have located the stores_demo:customer table.
    6. Finally, in Connection, select stores_demo from the list. Review the inputs in the Table Source's General page and make sure they are correct and click on Next.
  3. The Condition page appears. There is no where condition to set because you are selecting all rows from the source stores_demo:customer table. So, leave it blank and click on Next.
  4. The Select List property page appears. As shown on the Listing 6 above, the target table sales_demo:customer will only use these four columns in the select list:
    • customer_num
    • a concatenation operation between fname and lname
    • company

    Therefore, pick up only these four columns, and leave them on the Result Columns side.

    In order to remove from the Result Columns side the columns that are not needed, select them (you can hold the Ctrl key and select several), and then click on the red cross (X), which is the delete button, to remove the unnecessary columns from the Select List.
    Figure 40. Select List page of table source operator, removing unwanted columns from source table
    Select List page of table source operator, removing unwanted columns from source table
  5. The resulting selected columns are the ones shown in Figure 66 (customer_num, fname, lname, and company):
    Figure 41. Select List page of table source operator
    Select List page of table source operator
  6. On this same Select List page, make all these other changes to be consistent with the desired transformation in Listing 6 above:
    1. Rename values in Column Name for columns: customer_num, that in the target table is customer_code, and company to company_name.

      At this point, do not change the data types of the columns that are not being transformed (customer_num and company) because these values come from the source and you are still at the Table Source operator at this point (stores_demo:customer), not the Table Target (sales_demo:customer).

      In order to change the names listed on the Column Name column, just step in the cell you want to change the column name, and type in the desired name for the column in the Result Columns side. These changes to the desired names are illustrated in Figure 42:
      Figure 42. Select List page of table source operator, renaming column names on select list
      Select List page of table source operator, renaming column names on select list
    2. The next step would be to combine columns fname and lname that come from the source stores_demo:customer into a single target customer_name conformed by the concatenation of those two columns. Therefore, the Result Columns on the right will contain only three columns:
      • customer_num, as customer_code
      • fname || ' ' || lname, as customer_name
      • company, as company_name
      In order to leave only 3 columns, let's remove the column for lname, as illustrated in Figure 43:
      Figure 43. Select List page of table source operator, removing one more column: lname
      Select List page of table source operator, removing one more column: lname
    3. The resulted three columns are shown in Figure 44. As you can see, second column fname needs to be changed to include lname in it too. You need to also change the column name of that column, from fname to customer_name:
      Figure 44. Select List page of table source operator, leaving three columns
      Select List page of table source operator, leaving three columns
    4. Let's change the column name for column fname to customer_name. This would be equivalent to an alias in a select list of a SQL SELECT statement (e.g.: SELECT ..., fname AS customer_name,... ). In order to change the column name, just step on the cell where fname appears as column name and replace it by customer_name. This step is illustrated in Figure 45:
      Figure 45. Select List page of table source operator, renaming second column in select list
      Select List page of table source operator, renaming second column in select list
    5. The resulting columns with the corrected name for the second column, are listed in Figure 46:
      Figure 46. Select List page of table source operator, results after renaming column
      Select List page of table source operator, results after renaming column
    6. Then, let's finally modify the result column fname so it replaced by an expression that produces a concatenation between fname and lname.

      In order to do this, position yourself on the Expression cell, where fname is. A ellipsis (...) button will show up. Click on the ellipsis button so the SQL Expression Builder will show up. Figure 47 illustrates this step.
      Figure 47. Select List page of table source operator, opening SQL Expression builder
      Select List page of table source operator, opening SQL Expression builder
    7. The SQL Expression Builder wizard appears. Double-click on the variety of existing elements (Input columns, keywords, operations, functions and variables) to construct the SQL-based expression that you need for the column where originally just fname used to be. You can edit the text inside the SQL text text box with valid SQL-based expressions, even if the functions, operators and keywords don't appear in the available lists.
      1. In this case, the current expression "admin"."customer"."fname" needs to be replaced with an expression that concatenates this column with lname
      2. To do this, position the cursor at the end of the expression inside the SQL Text.
      3. Add a blank space.
      4. Double-click on the concat operator, ||, listed in the operations frame, or just type it in the SQL Text.
      5. Add a blank space.
      6. Add the literal for the blank space at the SQL level: ' '.
      7. Type or double-click on the concat operator, || again.
      8. Add a blank space.
      9. From the list of Inputs on the upper-left corner, double-click on the "admin"."customer".lname column, to add it as operand of the concat operator.
      10. Finally, type in the function trim for each of both columns fname and lname so that the blank spaces are removed before concatenating the column values. Notice that the trim function is not listed in the functions lists on the right-upper corner of functions you can use. Even so, and trim being a valid Informix function, you can still use it by just typing it in the SQL Text.
      11. The final expression will look like this one: trim("admin"."customer"."fname") || ' ' || trim("admin"."customer"."lname"). Figure 48 illustrates the final SQL expression constructed from the SQL Expression Builder. Click on OK to close this window.
        Figure 48. SQL Expression builder to create concatenation between fname and lname columns
        SQL Expression builder to create concatenation between fname and lname columns
    8. Back in the Select List page of table source operator, you can now see that the second column is an expression based on the concatenation of trimming fname and lname columns from stores_demo:customer table. Figure 49 illustrates the final select list. Click on Finish.
      Figure 49. Select list's result solumns after creating SQL expression
      Select list's result columns after creating SQL expression
    9. Save all the changes. The new table source operator for the select and transformation of columns taken from source table stores_demo:customer appears on the canvas.

Table target operator for target table Customer in sales_demo

Where to place an SQL-based transformation

In some cases you will find that there is not a single place, in the whole ETL workflow, to make an SQL-based transformation on a set of columns/fields coming from a table or a file, and inserted/updated into a target table or file. For instance, the basic SQL-based transformation on fname and lname retrieved from source table stores_demo:customer were put in the Table Source operator, but it could also have gone in the Table Target operator instead, to be done by the time you are actually loading the data into the target sales_demo:customer table. Therefore, there are several ways to do the same data extraction, cleansing, transformation, movement, load. The path you choose will depend on the alternative that seems the best in your own environment, in terms of performance and administration.

  1. Drag the table target operator that is in the General Operators palette on the right, and drop it onto the canvas of the data flow populate_customer, as shown in Figure 50.
    Figure 50. Drag and drop table target operator onto data flow's canvas
    Drag and drop table target operator onto data flow's canvas
  2. The Table Target operator's wizard appears and is on the General properties page.
  3. Edit the General properties, like the following:
    1. Change the Label of the operator to insert_customer.
    2. Enter the Description: Table Target operator to insert/load data into sales_demo:customer dimension table.
    3. Keep Commit interval to 0. This field can be changed as needed.
    4. In SQL operation, leave Insert, as you are populating the table sales_demo:customer for the first time. Depending on the intention of the data flow, you may need to use the Target Table operator to either insert, delete, or update rows.
    5. In JDBC batch size, keep 0. You may want to increase this value for performance purposes.
    6. In Location, you can either choose: SQL execution database if the target database is the SQL execution database defined in the General properties table for the data flow (at this point, you have not set the execution database at the data flow's properties page) and so the target table customer is local to this database; or you can choose Remote database if the target table is in another database you want to specify at the operator's level.

      All JDBC compliant databases are supported, not only Informix. For performance purposes, it is recommended that the execution database is the same as the target database, so no additional connections are required.

      Therefore, in this case, set Location to SQL execution database.
    7. In Target database table, click on the ellipsis (...) button.
      • Inside the Table Selection dialog, expand the physical data model for the target data mart database, sales_demo, called sales_demo DB Model.dbm. Then, expand the admin schema under which you created the tables, and select the table customer. Click on OK.
    8. Back in the General page, check the check box to have the table use the Not-Logged Initially option.
    9. Check that all values are correct, as in Figure 79, and then click on Finish.
  4. The table target operator appears on the canvas. Save (Ctrl+S) all changes. Figure 51 shows the canvas for the data flow populate_customer as it is at this point. Notice this last operator contains a validation error (the red cross on the upper-left corner), because the input port have not been configured. Once you finish the connection between the Table Source with this Table Target operator and complete the set up, this error will be gone.
    Figure 51. Target table operator insert_customer added into the data flow canvas
    Target table operator insert_customer added into the data flow canvas

Setting the SQL execution database and the temporary prefix for the data flow populate_customer

  1. Click on some blank area of the data flow populate_customer's canvas, and find the Properties page in the bottom side, opened at the General tab, to set up the general properties of the data flow. Set the following:
    • In the SQL execution schema, enter admin.
    • Set the SQL execution database to sales_demo value from the drop-down list of available connections.
    • In Temporary Staging Table Name Prefix, enter the prefix customer_.
    • Keep the check boxes marked for optimization: Enable Optimization for Code Generation and Enable Optimization for Code Generation on Distributed Queries.
    • Leave the other fields as they are.

Connecting the output of table source with the input of table target operator

  1. Grab the output port at the table source operator stores_demo_customer to make a connection to the input port at the table target operator insert_customer. Then, save the changes. Figure 52 shows the status of both operators, connected:
    Figure 52. Connecting output and input ports of operators in data flow populate_customer
    Connecting output and input ports of operators in data flow
  2. Right-click on the table target operator insert_customer and select Show Wizard.
  3. The Table Target operator wizard will appear on the General page. Click on Next.
  4. The Map page of the insert_customer operator appears.

    Make sure that the mapping of the fields in the source table stores_demo:customer's select list (that corresponds to input INPUT_02_0) to the columns in target table sales_demo:customer is correct. The fields of the table are listed on the left panel for Available Columns and also in the Map from column on the right panel for Result Columns; and the columns of table time appear in column Map to on the Result Columns. Note that, since you have done all the transformations to the columns in the Table Source operator, you do not need to make further transformations in this Table Target operator when matching source to target. Click on Finish, and save (Ctrl+S) all changes.
    Figure 53. Mapping input fields from source table customer with columns in target table customer
    Mapping input fields from source table customer with columns in target table customer

Testing with an actual execution the data flow populate_customer

It is time to test in your development environment that the newly created data flow populate_customer works fine at extracting data from the input table stores_demo:customer and loading the data into the target dimension table sales_demo:customer.

In order to do this execution test, follow the steps below.

  1. Since you are testing full insert data flow to entirely populate sales_demo:customer table for the first time, make sure that the table sales_demo:customer is empty (no rows in it).
  2. Make sure you have the desired data flow to test, open and shown as the active tab on the right side of the workspace, and that you have saved all the changes.
  3. Make sure you validate the data flow, to make sure that data type mismatch and other types of issues, are identified and resolved. To validate the data flow, right-click on the data flow populate_customer located under the Data Flows folder on your project explorer, and click on Validate.
  4. You should receive a message saying "Validation succeeded".
  5. Use the Execute button on the upper toolbar of Design Studio to start the execution of the data flow
  6. The Flow Execution wizard appears. Leave the default values and click on the Execute button.
  7. Wait until the execution of the data flow populate_customer finishes.
  8. Once the execution of the data flow completes, view the Execution Result report and ensure it is successful.
  9. Click on OK to close this window.
  10. Verify directly on the Informix instance that contains the database sales_demo that its table customer has been populated and now contains data:
    Listing 7. Verify table customer is populated and contains data
    dbaccess sales_demo -
    
    Database selected.
    
    > select count(*) from customer;
    
    
          (count(*))
    
                  28
    
    1 row(s) retrieved.

Now, you are ready to continue with the other data flows that remain:

  • geography (dimension table: Geography)
  • product (dimension table: Product)
  • sales (fact table: Sales)

Designing the data flow for loading dimension table Geography

Data flow for loading dimension table sales_demo:geography

The dimension table geography in sales_demo database is populated using the data from two source OLTP tables: stores_demo:customer and stores_demo:state, FIX.

The following listing shows the type of ETL activity, in the form of SQL script, to load dimension table geography for the first time, that you will recreate in Design Studio in the form of a graphical SQL-based ELT Data Flow.

Listing 8. How the table sales_demo:geography is populated
-- The following statement loads data into the geography dimension table.
insert into geography (district_name, state_code, state_name)
select distinct c.city, s.code, s.sname
from stores_demo:customer c, stores_demo:state s
where c.state = s.code;

update geography   	-- converts state_code values to region values
set region = 1
where state_code = "CA";

update geography
set region = 2
where state_code <> "CA";

Let's create another Data Flow to load the geography table.

  1. Right-click on the Sales Demo DW-ELT Project's Data Flows folder and select New > Data Flow.
  2. Enter information for new data flow called populate_geo and select the "work against data models: (Offline)" option. Select Informix as the SQL execution database type.
  3. Click on Finish.

Adding customer table source operator

You need two source tables from the stores_demo database: customer and state. The first source table is customer table from stores_demo database.

  1. To create a source table for stores_demo:customer, drag a Table Source operator from the palette and drop it onto the empty canvas area. A wizard for the Table Source operator appears.
  2. The first field you see is Label. This label will be automatically named "customer" once you select the customer table from the stores_demo data model.
  3. Select "remote database" radio button.
  4. Now you need to list the customer table as the Source Database Table. Click on the ellipsis (...) button next to it. From there, expand the stores_demo data model, and then expand the "admin" schema. Select the customer table.
  5. Click on OK.
  6. Back to the wizard, in the Connection field, select stores_demo.
  7. Click on Next twice.
  8. As per the original SQL statement in loaddw.sql to populate geography, you do not need a select * on stores_demo:customer.

    Only two columns are needed—city and state:
    Listing 9. Loading data into geography dimension table
    -- The following statement loads data into the geography dimension table.
    insert into geography (district_name, state_code, state_name)
    select distinct c.city, s.code, s.sname
    from stores_demo:customer c, stores_demo:state s
    where c.state = s.code;

    So, remove the unused columns using the delete button on the top, the red "X", leaving just the city and state columns as indicated in the following Figure 54:
    Figure 54. Select List wizard for table source operator for customer
    Select List wizard for Table Source operator for customer
  9. Click on Finish.

    You now see the table source operator labeled "customer" on the canvas.

The first Table Source operator is now complete.

Adding state table source operator

The next table source is state table which also comes from stores_demo database. Let's add the second Table Source operator for stores_demo:state.

  1. Do the same steps as the Table Source operator for stores_demo:customer, except selecting state table instead for the Source database table.
  2. After completing the first screen of the wizard, select Next twice. You want to take a look at the Select List properties and confirm if all the columns from state table are needed.
  3. As per the original sql statement in loaddw.sql, both columns in state are needed, which are code and sname.
    Listing 10. Loading date into geography dimension table
    -- The following statement loads data into the geography dimension table.
    insert into geography (district_name, state_code, state_name)
    select distinct c.city, s.code, s.sname
    from stores_demo:customer c, stores_demo:state s
    where c.state = s.code;

    So, leave them as they are:
    Figure 55. Select List properties for table source operator, state table
    Select List properties for table source operator, state table
  4. Click on Finish.
  5. Save all changes.
    Figure 56. Table source operators on populate_geo canvas
    Table source operators on populate_geo canvas

Now, the populate_geo have two table sources: stores_demo:customer and stores_demo:state.

Adding table join operator to join two table sources

The table sources need to be joined, so you need to use the table join operator.

Let's add the table join operator for the join between the two tables.

  1. Connect the output ports from customer and state into the in and in1 ports at the Table Join like this:
    Figure 57. Connecting the ports between table source operators and table join operator
    Connecting the ports between table source operators and table join operator
  2. Let's configure the Table Join operator. Right-click on the Table Join > Show Wizard.
  3. Change the name to Table_join_cust_state.
  4. Click on Next.
  5. Next screen displays the join condition. Click on the ellipsis (...) button on the upper right to enter the join condition.
    Figure 58. Table Join Wizard, screen 2
    Table Join Wizard, screen 2
  6. A SQL Expression Builder screen appears.

    Use the operators in the SQL expression builder, to do an equivalent to this clause:
    Listing 11. Loading data into geography dimension table
    -- The following statement loads data into the geography dimension table.
    insert into geography (district_name, state_code, state_name)
    select distinct c.city, s.code, s.sname
    from stores_demo:customer c, stores_demo:state s
    where c.state = s.code;

    So this would work:
    Figure 59. Editing SQL condition in SQL Condition Builder for table join operator
    Editing SQL condition in SQL Condition Builder for table join operator
  7. Click on OK, and then click on Next to go to the next screen.
  8. Review the result columns in the Select List. As per the original SQL statement, you don't need to repeat the state code (from customer.state and from state.code).

    So, only the following columns are needed: city, code and sname. You can use the delete button to remove the rest (only state is removed):
    Figure 60. Select List
    Select List
  9. Click on Finish.
  10. Save all changes.

The two tables are now joined.

Adding distinct operator

Now, you will do a DISTINCT to remove duplicates from the output of this join by using inner join between stores_demo:customer and stores_demo:state, and then, you will populate the target table sales_demo:geography with the result and some additional conditions that are in the original SQL statement in loaddw.sql script.

  1. Drag and drop a Distinct operator onto the canvas and connect the Inner port in the Table Join operator to the input port in the Distinct operator.

    You will see red "X" on the Table Join operator at first, but after saving this flow the red "X" will disappear.
    Figure 61. Addition of distinct operator to populate_geo data flow
    Addition of distinct operator to populate_geo data flow
  2. Let's configure the Distinct operator. Double-click on the Distinct operator to open up its Wizard.
  3. Change the label to Distinct_join_cust_state.
  4. Click on Next to view the Column Select screen.
    Figure 62. Distinct Operator Wizard, screen 2
    Distinct Operator Wizard, screen 2
    You are ok. You want all of those columns (city, city, sname) to be distinct.
  5. Click on Finish.
  6. Save all changes.

You have made the join between the two table sources distinct.

Adding distinct operator

Now, you will add the final operator: Target Table operator for the geography table in the sales_demo database.

  1. Drag and drop the Target Table operator.
  2. Perform the same steps as you did with the Table Source operators, except by selecting the geography table under the sales_demo data model (instead of stores_demo which is the source).
  3. Click on Finish.
  4. Connect the result port in the Distinct operator to the input port in the table target operator and save all changes.
    Figure 63. Distinct operator connected to table target operator
    Distinct operator connected to table target operator
  5. You need to map the columns and make the few transformations that are required. Right-click on the table target operator, select Show Wizard, and then Next to go to the Map screen:
    Figure 64. Table Target Wizard Map screen
    Table Target Wizard Map screen
  6. This is the original SQL statement to populate geography:
    Listing 12. Original SQL statement to populate geography
    -- The following statement loads data into the geography dimension table.
    insert into geography (district_name, state_code, state_name)
    select distinct c.city, s.code, s.sname
    from stores_demo:customer c, stores_demo:state s
    where c.state = s.code;
    
    update geography   	-- converts state_code values to region values
    set region = 1
    where state_code = "CA";
    
    update geography
    set region = 2
    where state_code <> "CA";

    Notice that: district_code is a SERIAL, and it's not being entered, so 0 will increment the number to the appropriate sequence; then, district_name has a 1-on-1 mapping with column city coming from the distinct; also, state_code and state_name each map with code and sname coming from the distinct.

    So, let's make these mappings as explained above:
    Figure 65. Columns mapped
    Columns mapped
  7. Finally, the region value depends on the state_code value. This will be either a CASE or IF operator.

    It can be converted into this CASE:
    CASE WHEN state_code = "CA" THEN region = 1 ;
    ELSE region = 2 ;
    END CASE;

    Click on the cell of the region column, then click on the ellipsis (...) button that will appear next to "region" name, to write following the CASE:
    CASE WHEN "INPUT_08_0"."code" = 'CA' THEN 1 ELSE 2 END CASE

    The input name "INPUT_08_0" may differ.
    Figure 66. Writing CASE expression for Region column
    Writing CASE expression for Region column
  8. Click on OK.
  9. Click on Finish.
  10. Save all changes.
    Figure 67. Final populate_geo data flow
    Final populate_geo data flow

Ready? Let's execute.

Executing populate_geo data flow

  1. Select the Execution button at the top. It is a green button with white arrow located at the top. See the following figure for the location of the "Execute" button.
    Figure 68. Location of Execute button
    Location of Execute button
  2. You will see the Flow Execution Wizard. Ensure the following values are used:
    • SQL execution schema: admin
    • Select SQL execution database: sales_demo
  3. Click on Execute.
  4. You will see a pop up box where it is executing the data flow "populate_geo". When the flow is executed, the Execution Result window appears. It tells you if the flow execution has succeeded.
  5. Here in Figure 69, it indicates that the Execution has succeeded, under the Result summary at the top. In the log below, look for the red arrow where it indicates that the execution is successful. This is a useful place to look if there are failures.
    Figure 69. Execution result window
    Execution result window
  6. Click on OK to close the Execution Result Window.
  7. To validate the flow by checking the table geography in sales_demo database, you can go to the Data Source Explorer and expand the sales_demo tree to view the tables under admin schema.
  8. Right-click on geography table and select Data > Return All Rows.
    Figure 70. Using the 'Return All Rows' option in Data Source Explorer
    Using the 'Return All Rows' option in Data Source Explorer
  9. Take a look at the "SQL Results" window which is usually on the right of the Data Source Explorer.
  10. Click on Result1 tab. A total of 17 rows are returned, which means the flow is a success.
    Figure 71. Result of returning all rows from geography table
    Result of returning all rows from geography table

Designing the data flow for loading dimension table Product

Data flow for loading dimension table sales_demo:product

In this section, populate the product table in sales_demo database. The product table is another dimension table in the star schema.

Let's take a look at the next piece of the loaddw.sql script:

Listing 13. How the table sales_demo:product is populated using SQL
-- The following statement loads data into the product dimension table.
insert into product (product_code, product_name, vendor_code,
 vendor_name, product_line_code, product_line_name)
 select a.catalog_num, 
 trim(m.manu_name) || " " || s.description,
 m.manu_code, m.manu_name, s.stock_num, s.description
 from stores_demo:catalog a, stores_demo:manufact m,
 stores_demo:stock s
 where a.stock_num = s.stock_num and a.manu_code = s.manu_code and
 s.manu_code = m.manu_code;

As you can see, product will be populated by using a join between catalog, manufact, and stock in the stores_demo database.

You want the final data flow to look similar to the following figure:

Figure 72. Final data flow design for populate_product
Final data flow design for populate_product

Create a data flow and name it populate_product. You should be able to know how to create a data flow already by now.

Creating first table source, catalog table

The data flow above indicates there are three table source operators: catalog, manufact, and stock tables from the stores_demo database.

  1. Drag and drop the first Table Source operator onto the empty canvas. This will be the catalog table.
  2. Ensure the remote database radio button is selected.
  3. Choose the catalog table under admin schema using the stores_demo's data model.
  4. Press Next twice to reach the Select List screen (third screen).
  5. From stores_demo:catalog table, only these columns (in bold in Listing 14) are needed: catalog_num, manu_code, and stock_num.
    Listing 14. Needed columns
    insert into product (product_code, product_name, vendor_code,
     vendor_name, product_line_code, product_line_name)
     select a.catalog_num, 
     trim(m.manu_name) || " " || s.description,
     m.manu_code, m.manu_name, s.stock_num, s.description
     from stores_demo:catalog a, stores_demo:manufact m,
     stores_demo:stock s
     where a.stock_num = s.stock_num and a.manu_code = s.manu_code and
     s.manu_code = m.manu_code;

    So, remove the rest and leave catalog_num, stock_num, and manu_code columns in the Result Columns list.
    Figure 73. catalog table source operator's Select List wizard
    catalog table source operator's Select List wizard
  6. Click on Finish.

Creating second table source, manufact table

  1. Drag and drop the second Table Source operator. This will use the manufact table from the stores_demo data model.
  2. Please edit the properties on the first screen of the wizard similar to what you did for the Table Source operator for catalog, except by selecting the manufact table in the stores_demo data model instead.
  3. Press Next twice to reach to the Select List screen.
  4. From the list on the screen, you only need to keep manu_code and manu_name (see SQL in Listing 15 below):
    Listing 15. Needed columns
    insert into product (product_code, product_name, vendor_code,
     vendor_name, product_line_code, product_line_name)
     select a.catalog_num, 
     trim(m.manu_name) || " " || s.description,
     m.manu_code, m.manu_name, s.stock_num, s.description
     from stores_demo:catalog a, stores_demo:manufact m,
     stores_demo:stocks
     where a.stock_num = s.stock_num and a.manu_code = s.manu_code and
     s.manu_code = m.manu_code;

    So, discard the lead_time column from the Result Columns list.
    Figure 74. manufact table source operator's Select List wizard
    manufact table source operator's Select List wizard
  5. Click on Finish.

Creating third table source, stock table

  1. Drag and drop the third Table Source operator. This will be the stock table from the stores_demo data model.
  2. Edit the properties on the first screen of the wizard similar to what you did for the Table Source operator for catalog and manufact, except by selecting the stock table in the stores_demo data model instead.
  3. Press Next twice to reach to the Select List screen.
  4. Based on the base sql in loaddw.sql, you only need these columns from stores_demo:stock table: stock_num, manu_code, and description.
    Listing 16. Needed columns
    insert into product (product_code, product_name, vendor_code,
     vendor_name, product_line_code, product_line_name)
     select a.catalog_num, 
     trim(m.manu_name) || " " || s.description,
     m.manu_code, m.manu_name, s.stock_num, s.description
     from stores_demo:catalog a, stores_demo:manufact m,
     stores_demo:stock s
     where a.stock_num = s.stock_num and a.manu_code = s.manu_code and
     s.manu_code = m.manu_code;

    So, remove the rest but keep stock_num, manu_code, and description columns.
    Figure 75. stock table source operator's Select List wizard
    stock table source operator's Select List wizard
  5. Click on Finish .
  6. Save all changes.
    Figure 76. Three table source operators for populate_product data flow
    Three table source operators for populate_product data flow

Creating a table join

Now, you need to join the three tables.

  1. Drag and drop the Table Join operator onto the canvas to the right of the three table source operators.
  2. Connect the Table Source operators' output ports to the input ports in the Table Join operator.
  3. You may notice there are only two input ports in the Table Join operator. Click on the little icon right below the "in1" port to create a new input port. A new input port "in2" will appear below "in1".
    Figure 77. Creating new input port in the table join operator
    Creating new input port in the table join operator
  4. Connect from the stock Table Source operator's output port to the "in2" input port in the Table Join operator.
    Figure 78. Connecting between stock table source operator and table join operator
    Connecting between stock table source operator and table join operator
  5. You need to configure the Table Join operator. Double-click on it to get the Wizard.
  6. Name the label as Table_Join_product.
  7. Click on Next to view the Join Condition screen.
  8. Click on the ellipsis (...) button on the upper right to set up the join condition.
  9. The join condition needs to be set up as shown in Listing 17:
    Listing 17. Join condition setup
    insert into product (product_code, product_name, vendor_code,
     vendor_name, product_line_code, product_line_name)
     select a.catalog_num, 
     trim(m.manu_name) || " " || s.description,
     m.manu_code, m.manu_name, s.stock_num, s.description
     from stores_demo:catalog a, stores_demo:manufact m,
     stores_demo:stock s
     where a.stock_num = s.stock_num and a.manu_code = s.manu_code and
     s.manu_code = m.manu_code;

    So, in the SQL Condition Builder, do like this (table names in the SQL Condition Builder might differ):
    "IN_04_0"."stock_num" = "IN2_04_2"."stock_num" AND 
    "IN_04_0"."manu_code" = "IN2_04_2"."manu_code" AND 
    "IN2_04_2"."manu_code" = "IN1_04_1"."manu_code"
    Figure 79. SQL Condition Builder for table join operator
    SQL Condition Builder for table join operator
  10. Click on OK.
  11. You are returned to the Condition screen. Click on Next.
  12. You are now in the Select List screen.

    Remove the columns that are not required for the result set. Based on the SQL statement in Listing 18, you only need the ones in bold :
    Listing 18. Needed columns
    insert into product (product_code, product_name, vendor_code,
     vendor_name, product_line_code, product_line_name)
    select a.catalog_num, 
     trim(m.manu_name) || " " || s.description,
     m.manu_code, m.manu_name, s.stock_num, s.description
     from stores_demo:catalog a, stores_demo:manufact m,
     stores_demo:stock s
     where a.stock_num = s.stock_num and a.manu_code = s.manu_code and
     s.manu_code = m.manu_code;

    So, remove the rest of the columns from the Result Columns list.
    Figure 80. Table join operator Select List wizard
    Table join operator Select List wizard
  13. You may notice some of the columns end with "_1", due to duplicates. There are no longer any duplicate columns.

    For clarity, rename the column names to remove the "_1" suffix.

    Click on any field under the Column Name column in the Result Columns list, and it will allow you to edit the name.
    Figure 81. manu_code and stock_num renamed with "_1" suffix removed
    manu_code and stock_num renamed with '_1' suffix removed
  14. Click on Finish.
  15. Save all changes.

Creating a target table

Finally, add the table target operator of sales_demo:product.

  1. Drag and drop the table target operator onto the canvas.
  2. The wizard screen appears. Edit the fields so that it will be using the remote database option, product table from the sales_demo data model (not stores_demo!), and sales_demo as the connection.
  3. Click on Finish.
  4. Connect the Inner port from the Table Join operator to the input port of the Table Target operator.
    Figure 82. Table join operator connects to the table target operator
    Table join operator connects to the table target operator
  5. Save all changes.
  6. You will see a small warning yellow icon appears on the upper left corner of the Table Target operator.

    Hover the mouse above the yellow icon to see the message.
    Figure 83. Column mapping warning
    Column mapping warning
    It explains that the column mapping has not been specified. You have not mapped the columns between the Table Join operator and the Table Target operator yet, which is needed.
  7. Let's open the Table Target wizard by double clicking on the Table Target operator on the canvas.
  8. The first screen is fine. Press Next.
  9. Notice that the columns are not mapped because the fields for "Map From" are all blank.
    Figure 84. Column mapping warning
    Column mapping warning
    You need to create the correct mappings. The original SQL statement will help:
    Listing 19. Original SQL statement
    insert into product (product_code, product_name, vendor_code,
     vendor_name, product_line_code, product_line_name)
     select a.catalog_num, 
     trim(m.manu_name) || " " || s.description,
     m.manu_code, m.manu_name, s.stock_num, s.description
     from stores_demo:catalog a, stores_demo:manufact m,
     stores_demo:stock s
     where a.stock_num = s.stock_num and a.manu_code = s.manu_code and
     s.manu_code = m.manu_code;

    Map the catalog_num to product_code by dragging and dropping the catalog_num column from the left into the Map From field next to product_code column.
    Figure 85. Drag and drop the catalog_num to map to product_code in target table
    Drag and drop the catalog_num to map to product_code in target table
  10. Do the same for the below mapping:
    • manu_code to vendor_code
    • manu_name to vendor_name
    • stock_num to product_line_code
    • description to product_line_name
    Figure 86. Mapping all columns except product_name
    Mapping all columns except product_name
  11. Notice the product name is not mapped. The product name is a concatenate between manu_name and description.

    Click on the empty Map From field next to the product_name. The ellipsis (...) button appears.
    Figure 87. Making the ellipsis button appear
    Making the ellipsis button appear
  12. Click on the ellipsis (...) button to use the SQL Expression Builder.
  13. Create the concatenation transformation. There are several ways to do this:

    First option:
    "INPUT_08_0"."manu_name" || ' ' || "INPUT_08_0"."description"

    Second option, using SUBSTR function:
    SUBSTR("INPUT_08_0"."manu_name",0,LENGTH("INPUT_08_0"."manu_name"))
     || ' ' || "INPUT_08_0"."description"

    Third option, using the TRIM function (you need to type this in directly, as it is not available in the Functions list as of this release):
    TRIM("INPUT_08_0"."manu_name") || ' ' || "INPUT_08_0"."description"

    Feel free to come up with your own way.

    This tutorial uses the third option. Click on OK.
    Figure 88. All columns mapped
    All columns mapped
  14. Click on Finish.
  15. Notice the warning yellow icon in the Table Target operator. Saving the flow should take care of this.

    Save the flow. The warning icon disappears.
    Figure 89. The warning icon disappears
    The warning icon disappears
  16. Execute the data flow, like you did with the previous other data flows in this tutorial.
  17. Validate the results by querying the product table in the sales_demo. You should see 74 rows.
    Figure 90. populate_product data flow validated with 74 rows returned
    populate_product data flow validated with 74 rows returned

Designing the data flow for loading fact table Sales

Data flow for loading fact table sales_demo:sales

Finally, you need to create one more data flow which is very important: populate the fact table. In all other previous data flows in the tutorial, the dimension tables were populated.

In the sales_demo database schema, our fact table is named sales.

Let's take a look at the next piece of the loaddw.sql script:

Listing 20. How the fact table sales is populated
-- The following statement loads data into the sales fact table.
insert into sales (customer_code, district_code, time_code, product_code,
units_sold, revenue, cost, net_profit)
select c.customer_num, g.district_code, t.time_code, p.product_code,
SUM(i.quantity), SUM(i.total_price), SUM(i.quantity * x.cost),
SUM(i.total_price) - SUM(i.quantity * x.cost)
from stores_demo:customer c, geography g, time t, product p,
stores_demo:items i, stores_demo:orders o, cost x
where c.customer_num = o.customer_num and o.order_num = i.order_num and
p.product_line_code = i.stock_num and  p.vendor_code = i.manu_code and
 t.order_date = o.order_date and  p.product_code = x.product_code and
 c.city = g.district_name
 GROUP BY 1,2,3,4;

This is equivalent of the following data flow populate_sales diagram:

Figure 91. Final data flow design for populate_sales
Final data flow design for populate_sales

According to the diagram, it uses seven Table Source operators, one Table Join operator, one Group By operator, and one Table Target operator. The diagram may look intimidating, but imagine coming up with the SQL code instead for that. Design Studio makes this easier with the drag and drop operators and connectors.

Because you used a data station for the cost source, the populate_cost data flow will be reused instead of a table source operator, so it will not look like the diagram above.

First, create a new data flow and name it populate_sales.

Creating first source table

Let's start putting the source tables together.

The fact table requires all these source tables (notice that some source tables come from stores_demo and others come from the sales_demo).

  • stores_demo:customer c, geography g, time t, product p
  • stores_demo:items i, stores_demo:orders o, cost x

Let's start with the first Table Source operator: the customer table from stores_demo.

  1. Drag and drop the Table Source operator onto the canvas and name it customer. The location will be remote database. The source database table will be stores_demo.admin.customer, which means the table name is customer and the schema name is admin. Connection will be from stores_demo.
  2. Select Next twice (there is no filter condition for customer table).
  3. The next screen that appears is the Select List wizard.
    Figure 92. Select List wizard for source sable operator customer
    Select List wizard for source table operator customer
    You do not need all columns in customer. As per the original SQL statement in loaddw.sql script, you only need customer_num and city:
    Listing 21. Needed columns
    insert into sales (customer_code, district_code, time_code, product_code,
    units_sold, revenue, cost, net_profit)
    select c.customer_num, g.district_code, t.time_code, p.product_code,
    SUM(i.quantity), SUM(i.total_price), SUM(i.quantity * x.cost),
    SUM(i.total_price) - SUM(i.quantity * x.cost)
    from stores_demo:customer c, geography g, time t, product p,
    stores_demo:items i, stores_demo:orders o, cost x
    where c.customer_num = o.customer_num and o.order_num = i.order_num and
    p.product_line_code = i.stock_num and  p.vendor_code = i.manu_code and
     t.order_date = o.order_date and  p.product_code = x.product_code and
     c.city = g.district_name
     GROUP BY 1,2,3,4;

    So, leave customer_num and city alone, and remove the rest using the delete button.

    See the following figure for the location of the delete button, which is red cross (X) located on the upper left corner of the Result Columns area.
    Figure 93. Two columns remain (circled 'X' is the delete button)
    Two columns remain (circled 'X' is the delete button)
  4. Click on Finish, and save (Ctrl+S).

Creating second source table, sales_demo:geography

Now, let's make a second Table Source operator: the geography table from sales_demo.

  1. Drag and drop the Table Source operator onto the canvas and name it geography. The location will be remote database. The source database table will be sales_demo.admin.geography which means the table name is geography and the schema name is admin. Connection will be from sales_demo.
  2. Select Next twice (there is no filter condition for geography table).
  3. The next screen that appears is the Select List wizard.
    Figure 94. Select List wizard for source table operator geography
    Select List wizard for source table operator geography
    As per the original SQL statement in loaddw.sql, you don't need all the columns in geography table. Only district_code and district_name are needed.
    Listing 22. Needed columns
    insert into sales (customer_code, district_code, time_code, product_code,
    units_sold, revenue, cost, net_profit)
    select c.customer_num, g.district_code, t.time_code, p.product_code,
    SUM(i.quantity), SUM(i.total_price), SUM(i.quantity * x.cost),
    SUM(i.total_price) - SUM(i.quantity * x.cost)
    from stores_demo:customer c, geography g, time t, product p,
    stores_demo:items i, stores_demo:orders o, cost x
    where c.customer_num = o.customer_num and o.order_num = i.order_num and
    p.product_line_code = i.stock_num and  p.vendor_code = i.manu_code and
     t.order_date = o.order_date and  p.product_code = x.product_code and
     c.city = g.district_name
     GROUP BY 1,2,3,4;

    So, remove all the columns that are not district_code or district_name:
    Figure 95. Columns district_code and district_name remain
    Columns district_code and district_name remain
  4. Click on Finish, and save (Ctrl+S).

Creating third source table, sales_demo:time

Now, let's make a third Table Source operator: the time table from sales_demo.

  1. Drag and drop the Table Source operator onto the canvas and name it time. The location will be remote database. The source database table will be sales_demo.admin.time, which means the table name is time and the schema name is admin. Connection will be from sales_demo.
  2. Select Next twice (there is no filter condition for geography table).
  3. The next screen that appears is the Select List wizard.
    Figure 96. Select List wizard for source table operator time
    Select List wizard for source table operator time
    As per the original SQL statement, you don't need all the columns in time table. You only need time_code and order_date:
    Listing 23. Needed columns
    insert into sales (customer_code, district_code, time_code, product_code,
    units_sold, revenue, cost, net_profit)
    select c.customer_num, g.district_code, t.time_code, p.product_code,
    SUM(i.quantity), SUM(i.total_price), SUM(i.quantity * x.cost),
    SUM(i.total_price) - SUM(i.quantity * x.cost)
    from stores_demo:customer c, geography g, time t, product p,
    stores_demo:items i, stores_demo:orders o, cost x
    where c.customer_num = o.customer_num and o.order_num = i.order_num and
    p.product_line_code = i.stock_num and  p.vendor_code = i.manu_code and
    t.order_date = o.order_date and  p.product_code = x.product_code and
    c.city = g.district_name
    GROUP BY 1,2,3,4;

    So, remove all the columns that are not time_code or order_date:
    Figure 97. Columns time_code and order_date remain
    Columns time_code and order_date remain
  4. Click on Finish, and save changes.

Creating fourth source table, sales_demo:product

Now, let's make a fourth Table Source operator: the product table from sales_demo.

  1. Drag and drop the table source operator onto the canvas, and name it product. The location will be remote database. The source database table will be sales_demo.admin.product, which means the table name is product and the schema name is admin. Connection will be from sales_demo.
  2. Select Next twice (there is no filter condition for product table).
  3. The next screen that appears is the Select List wizard.

    As per the original SQL statement, you don't need all the columns in product table. You only need product_code, vendor_code, product_line_code:
    Listing 24. Needed columns
    insert into sales (customer_code, district_code, time_code, product_code,
    units_sold, revenue, cost, net_profit)
    select c.customer_num, g.district_code, t.time_code, p.product_code,
    SUM(i.quantity), SUM(i.total_price), SUM(i.quantity * x.cost),
    SUM(i.total_price) - SUM(i.quantity * x.cost)
    from stores_demo:customer c, geography g, time t, product p,
    stores_demo:items i, stores_demo:orders o, cost x
    where c.customer_num = o.customer_num and o.order_num = i.order_num and
    p.product_line_code = i.stock_num and  p.vendor_code = i.manu_code and
    t.order_date = o.order_date and p.product_code = x.product_code and
    c.city = g.district_name
    GROUP BY 1,2,3,4;

    So, remove all the columns that are not product_code, vendor_code, or product_line_code:
    Figure 98. Columns product_code, vendor_code, and product_line_code remain
    Columns product_code, vendor_code, and product_line_code remain
  4. Click on Finish, and save changes.

Creating fifth source table, stores_demo:items

Now, let's make a fifth Table Source operator: the items table from stores_demo.

  1. Drag and drop the Table Source operator onto the canvas and name it items. The location will be remote database. The source database table will be stores_demo.admin.items, which means the table name is items and the schema name is admin. Connection will be from stores_demo.
  2. Select Next twice (there is no filter condition for items table).
  3. The next screen that appears is the Select List wizard.

    As per the original SQL statement, you don't need all the columns in time table. You only need quantity, total_price, order_num, stock_num, and manu_code:
    Listing 25. Needed columns
    insert into sales (customer_code, district_code, time_code, product_code,
    units_sold, revenue, cost, net_profit)
    select c.customer_num, g.district_code, t.time_code, p.product_code,
    SUM(i.quantity), SUM(i.total_price), SUM(i.quantity* x.cost),
    SUM(i.total_price) - SUM(i.quantity * x.cost)
    from stores_demo:customer c, geography g, time t, product p,
    stores_demo:items i, stores_demo:orders o, cost x
    where c.customer_num = o.customer_num and o.order_num = i.order_num and
    p.product_line_code = i.stock_num and p.vendor_code = i.manu_code and
    t.order_date = o.order_date and p.product_code = x.product_code and
    c.city = g.district_name
    GROUP BY 1,2,3,4;

    The column item_num is not needed, so let's just remove that only.
    Figure 99. Columns quantity, total_price, order_num, stock_num, and manu_code remain
    quantity, total_price, order_num, stock_num, and manu_code remain
  4. Click on Finish and save changes.

Creating sixth source table, stores_demo:orders

Now, let's make a sixth Table Source operator: the orders table from stores_demo.

  1. Drag and drop the Table Source operator onto the canvas and name it orders. The location will be remote database. The source database table will be stores_demo.admin.orders, which means the table name is orders and the schema name is admin. Connection will be from stores_demo.
  2. Select Next twice (there is no filter condition for items table).
  3. The next screen that appears is the Select List wizard. As per the original SQL statement, you don't need all the columns in orders table. You only need customer_num, order_num, and order_date:
    Listing 26. Needed columns
    insert into sales (customer_code, district_code, time_code, product_code,
    units_sold, revenue, cost, net_profit)
    select c.customer_num, g.district_code, t.time_code, p.product_code,
    SUM(i.quantity), SUM(i.total_price), SUM(i.quantity * x.cost),
    SUM(i.total_price) - SUM(i.quantity * x.cost)
    from stores_demo:customer c, geography g, time t, product p,
    stores_demo:items i, stores_demo:orders o, cost x
    where c.customer_num = o.customer_num and o.order_num = i.order_num and
    p.product_line_code = i.stock_num and  p.vendor_code = i.manu_code and
    t.order_date = o.order_date and p.product_code = x.product_code and
    c.city = g.district_name
    GROUP BY 1,2,3,4;

    So, remove the columns that are not customer_num, order_num, and order_date:
    Figure 100. Columns customer_num, order_num and order_date remain
    customer_num, order_num and order_date remain
  4. Click on Finish, and save changes.

Optional: Renaming the table source operators

Let's rename all Table Source operators so you can tell which table is from which source easily. For example, change the label of Table Source operator from "orders" to "stores_orders". You can do this by clicking on any Table Source operator and modifying the properties at the bottom. This is optional.

Figure 101. Renaming the table sources
Renaming the table sources

Adding the cost source

  1. Finally, you need to add the cost table. Recall the data flow you have created for cost, populate_cost. You can now drag and drop that data flow onto the canvas without recreating the job again.
    Figure 102. Drag populate_cost from Data Flows folder onto the canvas
    Drag populate_cost from Data Flows folder onto the canvas
  2. A pop-up window warns you that this operation cannot be undone and asks you if you want to proceed. Click on Yes.
  3. Next window asks you to select the drag and drop method.
    Figure 103. Select drag and drop method
    Select drag and drop method
    Since only one option is available, which is copying the operators to the target flow, click on OK. The operators are now pasted onto the canvas.
    Figure 104. populate_cost operators in populate_sales data flow
    populate_cost operators in populate_sales dataflow
  4. The operators are not in a location you want them to be.

    Let's relocate the populate_cost operators to the bottom. This can be done by clicking on any of its operator and dragging it downwards.
    Figure 105. Relocating populate_cost operators
    Relocating populate_cost operators
  5. After moving the operators to a desired place, you can move the operators individually to make them look neat.
    Figure 106. populate_cost operators in place
    populate_cost operators in place
  6. As per the original SQLscript loaddw.sql, only these columns are needed: product_code and cost.
    Listing 27. Needed columns
    insert into sales (customer_code, district_code, time_code, product_code,
    units_sold, revenue, cost, net_profit)
    select c.customer_num, g.district_code, t.time_code, p.product_code,
    SUM(i.quantity), SUM(i.total_price), SUM(i.quantity * x.cost),
    SUM(i.total_price) - SUM(i.quantity * x.cost)
    from stores_demo:customer c, geography g, time t, product p,
    stores_demo:items i, stores_demo:orders o, cost x
    where c.customer_num = o.customer_num and o.order_num = i.order_num and
    p.product_line_code = i.stock_num and  p.vendor_code = i.menu_code and
     t.order_date = o.order_date and  p.product_code = x.product_code and
     c.city = g.district_name
     GROUP BY 1,2,3,4;

    So the manu_code column is not needed.

    Click on the input port in the Data Station operator. You want to edit its virtual table properties to remove the manu_code column.
    Figure 107. Properties of data station operator's input port
    Properties of data station operator's input port
  7. Click on the virtual table tab in the properties.
  8. Delete the manu_code column by clicking on red cross (X).
    Figure 108. Deleting the manu_code column
    Deleting the manu_code column
    Figure 109. Final virtual table
    Final virtual table
  9. You want to confirm the manu_code column is not there in the result port. Select the result port and view its virtual table.
    Figure 110. Confirming manu_code is not in virtual table for result port
    Confirming manu_code is not in virtual table for result port
    manu_code is not there, so you have just validated our virtual table changes.
  10. Save.

Joining all source tables

The next step is to join all of those source tables.

  1. Drag and drop a Table Join operator onto the canvas.
  2. Connect all the outputs from the different sources to the input ports in the Table Join operator. Remember, you can add more input ports in the Table Join operator by clicking on "add a new port" icon inside the operator.

    However, you can go ahead and connect the output port to the Table Join operator without adding an input port manually: it will automatically add the input port for you.
    Figure 111. Adding a connection from sales_product output to table join but no input port
    Adding a connection from sales_product output to table join but no input port
    Figure 112. An input port automatically added as 'in3'
    An input port automatically added as 'in3'
  3. Continue to connect the remaining source tables.
    Figure 113. All source tables connected to table join operator
    All source tables connected to table join operator
  4. Now, let's configure the Join operator. Right-click on the operator itself > Show Wizard. Make sure you do not right click on any of its ports, but the title of the operator.
  5. Name it Table_Join_4_Sales.
  6. Click on Next.
  7. Here is where you need to make all the join conditions. Click on the ellipsis (...) button, and use the input columns and the operators to create a join equivalent to this one:
    Listing 28. Create join
    insert into sales (customer_code, district_code, time_code, product_code,
    units_sold, revenue, cost, net_profit)
    select c.customer_num, g.district_code, t.time_code, p.product_code,
    SUM(i.quantity), SUM(i.total_price), SUM(i.quantity * x.cost),
    SUM(i.total_price) - SUM(i.quantity * x.cost)
    from stores_demo:customer c, geography g, time t, product p,
    stores_demo:items i, stores_demo:orders o, cost x
    where c.customer_num = o.customer_num and o.order_num = i.order_num and
    p.product_line_code = i.stock_num and p.vendor_code = i.manu_code and
    t.order_date = o.order_date and p.product_code = x.product_code and
    c.city = g.district_name
    GROUP BY 1,2,3,4;

    So, this in Listing 29 would work in the Join expression builder:
    Listing 29. Create join
    "IN_017_0"."customer_num" = "IN5_017_5"."customer_num" AND 
    "IN4_017_4"."order_num" = "IN5_017_5"."order_num" AND 
    "IN3_017_3"."product_line_code" = "IN4_017_4"."stock_num" AND 
    "IN3_017_3"."vendor_code" = "IN4_017_4"."manu_code" AND 
    "IN2_017_2"."order_date" = "IN5_017_5"."order_date" AND 
    "IN3_017_3"."product_code" = "IN6_017_6"."product_code" AND 
    "IN_017_0"."city" = "IN1_017_1"."district_name"
    Figure 114. Creating the join condition
    Creating the join condition
  8. Click on OK.
    Figure 115. Join condition expression updated
    Join condition expression updated
  9. Click on Next.
    Figure 116. Select List wizard for table join operator
    Select List wizard for table join operator
  10. From all the columns listed on the right, start looking at the ones that are required in the result set. Notice several are duplicates.

    As per the SQL in Listing 30, you will need: customer_num, district_code, time_code, product_code, quantity, total_price, and cost:
    Listing 30. Needed columns
    insert into sales (customer_code, district_code, time_code, product_code,
    units_sold, revenue, cost, net_profit)
    select c.customer_num, g.district_code, t.time_code, p.product_code,
    SUM(i.quantity), SUM(i.total_price), SUM(i.quantity * x.cost),
    SUM(i.total_price) - SUM(i.quantity * x.cost)
    from stores_demo:customer c, geography g, time t, product p,
    stores_demo:items i, stores_demo:orders o, cost x
    where c.customer_num = o.customer_num and o.order_num = i.order_num and
    p.product_line_code = i.stock_num and p.vendor_code = i.manu_code and
    t.order_date = o.order_date and p.product_code = x.product_code and
    c.city = g.district_name
    GROUP BY 1,2,3,4;

    So, remove all the unnecessary columns.
  11. Change the data type for district_code to be SMALLINT, instead of SERIAL.
  12. Change the data type for customer_num to be INTEGER, instead of SERIAL.
  13. View the updated expressions for the result columns.
    Figure 117. Result columns updated for Select List wizard
    Result columns updated for Select List wizard
  14. Click on Finish. Save all changes.

Adding a group by operator

Now, you will add a group by operator, so you can do the required aggregates (using SUM).

  1. Drag and drop a group by operator onto the canvas.
  2. Connect the Inner join output from the Join operator with the input port of the group by operator.
    Figure 118. Table join operator connected to group by operator
    Table join operator connected to group by operator
  3. Let's configure the Group By operator by right clicking on the Group-By operator > Show Wizard.
  4. Label the operator Group_By_4_Sales.
  5. Click on Next.
    Figure 119. Second screen in group by operator wizard
    Second screen in group by operator wizard
  6. The result set needs to have three aggregation/expression columns. As per the original SQL script:
    Listing 31. Needed columns
    insert into sales (customer_code, district_code, time_code, product_code,
    units_sold, revenue, cost, net_profit)
    select c.customer_num, g.district_code, t.time_code, p.product_code,
    SUM(i.quantity), SUM(i.total_price), SUM(i.quantity * x.cost),
    SUM(i.total_price) - SUM(i.quantity * x.cost)
    from stores_demo:customer c, geography g, time t, product p,
    stores_demo:items i, stores_demo:orders o, cost x
    where c.customer_num = o.customer_num and o.order_num = i.order_num and
    p.product_line_code = i.stock_num and  p.vendor_code = i.manu_code and
    t.order_date = o.order_date and  p.product_code = x.product_code and
    c.city = g.district_name
    GROUP BY 1,2,3,4;

    You will edit the cells of quantity, total_price and cost, to add expressions as the one mentioned above.

    For the quantity column, you need to edit the expression to look like this:
    SUM(i.quantity)

    This column will be also renamed units_sold.
  7. Click on the ellipsis (...) button located next to quantity in the expression field.
    Figure 120. Modifying quantity column
    Modifying quantity column
  8. Build the SUM expression for quantity column.
    Figure 121. Building SUM expression using quantity column
    Building SUM expression using quantity column
  9. Click on OK.
  10. Rename the quantity column to units_sold.
    Figure 122. Modified quantity column
    Modified quantity column
  11. Do the same for the next aggregate expression, focusing on the total_price column:
    SUM(i.total_price)

    Rename the total_column to revenue.
    Figure 123. Renamed total_price and SUM aggregate expression
    Renamed total_price and SUM aggregate expression
  12. Do the same for the next aggregate expression, focusing on cost column:
    SUM(i.quantity * x.cost)

    The column name remains the same.
    Figure 124. Aggregate SUM expression of quantity multiplied by cost
    Aggregate SUM expression of quantity multiplied by cost
  13. Finally, a 4th aggregate column is needed:
    SUM(i.total_price) - SUM(i.quantity * x.cost)

    This column will be named net_profit.

    To add a new column, click on the Add new button icon at the upper left corner of the Result Columns toolbox, as indicated by the following figure:
    Figure 125. Result Columns:Add New
    Result Columns:Add New
  14. Name this new column net_profit.
    Figure 126. Naming new column net_profit
    Naming new column net_profit
  15. Add the new aggregate expression for the net_profit column.
    Figure 127. SUM(i.total_price) - SUM(i.quantity * x.cost)
    SUM(i.total_price) - SUM(i.quantity * x.cost)
  16. Click on OK.
  17. Fix the data types where needed. It's possible that you find both customer_num and district_code, or one of them, with SERIAL data type.

    Fix them to be: customer_num (integer) and district_code (smallint).
  18. According to createdw.sql script, Listing 32 shows the data types of columns in sales table:
    Listing 32. Data types of sales table columns
    create table sales (
       customer_code integer,
       district_code smallint,
       time_code integer,
       product_code integer,
       units_sold smallint,
       revenue money (8,2),
       cost money (8,2),
       net_profit money(8,2));

    So, you will need to fix the data types in the aggregation, to the ones above, like this figure.
    Figure 128. Final result columns with expressions, column names, and data types updated
    Final result columns with expressions, column names, and data types updated
  19. Click on Next.
  20. You are in the Group By wizard screen now.

    Keep the columns that would go in the Group by clause, which are customer_num, district_code, time_code, and product_code. Remove the rest.

    Ensure district_code is SMALLINT and customer_num is INTEGER and fix them if necessary.
    Figure 129. Editing the group by fields
    Editing the group by fields
  21. Click on Next.
  22. You do not have a HAVING clause, so click on Finish, and save all changes.

Adding the table target operator

Finally, let's add a table target operator. It is the fact table, which is the sales table in the sales_demo database.

  1. Drag and drop the Table Source operator onto the canvas and name it sales. The location will be remote database.
  2. The target database table will be sales_demo.admin.sales, which means the table name is sales and the schema name is admin. Connection will be from sales_demo.
  3. Click on Finish, and save changes.
  4. Now, connect the output port from the Group By operator with the input port in the Table Target operator.
  5. Let's edit the mappings between the output and the input of this connection. Right-click on the Table Target Operator Sales > Show Wizard.
  6. The General screen appears. You have filled this out already. Click on Next.
  7. Check all the mappings. Notice the "Map From" field is empty for the customer_code. You need to map the customer_num column from the source table to the customer_code in the target table.
    Figure 130. "Map From" field is empty for customer_code
    'Map From' field is empty for customer_code
  8. Drag the customer_num column from the source and drop it in the empty field next to the customer_code column in the target.
    Figure 131. Fixing the mapping
    Fixing the mapping
  9. Check all the mappings now to see if there is anything else to fix.
    Figure 132. All columns mapped correctly
    All columns mapped correctly
  10. They seem okay. So, click on Finish. Save changes.
    Figure 133. populate_sales data flow completed
    populate_sales data flow completed

Validating the populate_sales data flow

  1. Let's execute the dataflow by clicking on the green execution arrow at the top.
  2. Enter admin for SQL execution schema, and select sales_demo for SQL execution database.
  3. Click on Execute.
  4. Execution should succeed.
    Figure 134. Execution a success
    Execution a success

    Execution succeeded!

  5. Check contents of sales_demo:sales table:
    Listing 33. Check contents of sales_demo:sales table
    > select count(*) from sales;
    
    
          (count(*))
    
                 946
    
    1 row(s) retrieved.

Congratulations! Next, you will create a Control Flow integrating all the data flows used to populate sales_demo database.


Designing the control flow to populate sales_demo database for the first time

Control flow for loading the target database sales_demo

On success, unconditional, and on failure output ports

It is important to remember the following when connecting operators using the output and input ports available:

  • Each output port (on success, unconditional, and on failure) of a control flow operator can be used only once
  • Each input port (input) in a control flow operator can be used only once
  • Unconditional output ports get disabled in some cases when on success, on failure, or both ports have already been used

After creating and testing all the data flows for the individual tables in the target data mart, it is now time to design the whole ELT job sequence that will include the data movements and transformations along with the control operators to handle errors and do further instructions (called control flow operators like sending e-mails, writing or transferring files, executing loops, operating system commands, customized Informix SQL statements, update statistics, invoking DataStage jobs, and more).

As explained in the introductory section of this tutorial, a control flow can comprise data flows in parallel or sequential execution, other control flows and subflows, and additional operators to link them together or handle additional control and error logic. For instance, you can control what should come following a successful, failed or unconditional execution of a data flow, control flow or control flow operator.

Following the list of activities that are to execute in the whole ELT job to populate all the tables in the sales_demo data mart for the first time:

  1. Truncate/empty all the tables in target data mart sales_demo.
  2. Load all dimension tables in parallel: time, customer, product, and geography.
  3. If all the dimension tables are loaded successfully, proceed to load the fact table sales. If there is failure, write to a failure-log file.
  4. If load of sales fact table is successful, then proceed to execute an Update Statistics on all tables in target database sales_demo. If everything including the Update Statistics succeeded, write positive results to a log file.
  5. If there is any failure at any step, record the failure on a failure-log file.

The following steps will help implement this ETL job in the form of a Control flow in Informix Warehouse Design Studio:

  1. Right-click on the Sales Demo DW-ELT project's Control Flows folder, and select New > Control Flow, as shown in Figure 135.
    Figure 135. Creating a new control flow in your data warehousing project
    Creating a new control flow in your data warehousing project
  2. The New Control Flow wizard appears. Change the Control flow name to first_load_sales_demo, and also indicate to Work against data models (Offline).
  3. Click on Finish.
  4. An empty control flow canvas appears for the new control flow first_load_sales_demo, with a start operator placed by default.

Customized SQL (IDS custom SQL) operator to empty the tables in sales_demo

Since this control flow is intended to serve the first full load of data into the data mart sales_demo, it is important to make sure that the tables you want to populate are empty first before the load starts.

One way to do this is by executing a full DELETE or, preferably, a TRUNCATE on the affected tables. This could be done with different mechanisms from the individual data flows, or all at once at the level of the control flow.

For this tutorial, an IDS custom SQL operator is used to execute all the TRUNCATE [TABLE] table_name commands.

  1. Drag an IDS custom SQL operator from the Informix Operators palette on the right, and drop it onto the canvas for the first_load_sales_demo control flow, as Figure 136 shows:
    Figure 136. Adding a new IDS custom SQL operator into the first_load_sales_demo control flow
    Adding a new IDS custom SQL operator into the first_load_sales_demo control flow
  2. Right-click on the new IDS Custom SQL operator and select Show Wizard to display and edit the properties of this operator. When the General page of the IDS Custom SQL wizard appears, make these changes:

    • Change the Label to Empty_database.
    • Enter the Description as you wish.
    • In Informix database connection, select sales_demo from the drop-down list. This is the IDS database, against which you are going to run the TRUNCATE SQL commands.
    • Click on Next when done. Figure 137 illustrates the suggested inputs.
      Figure 137. Filling in General page of new IDS custom SQL operator
      Filling in General page of new IDS custom SQL operator
  3. The Diagnostics page appears. Here, you can configure the logging level to capture either both warnings and errors, or only errors that occurred at processing the Informix SQL statements. You can also indicate the tracing level of the control flow (Methods, Contents, or Both). Finally, if you want to skip the execution of this particular operator you can do so by marking the check box Set the execution status of this operator to successful and continue.

    Leave all the default settings as they are, and click on Next.
    Figure 138. Filling in diagnostics information for IDS custom SQL operator
    Filling in diagnostics information for IDS custom SQL operator
  4. The SQL Statements page is next. This is where you will enter the Informix SQL statements to delete all the data from the tables you intend to load.

    In the SQL Statements field, you can directly type in the TRUNCATE statements that will ensure that all the data in fixed tables, like the dimension tables time, customer, product, and geography and the fact table sales are deleted.

    You can click on the ellipsis (...) button to access the SQL Expression Builder if you need to build an expression.

    It is very important to place the TRUNCATE SQL statements in the correct order, taking care of not truncating a table that already has child tables, or dependent, tables. Start with the child tables (such as the fact table sales) and then proceed with the parent tables (such as the dimension tables). See Listing 34.
    Listing 34. SQL Statements inside the IDS custom SQL to clean up all data in tables
    connect to "sales_demo";
    
    { The following series of SQL Statements will delete all rows in the tables }
    truncate sales;
    truncate time;
    truncate customer;
    truncate product;
    truncate geography;

    This is shown in Figure 139. Click on Finish, and save (Ctrl+S) changes.
    Figure 139. SQL Statements page of table source operator
    SQL Statements page of table source operator
  5. The IDS custom SQL operator is now on the canvas. You want that when control flow is started, the first step is to empty the target tables in sales_demo. Therefore you need to connect the start operator with the IDS custom SQL operator in a sequence.

    As depicted in Figure 140, you connect the start output port at start control flow operator, with Input port at the IDS custom SQL operator Empty_database. Then, save (Ctrl+S) the changes.
    Figure 140. Connecting start operator with IDS custom SQL operator
    Connecting start operator with IDS custom SQL operator

Parallel container to run dimension table's data flows in parallel

The next step is to load all the dimension tables first (time, customer, product, and geography). Since they do not depend on each other to get loaded, they can run in parallel using a parallel container.

  1. Drag the parallel container operator from the Common Operators palette on the right, and drop it onto the canvas of the control flow first_load_sales_demo.
  2. Drag and drop a data flow operator from the Common Operators palette into the parallel container.
    Figure 141. Add a data flow into a parallel container
    Add a data flow into a parallel container
  3. Edit the data flow in the parallel container to associate an existing data flow. To do this, right-click on the data flow you just added, and select Show Wizard, like illustrated in Figure 142.
    Figure 142. Add a data flow into a parallel container
    Add a data flow into a parallel container
  4. The data flow operator's wizard appears, displaying the General page.

    • In Label, enter load_time to associate with populate_time data flow.
    • In Description, enter the desired description about this data flow,
    • In Data flow, browse for the populate_time data flow by clicking on the ellipsis (...) button. After selecting populate_time, click on OK. Figure 143 shows this procedure.
      Figure 143. Data Flow wizard and selecting the data flow to run
      Data Flow wizard and selecting the data flow to run
  5. Click on Finish, and then save all changes.
    Figure 144. Parallel container on canvas with first data flow placed
    Parallel container on canvas with first data flow placed
  6. Add those data flows for the remaining dimension tables to the data flows to the parallel container:
    • data flow load_customer that is associated to existing data flow populate_customer
    • data flow load_geography that is associated to existing data flow populate_geo
    • data flow load_product that is associated to existing data flow populate_product

Setting properties for parallel container and connecting it with on-success flow sequence

Now you will rename the parallel container, review its properties, and connect it to the existing flow.

  1. Right-click on the parallel container that was just added, and select Show Wizard.
  2. The General page appears. Change the Label to Parallel_load_dim_tables. Also, make sure that all the activities inside the container are running in parallel, by verifying that Execution is set to Parallel. As to the logging, this tutorial will use the same log file for all the activities in the container, but you could choose to use separate log files, one per activity. Figure 145 shows the suggested values. Click on Next.
    Figure 145. Setting General properties for parallel container
    Setting General properties for parallel container
  3. The Diagnostics page appears. Leave all the defaults for this tutorial, and click on Next.
  4. The Activities page appears. In case you choose to execute activities in a serial way, rather than in parallel, here is the place you can change the execution order for the activities inside a container. Leave the order as it is because these activities will be run in parallel. Click on Finish.
  5. After all these changes, back to the canvas, connect the on success port (with the check-mark) at the IDS custom SQL operator Empty_database in success-sequence with the input port (with the incoming arrow) at the new parallel container operator Parallel_load_dim_tables, as shown in Figure 146. Save changes.
    Figure 146. Parallel container on canvas and connected to the success-sequence of the control flow
    Parallel container on canvas and connected to the success-sequence of the control flow

Adding data flow operator to on-success control flow sequence, to load fact table Sales

After the process of loading in parallel all the dimension tables (time, customer, geography, and product), the next to do is to proceed to load the fact table sales which users information from tables in stores_demo source database and from input file cost.unl, but also from some of the dimension tables, so they needed to be loaded first.

You can follow the steps below in order to continue the on-success sequence of activities in the control flow, with the step of loading Sales fact table, by invoking data flow populate_sales:

  1. Drag a data flow operator from the Common Operators palette on the right, and drop it onto the control flow canvas.
  2. Right-click on the new data flow operator, and select Show Wizard. Alternatively, edit the Properties page of the operator. Edit the General page, as follows in Figure 147, changing the label of the data flow to something meaningful, such as Load_Sales, and also, in Data Flow browse using the ellipsis (...) button to find and select the existing populate_sales data flow under the Data Flows folder of the data warehousing project. Click on Finish when done, and save changes.
    Figure 147. Data Flow wizard after having selected the data flow to run: populate_sales
    Data Flow wizard after having selected the data flow to run: populate_sales
  3. Back to the canvas, connect the on success output port at the parallel container Parallel_load_dim_tables on success sequence with the Input port at the data flow Load_sales you have just added to the flow. Figure 148 shows these operators connected. Save changes.
    Figure 148. Data Flow Load_sales on canvas and connected to on-success sequence of the control flow
    Data Flow Load_sales on canvas and connected to on-success sequence of the control flow

Adding an update statistics operator to the success-sequence of the control flow

After successfully loaded all the dimension tables and fact table in target data mart sales_demo, the next step you could include in the control flow is to invoke a series of UPDATE STATISTICS commands in IDS, so the SQL Optimizer can have the latest information about the table and the data in the data mart, for query performance.

You can follow the steps below in order to continue the on-success sequence of activities in the Control flow, with the step of invoking a customized Update Statistics command:

  1. Drag an update statistics operator from the Informix Operators palette on the right, and drop it onto the control flow canvas.
  2. Right-click on the new update statistics operator, and select Show Wizard. Alternatively, edit the Properties page of the operator. Edit the General page, as follows and shown in Figure 149:
    • Change the Label of the operator as you wish; in this case Update_Statistics_Sales_Demo.
    • In the Database connection, select the target sales_demo database from the drop-down list.
    • In the Update Statistics on field, select Both tables and routines. The reason is because it is the most flexible option for running customized update statistics commands. If you chose Tables, you have the option of selecting just one table to update statistics. If you chose Routines, is because you need to run an update statistics for procedure. With the option you are selecting, Both tables and routines, Design Studio provides additional option of entering the SQL commands for the Update Statistics, which is what you want (entering the individual Update Statistics commands per table).
    Figure 149. General page settings for the update statistics operator
    General page settings for the update statistics operator
  3. Click on Next. No changes are needed for the Diagnostics settings of the update statistics operator at this point. Let's move on to the next page (click on Next), which is the Update Statistics Statement page.
    • Change the Statement type to Customized.
    • In Customized statement, enter the commands in Listing 35, or the ones you prefer, based on better practices for the Optimizer.
    Listing 35. SQL statements inside the IDS custom SQL to clean up all data in tables
    connect to "sales_demo";
    
    { Update statistics to run at the target database after loading the data }
    UPDATE STATISTICS FOR TABLE time;
    UPDATE STATISTICS FOR TABLE customer;
    UPDATE STATISTICS FOR TABLE geography;
    UPDATE STATISTICS FOR TABLE product;
    UPDATE STATISTICS FOR TABLE sales;

    Figure 150 shows these suggested instructions for the customized Update Statistics statements. Then, click on Finish, and save the changes.
    Figure 150. Customized SQL statements for update statistics operator
    Customized SQL statements for update statistics operator
  4. Back to the control flow canvas, connect the new update statistics operator with the rest of the on-success flow: Connect the on-success port at data flow Load_Sales with the input port at the update statistics operator Update_Statistics_Sales_Demo, and save the changes, as depicted in Figure 151:
    Figure 151. Update statistics operator connected to the on-success sequence of control flow
    Update statistics operator connected to the on-success sequence of control flow

Adding on-success file write operator at the end of the successful control flow sequence

Considerations about writing to a file

File write operator should be created in a way that it can be tested in the development environment, using Design Studio but also, it must work once the data warehousing project's package has been deployed, which means that it should still be valid and work fine on production, when executed at the Informix Warehouse server machine, by the runtime services scheduled for production execution using the Admin Console. Therefore, the File name field in file write operator, which is actually the absolute path to the file, should be a valid path on the production environment where the Informix Warehouse SQW services run, even if the file does not exist initially and is to be created by the operator's execution.

Finally on the success-sequence of the control flow to populate the whole data mart sales_demo for the first time, you would like to write a success-log file, adding an entry there that says that the whole ELT job sequence was successfully executed, and log the date and time it finished with success. To do this, you will add a last operator, a file write, that will append a new line to inform about the successful execution, on a predetermined file.

The file write operator you are using will automatically create the file you want to write, if it does not exist at the control flow runtime.

In order to do this, follow the steps below.

  1. Drag a file write operator from the Common Operators palette on the right, and drop it onto the control flow canvas.
  2. Right-click on the new file write operator, and select Show Wizard. Alternatively, edit the Properties page of the operator. Edit the General page as follows:
    • Change the Label of the operator to Success_File_Write.
    • In the Description, add the one you need. In this case, Writes/appends a new entry into the successful execution log file, with execution timestamp.
    • In the Write to, select non-default option Specified file, so you can enter your own file next.
    • In the File name, enter the absolute path to the file you want to write the new entry for the successful execution. You can use the Browse button to locate and/or create the desired file. In this tutorial, enter the absolute path to file: C:\temp\IW_tutorial_files\sales_demo_success.log

      This file above does not exist yet, so you did not use the Browse button on the right side, but directly typed in the path to this file. Informix Warehouse will create the file at the control flow execution runtime if it does not exist.

    • In the Write to file by option, you have three possibilities:
      • Overwriting text in existing file, if you are replacing the file content's with new contents
      • Appending text to end of file, if you are appending a text at the end of the file, no new line
      • Appending text as a new line at end of file, if you are adding a new entry at the end of the file, as a new line

      Select: Appending text as a new line at end of file.

    • In the Text field, you will combine a fixed text with a variable value:

      First, enter the fixed text to say: Successful sales_demo control flow execution at: , so type this sentence inside the Text field. At the end of this fixed text, leave a blank space.
    • Then, click on the button left to Text field, and choose Use Variable (instead of Use Fixed Value), as shown in Figure 152, since you are going to append a variable value (the current timestamp) along with the fixed text you just typed in, to write into the file.
      Figure 152. General page settings for the file write operator
      General page settings for the file write operator
    • Then, click on the browse button on the right side ... to open the Variable Selection dialog
    • Inside the Variable Selection dialog, select the predefined variable CURRENT_TIMESTAMP, and click on Append. Figure
      Figure 153. Variable selection dialog, to select CURRENT_TIMESTAMP
      Variable selection dialog, to select CURRENT_TIMESTAMP
  3. The text would look like: Successful sales_demo control flow execution at: ${RESERVED/CURRENT_TIMESTAMP}

    Figure 154 shows the General page with these suggested inputs. No need to click on Next because you are not going to configure any Diagnostics properties for this operator. So, you can directly click on Finish.
    Figure 154. General page final settings for the file write operator
    General page final settings for the file write operator
  4. Back to the control flow canvas, connect the new file write operator Success_File_Write with the rest of the on-success flow: Connect the on-success port at the Update Statistics operator Update_Statistics_Sales_Demo with the input port at the file write operator Success_File_Write, and save the changes, as depicted in Figure 155.
    Figure 155. File write operator for success log connected to the on-success sequence of control flow
    File write operator for success log connected to the on-success sequence of control flow

Setting on-failure actions for all the operators that may fail

You can decide what to do if a failure happens when executing any of the operators within the control flow. Among the actions, you could run any of the common operators or the Informix operators available. In this tutorial, you will use the same file write operator in all the on failure cases, reporting in the form of an entry in the same log file, that a Failure has been encountered executing the control flow.

In order to do this, you will first add one File Write and then replicate the steps across the rest of the on failure ports found within the control flow.

  1. Drag a file write operator from the Common Operators palette on the right, and drop it onto the control flow canvas, as shown in Figure 156.
    Figure 156. Adding a file write operator into the control flow on-failure sequence
    Adding a file write operator into the control flow on-failure sequence
  2. Right-click on the new file write operator, and select Show Wizard. Alternatively, edit the Properties page of the operator. Edit the General page as follows:
    • Change the Label of the operator to Failure_File_Write.
    • In the Description, add "Writes/appends a new entry into the failed execution log file, with execution timestamp".
    • In the Write to, select non-default option Specified file.
    • In the File name, enter the absolute path to the file you want to write the new entry for the failed execution. The absolute path to file should look like this: C:\temp\IW_tutorial_files\sales_demo_failure.log

      Informix Warehouse will create the file at the control flow execution runtime if it does not exist.

    • In Write to file by option, you have three possibilities:
      • Overwriting text in existing file, if you are replacing the file content's with new contents
      • Appending text to end of file, if you are appending a text at the end of the file, no new line
      • Appending text as a new line at end of file, if you are adding a new entry at the end of the file, as a new line

      Select: Appending text as a new line at end of file.

    • In the Text field, you will combine a fixed text with a variable value:

      First, enter the fixed text Failed sales_demo control flow execution at: in the Text field. At the end of this fixed text, leave a blank space.
    • Then, click on the button left to Text field, and choose Use Variable (instead of Use Fixed Value), and from the Variable Selection dialog, pick up reserved variable CURRENT_TIMESTAMP. Click on Append on that window.
    • The text would look like: Failed sales_demo control flow execution at: ${RESERVED/CURRENT_TIMESTAMP}

      Figure 157 shows the General page with these suggested inputs. Click on Finish.
      Figure 157. General page final settings for the file write operator
      General page final settings for the file write operator
  3. Back to the control flow canvas, connect the new file write operator Failure_File_Write with the rest of the on-failure flow: Connect the on failure port at IDS custom SQL operator Empty_Database with the input port at the File Write operator Failure_File_Write, and save the changes, as depicted in Figure 158:
    Figure 158. File write operator for failure log connected to first on-failure sequence of control flow
    File write operator for failure log connected to first on-failure sequence of control flow
  4. Replicate as many file write operators Failure_File_Write as needed to connect with all the on failure ports at all the other operators on the canvas. In order to easily replicate the steps above, right-click on the existing file wrie operator Failure_File_Write and select Copy. Then, do the following as many times as needed to cover all the on failure cases you want to handle with this same file write operator. Right-click on any blank area on the control flow canvas, and select Paste. Finally, connect the on failure ports of an operator that may fail with the newly pasted file write to report a failure. Do this with all the operators that may fail. You may also choose some other action, or different actions/operators depending on the type of operator that fails (in some cases, for instance, you may want to send an email notifying of the failure, or write into a table, etc), rather than having one single way to respond to all the failures across the control flow. Figure 159 shows the final canvas with all the success and failures taken care of. Save all changes.
    Figure 159. Final canvas for control flow first_load_sales_demo
    Final canvas for control flow first_load_sales_demo

Testing execution of the control flow first_load_sales_demo

Finally, it is time for us to test in our development environment the execution of the whole control flow first_load_sales_demo to make sure it works fine at taking care of the entire sequence of data flows, logic and errors that will happen when loading all the tables in target data mart sales_demo for the first time.

In order to do this execution test, follow the steps below.

  1. Make sure you have the desired control flow to test, open and shown as the active tab on the right side of the workspace, and that you have saved all the changes.
  2. Make sure you validate the control flow, to make sure that data type mismatch and other types of issues, are identified and resolved. To validate the control flow, right-click on the control flow first_load_sales_demo located under the Control Flows folder on your project explorer, and click on Validate.
  3. The ideal situation is to get no warnings nor errors from the validate function of Design Studio. You should receive a message saying "Validation succeeded".
  4. Use the Execute button on the upper toolbar of Design Studio, or the menu options Control Flow > Execute to start the execution of the control flow.
  5. The Flow Execution wizard will appear. As in the case of the data flows, with the execution of the control flow there are general, diagnostics, resources, and variables settings to configure different options at the execution. Leave the default values, as shown in Figure 186, and click on the Execute button.
  6. A dialog indicating that the data flow operation is in progress will appear. You can move it a little bit lower so you can see the behavior on the canvas during the execution time. You will notice that the operators within the canvas are colored (yellow during their execution; red after their execution, if failed; or green after their execution, if succeeded). This behavior is not seen if the in progress dialog is closed and is made to run in background. Figure 160 shows a partial-state execution in which the parallel container operator is being processed and the previous operator to Empty_database was completed successfully.
    Figure 160. Execution in progress for control flow first_load_sales_demo
    Execution in progress for control flow first_load_sales_demo
  7. Once the execution of the data flow completes, the ideal situation is to see all the operators within the on-success sequence in the control flow, colored in green, as in Figure 161.
    Figure 161. Flow execution results on canvas for successful control flow execution
    Flow execution results on canvas for successful control flow execution
  8. The Execution Result result window appears. It should succeed. Click on OK to close this window.
  9. Verify directly on the Informix instance that contains the database sales_demo that all its dimension tables and the fact table are populated:
    Listing 36. Verify dimension and fact tables are populated
    dbaccess sales_demo -
    
    Database selected.
    
    > select count(*) from time;
    
    
          (count(*))
    
                  92
    
    1 row(s) retrieved.
    
    > select count(*) from customer;
    
    
          (count(*))
    
                  28
    
    1 row(s) retrieved.
    
    > select count(*) from product;
    
    
          (count(*))
    
                  74
    
    1 row(s) retrieved.
    
    > select count(*) from geography;
    
    
          (count(*))
    
                  17
    
    1 row(s) retrieved.
    
    > select count(*) from sales;
    
    
          (count(*))
    
                2311
    
    1 row(s) retrieved.

Congratulations! This concludes this tutorial, and you are now ready to package your data warehousing project for deployment.


Conclusion

This second tutorial has allowed you to get familiar with designing and testing the data flows and control flows in the Informix Warehouse Feature using its client component, Design Studio.

Also, it introduced you to the ELT capabilities of Design Studio and guided you through the step-by-step process of designing data movements, transformations, and loads in the form of data flows and finally, control flows, comprising of different operators to do the job. They were used to populate a data mart on Informix for the first time, sourced from an operational database and flat files.

The following and last part of this tutorial series will show you how to prepare the data warehousing application (in this case, containing a control flow to load the data mart for the first time) for deployment using Design Studio, deploy it in a production system using the Administration Console, schedule the ETL job in the form of control flows to run at a specific date and time, and finally, monitor it while learning how to troubleshoot and inspect for errors.


Downloads

DescriptionNameSize
DDL Script to add foreign keys and primary keyssales_demo_alter.zip10KB
UNL file to add more rows into stores_demo:ordersadd_orders.zip10KB
UNL file to add more rows into stores_demo:itemsadd_items.zip10KB
UNL data file to populate Time dimension tabletime.zip10KB
UNL data file to populate Cost staging tablecost.zip10KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=446417
ArticleTitle=Get started with Informix Warehouse Feature, Part 2: Extract, load, and transform your data in Design Studio
publish-date=11122009