The Informix TimeSeries DataBlade was designed to help Informix users better handle time-related data. Using the object-relational technology built into the Informix Dynamic Server, the TimeSeries DataBlade reduces the size of the data, speeds up processing across that data, and provides a new set of tools for analyzing it. As the User's Guide presents, the performance gains can be remarkable. Space can be reduced by 55%. Processing can be up to 30 times faster. It is truly a useful technology for a specific data problem.
The TimeSeries DataBlade works on large sets of date-time-based data. Examples are: stock/financial data (this is the primary example in the User's Guide), audit trails or logs, monitoring statistics, weather, seismic readings, and billing. The TimeSeries DataBlade can be paired with the Real-Time Loader to handle huge volumes of streaming, real-time data. Because of the difficulty of migrating TimeSeries data and schemas, TimeSeries is best used with known data schemas that are unlikely to change.
For the examples in this article, I will use annual baseball statistics for all players since 1882. I used the baseball statistics provided by The Baseball Archive (http://www.baseball1.com/). Here's a glimpse at what the TimeSeries does to the schema of the data. Notice that a table may have multiple TimeSeries columns.
Listing 1. A Standard Date-Time indexed table
CREATE TABLE playerstats (
playerID char(9),
year int, --(or date or datetime)
atbats int,
hits int,
runs int,
walks int,
etc...
);
CREATE UNIQUE INDEX ix_playerstats on playerstats (playerID, year); |
Listing 2. Playerstats with TimeSeries columns
CREATE TABLE playerstats (
playerID char(9),
hitting TimeSeries(batstat_t),
pitching TimeSeries (pitchstat_t)
);
CREATE UNIQUE INDEX ix_playerstats on playerstats (playerID); |
The team -- basic TimeSeries structures
There are four basic structures for building a TimeSeries. They are: the Calendar and Calendar Pattern, the Subtype (or ROW type), the Container, and the TimeSeries itself. A TimeSeries can be either regular or irregular. A regular TimeSeries has data at known and defined time intervals. At every interval there is an entry. Irregular TimeSeries, as you might guess, are for data that doesn't recur at known intervals. In the baseball example below, I use a regular TimeSeries. For each year, on October 1st, I have recorded the full year's stats for each player. If, however, your data is capturing live stock trade transactions, it would likely be an irregular TimeSeries. You must decide on the type of TimeSeries before creating it.
TimeSeries data is more compact and faster to access precisely because it does not actually store the date and time with the data. All the data is stored as an offset to a known beginning date called the origin. Each TimeSeries column can have a different origin. The DataBlade can quickly translate between DateTimes and the offset values. Thus, whole columns of data and the associated indices are not stored in the database. This is the basic idea that powers the DataBlade to its substantial performance improvements.
The offsets are controlled by Calendars and Calendar Patterns. The Calendar Pattern defines the valid intervals for data. The Calendar combines the Pattern with the earliest possible start date. You can have as many Calendars and Calendar Patterns as your data requires. They are just rows in system tables created by the TimeSeries DataBlade. Here are some examples:
Listing 3. Calendars and calendar patterns
INSERT INTO CalendarPatterns VALUES (‘weekly’,
‘{1 on, 6 off}, day’)
INSERT INTO CalendarPatterns VALUES (‘work_hours’,
‘{24 off, 120 on, 24 off}, hour’)
INSERT INTO CalendarTable VALUES (‘week’, ‘startdate(
1998-01-01 00:00:00.00000), pattstart(
1998-01-02 00:00:00.00000), pattname (weekly)’) |
There are some details to notice here. First, in the TimeSeries DataBlade, all the datetimes are handled as datetime year to fraction(5), regardless of the precision of your calendar patterns. Second, the functions and tables provided by the TimeSeries DataBlade often use specially formatted lvarchar strings as parameters or column values. This does make syntax checking more difficult. Use the documentation to make sure you get them exactly correct. Finally, notice that the calendar and the calendar pattern can start on different dates.
The next basic structure is the subtype or ROW type. This is a user-defined datatype that defines the columns or elements of data to be stored in the TimeSeries. This is where the TimeSeries DataBlade takes advantage of the IDS object-relational architecture. This ROW type must start with a datetime year to fraction(5) column. This column is not actually stored in the database, but it is required so that the DataBlade can translate between offsets and datetimes. The rest of the ROW type can be columns of any datatype, including other user-defined types. Listing 4 is an example:
Listing 4. Creating the ROW type
CREATE ROW TYPE batstat_t (
statyear datetime year to fraction(5),
atbats int,
hits int,
runs int,
walks int,
etc...); |
As a naming convention, I add the "_t" to the type name so that I know it is a ROW type. In many of the TimeSeries functions, there is a mix of TimeSeries types and TimeSeries column names as parameters. The examples in the documentation are unclear because they don't follow any naming convention. In the current release of IDS (version 10), there is no ALTER ROWTYPE statement. That is why it is crucial to get the schema of the type correct from the beginning. Otherwise, you have to build a new ROW type and a new table to hold an updated TimeSeries schema.
The final basic structure for building TimeSeries data is the Container. The TimeSeries Container describes how the data is stored on disk. It is a named space within a dbspace that holds the data for one TimeSeries ROW type. It has an initial size and a growth parameter. Here's the syntax for creating the container: EXECUTE PROCEDURE TSContainerCreate( ‘name_cont’, ‘dbspace_name’, ‘subtype_t’, initial_size_int_kb, growth_increment_int_kb); TimeSeries data will stay with the table until it exceeds the row size for the table (or a user-defined limit). It then gets promoted to the Container. Once promoted, the data will always stay in the Container.
The User's Guide has performance suggestions for the Container. There should be one container per physical disk, only one concurrent updating user per container, and each container should exist in a separate dbspace. Obviously, the requirements of your application will dictate acceptable performance. Since my process is a weekly batch process run on the weekends, I get acceptable performance with all of my Containers in the same dbspace.
The Lineup -- assembling the TimeSeries
The following example puts all of the basic structures together. Then we will look at what we can do with the data.
Listing 5. Building the baseball example
-- Create the calendar pattern and calendar
INSERT INTO CalendarPatterns VALUES (‘annual’, ‘{1 on}, year’);
INSERT INTO CalendarTable (c_name, c_calendar) VALUES (‘yearly’,
‘startdate(1882-10-01 00:00:00.00000),
pattstart(1882-10-01 00:00:00.00000), pattname (annual)’);
-- Create the row types describing the data
CREATE ROW TYPE batstat_t (statyear datetime year to fraction(5),
atbats int, hits int, walks int, ...);
CREATE ROW TYPE pitchstat_t (statyear datetime year to fraction(5),
inningspitched inning_t, hits int, earnedruns int, ...);
-- Create the TimeSeries Container
EXECUTE PROCEDURE TSContainerCreate( 'ts_baseball_bat',
'tscontainer', 'batstat_t', 1000, 1000);
-- Create the TimeSeries table
CREATE TABLE playerstats (playerid int,
hitting TimeSeries(batstat_t), pitching TimeSeries(pitchstat_t) ); |
Now the data must be loaded into the TimeSeries table. The baseball data comes from a large data file. I wrote a .NET program to parse through the file and build INSERT statements for the data. The User's Guide describes many other ways to load data into a TimeSeries. As a standard in my production application, I insert the row with an empty TimeSeries first. Then I use the PutElemNoDups() function provided by the DataBlade. Listing 6 is a sample:
Listing 6. Inserting new rows into a TimeSeries table
INSERT INTO playerstats (playerid, hitting) VALUES (new_id,
TSCreate (‘yearly’, origin_date, threshold_num, 0,
initial_element_num, ‘container’));
UPDATE playerstats
SET hitting = PutElemNoDups(hitting,
ROW('2004-10-01 00:00:00.00000','SFN',147,373,...)::batstat_t)
WHERE playerid = 'bondsba01'; |
Play ball! -- accessing TimeSeries data
How can we retrieve the data now tucked away in the TimeSeries? Here is how to retrieve one element from the TimeSeries for one player and one year:
Listing 7. Selecting one TimeSeries element
SELECT GetElem(hitting,'2004-10-01 00:00:00.00000').hr FROM playerstats WHERE playerid = 'bondsba01'; Results: 45 |
Use the dot notation (".hr") to retrieve one element from the batstat_t ROW type. If no element name is included, GetElem returns the whole batstat_t ROW type. The DataBlade provides other retrieval functions such as: GetFirstElem(), GetLastElem(), GetClosestElem(), and GetNextValid(). There is also a useful Clip() function to extract a ranged subset of the TimeSeries as another TimeSeries.
SQL functions and commands are not the only tools for working with TimeSeries data. The DataBlade includes a wealth of Java and C functions, too. The C functions are the most complete and offer the fastest access. The Java and C functions can be used in external programs, or directly inside the engine with User Defined Routines (UDRs). In this introduction, though, I will limit the discussion to the SQL functions. See the User's Guide and the resources for more help with these advanced topics.
Here is a more complicated SELECT statment to calculate and retreive all the batting averages for players in 2004 who had more than 300 at-bats:
Listing 8. Selecting a set of data
SELECT playerid, (GetElem(hitting,'2004-10-01 0:00:00.00000’ ).hits / GetElem( hitting, '2004-10-01 00:00:00.00000').atbats) as ba FROM playerstats WHERE GetElem(hitting,'2004-10-01 00:00:00.00000').atbats > 300; |
There are some significant restrictions here. GetElem requires a date, so the query cannot be run over all years and all players. And the constant re-typing of the GetElem function's syntax is wordy and annoying. There are some remedies, but first, let's look at some of the power of TimeSeries. The DataBlade provides a catch-all Apply function. Apply can execute a SQL expression or function across a TimeSeries and return another TimeSeries. How is this useful? Here is another way to calculate batting averages:
Listing 9. Using apply() -- Complicated
SELECT * FROM table ((SELECT TSColNumToList(
(Apply('$hits/$atbats', hitting)::TimeSeries(one_real_t)),
1)::list (real not null)
FROM playerstats
WHERE playerid = 'bondsba01'));
Results:
0.22276029
0.261343
0.2825279
0.24827586
... |
Yes, this is a mess of TimeSeries functions, casts (the :: operator), and other syntactic nastiness. These types of TimeSeries queries are not easy to write. And functions like TSColNumToList only return one column of data. There has to be a better way. There are a couple, in fact. One is the extremely useful TSSetToList function in Listing 10 below. Another is virtual tables, which are discussed in the next section.
Listing 10. TSSetToList() -- Useful
SELECT * FROM table
((SELECT TSSetToList(hitting)::list(batstat_t not null)
FROM playerstats WHERE playerid = 'bondsba01'));
Results:
statyear team games atbats runs hits doubles ...
1986-10-01 00:00:00.00000 PIT 113 413 72 92 26
1987-10-01 00:00:00.00000 PIT 150 551 99 144 34
1988-10-01 00:00:00.00000 PIT 144 538 97 152 30
1989-10-01 00:00:00.00000 PIT 159 580 96 144 34
... |
This returns regular, tabular data and is not too complicated to understand. Basically, the inner select creates a LIST (an Informix collection type) of the data as defined by the batstat_t ROW type. The LIST is transformed into a table by the outer SELECT FROM TABLE syntax.
The TimeSeries DataBlade provides another tool to cut through some of the complexity of working with TimeSeries data -- routines to build virtual tables. This functionality is built upon the virtual table interface (VTI) provided by IDS. A virtual table allows us to work with the TimeSeries data more like a "normal" table. It is like a view; there is no duplication of the underlying data. Virtual tables can also be used for loading TimeSeries data. Virtual tables allow SELECT and INSERT statements, but not UPDATE or DELETE. INSERTS, though, will operate as an update if the data already exists. Thus, the limitation is the absence of DELETE. Another limitation is that you cannot create indexes on TimeSeries virtual tables. Listing 11 shows how to build virtual tables for the baseball example:
Listing 11. Creating virtual tables
EXECUTE PROCEDURE TSCreateVirtualTab(‘playerhitting_virt’, ‘playerstats’, ‘calendar(yearly), origin(1882-10-01 00:00:00.00000), container( ts_baseball_bat)’, 0, ‘hitting’); EXECUTE PROCEDURE TSCreateVirtualTab(‘playerpitching_virt’, ‘playerstats’, ‘calendar(yearly), origin(1882-10-01 00:00:00.00000), container( ts_baseball_pitch)’, 0, ‘pitching’); SELECT * FROM playerhitting_virt; Result: Playerid Statyear Team Games AtBats aardsda01 2004-10-01 00:00:00.00000 SFN 11 0 ... aaronha01 1954-10-01 00:00:00.00000 ML1 122 468 ... aaronha01 1955-10-01 00:00:00.00000 ML1 153 602 ... |
The TSCreateVirtualTab() function creates the virtual table. The first parameter is the new table name. Again, I created a naming convention. I add "_virt" to the end of the virtual table's name, so that I know it is a virtual table. The second parameter is the TimeSeries table name. The third parameter (shown in bold) is an optional description of the empty TimeSeries to create if a new row is inserted into the virtual table. Again, it is a specially formatted lvarchar. Notice that the example creates a separate virtual table for each TimeSeries column. With the virtual tables in place, you can build traditional queries. Here is one that calculates the batting average and slugging percentage for all players since 1905:
Listing 12. Queries on virtual tables
SELECT playerid, year(statyear), round(hits/atbats, 3) as ba, round(((hits + 2* doubles + 3*triples + 4*hr)/atbats),3) as slg FROM playerhitting_virt WHERE atbats > 300 AND year(statyear) > 1905 ORDER BY 4 DESC; Results: playerid (expression) ba slg bondsba01 2001 0.328 1.088 bondsba01 2004 0.362 1.013 bondsba01 2002 0.370 0.995 gehrilo01 1927 0.373 0.966 |
After seeing the TimeSeries functions and casts and other elements, a virtual table looks like an easy choice. However, there are some drawbacks with virtual tables. The primary one is performance. It can be very slow to do queries across a virtual table, especially if you use one of the TimeSeries elements in the WHERE clause. In my tests, it can be an order of magnitude slower than a TimeSeries function that does the same query. Also, I have occasionally encountered strange results from virtual table queries that do not occur when I use the direct TimeSeries functions. Virtual tables are great for quick queries, loading data, and for use in report writers and other end-user tools (if the performance doesn't kill you). For speed, though, you must dig in and build the more complicated TimeSeries queries. A tool like ServerStudioJE's Advanced SQL Editor becomes quite helpful.
To close this introduction to TimeSeries, I'd like to offer you four implementation tips from my 3+ years of experience with the TimeSeries DataBlade. Call it an intentional walk to get you to first base. Ball one: Be very sure of your schema. As noted above, there is no ALTER ROW TYPE statement. Migration from one TimeSeries schema to another is tricky and time-consuming. Ball two: Watch the TimeSeries system tables like tsinstance. They can quickly multiply into a large number of extents. Rebuild them with better extent sizing. Ball three: Use virtual tables for convenience, not speed. Ball four: Build stored procedures to handle the details of creating and updating your TimeSeries. There are several steps to getting it right and several performance-related parameters to remember; write them once and get it right. Take your base!
Learn
- TimeSeries DataBlade product page
- TimeSeries DataBlade User's Guide: The official IBM Informix documentation. Includes examples for most of the DataBlades functions (SQL, Java, and C).
- From A-Z of DataBlades: How to build your own DataBlade extensions to IDS.
- Informix Infocenter: Gateway to all the Informix documentation including several volumes on extensible types and DataBlades.
- Developing Object-Relational DataBase Applications: Paul Brown is one of the main architects of Informix object-relational technology. His articles and books are home runs.
- Open-Source Components for Informix Dynamic Server 9.x: A book by Paul Brown, Jacques Roy, and others. Jacques is another IBMer who has done extensive work with extensibility in IDS.
Get products and technologies
- Informix product family trials and downloads: Obtain evaluation and trial software, drivers, fixes and patches for IBM Informix products.
Discuss
- Participate in the discussion forum.
- Discuss Informix Dynamic Server and related topics in the IDS user and developer discussion forum.
Comments (Undergoing maintenance)






