© 2002 International Business Machines Corporation. All rights reserved.
In today's environment of distributed transactions, Enterprise JavaBeansTM and application servers talking to nameless Object Request Brokers, the concept of centralized processing may seem rather old-fashioned and perhaps too client-server, or too mainframe-like. I would argue that databases, while possibly thought of as commodities, should in fact be regarded as a framework for content exploitation in many ways superior to the capacities of plug-and-play distributed applications. Distributed computing is evolving from a predominately client-server model towards a thin-client model with a standard user interface such as a Web browser or software agent. A natural consequence of this evolution is a requirement for progressively more functionality to move from the client to the server.
Database engineers have had almost 30 years to hone, fine tune and develop methods for storing, retrieving, and locating objects using more sophisticated, space-efficient, and response-time aware techniques than ever before. The main difference in today's storage requirement however, is the increase in complexity of data that needs accessing. In fact, the plethora of data mining tools and statistical analysis engines that require a database would imply that the objective has always been to exploit the value contained within a database, with the database often considered nothing more than a sophisticated file system. This perception is certainly a mistake since technology exits today to apply innumerable operations on the data within the server - that is, there is no requirement to leave your Excel or JDBC interface if the database can provide the operations for you. For example, image recognition, statistical analysis, complex mathematical operations, encryption, watermarking, and audio analysis are all being done inside IBM® Informix® databases using an object-relational approach. This is provided via server-based user-defined data types and routines. With these "DataBladeTM modules" (known as "DataBlades," or "blades" for short), new and interesting functionality can be plugged in on demand to deal with concepts and requirements previously unheard of.
This article shows how easy it is to create a DataBlade module using a small amount of C and SQL to create a Soundex data type. This will allow character strings to be stored and found later using a sound search algorithm. To demonstrate how powerful this can be, its functionality is then extended to include regular expression matching on sounds.
In its simplest terms a DataBlade can best be defined as a number of domain-specific data types and any functionality associated with those types. For example, a spatial DataBlade might contain data types such as Point, Line, and Polygon. Some logical functions associated with these might be Contains() and Intersects(). An encryption DataBlade might include data types such as PlainText, CryptText, and Key. Functions associated with these might include Encrypt(), Decrypt(), and Verify().
These data types become as part of the available data types as regular types such as integer, float, character, or date. Indeed, applications do not perceive them as special or different. One advantage of making very different types accessible in a standard way is that an application can use a number of very different data sources in ways unique or not thought of before. There is no limit to how many DataBlades you can create or use simultaneously. The database allows you to add the required functionality dynamically and remove it later once that expertise is no longer required within the database. Pictorially, it might look like Figure 1.
Figure 1. Third-party or user-defined functionality slots directly into the database. The new facilities work together and are treated equally.

Informix provides a tool to assist in the DataBlade creation process. Known as the DataBlade Development Kit (DBDK), the framework of a blade can be written allowing you to plug in your own domain-specific code. The generated code is fully commented (that is; quite verbose) and provides instructions as to where you add your specific code. Think of it as a DataBlade version of a Visual C++ code generator. To show how we create a DataBlade from scratch, we won't use the DBDK here - it's not strictly necessary if you know what you're doing. However I would suggest using it for your own DataBlades, since error checking is also added automatically.
Imagine that you need to store an inventory value representing the number of items left in a warehouse. This would normally be an integer type. The database engine would require data structures to exist for managing that data item's value. In addition, routines dealing with comparisons (equal, less-than, etc.) might be necessary for you to determine when to re-order inventory or to compare purchase availability.
DataBlades require the same "infrastructure." You create structures that hold the data, routines to deal with the input and presentation of the content, and any functions you might need to apply to that data. In addition, you provide the SQL interface to allow any of those data types to be used within standard SQL statements. Your DataBlade is loaded on demand and is run within the server's address space. The not-so-obvious advantage of running your functionality within a new data type as opposed to an externally called process, is that the optimizer can be made aware of the nature of the data type being used. It can treat simple and complex data types accordingly, thereby providing orders of magnitude improvements to performance.
It's very easy to write DataBlades in C or JavaTM; the code in this article uses C. You can also link to external libraries compiled in other languages as well; for example, you could link to mathematical libraries in FORTRAN, thus enabling complex statistical routines through simple SQL. Traditionally, you might export your data to a statistical package, run your analyses, and then re-import it - now you can run statistical operations on your data without any need for it to leave the security of your database. In addition, every process your data has to go through outside of the database is an opportunity for errors to creep into the results. The other significant point about the ability to link externally is that there is a lot of legacy code out there that might not actually need rewriting in C and could be integrated into new systems with fairly little effort. While this may sound complex, the example in this article shows just how simple it can be.
Many applications often require the need to search on character strings based on their so-called "sound." I have seen this functionality (called Soundex) built into customer search screens, or street name search screens because spelling is often subjective, and mistakes in annotation are common. Unfortunately for the application designers, it is usually necessary to compromise the relational integrity of the database by introducing redundant columns into tables that hold some form of Soundex code. This is further complicated by the fact that any change in the column holding the actual value requires an update to the associated sound, and every application needs to specify a reference to the sound column in order to search against it. Because applications should be able to use this as transparently as possible, automatic updating of the Soundex signature is traditionally achieved using triggers and stored procedures. In short, the work is left to the database administrator who may have to go through hoops depending on the database design.
Our DataBlade will be used within an employee table in which the name column has Soundex properties. No triggers, stored procedures, or extra columns are required to allow Soundex functionality, and a consistent method of accessing this can be used from any interface.
For example, in Listing 1, a standard SQL query has asked for employees where the name equals "Smith." Note that "Smythe" and "Smithie" are also returned. This is not an error - equality is now based on the sound attributes of the name column.
Listing 1. Searching for a value that sounds like "Smith"
|
Don't worry too much now about doing a search on the employee whose name actually is "Smith." This is also possible but less interesting for the purposes of this article.
Listing 2 shows all employees in the database with similar-sounding names. Note that the programmer does not even have to understand how the Soundex algorithm works.
Listing 2. Find all similar-sounding names.
|
This article will show the following features of DataBlade creation and use:
- Using a new data type within SQL
Our data type is called TSndx, and is effectively a replacement for a char type. To make things simple, it has a fixed length. The user will simply specifies TSndx in place of char. As Listing 1 and Listing 2 showed, no special syntax is necessary to take advantage of this. Further, no extra columns are required in the employee table. Soundex is now part of the data type's behaviour. - Creating comparison functions for the TSndx datatype
While you could argue that one sound is not necessarily greater or less than another, we show how you can overload comparison operators to take advantage of the sound. This also allows us to build indexes on the sound of the data. - Creating functional indexes
Functional indexes allow us to build an index on the result of a function. This is very powerful and avoids performance issues such as full-table scans on large columns. I could for example create a function that determines the heritage of a name (Scottish, German, Spanish, etc.), or process verb tables returning their infinitive. - Extending our TSndx datatype functionality
It's one thing to compare strings against a known sound. But what about doing a search for all names that might sound like something beginning with a "J," having an "N," "P," "F," in the middle and ending with either "B" or "SH"? Regular expression matching on sound could be difficult, but a simple function declaration and some C code shows that this can be implemented with very little fuss.
The actual blade takes a relatively small amount of code. The majority of this code is in the actual sound generation function and regular expression functions. Having said that, the design of this example DataBlade is definitely not written in an optimal fashion. Without going into too much detail, this code has been written in a way that demonstrates a number of features about DataBlades, namely new type and function creation, functional indexes, function overloading, and the ability to extend core database functionality. Strictly speaking, a number of features such as error checking and code efficiency are not included to make the code both easier to understand, as well as focus on our core objective of understanding how DataBlades work. These are covered in great detail in the relevant manuals mentioned towards the end of this article.
New data types are used in SQL like any other regular data type. Listing 3 shows the create statement for the employee table. The only thing out of the ordinary here is the reference to the data type TSndx.
Listing 3. The CREATE TABLE statement using the TSndx data type.
|
Before we can use the type in a table, however, we need to let the database know exactly how TSndx behaves, how the data is stored, and how we search on the contents. This process is simpler than it sounds.
We define the type in Listing 4. This is an opaque type, which as the name suggests means we cannot see inside the type. All modifications to the type contents require encapsulating functions, which we'll show a little later. In Listing 4, we provide a structure in C to hold the actual text and an internal representation. The data attribute holds a copy of the characters, and sound holds a pseudo-representation of those characters. For simplicity, we assume that there are never more than thirty distinct sounds in a stream of text. We use the SQL CREATE OPAQUE type statement to let the database know data type now exists.
Listing 4. C code for holding text and SQL statement to define the new type to the database
|
All data types require a way to convert from the externally given format to an internal representation, and back again. For example, an integer stored as a SQL SMALLINT with the value 32767 doesn't require 5 bytes to store the value. It would use two bytes for the value and convert it to 5 characters on display. Similarly, a spatial point such as (3, 6) would be parsed on input and two integers (X and Y values) stored internally. On output, one would represent the point as two numbers with a comma separating them and surrounded by brackets. We require similar functionality with our TSndx type. Remember that the sound is held internally, and the external representation is in fact the same as a character data type. To achieve this, an input routine and an output routine are required.
Listing 5. Defining a routine to store character representation internally
|
What Listing 5 actually shows is that when we need to store any TSndx data (e.g., on input or update), the TSndxInput function within the shared object (or DLL) Soundex.bld is called. The function takes a character string as its only argument and returns a TSndx type. The actual code for this function is in Listing 6.
Listing 6. Source code to convert from an external character representation to an internal representation
|
Now we need a routine to display the contents of the data type. The output function is a little different from the input, in that the argument is of type TSndx, and the return type is character. This should make sense since we expect a character-like output.
Listing 7. Routine to convert a TSndx type to a character string acceptable for display
|
One important issue remains: If we are going to use TSndx like a character, then we will almost certainly expect to use it in all situations for which we would use a character. This provides a small problem because TSndx is in fact a completely different type, and logically, the only way for us to use characters in place of TSndx is to convert every character string to a TSndx before storing it in the database. Instead of having to convert (or cast) from TSndx to char and back again when necessary, the database allows us to provide what are known as implicit casts. This means that in the right context, the database will not just come back with an "incorrect data type" error. We give it permission to make certain assumptions and do our casting (or converting) for us.
Listing 8. Implicit casts allow TSndx to be converted to character and back again automatically
|
We've now done most of the hard work. We can now insert data into our table and display it. Listing 9 shows the insert statements. Note that we are now using character strings in the insert. Even though the "name" column is of type TSndx, the engine understands that we can cast it automatically. So far the user is unaware that name is in fact more than a standard character type.
Listing 9. Inserting data into our employee table
|
If we now select from the employee table, we should see the same sort of output as if name were a simple character type.
Listing 10. Listing of the contents of the employee table
|
So if everything is so normal, how then do we get the kind of behavior shown in Listing 1 and Listing 2? How do we find similar-sounding character strings? We simply need to overload the equals() operator to allow a new sort of comparison. In addition, we provide some C code to return a comparison measure.
Listing 11. Overloading the = operator allows comparison of two TSndx data types, and the C code required to perform this comparison.
|
What is probably not so evident from this is that the normal "=" operator used in SQL is now available to compare sound values. The full listing of all SQL and C code at the end of this article shows that I have gone all the way here and also overloaded the >, >, >=, >=, != operators as well. It's debatable whether a string sounds greater or less than any other, but the fact is, if I can order the sounds, then I can index them and gain great performance benefits.
Informix Dynamic ServerTM has a number of indexing methods. Any ordinal value can be indexed with a standard B-Tree, as long as compare() functions exist. You've already seen how simple that is, so we get B-Tree indexes for nothing. Spatial data types might use an R-Tree index, which is provided with the system, but we're not indexing in more than one dimension. Assuming we had a criteria different from the normal compare() functionality, how can we index on this value? We can do this using a feature known as functional indexes. With functional indexes, you create an index on the result of a function, and our function returns an integer value representing the sound of the text. I could also index on name length, number of consonants, text language, etc. The internals are of less importance than the fact that you can do it. In our example, by indexing on the sound itself we ensure that names like "Duffies" and "Toffies" are stored close to each other.
Listing 12 shows code necessary to display the actual sound value for a piece of text, and Listing 13 shows how we can now view the so-called sound of that text.
Listing 12. Code to display the actual Soundex values for a character string of type TSndx
|
Here then are the corresponding sounds of our employee names:
Listing 13. The names in the employee table and their sounds
|
This shows that for every character string, there is a function that will return a single value representing the sound of that string. Using this, we can now create a functional index based on the sound value (Listing 14).
Listing 14. Creating a functional index
|
Now compare how data might be stored using an index on employee name as opposed to an index on its associated sound. Figure 2 shows that similar-sounding employees are grouped together sound-wise using the functional index and alphabetically using a regular index. Note that "Duffies" and "Davis" are stored near "Toffies" and "Doofus." Indeed, the spread of data is greatly exaggerated here at three rows per page to highlight how having to search for n similar-sounding objects may mean having to read n pages. In the case where no intelligence can be built around the data type, this may mean a worst-case scenario of the system doing a full table scan.
Figure 2. How data might be physically stored using indexes against strings versus the sound of the string

Extending the base functionality
We now have what you might think is an acceptable solution to our sound-search function. However, we're not finished by a long shot. There is so much code in the public domain that we can use to extend the functionality of our new Soundex data type. For example, we can look into using regular expression syntax to extend the range of search possibilities on our data. Here, we use a simple regular expression parser from the source code for GNU "make," but there are far more sophisticated versions around. Feel free to use what you feel is appropriate. The use of wild cards and optional sounds make this new data type a particularly powerful facility, and SQL already has the facility to use wild cards with the LIKE clause. Once again, we are going to make it so easy to use this data type that the user need not even know they are doing so. The code for Sndx() and strmatch() are available for download.
Listing 15. Overloading the LIKE function
|
Now that the server knows what we mean by LIKE when referring to a TSndx data type, we can use queries such as in Listing 16.
Listing 16. Examples of using LIKE with Sound values
|
My addition of a regular expression comparison as an alternative comparison operator was arbitrary. You could easily extend the comparison functions to include things like fuzzy matching or partial equality. For example, "find me all records that have an 85% match" (whatever you might want that to mean). You might also add an attribute to the data type indicating nationality. For example a Belgian application could store the name as Flemish, German, French or English, in which the Soundex routine is customized to work optimally based on sound rules for the language of interest. Considerations could be made for dialects, too, allowing for dropped consonants and extended vowels. This parallel can be extended infinitely. For example, using sound matching together with more functionality to aid, data cleansing and data de-duplication within name and address fields in direct mailing lists, or searching for chemical structures both on the chemical structure itself, and using the textual description as a pre-index to group similar "sounding" structures for performance reasons. Perhaps even doing something similar with DNA and family names for hypothesis testing could be useful.
It should be apparent that creating DataBlades using Informix Dynamic Server is not an arduous task. With a small amount of code, the database administrator can add an enormous amount of functionality with very little effort. The power of functional indexes shows how performance benefits can be achieved by indexing intelligently on the data type being used (that is, the real content of the data). Extending the data type with regular expression on the sound of the string was simple once the basic data type functionality was present. This blade is small yet provides a piece of domain specific functionality arguably of use to every organization. It shows the reality of component-ware from a data-centric perspective, and potentially allows businesses to truly understand their data, its real value, and how best to deploy it.
This article has only touched on some peripheral issues of DataBlade creation. In effect, if you can imagine something you'd like to do with your data, then actually doing it is not that difficult. There are a number of good resources that might help you understand DataBlades in more detail. I have outlined these in Resources.
- Roy, Jacques. Informix Dynamic Server.2000. Prentice Hall PTR; ISBN: 013013709X; 1st edition (November 8, 1999).
A good introduction to developing DataBlades. Describes performance considerations, tracing and debugging, and gives lots of code examples.
- Brown, Paul. Object Relational Database Development: A Plumber's Guide. Prentice Hall PTR; ISBN: 0130194603; 1st edition (December 22, 2000).
Very good book for understanding Object Relational technology, and why DataBlades (or equivalent) are a superior medium for addressing many complex problems. Lots of good code examples for Datablades.
- Sanchez, Angela (Ed.). Informix Dynamic Server with Universal Data Option: Best Practices. Prentice Hall PTR; ISBN: 0139110747; 1st edition (September 1998).
A good overall guide to the object-relational technology, and how third parties can embed domain expertise within a database. This also provides some good examples of DataBlade code.
- Informix DataBlade development manuals at http://www.ibm.com/software/data/informix/pubs/library/
You can download almost any manual you need from this site. Of most importance here are the DataBlade Developer's Kit Users Guide (PDF), and the DataBlade API Programmer's Manual (PDF). The DataBlade Developer's Kit gives extensive information about creating and deploying DataBlades, and the Programmer's Manual gives a lot of source code examples.
- Informix Developer Zone at http://www.ibm.com/software/data/developer/informix
This page is not just for DataBlades. This also covers XML, Java databases, performance tips and the like. Informix employees have been writing DataBlades for ages, and there is a ton of code, tutorials, and discussions on this site. I mention this page from a DataBlade perspective but the rest of the site has loads of content of interest.

Jon Machtynger has been working with data management technology for 15 years, the last 10 working for database suppliers. He has held a variety of roles from technical to commercial. He recently joined IBM by way of the Informix acquisition. He can be reached at jon.machtynger@uk.ibm.com.





