This is a continuation of the entry last week where I summarized IDC’s paper on “The Third Generation of Database Technlogy: Vendors and Products That Are Shaking up the Market” by Carl Olofson. Those interested in Part I should refer to the previous entry on this topic.
The Third Generation
The economics of computing have changed in the 2000s such that multi-core processors are common. 4 and 8 way systems, each with dual core processors, are common for even moderate workloads. 64-bit technology is taking over for enterprise servers, especially for database servers, and memory is cheap. Even though disks are also cheaper and faster than ever before, reading and writing whole records on disk represent a drag on overall system performance. Clever adaptations to reduce I/O help, but don’t really eliminate the problem. Third generation DBMS products began to emerge in the late 1990’s and are now beginning to supplant second-generation products in significant ways.
In-Memory Database Technology
Instead of conventional disk-based RDBMS, which involves mapping buffer contents to segments that, in turn, represent pages (disk blocks), with database keys that must be translated and mapped for the data to be located, in-memory databases simply manage blocks of main memory, using memory addresses as direct pointers to the data. This technique eliminates the I/O drag of disk-based data and also claims to reduce the instruction path length of a typical database operation by anywhere from 20 to 200 times.
An IMDB is not simply a shared memory cache, and its contents cannot be accessed directly by applications. Instead, applications access the data through the services of the DBMS as exposed in an API, which may be proprietary, or most commonly a relational convention such as SQL, ODBC, or JDBC. This ensures that not only the application can’t corrupt the database but the database contents are controlled by the IMDB, which uses various techniques to ensure a consistent view of data to each application, and, in most cases, manages multiple concurrent sessions.
A common misconception regarding IMDB is that it lacks the ACID properties of a transactional database. This is not true. Most IMDB implementations used for transaction processing still have transaction logs for error recovery, and can stream the logs to physically persistent storage such as SSM or spinning disk. A few provide full recoverability by nonlogging means. They also commonly replicate their memory contents to other servers, to provide high availability functionality through failover support.
Examples of DBMS products that include this technology are Oracle TimesTen, Sybase ASE 15.5 IMDB, ENEA’s Polyhedra, GeneroDB from Four Js, Xcelerix IMDB from Frontex, IBM’s SolidDB, eXtremeDB from McObject, and VoltDB.
DBMS servers are using peer-to-peer clusters to provide a hybrid of failover and scalability support, spreading workloads across many servers and exchanging small units of data at high rates of speed to execute database operations. This is an especially useful approach when the workload mainly consists in access to a large number of read-only tables, but a small number of updatable tables, and where the transactions do not require a high degree of data sharing for execution, such as classic data entry operations.
Examples of products that include this technology include ParAccel, VoltDB and (Informix Mach11).
Columnar Data Storage
Storing table rows as blocks with selected indexed columns is a spectacularly inefficient approach for databases that are commonly used for statistical analysis, such as data warehouses. This is because many data warehouse operations scan tables, performing aggregate operations on the data, usually in select columns, and because when they randomly select rows, it is normally for only a few columns in that row, and not for the whole row.
Columnar data storage involves storing tables as blocks by column rather than by row. This offers several advantages for analytic databases. One advantage is that operations on a column of data can be carried out with very few I/O operations. Another is that since the column contains all data of the same type, it can very easily compressed to a tiny fraction of its size by using indexing to eliminate duplicate values and then compressing the values themselves. Once that is done, any random select on the table will result in a very quick result because every column is, in effect, indexed. Finally, the index structures that define the columns can themselves be cross-indexed, further optimizing access.
Examples of products that incorporate this technology include Oracle Exadata Database Machine V2, Sybase IQ, Vertica, ParAccel, and VectorWise from Ingres.
Nonschematic DBMS may be one of the most controversial and least understood of the 3rd generation DBMS technologies, though it is based on an idea that is not all that new. It could be argued that every native XML DBMS is really a nonschematic DBMS, in that it dynamically builds index structures based on the tag structures found in the XML documents that are loaded into it, and uses those index structures to organize the data. (Kevin Brown and I have had several discussions with one of the vendors mentioned below to assess its suitability with IDS)
A nonschematic DBMS is one that stores data as a complex of key-value pairs, building cross-pair structures, such as table rows, based on the order inherent in the data presented to it The user can then use tools to discover structures inherent in the data, and derive a schema from those structures.
Nonschematic DBMS could be used as a tool in preparing data to be loaded into a data warehouse, but would make a poor platform for the warehouse itself. It is best used for managing semi-structured data, such as XML, and for aggregating large amounts of data from many sources for performing search and discovery operations. It seems to have a bright future for a variety of applications that may be offered as services on the Internet and in cloud environments.
Examples of products with this technology include Google’s BigTable, Infobionics Knowledge Server, and Amazon’s SimpleDB. Among general-purpose XML DBMS vendors, the pure-plays include Raining Data (TigerLogic XDMS), Software AG (webmethods Tamino), and Xpriori (XMS). A number of leading relational DBMS products also offer native XML support, including IBM DB2, Oracle, and Sybase ASE.
Cloud computing is based on the principles of utility computing, resource virtualization, and an approach to shared resource management, such as multitenancy, that enables a service to offer to users the illusion of limitlessly expandable abstracted resources (memory, processing power, etc.). To fit into such as framework, DBMS technology must also offer virtualization, horizontal scalability, and multitenancy.
These capabilities can be offered through server and storage resources that are tightly controlled, or through virtualization techniques that cooperates with other external resource management systems. A third approach is to encapsulate these capabilities within a physical environment that is directly controlled by the vendor, offering DBMS functionality as cloud services.
Only a few DBMS products feature dynamic, virtualized hardware and software resource management for scaling processing power or storage up or down without manual reconfiguration. Examples of products that explicitly use this technology include Amazon SimpleDB and Microsoft SQL Server Azure.
Part III of this entry will discuss some of the vendors mentioned so far and what specific products that fit into the categories discussed so far. Still further out, there will be discussion on what IBM and Informix are doing with these technologies.