This entry is primarily a recap of a recent article published by IDC (Carl Olofson) titled "The Third Generation of Database Technology: Vendors and Products That Are Shaking Up the Market". My disclaimer for this entry is that I cannot attach the original article published by IDC (due to copyright). Therefore, I will attempt to summarize and discuss the salient points in the article. It will not be a complete summary or review of the article, due to time and relevance to this audience. At times, I may inject my own thoughts and comments (highlighted). I am hoping to make people aware of some of the competitive products already in the marketplace and others along the way (including IBM and Informix). Part I will cover the 1st and 2nd generation technology, while Part II (the good stuff) will cover the 3rd generation systems.
The premise of the article is that a new generation of DBMS technology is sending a simple message to current generation of DBA's and users that "Everything you know is wrong". These new systems will encourage you to forget disk-based partitioning schemes, indexing strategies, and buffer management. It will embrace a world of large-memory models, multi-core processors, clustered servers, and highly compressed columnwise storage. He further predicts that wihin 5 years:
Most data warehouses will be stored in a columnar fashion.
Most OLTP databases will either be segmented by an in-memory database or reside entirely in memory.
Most large-scale database servers will achieve horizontal scalability through clustering.
Many data collection and reporting problems will be solved with databases that have no formal schema at all.
The First Generation
The first generation of database technology started in the 60's and continued into the 70's. It served two distinct purposes: 1) to enable disparate but related applications to share their data by a means other than passing files back and forth, and 2) to provide a platform for independent data query and reporting that did not require custom code.
At the time, computer programs typically ran in batch mode, so the databases were linked to applications as layers of data organization and indexing between the application code and the file system. Because memory was extremely limited, these systems lacked generalized data access support beyond services that required very explicit data structure navigation, which was coded directly into the application.
Each database management system was different, having its own style for organizing data, and its own set of services or access languages (DDL and DML) that were quite distinct and completely incompatible. This meant that once one developed an application for one of the DBMSs, that application was "locked in". A DBMS migration would require a total application rewrite.
Because of this "lock in" factor, most of these DBMSs are still actively in use today. These products include IBM's IMS, CA's IDMS and DATACOM, Unisys' DMS II (that I worked on), and Software AG's Adabas. All these products have undergone various forms of "modernization" since the 1970's, including the addition of relational interfaces, and support for service conventions such as SOA. The reason that these products are out of favor is due to the lock in quality as well as the cost and complexity of developing and maintaining these databases.
The Second Generation
A new paradigm began to emerge in 1970 with the publication of A Relational Model of Data for Large Shared Data Banks by Dr. E. F. Codd. His premise was that data should be made accessible by managing it in a catalog structured along the lines of mathematical set theory, presented as attributes of relations organized into tuples. IBM then developed a test DBMS based on Codd's ideas, called System R using jargons like "tables", "columns" and "rows", metaphors made familiar by popular spreadsheet products that were delivering computer power to the masses. IBM made relational DBMS seem comprehensible and accessible to ordinary people.
Relational DBMS offer a standard way of modeling and accessing data that could enable users to break out of the dreaded lock in effect. At first, multiple competing languages were proposed for relational databases. IBM took a generalized query language that it had developed for System/R, called the structured query language (SQL), and extended it to include full DDL and DML capabilities.
By the mid-1980's, users resented lock in at not only the DBMS level but also the operating system level. Most systems, whether mainframes or midrange, were driven by proprietary OSs designed and controlled by their computer vendors. Users were increasingly attracted to what were then called minicomputers, driven by Unix. Such systems were called "open systems".
New RDBMS vendors quickly embraced open systems, pushing the idea that users could get cheaper processing and escape lock in at both the OS and the DBMS level. The were fantastically successful, and companies like Ingres, Oracle, Informix, and Sybase emerged as the major players in the late 1980s and early 1990s.
These second generation DBMS products were well designed for the economic constraints of their times. Memory was expensive, and most systems had one or a few processors. So, they used buffer management techniques designed to minimize memory usage. They tended to manage processing threads based on single processor architectures, and they based their internal data structures on optimal layout on dedicated disk drives that were typically direct attached storage using SCSI connections. Some RDBMS products offered cluster configurations for greater scalability, but these would typically require purpose-built hardware provided by the manufacturer: Teradata, Sequent, and Tandem (where I also worked) were examples of vendors of such products.
Since the 1990s, some RDBMS products have been enhanced with various kinds of alternate clustering support that features shared disk support enabled by cluster file systems deployed on network-attached storage (NAS) or storage area networks (SANs). [Mach-11 anyone?] Some also offer caching techniques, 64-bit buffers, and automated database and query optimization, multi-level partitioning, and some degree of parallel query processing, all designed to deliver incremental benefits over the basic product.
Nonetheless, these products are still based on the core desgin principles that drove their early development, i.e. laying out data on disk for optimal performance by scattering data across volumes, partition data to simplify indexes and maintenance operations, and query optimization options that take full advantage of the way the data is organized on storage. In short, the systems are still essentially focused on so-called spinning disks, and with that comes both limitations of internal operation scalability and the bottlenecks represented by the disk I/O.
Everything up to now should be already familiar to the reader. It does represent an accurate view (as shared by other authors as well) of the current state of database technology. Stayed tuned for Part II of this entry.