A motto in "Extreme Programming" (XP) has a widespread, and understandable, appeal among many developers—especially those of us who tend to do consulting projects: "You ain't gonna need it" (abbreviated YAGNI) is a shorthand way of saying that you should not develop a given bit of functionality until it is actually required by your client or project. In general spirit, it is hard to argue with YAGNI; it has the clear advantage of letting you show clients "something" at an early stage of development and continuously along the way.
One area where eschewing too-early complications often goes wrong is in putting off database design decisions until you have already coded yourself into a cul-de-sac. The issue I find most often is not really in choices among RDBMSes per se, but rather with "quick and dirty" prototypes that create ad hoc solutions that skip any RDBMS altogether (or that settle for partial implementations like MS-Access, SQLite, or even Berkeley DB).
For the large majority of dynamic Web sites, some sort of true RDBMS at the back end will wind up being the right engineering approach. While switching between RDBMSes is unlikely to entirely meet the ideal of happening solely at the database layer, passing data access through SQL calls does a good job at componentizing schemas. If you have genuinely specialized performance requirements, you might eventually need optimization and tuning at the RDBMS level; this might include choosing an RDBMS, or even choosing a specialized "Column-oriented DBMS" for certain data warehousing needs. In any case, by the time you worry about these types of optimizations, you probably have a pretty good, working Web site, and your concerns have moved beyond those addressed in this article.
Stages of development
Another slogan loosely associated with XP is Kent Beck's, "Make it work, make it right, make it fast." I do not want this article to digress into a discussion—neither pro nor con—on extreme programming. However, there is a definite sense in which some ideals often associated with this programming methodology are the cause of a misguided eschewal of databases and good schema design.
I do not disagree with the benefit of early prototyping. A mockup that "works" is an excellent first step for demonstrating a concept. To get to that stage, creating almost-static pages with just a little bit of placeholder data is something almost all developers do. The next step comes soon thereafter, however: "make it right." In my mind, "right" usually means utilizing an RDMBS; and moreover, it means utilizing one that makes "pretty good" assumptions about the underlying structure of the data. Making it "fast" should be postponed as long as possible—which doesn't mean forever, but usually means for longer than many developers think.
A digression on configuration tables
One area where ad hoc storage of data is used is in application
configuration. Formats like flat files, custom XML, YAML, JSON, native
data syntax of your server language (Python, Perl, Ruby, Scheme, PHP),
and other high-level languages all make it easy to store data in
importable modules with names like
The problem with configuration data is that its boundaries are
ill-defined. There is no bright line between "configuration" and
"application," and as a result,
config files have a tendency to
grow—and often to multiply—into multiple files as various
"configuration sets" become needed. Key/value databases like Berkeley
DB often propagate this confusion at an only slightly higher level.
I have found that many temptations can be reduced by simply using an
RDBMS table called
configuration right at the beginning. Usually
this table can settle for three columns (just using two is tempting),
SQL that defines generic configuration table
-- General configuration CREATE TABLE configuration ( key TEXT, -- Configuration key version TEXT, -- Allow single switch for multiple and -- experimental application versions value TEXT, -- Configuration value PRIMARY KEY(key) );
A table like this reduces temptation; it does not eliminate it. After all, it is very possible to put all the data your application handles in this one table. But in practice, once you need to read and write configuration information through layered SQL calls, it becomes much easier to "bite the bullet" and move all the data that might live in some other tables into appropriately structured schemas. And even if you do not do this right away, when you do it, you're only required to change the SQL in your server code, not rip out your lovely XML or YAML parsing code and replace it with SQL and database connection code.
What tools encourage
The advice of this article is primarily aimed at developers who use "agile" or "very high level" programming languages for server code—that is to say, Perl, Python, Ruby, PHP, and a few others. For Web application developers that choose Java or C++, you can almost assume that a "heavier" methodology goes hand-in-hand with the development process—throwing in up-front schema design adds little "weight" to the process in these latter cases.
Moreover, in any language, using a "Web development framework" moves the temptations for too-early, too-persistent simplification laterally. That is, developers who use CakePHP, Apache Struts, Ruby on Rails, Django, or other Model-View-Controller (MVC) frameworks are unlikely to skip the RDBMS altogether. Instead, a different simplification danger usually awaits them: Object-Relational Mappings (ORMs).
In my experience—and that of many developers—starting with an ORM leaves you with poorly (under-)designed database schemas, essentially something not very far from the key/value configuration table mentioned above. While many ORMs are technically capable of generating and utilizing referential constraints, usually foreign key constraints are ignored, or sloppily pushed from the database to the application layer. What we often wind up with is simply RDBMSes reduced to collections of flat files with slightly better performance characteristics.
While frameworks are seemingly becoming more commonplace, a great deal
of Web application code is still developed as basic (Fast)CGI—or at
least as its moral equivalent using server modules like
mod_php, JSP, and so on. When using these "close to the metal"
techniques, it is easy to put data in local files, in various
structured or semi-structured formats. Moreover, the default behavior
of most Web servers to mirror the structure of a local file system in
served URLs adds to the "obviousness" of storing data locally. Adding to
those factors, many types of source data already come "prepackaged"
as textual or binary formats, so converting them to schema-structured
tables might genuinely require extra effort. For example, the job of
some Web applications is to serve content that derives from log files;
CSV dumps; spreadsheets; user-generated, loosely-structured documentation in formats like reStructuredText or
MarkDown; or from wiki formats.
Even within the set of "scripting languages," philosophical differences exist—many reflected in code syntax and module organization. Perl—and to a slightly lesser extent Ruby and Python—make parsing a wide variety of data formats absurdly easy. CPAN has a parser for every obscure text format you have ever heard of, and for most binary ones. PHP or Java, in contrast, are less obviously facile in parsing text formats—and more importantly (for PHP especially), RDBMS access functions are "inherent" rather than merely importable modules. Obviously, one or a few lines of code to import an RDBMS binding is trivial, but the tiny fact that it is a separate step ties in with the "parse every format" attitude of Perl and Python, and makes "just use the file system" seem more compelling in those latter languages.
The reasons why you "are going to need" an RDBMS for your Web application are pretty much the same reasons why anyone needs one for any application. Not all applications are coded using MVC frameworks, but nearly all applications can benefit from at least a conceptual separation of data structure, business logic, and presentation. Moreover, what is important is not only the structure of data, but its integrity.
The acronym ACID (Atomicity, Consistency, Isolation, Durability) is a good summary of the integrity (and reliability) guarantees we expect from an RDBMS. Moreover, "informal" solutions that skip using an RDBMS usually wind up taking ad hoc measures to implement (some of) these guarantees, and usually get their subtleties wrong.
A recap of the ACID guarantees goes a long way toward illustrating why using an established infrastructure that provides these guarantees (a RDBMS) is better than trying to make guarantees "as needed."
Atomicity: Any transaction that is performed is done in its entirety, or is not done at all. A user action might need to update a variety of data values. If you were to perform these updates in a collection of files on the Web server file system, a failure in the middle might update one file, but not another related one. Or worse still, individual files might be left in invalid states (for example, partial writes create data that does not match the data format of the filespec). We all know how susceptible transactions on the Web are to a variety of failures (many not controlled by our servers): Failure should be as graceful as possible.
Consistency: The distinction between consistency and atomicity is subtle. Both aim at making sure your data always "makes sense." But atomicity only assures that a whole series of steps are performed, not necessarily that these steps are themselves logically consistent. By analogy, a journaling file system gives you a sort of atomicity guarantee: Each file is written or modified either with the whole change, or not at all. Journaling does not guarantee the relation of the contents of files to each other. A journaling file system does not assure that each file has unique contents, nor that a certain content or file cannot exist unless another file already does. But those sorts of guarantees are exactly what relational theory defines (for instance, primary and secondary keys in the example), and they are also exactly the sort of knowledge we have about our data during design stage.
Isolation: This should be simple enough: Your application has—or might have—multiple users, and what each one does must not create inconsistencies with the effects of another user's actions. Do not try to jury-rig isolation by, for example, having a self-written daemon process queue up changes. I have committed this sin myself, but the creators of well-tested RDBMSes have solved this problem already, and they have done so much better than your team can.
Durability: If a transaction commits successfully, you will not lose it. Genuinely guaranteeing this property is hard. It is one thing to use transaction logs and snapshot, it is rather harder to guarantee that a transaction survives every imaginable catastrophic failure. (Keeping a transaction through a major earthquake at your server farm is not easy.) In any case, regular RDBMSes do a pretty good job of making transactions durable.
It is worth mentioning a few "freebies" that RDBMSes give you that you might otherwise have to implement laboriously. The simplest of these is plain old indexes. Finding the right data becomes more difficult as data grows. Various solutions, made out of naming conventions, duct tape, and bailing wire, often arise for organization of file-system organized data. What they have in common is that the index fields built into RDBMSes have already solved most of these issues better. Yes, index optimization is a tricky topic past a certain level (index types, table partitioning, compound index types, join tables, caching options, and so on), but at least it is a known issue with existing experts. And much of the time "out of the box" is good enough with modern RDBMSes.
Another situation that I have often encountered where eschewing RDBMSes appears compelling is for full-text searching. A large number of quite excellent tools exist for indexing and querying collections of files on a file system, or at the end of URLs. Many of these are less straightforward to use for indexing free-form textual fields in databases. Fortunately, modern RDBMSes come with their own full-text search engines.
It is difficult in an article like this to be entirely specific about the design principles that apply to your Web application. Every application exists for its own purpose, and comes with its own requirements and constraints. But in my experience as a consultant and developer, the tendency to skip use of an RDBMS—or at least to push it off for longer than is desirable—is far too common in rapid-prototyping contexts, and those spill over to perhaps most of what gets called Web application development.
It is really just not that hard to configure a simple RDBMS, and good bindings for most back ends almost surely exist for your favorite programming language or framework. Rather than wait, just set up a RDBMS right at the beginning, and work from there.
- Extreme Programming and the related term, Agile Programming, are largely development methodologies intended to minimize the perceived excesses of design processes in other methodologies. As with any good political movement, Agile Programming has its own manifesto.
- As with many topics, Wikipedia has a good take on the problems with ORMs.
- Information on MySQL's full-text search capability can be found at the MySQL developer zone.
PostgreSQL 8.3, TSearch2 is fully integrated into the core engine
(prior to that, it had been part of the
- Find product information on IBM DB2's Net Search Extender at the IBM Software pages.
- For implementors for whom a small footprint and easy installation are premium concerns, take a look at Derby and/or Firebird On Linux systems, both MySQL and PostgreSQL also usually come either pre-installed, or as an automated upgrade.
- Get your hands on more articles from the Web development zone technical library.
- Subscribe to the developerWorks Web development newsletter.
Get products and technologies
- Download IBM product evaluation versions.