Why (almost) every Web site needs an RDBMS

Deciding to bite the bullet early

When your Web application reaches a certain size, it needs a good database design behind it. And in fact, this "certain size" is much smaller than almost every small-site developer thinks. Relational Data Base Management Systems (RDBMSes) need not be restrictive or over-architected, as their bad reputation sometimes brings developers to fear. A bit of thought toward what your site does quickly turns into a sensible schema design, and it is easy to leave open expandable storage mechanisms like a configuration table within an RDBMS back end.

David Mertz, Ph.D (mertz@gnosis.cx), ACID Washed Author, Gnosis Software, Inc.

Photo of David MertzDavid Mertz values data, and values relationships even more. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/publish/. Check out David's book Text Processing in Python.



11 December 2007

Also available in Russian

Introduction

A motto in "Extreme Programming" (XP) has a widespread, and understandable, appeal among many developers—especially those of us who tend to do consulting projects: "You ain't gonna need it" (abbreviated YAGNI) is a shorthand way of saying that you should not develop a given bit of functionality until it is actually required by your client or project. In general spirit, it is hard to argue with YAGNI; it has the clear advantage of letting you show clients "something" at an early stage of development and continuously along the way.

One area where eschewing too-early complications often goes wrong is in putting off database design decisions until you have already coded yourself into a cul-de-sac. The issue I find most often is not really in choices among RDBMSes per se, but rather with "quick and dirty" prototypes that create ad hoc solutions that skip any RDBMS altogether (or that settle for partial implementations like MS-Access, SQLite, or even Berkeley DB).

For the large majority of dynamic Web sites, some sort of true RDBMS at the back end will wind up being the right engineering approach. While switching between RDBMSes is unlikely to entirely meet the ideal of happening solely at the database layer, passing data access through SQL calls does a good job at componentizing schemas. If you have genuinely specialized performance requirements, you might eventually need optimization and tuning at the RDBMS level; this might include choosing an RDBMS, or even choosing a specialized "Column-oriented DBMS" for certain data warehousing needs. In any case, by the time you worry about these types of optimizations, you probably have a pretty good, working Web site, and your concerns have moved beyond those addressed in this article.


Stages of development

Another slogan loosely associated with XP is Kent Beck's, "Make it work, make it right, make it fast." I do not want this article to digress into a discussion—neither pro nor con—on extreme programming. However, there is a definite sense in which some ideals often associated with this programming methodology are the cause of a misguided eschewal of databases and good schema design.

I do not disagree with the benefit of early prototyping. A mockup that "works" is an excellent first step for demonstrating a concept. To get to that stage, creating almost-static pages with just a little bit of placeholder data is something almost all developers do. The next step comes soon thereafter, however: "make it right." In my mind, "right" usually means utilizing an RDMBS; and moreover, it means utilizing one that makes "pretty good" assumptions about the underlying structure of the data. Making it "fast" should be postponed as long as possible—which doesn't mean forever, but usually means for longer than many developers think.

A digression on configuration tables

One area where ad hoc storage of data is used is in application configuration. Formats like flat files, custom XML, YAML, JSON, native data syntax of your server language (Python, Perl, Ruby, Scheme, PHP), and other high-level languages all make it easy to store data in importable modules with names like config.py or config.pl.

The problem with configuration data is that its boundaries are ill-defined. There is no bright line between "configuration" and "application," and as a result, config files have a tendency to grow—and often to multiply—into multiple files as various "configuration sets" become needed. Key/value databases like Berkeley DB often propagate this confusion at an only slightly higher level.

I have found that many temptations can be reduced by simply using an RDBMS table called configuration right at the beginning. Usually this table can settle for three columns (just using two is tempting), for example:

SQL that defines generic configuration table
-- General configuration
CREATE TABLE configuration (
  key     TEXT,       -- Configuration key
  version TEXT,       -- Allow single switch for multiple and
                      -- experimental application versions
  value   TEXT,       -- Configuration value
  PRIMARY KEY(key)
);

A table like this reduces temptation; it does not eliminate it. After all, it is very possible to put all the data your application handles in this one table. But in practice, once you need to read and write configuration information through layered SQL calls, it becomes much easier to "bite the bullet" and move all the data that might live in some other tables into appropriately structured schemas. And even if you do not do this right away, when you do it, you're only required to change the SQL in your server code, not rip out your lovely XML or YAML parsing code and replace it with SQL and database connection code.

What tools encourage

The advice of this article is primarily aimed at developers who use "agile" or "very high level" programming languages for server code—that is to say, Perl, Python, Ruby, PHP, and a few others. For Web application developers that choose Java or C++, you can almost assume that a "heavier" methodology goes hand-in-hand with the development process—throwing in up-front schema design adds little "weight" to the process in these latter cases.

Moreover, in any language, using a "Web development framework" moves the temptations for too-early, too-persistent simplification laterally. That is, developers who use CakePHP, Apache Struts, Ruby on Rails, Django, or other Model-View-Controller (MVC) frameworks are unlikely to skip the RDBMS altogether. Instead, a different simplification danger usually awaits them: Object-Relational Mappings (ORMs).

In my experience—and that of many developers—starting with an ORM leaves you with poorly (under-)designed database schemas, essentially something not very far from the key/value configuration table mentioned above. While many ORMs are technically capable of generating and utilizing referential constraints, usually foreign key constraints are ignored, or sloppily pushed from the database to the application layer. What we often wind up with is simply RDBMSes reduced to collections of flat files with slightly better performance characteristics.

While frameworks are seemingly becoming more commonplace, a great deal of Web application code is still developed as basic (Fast)CGI—or at least as its moral equivalent using server modules like mod_perl, mod_php, JSP, and so on. When using these "close to the metal" techniques, it is easy to put data in local files, in various structured or semi-structured formats. Moreover, the default behavior of most Web servers to mirror the structure of a local file system in served URLs adds to the "obviousness" of storing data locally. Adding to those factors, many types of source data already come "prepackaged" as textual or binary formats, so converting them to schema-structured tables might genuinely require extra effort. For example, the job of some Web applications is to serve content that derives from log files; CSV dumps; spreadsheets; user-generated, loosely-structured documentation in formats like reStructuredText or MarkDown; or from wiki formats.

Even within the set of "scripting languages," philosophical differences exist—many reflected in code syntax and module organization. Perl—and to a slightly lesser extent Ruby and Python—make parsing a wide variety of data formats absurdly easy. CPAN has a parser for every obscure text format you have ever heard of, and for most binary ones. PHP or Java, in contrast, are less obviously facile in parsing text formats—and more importantly (for PHP especially), RDBMS access functions are "inherent" rather than merely importable modules. Obviously, one or a few lines of code to import an RDBMS binding is trivial, but the tiny fact that it is a separate step ties in with the "parse every format" attitude of Perl and Python, and makes "just use the file system" seem more compelling in those latter languages.


Acid

The reasons why you "are going to need" an RDBMS for your Web application are pretty much the same reasons why anyone needs one for any application. Not all applications are coded using MVC frameworks, but nearly all applications can benefit from at least a conceptual separation of data structure, business logic, and presentation. Moreover, what is important is not only the structure of data, but its integrity.

The acronym ACID (Atomicity, Consistency, Isolation, Durability) is a good summary of the integrity (and reliability) guarantees we expect from an RDBMS. Moreover, "informal" solutions that skip using an RDBMS usually wind up taking ad hoc measures to implement (some of) these guarantees, and usually get their subtleties wrong.

A recap of the ACID guarantees goes a long way toward illustrating why using an established infrastructure that provides these guarantees (a RDBMS) is better than trying to make guarantees "as needed."

Atomicity: Any transaction that is performed is done in its entirety, or is not done at all. A user action might need to update a variety of data values. If you were to perform these updates in a collection of files on the Web server file system, a failure in the middle might update one file, but not another related one. Or worse still, individual files might be left in invalid states (for example, partial writes create data that does not match the data format of the filespec). We all know how susceptible transactions on the Web are to a variety of failures (many not controlled by our servers): Failure should be as graceful as possible.

Consistency: The distinction between consistency and atomicity is subtle. Both aim at making sure your data always "makes sense." But atomicity only assures that a whole series of steps are performed, not necessarily that these steps are themselves logically consistent. By analogy, a journaling file system gives you a sort of atomicity guarantee: Each file is written or modified either with the whole change, or not at all. Journaling does not guarantee the relation of the contents of files to each other. A journaling file system does not assure that each file has unique contents, nor that a certain content or file cannot exist unless another file already does. But those sorts of guarantees are exactly what relational theory defines (for instance, primary and secondary keys in the example), and they are also exactly the sort of knowledge we have about our data during design stage.

Isolation: This should be simple enough: Your application has—or might have—multiple users, and what each one does must not create inconsistencies with the effects of another user's actions. Do not try to jury-rig isolation by, for example, having a self-written daemon process queue up changes. I have committed this sin myself, but the creators of well-tested RDBMSes have solved this problem already, and they have done so much better than your team can.

Durability: If a transaction commits successfully, you will not lose it. Genuinely guaranteeing this property is hard. It is one thing to use transaction logs and snapshot, it is rather harder to guarantee that a transaction survives every imaginable catastrophic failure. (Keeping a transaction through a major earthquake at your server farm is not easy.) In any case, regular RDBMSes do a pretty good job of making transactions durable.

RDBMS tools

It is worth mentioning a few "freebies" that RDBMSes give you that you might otherwise have to implement laboriously. The simplest of these is plain old indexes. Finding the right data becomes more difficult as data grows. Various solutions, made out of naming conventions, duct tape, and bailing wire, often arise for organization of file-system organized data. What they have in common is that the index fields built into RDBMSes have already solved most of these issues better. Yes, index optimization is a tricky topic past a certain level (index types, table partitioning, compound index types, join tables, caching options, and so on), but at least it is a known issue with existing experts. And much of the time "out of the box" is good enough with modern RDBMSes.

Another situation that I have often encountered where eschewing RDBMSes appears compelling is for full-text searching. A large number of quite excellent tools exist for indexing and querying collections of files on a file system, or at the end of URLs. Many of these are less straightforward to use for indexing free-form textual fields in databases. Fortunately, modern RDBMSes come with their own full-text search engines.


Wrapping up

It is difficult in an article like this to be entirely specific about the design principles that apply to your Web application. Every application exists for its own purpose, and comes with its own requirements and constraints. But in my experience as a consultant and developer, the tendency to skip use of an RDBMS—or at least to push it off for longer than is desirable—is far too common in rapid-prototyping contexts, and those spill over to perhaps most of what gets called Web application development.

It is really just not that hard to configure a simple RDBMS, and good bindings for most back ends almost surely exist for your favorite programming language or framework. Rather than wait, just set up a RDBMS right at the beginning, and work from there.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Web development on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development
ArticleID=276771
ArticleTitle= Why (almost) every Web site needs an RDBMS
publish-date=12112007