A database is a digital repository for storing, managing and securing organized collections of data.
Different types of databases store data in different ways. For example, relational databases store it in defined tables with rows and columns, while nonrelational databases can store it as a variety of data structures, including key-value pairs or graphs.
Organizations use these different kinds of databases to manage different types of data. Relational databases excel with structured data such as financial records. Nonrelational databases are best for unstructured data types such as text files, audio and video. Vector databases store data as vector embeddings, a format used by many generative AI applications.
Businesses own large amounts of data—often measured in petabytes, or quadrillions of bits—on everything from customer transactions and product inventory to internal processes and proprietary research. This data must be organized in a coherent data architecture for users and apps to access it when they need it.
Databases are foundational to building such a data architecture. They are more than a place to store information. Rather, they enable organizations to centrally manage data, enforce data integrity and security standards and facilitate data access.
With the proper database systems in place, organizations can use high-quality data sets for key business initiatives, including business intelligence (BI), artificial intelligence (AI) and machine learning (ML) projects.
People often use the term “database” rather loosely, which can cause confusion about what a database is—and what it is not.
A database is a system for storing and managing data, comprising both the physical hardware on which the data is stored and the software that organizes and controls access to the data.
Databases underpin much modern IT infrastructure, including websites, apps and platforms such as Amazon and Google. These services are not databases themselves, but they do rely on databases to manage information, such as product inventories or search results.
It is also worth noting that Microsoft Excel is not a database, but a spreadsheet application. An Excel spreadsheet organizes data in rows and columns much like a relational database does, but that spreadsheet is a single file. Databases, however, are robust, centrally managed systems that can store many different types of data, in many different formats, while supporting more advanced queries.
Organizations use different types of databases to manage different types of data and support different applications. Some of the most common types of databases include:
Navigational databases store data in sets of linked records. Users must navigate between these records to reach the data they want, hence the name.
The 2 most common types of navigational databases are hierarchical databases and network databases.
Hierarchical databases arrange data in a tree-like structure of parent records and child records. Each child record can have only a single parent, but parent records can have multiple children. To reach the wanted record, users must start at the top of the tree and work their way down.
Network databases behave much like hierarchical databases, except they allow each child record to be linked to multiple parent records. Users must still navigate through linked records, typically by using pointers to arrive at the data they want.
Navigational databases were once common, but advancements in database technology—particularly the development of the relational data model—have made them much less popular.
Relational databases store data in formatted tables of rows and columns. They are sometimes called “SQL databases” because many relational databases support the use of structured query language (SQL) to query and manipulate data. (For more information, see “Database languages”).
Each table in a relational database contains information on one type of entity. For example, an organization might have a table that contains information on all its customers, plus separate tables detailing each individual customer’s purchase history.
IBM scientist Edgar F. Codd developed the relational model in the 1970s. The model quickly outpaced the navigational model’s popularity because it greatly simplifies the act of retrieving data. Instead of specifying paths between records, users can use SQL statements to name the data they want. The database figures out how to retrieve the relevant records, often by using indexes instead of full-table scans to speed up the process.
Relational databases also cut down on redundancy, as each datapoint needs to be stored only once. Data from different tables can be combined into a single view without needing to duplicate the data.
Relational databases are some of the most common databases today. They are well suited for managing structured data sets with a standard format, such as financial transactions or user contact information.
A more recent class of relational databases, called “NewSQL databases,” aims to make the relational model more scalable by adopting a distributed database architecture, that is, distributing data across multiple database servers.
“Nonrelational database” is essentially a catch-all term for any database that does not store data in a rigid format, such as a table. They are sometimes called “NoSQL databases” because they generally don’t require SQL to navigate.
Nonrelational databases arose to support unstructured and semistructured data types—such as free-form text and images—that doesn’t fit neatly into relational tables.
Common types of nonrelational databases include:
Graph databases that store data as “nodes” (representing entities) and “edges” (representing relationships between them). Graph databases are often used to track relationships, such as the connections between users of a social networking site.
Document databases store data as documents, including formats such as JSON, XML and BSON. Document databases are common in content management systems.
Key-value databases store information as key-value pairs, where keys are unique identifiers (such as a digital shopping cart ID) and values are arrays of data (such as the items in the cart).
Wide-column databases use rows and columns much like relational databases. The difference is that each row can have its own distinct set of columns that store different information than the other rows. Wide-column databases are often used to support data warehouses, where data must be extracted from multiple sources and centralized.
Object-oriented databases, also called object databases, store data as objects in the sense of object-oriented programming.
Objects are basically bundles of information and associated code. Each object represents an entity. Objects are grouped in classes and have attributes that describe their characteristics and methods that define their behavior.
For example, an object in the “cat” class might have the attributes “color” and “weight,” and the methods “purr” and “hunt.”
Object-oriented databases gained popularity in the 1990s alongside object-oriented programming. Relational databases can pose problems for some apps built with object-oriented languages, as data objects must be converted to tables to be stored in these databases. Object-oriented databases allow developers to avoid that problem.
Vector databases store information as arrays of numbers called “vectors,” which are clustered based on similarity. For example, a weather model might store the low, mean and high temperatures for a single day in vector form: [62, 77, 85].
Vectors can also represent complex objects such as words, images, videos and audio. This high-dimensional vector data is essential to machine learning, natural language processing (NLP) and other AI tasks.
Vector databases are common in AI and ML use cases. For example, many implementations of retrieval augmented generation (RAG) frameworks—which enable large language models (LLMs) to retrieve facts from an external knowledge base—use vector databases.
Cloud databases are databases hosted in the cloud. Any kind of database—relational, nonrelational or otherwise—can be a cloud database.
There are 2 main types of cloud databases. The first, and most basic, is a self-managed database system that runs in the cloud. The second is called database as a service (DBaaS).
DBaaS is a cloud computing service that enables users to access and use database software without managing the system themselves. As the name suggests, DBaaS providers offer a suite of database services, including upgrades, backups, database security and more.
Cloud databases are more scalable than on-premises databases. If an organization needs more storage space or performance starts to drop, it can spin up more resources as needed.
Multimodel databases can store more than one type of data. For example, IBM® Db2® cloud database can support XML, JSON, text and spatial data in a single database instance.
In-memory databases store information in a device’s main memory or RAM. Applications can typically retrieve data from RAM faster than from a traditional database, so in-memory databases are often used to cache data and support real-time data processing. However, storage capacity is much more limited, and data can easily be lost because RAM is more volatile than a standard database.
Databases are not the only way to organize data, and organizations often use different data stores to support different initiatives.
Databases are primarily built for automated data capture, fast queries and transaction processing.
Data lakes are low-cost storage environments designed to handle massive amounts of raw structured and unstructured data. Unlike databases, data lakes generally don’t clean, validate or normalize data. They typically house vast amounts of data to support activities such as AI training and big data analytics where real-time performance is less important.
Data warehouses are built to support data analytics, business intelligence and data science efforts. They aggregate data from various databases, clean it and prepare it so that it is ready for use.
Data lakehouses merge the capabilities of warehouses and lakes into a single data management solution. A lakehouse combines low-cost storage with a high-performance query engine and intelligent metadata governance. This enables organizations to store large amounts of structured and unstructured data and easily use that data for AI, ML and analytics efforts.
At a high level, a database system has 2 key components: the data storage system, which physically or logically houses the data, and the database management system (DMBS), which enables users to interact with the stored data sets.
One can also take a more granular look at the components of a database system to get an even better understanding of what makes a database tick.
Databases must store their data somewhere, on some kind of hardware. That said, databases don’t require specialized machines.
Instead, most database systems are composed of database software running on a computer, server or other device. The machine offers the physical hardware on which the database runs. The software handles the logical arrangement of the data. For example, formatting the data as tables in a relational database or graphs in a graph database.
A database and the applications that use it can run on the same piece of hardware, but today, most database systems use a multitier architecture that separates app servers and database servers. This arrangement offers more scalability and reliability. App and database servers can scale independently of each other, and outages in one tier need not affect the others.
A data model is a visual representation of an information system. Models are conceptual tools that database administrators and designers use to understand the types of data they must track, relationships between datapoints and how to best organize the data.
The data model helps identify the right database model, that is, the practical implementation of the database system, including technical requirements and storage formats. For example, the preceding logical data model might result in a relational database that looks like this:
A database schema technically and logically defines how data is organized within a database. Put another way, it translates the data model into a set of rules for the database to follow.
For example, a relational database schema would define things such as table names, fields, data types and relationships between these things.
Schemas can be represented through visual charts, written out with SQL statements or other programming languages or defined in some other way. It depends on the type of schema and the database system in question.
All relational database systems have schemas. Some nonrelational databases have schemas, some don’t and some allow but don’t require them.
A database management system (DBMS) is software that enables database administrators, users and apps to easily interface with data in a database.
DBMS systems allow users to perform key data management tasks such as formatting databases, managing metadata, querying data sets and adding, updating or deleting data.
Some DBMSs help enforce data security measures, such as by applying database access controls and logging user activity. They might also track database performance.
Like databases themselves, DBMSs can vary in model. For example, relational database management systems (RDBMS) are built for relational databases, while object-oriented database management systems (OODBMS) manage object-oriented databases.
Some common database management systems include:
MySQL is an open source RDBMS often used for e-commerce sites and other web apps.
PostgreSQL is known for its emphasis on extensibility and transaction reliability.
Microsoft SQL Server is widely used by organizations with Microsoft networks.
Oracle Database is a multimodel DBMS that can manage both structured and unstructured data.
IBM Db2 is a cloud-native database system that includes database management, warehousing, storage and other features to support real-time analytics and AI applications.
Database languages are specialized programming languages that people use to interact with databases. They give users a syntax for writing queries to fetch, combine, update or otherwise use data.
The most common database language is structured query language (SQL), which most relational databases use. Developed by IBM scientists in the 1970s, SQL helps database administrators, developers and data analysts perform tasks such as data definition, access control, data sharing, data integration and analytical queries.
Other database languages include object query language (OQL), which works with object-oriented databases and XQuery, which works with XML document databases.
There are also database-specific languages such as MongoDB query language (MQL) for MongoDB and Cassandra query language (CQL) for Apache Cassandra.
Databases are crucial to many technologies that people rely on today, from banking apps that track financial transactions in relational databases to AI assistants that use vector databases to improve accuracy. Databases are so common precisely because they are key to supporting:
Organizations today own massive amounts of data, but that doesn’t mean much if people can’t use that data. In fact, the IBM Data Differentiator reports that as much as 68% of enterprise data is never analyzed. Often, that’s because people don’t know it’s there or silos keep them from accessing it.
Databases give organizations a way to curate, store and centrally manage a collection of data. They can also help automate much of the data collection process, including capturing events and transactions in real time.
The way an organization selects, designs and implements its database applications can make or break key business initiatives. When data is organized and readily accessible, it can drive decision-making, fuel business intelligence and power AI and ML projects.
Databases can offer significant advantages over spreadsheets and other manual recordkeeping processes, which are prone to error, redundancy and inaccuracy.
Because databases can be centrally managed, they can make it easier to enforce cleansing and formatting rules, monitor usage and track data lineage. Databases also remove the need to circulate multiple copies of data sets, which can become unsynchronized over time. Instead, every application and user can work off the same shared repository.
Ultimately, databases can help connect users of all kinds—people, apps, APIs—with clean, trustworthy data.
Depending on location and industry, organizations must comply with data protection and data privacy regulations, such as the US Health Insurance Portability and Accountability Act (HIPAA) and the EU General Data Protection Regulation (GDPR).
Beyond legal requirements, organizations have a business interest in preventing unauthorized data access. According to the IBM Cost of a Data Breach Report, the average breach costs USD 4.88 million between lost business, system downtime, remediation efforts and other costs.
Databases can help protect data and maintain compliance by enforcing data security measures such as role-based access controls (RBAC) to help ensure that only the right users can access the right data.
75% of CEOs believe that having the most advanced generative AI will be a deciding factor in organizational competitive advantage moving forward. To support such advanced AIs, organizations need the ability to store, manage and govern massive amounts of structured and unstructured data. They can only do that with the right database systems in place.
Different types of databases can support AI and ML efforts in different ways. For example, vector databases are commonly used to implement RAG frameworks that can help reduce hallucinations. Key-value databases can speed up data retrieval and processing. In-memory databases can support caching and streaming analytics.
Several factors can influence the types of databases an organization chooses for a specific initiative. Some of the most salient include:
Type of data: Each type of database handles some kinds of data better than others. For example, a graph database is often a better choice for mapping relationships than an SQL database.
Purpose: Different types of databases are also better suited for different applications. For example, a vector database is often the best choice for a RAG framework.
Performance requirements: If an app continuously pulls data in real time, the organization needs a database that optimizes for query speed. However, if the organization needs a place to store data before sending it to a warehouse, performance might be less important.
Price: The amount of data an organization needs to store, the format of that data and the performance requirements can all contribute to database cost.
Scalability: Some databases can only scale vertically, meaning more resources must be added to an existing server or machine. Others can scale horizontally, meaning more servers can be added to support the database in a distributed fashion.
Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.
Explore the data leader's guide to building a data-driven organization and driving business advantage.
Access our guide to learn how to use the right databases for applications, analytics and generative AI.
Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.
Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.
Discover why AI-powered data intelligence and data integration are critical to drive structured and unstructured data preparedness and accelerate AI outcomes.