Data reduction is the process in which an organization sets out to limit the amount of data it’s storing.
Data reduction techniques seek to lessen the redundancy found in the original data set so that large amounts of originally sourced data can be more efficiently stored as reduced data.
At the outset, it should be stressed that the term “data reduction” does not automatically equate to a loss of information. In many instances, data reduction only means that data is now being stored in a smarter fashion—perhaps after going through the optimization process and then being reassembled with related data in a more practical configuration.
Nor is data reduction the same thing as data deduplication, in which extra copies of the same data are purged for streamlining purposes. More accurately, data reduction combines various aspects of different activities, such as data deduplication and data consolidation, to achieve its goals.
When data is being discussed in the context of data reduction, we’re often speaking of data in its singular form, as opposed to the pluralized form typically used. One aspect of data reduction, for example, deals with defining the actual physical dimensions of individual data points.
There’s a considerable amount of data science involved with data-reduction activities. The material can be fairly complex and difficult to summarize concisely, and this dilemma has spawned its own term—interpretability, or the ability for a human of average intelligence to understand a particular machine learning model.
Grasping the meanings of some of these terms can be challenging because this is data as seen from a near-microscopic perspective. We’re usually discussing data in its “macro” form, but in data reduction, we’re often speaking of data in its most “micro” sense. More accurately, most discussions of this topic will require both discussions at the macro level and others at the micro end of the scale.
When an organization reduces the volume of data it’s carrying, that company typically realizes substantial financial savings in the form of reduced storage costs associated with consuming less storage space.
Data reduction methods provide other advantages, as well, like increasing data efficiency. When data reduction has been achieved, that resulting data is easier for artificial intelligence (AI) methods to use in a variety of ways, including sophisticated data analytics applications that can greatly streamline decision-making tasks.
For example, when storage virtualization is used successfully, it assists the coordination between server and desktop environments, enhancing their overall efficiency and making them more reliable.
Data reduction efforts play a key role in data mining activities. Data must be as clean and prepared as possible before it’s mined and used for data analysis.
The following are some of the methods organizations can use to achieve data reduction.
The notion of data dimensionality underpins this entire concept. Dimensionality refers to the number of attributes (or features) assigned to a single dataset. However, there’s a tradeoff at work here—the greater the amount of dimensionality, the more data storage demanded by that dataset. Furthermore, the higher the dimensionality, the more often data tends to be sparse, complicating necessary outlier analysis.
Dimensionality reduction counters that by limiting the “noise” in the data and enabling better visualization of data. A prime example of dimensionality reduction is the wavelet transform method, which assists image compression by maintaining the relative distance that exists between objects at various resolution levels.
Feature extraction is another possible transformation for data—one that changes original data into numeric features and works in conjunction with machine learning. It differs from principal component analysis (PCA), another means of reducing the dimensionality of large data sets, in which a sizable set of variables is transformed into a smaller set while retaining most of the data from the large set.
The other method involves selecting a smaller, less data-intensive format for representing data. There are two types of numerosity reduction—that which is based on parametric methods and that which is based on non-parametric methods. Parametric methods like regression concentrate on model parameters, to the exclusion of the data itself. Similarly, a log-linear model might be employed that focuses on subspaces within data. Meanwhile, non-parametric methods (like histograms, which show the way numerical data is distributed) don’t rely upon models at all.
Data cubes are a visual way to store data. The term “data cube” is actually almost misleading in its implied singularity, because it’s really describing a large, multidimensional cube that’s composed of smaller, organized cuboids. Each of the cuboids represents some aspect of the total data within that data cube, specifically pieces of data concerning measurements and dimensions. Data cube aggregation, therefore, is the consolidation of data into the multidimensional cube visual format, which reduces data size by giving it a unique container specifically built for that purpose.
Another method enlisted for data reduction is data discretization, in which a linear set of data values is created based around a defined set of intervals that each correspond to a determined data value.
In order to limit file size and achieve successful data compression, various types of encoding can be used. In general, data compression techniques are considered as either using lossless compression or lossy compression, and they are grouped according to those two types. In lossless compression, data size is reduced through encoding techniques and algorithms, and the complete original data can be restored if needed. Lossy compression, on the other hand, uses other methods to perform its compression, and although its processed data may be worth retaining, it will not be an exact copy, as you would get with lossless compression.
Some data needs to be cleaned, treated and processed before it undergoes the data analysis and data reduction processes. Part of that transformation may involve changing the data from analog in nature to digital. Binning is another example of data preprocessing, one in which median values are utilized to normalize various types of data and ensure data integrity across the board.
Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.
IBM named a Leader for the 19th year in a row in the 2024 Gartner® Magic Quadrant™ for Data Integration Tools.
Explore the data leader's guide to building a data-driven organization and driving business advantage.
Discover why AI-powered data intelligence and data integration are critical to drive structured and unstructured data preparedness and accelerate AI outcomes.
Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.
Explore how IBM Research is regularly integrated into new features for IBM Cloud Pak® for Data.
Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.
Design a data strategy that eliminates data silos, reduces complexity and improves data quality for exceptional customer and employee experiences.
Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.
Unlock the value of enterprise data with IBM Consulting, building an insight-driven organization that delivers business advantage.