Published: 18 January 2024
Contributors: Phill Powell, Ian Smalley
Data reduction is the process in which an organization sets out to limit the amount of data it’s storing.
Data reduction techniques seek to lessen the redundancy found in the original data set so that large amounts of originally sourced data can be more efficiently stored as reduced data.
At the outset, it should be stressed that the term “data reduction” does not automatically equate to a loss of information. In many instances, data reduction only means that data is now being stored in a smarter fashion—perhaps after going through the optimization process and then being reassembled with related data in a more practical configuration.
Nor is data reduction the same thing as data deduplication, in which extra copies of the same data are purged for streamlining purposes. More accurately, data reduction combines various aspects of different activities, such as data deduplication and data consolidation, to achieve its goals.
Learn about barriers to AI adoptions, particularly lack of AI governance and risk management solutions.
Register for the guide on foundation models
When data is being discussed in the context of data reduction, we’re often speaking of data in its singular form, as opposed to the pluralized form typically used. One aspect of data reduction, for example, deals with defining the actual physical dimensions of individual data points.
There’s a considerable amount of data science involved with data-reduction activities. The material can be fairly complex and difficult to summarize concisely, and this dilemma has spawned its own term—interpretability, or the ability for a human of average intelligence to understand a particular machine learning model.
Grasping the meanings of some of these terms can be challenging because this is data as seen from a near-microscopic perspective. We’re usually discussing data in its “macro” form, but in data reduction, we’re often speaking of data in its most “micro” sense. More accurately, most discussions of this topic will require both discussions at the macro level and others at the micro end of the scale.
When an organization reduces the volume of data it’s carrying, that company typically realizes substantial financial savings in the form of reduced storage costs associated with consuming less storage space.
Data reduction methods provide other advantages, as well, like increasing data efficiency. When data reduction has been achieved, that resulting data is easier for artificial intelligence (AI) methods to use in a variety of ways, including sophisticated data analytics applications that can greatly streamline decision-making tasks.
For example, when storage virtualization is used successfully, it assists the coordination between server and desktop environments, enhancing their overall efficiency and making them more reliable.
Data reduction efforts play a key role in data mining activities. Data must be as clean and prepared as possible before it’s mined and used for data analysis.
The following are some of the methods organizations can use to achieve data reduction.
The notion of data dimensionality underpins this entire concept. Dimensionality refers to the number of attributes (or features) assigned to a single dataset. However, there’s a tradeoff at work here—the greater the amount of dimensionality, the more data storage demanded by that dataset. Furthermore, the higher the dimensionality, the more often data tends to be sparse, complicating necessary outlier analysis.
Dimensionality reduction counters that by limiting the “noise” in the data and enabling better visualization of data. A prime example of dimensionality reduction is the wavelet transform method, which assists image compression by maintaining the relative distance that exists between objects at various resolution levels.
Feature extraction is another possible transformation for data—one that changes original data into numeric features and works in conjunction with machine learning. It differs from principal component analysis (PCA), another means of reducing the dimensionality of large data sets, in which a sizable set of variables is transformed into a smaller set while retaining most of the data from the large set.
The other method involves selecting a smaller, less data-intensive format for representing data. There are two types of numerosity reduction—that which is based on parametric methods and that which is based on non-parametric methods. Parametric methods like regression concentrate on model parameters, to the exclusion of the data itself. Similarly, a log-linear model might be employed that focuses on subspaces within data. Meanwhile, non-parametric methods (like histograms, which show the way numerical data is distributed) don’t rely upon models at all.
Data cubes are a visual way to store data. The term “data cube” is actually almost misleading in its implied singularity, because it’s really describing a large, multidimensional cube that’s composed of smaller, organized cuboids. Each of the cuboids represents some aspect of the total data within that data cube, specifically pieces of data concerning measurements and dimensions. Data cube aggregation, therefore, is the consolidation of data into the multidimensional cube visual format, which reduces data size by giving it a unique container specifically built for that purpose.
Another method enlisted for data reduction is data discretization, in which a linear set of data values is created based around a defined set of intervals that each correspond to a determined data value.
In order to limit file size and achieve successful data compression, various types of encoding can be used. In general, data compression techniques are considered as either using lossless compression or lossy compression, and they are grouped according to those two types. In lossless compression, data size is reduced through encoding techniques and algorithms, and the complete original data can be restored if needed. Lossy compression, on the other hand, uses other methods to perform its compression, and although its processed data may be worth retaining, it will not be an exact copy, as you would get with lossless compression.
Some data needs to be cleaned, treated and processed before it undergoes the data analysis and data reduction processes. Part of that transformation may involve changing the data from analog in nature to digital. Binning is another example of data preprocessing, one in which median values are utilized to normalize various types of data and ensure data integrity across the board.
Take advantage of a win-win situation for both your organization and the environment by using IBM FlashSystem storage. Consume less energy and reap cost savings, while reducing your company’s carbon footprint.
Imagine a solution that supports mirroring between on-premises and cloud data centers or between cloud data centers. IBM Spectrum Virtualize for Public Cloud also helps implement disaster recovery strategies.
Get the best of two worlds with IBM Storage as-a-Service. Start with on-premises hardware provided and managed by IBM. Couple that with a cloud-like, consumption-based pricing model, for a flexible combination.
Explore FlashSystems powered by IBM Spectrum Virtualize software that uses symmetric virtualization.
Energy costs and data seem to both be growing at exponential rates. As corporations grapple with this expensive reality, they require energy-efficient storage that they can rely on.
The Data Reduction Estimator tool (DRET) is a command-line host-based utility for estimating the data reduction saving on block devices.
Discover why many organizations are relying on data consolidation tools to handle their data warehouses.
Learn about the basics of data storage, including storage device types and different formats of data storage.
Flash storage solutions range from USB drives to enterprise-level arrays. Learn what makes them tick.