Share this post:
The more data you have, the more hardware you need in the back end. This is where the economics of storage come in: if we can reduce the amount of data, we can cut costs by reducing the amount of hardware needed to store it. And if you can get an accurate estimate of what data can be reduced and by how much, you can really save by buying only the hardware you need.
Because data reduction works differently for different kinds of data, clients are not always aware of the potential and of whether or not it pays to invest in new hardware. This is where our tool can help.
As part of our work on IBM’s new FlashSystem A9000, my team at IBM Research – Haifa made two contributions. First, we worked on algorithms to help scale the performance of data reduction for very large systems. Second, we built a new tool code-named the “Data Reduction Estimator”, to help clients identify how much they can benefit from deduplication and compression of their data.
Deduplication, also known as “dedup”, is a data reduction method that works by saving repeated data chunks only once and then pointing to them from all of the other places they are used. This method is particularly well-suited for flash storage, which can read memory almost instantaneously but costs more than standard storage.
New challenges for data reduction in primary storage
Compression has been around since the 1970s and is widely accepted in the industry. In fact, many of us use it every day in zip files. Deduplication made its name about 10 years ago. It could reduce the size of data being backed up, allowing systems to back up only what changed from the day before. At that time it became clear that if you use a smart system for deduplication, you can save tons of space for backup.
More recently, dedup has become a popular choice for reducing the size of primary storage. We looked carefully into the kinds of data that do or don’t benefit from deduplication. Take, for example, high-level enterprise data like databases that store repositories of names or transactions, where there are a lot of small entries without very much repetition. This data cannot get much reduction from deduplication. But the opposite is true for the world of virtual machines running on the cloud.
A virtual machine essentially takes a person’s computer and places it within a central storage where an operating system runs on it to simulate a physical machine. A large enterprise like a bank doesn’t need to give every employee a computer. Rather, each person can have a screen and a keyboard and run their “desktop” virtually on the cloud. With each person running the same operating system, and very likely many of the same programs, we can benefit from huge cost reductions with deduplication. For organizations with 1,000 machines, you can reduce the amount of storage space needed by the system significantly.
The bottom line is that data reduction is great for reducing costs, and deduplication is a key capability when it comes to primary storage in all-flash systems. My talk at Edge together with IBM Fellow Andy Walls was on a “Deep Dive into Deduplication in All-Flash Storage” to explain deduplication, the related challenges, what to expect from different workloads, and how to use the tool.
Click here to learn more about the work on cloud storage at IBM Research – Haifa.