Netflix, Hadoop, and big data

Six rectangular tiles organized in neat horizontal bands define most of our days. These carefully curated images change with some regularity, as to not get stale, but only give the illusion that we have a choice of TV shows and movies to watch. They’re chosen for us.

Netflix utilizes data to suggest shows to visitors. How? Hordes of data scientists analyze what we’re watching, when we’re watching it – to make each dashboard match the interests of its viewers. That’s why no two dashboards are alike.

This is all thanks to an abundance of data being generated by you and me. It’s no surprise that we love data so much. It inspires beautiful, carefully curated experiences for users. But with great opportunity, comes great risk.

Data can be a powerful resource if used properly, but can also be a swamp of jumbled, unintelligible information.


One assumption about big data to avoid, is that it will save you money.

Exploiting big data technologies to reduce costs is enticing. But engineers can’t simply lift and shift (a common approach to migrating data to the cloud).

While this approach can sometimes be cost effective, it doesn’t result in improved data access and quality. Why? Because the format for data in a processing technologies requires a few, wide data fields—not hundreds of small, discrete tables, as is the case with relational tables. Big data requires deliberate, optimized structures, adding another layer of figures to an environment and increasing cost and complexity.

Getting the most out of big data is expensive.

Another assumption about big data that has the potential for catastrophe, is that data scientists must work in Hadoop, the ubiquitous data processing framework.

Spark, Cassandra, and Acumulo are just a few alternatives to Hadoop. Yet each still carries the possibility of increased data set duplication and a hefty price tag. As organizations adopt big data technologies, the need for experienced data scientists to help weave through new and emerging tools becomes greater.

As with anything new and shiny – big data is misunderstood. That’s why IBM has developed a new platform, the Digital Insights Platform, to help bring order to the ever-growing number of data technologies.

The Digital Insights Platform includes a three-tiered data lake component that is designed to help you manage big data complexity risks and increase your success by rationalizing your current data footprint through a data architecture that’s optimized for big data use. It can capture, store, analyze and act on data from traditional first- and third-party digital sources—giving you a common platform for analytic, operational and digital use cases.

The three data layers include:

  • One that stores and manages raw data in multiple formats such as Hadoop Distributed File System (HDFS), Parquet, Hive, Columnar and other object storage types
  • A conformed storage layer, which uses IBM’s industry-based data models to create common target files in HDFS to conform and integrate data from multiple sources into a common format
  • A data access layer—the primary interface between analytic, cognitive and digital processes as well as underlying data for large, complex organizations—which contains both dig data and traditional data formats and technologies

We can’t all be Netflix. But we can take steps to properly manage and gain insights from our data through a multi-tiered data architecture and common data platform.


IBM Digital Insights Platform