Amp up your Spark: three keys to successful deployment

Share this post:

Today, the advantages of an integrated nationwide electricity infrastructure are obvious.  But in the early days of electrical power in the late 1800s, electricity was available only through small, isolated networks.  AC and DC grids competed with one another. This mishmash of isolated systems was costly, inefficient and inflexible.

Now IT directors are often faced with a similar situation with Apache Spark. It’s creating a huge amount of excitement as a framework for powering big data analytics. And we are seeing many examples of individual, isolated Spark clusters as different lines of business or functional groups set up their own infrastructure to learn to take advantage of the power of Spark.

With its known advantages and increasing maturity, Spark is set to move into the mainstream. But businesses are not always sure how best to take the next step on the path to move Spark from its early adopter stage to production-level deployment. There are three important factors that can help organizations adopt Spark successfully.


Shared infrastructure is akin to an electrical grid that can move power from one region of a country to another in response to demand.  By sharing computing infrastructure across applications and business groups, resources that would otherwise be idle due to lack of local demand can be made available to meet other groups’ current workload demands. This approach improves service levels and reduces costs, since the resulting greater utilization allows the same amount of work to be accomplished using fewer resources.  In one real-world case, a large corporation is able to run its applications on a shared infrastructure of approximately 16,000 cores.  If these applications were in isolated silos, each sized to manage local peak demand, they would need about 28,000 cores.  That’s more than a 40 percent reduction in required servers.


When you deploy Spark, you need other components.  Spark requires a resource scheduler to allocate work to available servers.  It also requires data management to handle the data that you are going to analyze, as well as the results of that analysis. And you need to monitor and report on the state of the system.  This is rather like the different components of the power supply system, such as power generation plants, high-voltage long-distance transmission of power, and local distribution, all of which work together to bring electrical power to the home.  One important consideration is how willing–or, indeed, able–your organization is to build a solution from individual components, instead of buying a complete solution from a single vendor: putting together open-source solutions requires significant time and expertise. With an integrated solution from a vendor, you know you have a single point of contact to help get the system up and running – and to keep it running if you have problems.


Finally, you need to be sure not to get locked into an inflexible solution.  Think about how power generation has changed over the years, including sources such as coal, hydroelectricity, nuclear, solar, wind and wave energy, all of which have been successfully incorporated into the power grid. Given the rapid pace of Spark development, a system to handle multiple versions of Spark efficiently and flexibly is vital.  And powerful as it is, Spark is not the solution to everything and won’t be around forever. You need an infrastructure that has the flexibility to work with Spark and other solutions, and to handle whatever comes after Spark.

Armed with a good understanding of these issues and how to address them, you will be ready to “amp up your Spark.” For more about deploying Spark in an enterprise environment, please check out this webcast.

More Storage stories

Storage for the exabyte future

AI, Cloud object storage, Storage

“There is no AI without IA (information architecture)” is a common phrase here at IBM. It describes the business and operation platform every business needs to connect and manage the lifecycle of their AI applications. Data scientists, analytic teams, and line of business need access to the data that helps drive innovation, insight, and ultimately more

The next big leaps for IBM modern data protection

Data security, Multicloud, Storage

Recent analyst research indicates why hybrid multicloud support is becoming increasingly important. According to a 2019 ESG report [1], 67 percent of organizations surveyed currently use public cloud services in their data protection environment. Among those companies, on average 26 percent of their protection environments (measured by amount of data) are housed in the cloud, more

IBM drives innovation in storage for AI and big data, modern data protection and hybrid multicloud

Cloud object storage, Multicloud, Storage

Storage for AI and big data IBM continues to enhance our storage solutions for AI and big data so our clients get the most out of their growing data on premises and in the cloud. Today, IBM announces innovations that allow our clients to leverage more heterogenous data sources and data types for deeper insights more