How-tos

Get Smarter About Apache Spark

We often forget how new Spark is. While it was invented much earlier, Apache Spark only became a top-level Apache project in February 2014 (generally indicating it’s ready for anyone to use), which is just 18 months ago. I might have a toothbrush that is older than Apache Spark!

Since then, Spark has generated tremendous interest because the new data processing platforms scales so well, is high performance (up to 100 times faster than alternatives), and is more flexible than other alternatives, both open source and commercial. (If you’re interested, see the trends on both Google searches and Indeed job postings.)

Spark gives the Data Scientist, Business Analyst, and Developer a new platform to manage data and build services as it provides the ability to compute in real-time via in-memory processing. The project is extremely active with ongoing development, and has serious investment from IBM and key players in Silicon Valley.

Tips for getting started with Apache Spark

Given the great potential to revolutionize advanced analytics for big data and modern applications, the IBM Analytics for Apache Spark team is frequently asked for our tips on great resources to help get up-to-speed on Spark.

Below is our team’s list of recommended resources that we share with you in anticipation of the IBM Analytics for Apache Spark open beta:

You have no idea what Spark is and want to at least be informed

You want to use Spark and want to understand the basics

You are familiar with Spark and want to continue learning

You are already experienced with Spark and want to reach expert level

Share this post:

Share on LinkedIn

Add Comment
6 Comments

Leave a Reply

Your email address will not be published.Required fields are marked *


Tamar Eilam

This is a great list

Reply

WhitepeakSoftware

Nice compilation of resources!

Few days back, I started the Spark Fundamentals I on bigdataunversity.com. Downloaded the 5+ GB QSE image but was surprised to found that the Spark service is missing (when I started all services) and on digging deeper (when failed to start the spark-shell) found out that the spark binaries are not present in the required folder [the soft link spark-client -> /usr/iop/4.0.0.0/spark is there but the actual binaries are missing].

Had to waste a lot of time troubleshooting. When I could not fix the problem with the Spark image (actually I tried to install and build spark out of desperation), I am now trying to see if the alternate docker images works.

Posted on the help section in bigdataunversity.com but no one replied. I am surprised there are no forums on bigdataunversity.com – searched for it a lot but could not find any related link. Can you help me out please – I need to be quickly up with Spark both for professional and academic reasons

Reply

WhitepeakSoftware

Fortunately now I find that I can get the spark-shell up and running with docker-image but I would love to get the same on the QSE (Biginsights) image – the Apache Ambari simply does not show the Spark service up even though I do “start all” from console or run “restartAll.sh” from terminal. Looking for your input and help!

Reply

WhitepeakSoftware

I am surprised by the author’s unresponsiveness to the problem I faced in bigdatauniversity spark course. It is the author who suggested bigdatauniversity course and when we faced problem and mentioned about it, he was silent. This is big sense of irresponsibility. If you do not know the answer, at least admit it – do not be silent.

Fortunately I could find my answer to the question in forum. The problem was I could not locate the forum link.

Reply

huangdk

Where can I download the spark docker image?

Reply

Luis Arellano

Hi huangdk,

IBM’s Spark-as-a-Service is not available as a docker image, but rather is a fully multitenant cloud service. You simply sign up for a free 30 day trial at the following link:

http://www.ibm.com/analytics/us/en/technology/cloud-data-services/spark-as-a-service/

Cheers,
Luis

Reply
More How-tos Stories

Analyzing Twitter trends in real time with Apache Kafka and microservices

In this series of blog posts we’re going to walk through building a scalable architecture for processing “real-time” Twitter streams. Using IBM Bluemix, IBM Insights For Twitter, Apache Kafka and Cloudant, we’ll build a processing pipeline using a series of microservices rather than a monolithic application. We’ll look at designing the architecture to support scaling automatically on response to fluctuating load and how to handle failures without losing work.

How to Scan Web and Mobile Applications for Vulnerabilities

While cloud provides the opportunity to rapidly build and run new web and mobile applications, realizing those ends successfully also means addressing the associated security requirements and challenges. Jeff Hoy, Cloud Security Architect in our Security Systems group demonstrates in a three-part video series how to scan both a web and mobile application as well as how to interpret the results of a vulnerability scan.

Getting started with the Bluemix API Management Service

APIs are rapidly becoming one of an organization's most important assets. Enabling customers and developers to consume APIs through their own applications and services provides a compelling system for innovation and monetization. IBM API Management delivers a powerful mechanism for controlling API access, managing multiple versions of an API, establishing rate limits, and ultimately tracking the performance metrics and analytics of each API in your portfolio. Combined with Bluemix--IBM's signature platform-as-a-service--we enable one-stop shopping and management of APIs from anywhere. If you're new to our API Management tools, this article will help you get quickly up to speed.