Apache Spark Tutorial: Top 5 Tips to get Started

Share this post:

Apache Spark Tutorial Logo I relish building tools that solve real-world problems, and I’m tremendously excited about our Apache Spark-as-a-service product, IBM Analytics for Apache Spark. We’re making the power and capabilities of Spark – and a new platform for creating big data analytics and application design – available to developers, data scientists, and business analysts, who previously had to deal with IT for support or simply do without.

If you’ve signed-up for the Beta, you are part of a tribe of next-generation problem-solvers and application builders, and on behalf of the Spark team, we look forward to hearing from you about your product experience going forward.

The service is nearing public beta, and the following are some top tips on how to quickly ramp up on using Spark:

  1. Gather and load data – a few simple stepsThe core of any data analysis/processing effort is getting your arms around all of your data. Your first step should be to understand where your data resides and how to get it out of its current home so that it can be uploaded to the IBM Analytics for Apache Spark service. This will usually involve some sort of export or extract from its current location. It’d also be helpful to understand the structure of the data so that you can start thinking through how to analyze it. One of the key attributes of Spark is that you no longer have to clean or reformat your data; Spark works exceptionally well at Extraction-Load-Transform (ETL) use cases so you’ll most likely find it much easier to do this work using the Spark service.

    Once you’re ready to upload, navigate to the SWIFT Object Storage attached to each Spark account and use the web interface to upload your data. If you have extremely large data, specifically more than 5GB of data, I’d recommend using the command-line approach outlined in the SWIFT documentation. As SWIFT has a 5GB per file limit, this approach chunks your single data file into multiple data chunks, uploads them, and then reassembles them within the SWIFT Object Storage. Note that the 5GB limit applies only to the smaller chunks in this case, which enables the workaround for files larger than the limit.

    It will also take time to upload, and that time will depend on your internet connection, and how large your data is. We highly recommend uploading from your workplace, or even better, directly from your server farm, if possible (as these are directly connected to Internet backbones and tend to have faster connections).

  2. Super in-notebook tutorials are at your fingertipsYou’ll find built-in tutorials within the IBM Analytics for Apache Spark service which are straightforward, relevant to day-to-day needs and provide extensive descriptions of each step of getting started. These tutorials effectively demystify Spark, and highlight key concepts in a clear and concise way. I highly recommend working through one or both tutorials if you’re new to Spark, or even if you’re not new, but still shaky on how Spark works. You will be up and running on your own Spark programs in no time!
  3. Get smart on Spark with a variety of resourcesWhile the tutorials are great, they aren’t your only resource to get up to speed quickly. In a prior post, we’ve outlined key resources for different levels of familiarity with IBM Analytics for Apache Spark i.e. New, Basic, Intermediate and Expert users, please take a look and jump in at whatever level you feel comfortable.
  4. Ask for help-almost everyone out there is a beginnerGiven the newness of IBM Analytics for Apache Spark, there are limited experts in the Spark community so far. As an example, in a recent survey of developers and data scientists, 28% had never even heard of Spark! Spark users will be learning from each other, and we encourage you to ask the Spark community questions in the forums. The majority of the community are learning Spark for the first time, so please join the discussions and freely post your questions.
  5. Try before you buy – and have fun!For the period of the public beta of IBM Analytics for Apache Spark, we will not be charging users for the service (please note that we will be monitoring usage and implement caps per single user if necessary). We hope you take the opportunity to try out new analytics and processing approaches (and learn a new programming language in doing so), with no commitment required. At IBM we’re fully committed to bringing the power of Spark to everyone, and making this initial period free is one way that we’re doing so.

And of course, if you haven’t already done so, check out my recommended reading in Get Smarter About Apache Spark.

More stories
May 1, 2019

Two Tutorials: Plan, Create, and Update Deployment Environments with Terraform

Multiple environments are pretty common in a project when building a solution. They support the different phases of the development cycle and the slight differences between the environments, like capacity, networking, credentials, and log verbosity. These two tutorials will show you how to manage the environments with Terraform.

Continue reading

April 29, 2019

Transforming Customer Experiences with AI Services (Part 1)

This is an experience from a recent customer engagement on transcribing customer conversations using IBM Watson AI services.

Continue reading

April 26, 2019

Analyze Logs and Monitor the Health of a Kubernetes Application with LogDNA and Sysdig

This post is an excerpt from a tutorial that shows how the IBM Log Analysis with LogDNA service can be used to configure and access logs of a Kubernetes application that is deployed on IBM Cloud.

Continue reading