Spark Summit Europe
JeanFrancoisPuget 2700028FGP Visits (8646)
Please join me at the Spark Summit next week (Oct 25-27) in Brussels. This is one of the yearly events where the Spark community gathers. More details can be found at: http
I will be talking about Machine Learning with my colleague Nick Pentreath at the meetup we organize right after the summit, on Thursday night. Location details below:
Here is the Agenda:
— Philippe Van Impe, Founder, European Data Innovation Hub & Brussels Data Science Community.
— Berni Schiefer, IBM Fellow
19:00 Creating an end-to-end Recommender System with Spark ML
There are many resources available for building basic recommendation models using Spark. But how does a practitioner go from the basics to creating an end-to-end machine learning system, including deployment and management of models for real-time serving? In this session, we will demonstrate how to build such a system based on Spark ML and Elasticsearch. In particular, we will focus on how to go from data ingestion to model training to real-time predictive system.
10-minute Spark and machine learning talks.
1. Data Science as a Team Sport
20:45 Networking & Refreshments
There is also a number of presentation and events hosted by my IBM colleagues. You'll find a complete list below. And of course there are a number of great talks from others than IBM. This is a unique opportunity to catch up with the vibrant Spark community. Please join us!
“Scaling Factorization Machines on Spark Using Parameter Servers”
by Nick Pentreath
Factorization machines are a relatively new class of model, that are extremely powerful as they are able to efficiently capture arbitrary order interactions between features. FMs are becoming increasingly popular in settings with large amounts of sparse data, including recommender systems and online advertising. Furthermore, with appropriate feature engineering, they can mimic most commonly used factorization-based models for collaborative filtering. However, one drawback of FMs is that, even though they are relatively efficient to train, they can still be difficult to scale to very large feature dimensions. This talk will explore scaling up FMs on Spark, using the Glint parameter server built on Akka. Rather than a general exploration of parameter server architectures, the focus will be on specific technical aspects of training factorization machines, with code examples and performance analysis and comparisons. It will also cover integration with Spark DataFrames and ML pipelines for feature engineering and cross-validation. Example code will be available as open source.
“From Single-Tenant Hadoop to 3000 Tenants in Apache Spark: Experiences from Watson Analytics”
by Alexander Lang
IBM Watson Analytics for Social Media is using a pipeline for deep text analytics and predictive analytics based on Apache Spark. This session describes our journey from our predecessor product, which used Hadoop in environments dedicated per tenant, to a system based on Apache Spark (both “core” and streaming), Kafka and ZooKeeper that serves more than 3000 tenants. We will describe our thought process, our current architecture, as well as the lessons we’ve learned since we put the environment into production in December 2015. Key takeaways are: – Changes to design, development and operations thinking required when going from single-tenancy to multi-tenancy – Architecture of a multi-tenant Spark solution in production – Orchestration of several Spark apps within a common data pipeline – Benefits of Apache Spark, Kafka and ZooKeeper in a multi-tenant data pipeline architecture
“From machine learning to learning machines: Creating an end-to-end cognitive process with Apache SparkTM”
Thursday, October 27, 10:00 – 10:10
By Dinesh Nirmal
Many people think of machine learning as something that begins with data and ends with a model. But machine learning in practice is actually a continuous process that begins with an application and never ends. Apache Spark has made many parts of this process dramatically easier. As an active member of the Apache Spark Community, we have recognized – through hosting meet-ups, advisory boards, and working with clients – the challenges that practitioners face in closing the loop and adapting automatically to changing business environments. Over the last 12 months we contributed over 25,600 thousand lines of code to Apache Spark including Spark ML, SparkR, and PySpark, and we’ve brought Apache SystemML to 356,000 lines of code, laying the groundwork for machine learning in business solutions and in particular for an end-to-end machine learning framework. In this keynote, I will share our recent progress and where we are headed with machine learning – towards a comprehensive vision for more effectively supporting continuous machine learning.
“SparkOscope: Enabling Apache Spark Optimization Through Cross-Stack Monitoring and Visualization”
by Yiannis Gkoufas
During the last year we have been using Apache Spark to perform analytics on large volumes of sensor data. These applications need to be executed on a daily basis, therefore, it was essential for us to understand Spark resource utilization. We found it cumbersome to manually consume and efficiently inspect the CSV files for the metrics generated at the Spark worker nodes. Although using an external monitoring system like Ganglia would automate this process, we were still plagued with the inability to derive temporal associations between system-level metrics (e.g. CPU utilization) and job-level metrics (e.g. job or stage ID) as reported by Spark. For instance, we were not able to trace back the root cause of a peak in HDFS Reads or CPU usage to the code in our Spark application causing the bottleneck. To overcome these limitations we developed SparkOscope. Taking advantage of the job-level information available through the existing Spark Web UI and to minimize source-code pollution, we use the existing Spark Web UI to monitor and visualize job-level metrics of a Spark application (e.g. completion time). More importantly, we extend the Web UI with a palette of system-level metrics of the server/VM/container that each of the Spark job’s executor ran on. Using SparkOScope, the user can navigate to any completed application and identify application-logic bottlenecks by inspecting the various plots providing in-depth timeseries for all relevant system-level metrics related to the Spark executors, while also easily associating them with stages, jobs and even source code lines incurring the bottleneck. Github: http
“Spark SQL 2.0 Experiences Using TPC-DS”
Thursday, October 27, 17:15 – 17:45
by Bernie Schiefer
This talk summarizes the results of using the TPC-DS workload to characterize the SQL capability, performance and scalability of Apache Spark SQL 2.0 at the multi-Terabyte scale in both single user dedicated and multi-user concurrent execution modes. We track the evolution of Spark SQL across versions 1.5, 1.6 and 2.0 to underscore the pace of improvement in Spark SQL capability and performance. We also provide best practices and configuration tuning parameters to support the concurrent execution of the 99 TPC-DS queries at scale. The key takeaways include 1) See the substantial progress made by Spark SQL 2.0 2) Understand what TPC-DS is and why it has become the preferred workload of SQL on Hadoop systems. 3) Experimental results supporting the optimized execution of multi-user, multi-terabyte TPC-DS-based workloads 4) Tuning and configuration changes used to attain excellent performance of Spark SQL.