Archive

Data scientists need a cloud sandbox

Big data is not just about scaling your data analytics processing platforms to keep up with the onslaught of new information. Just as important, big data is about bringing together your best and brightest minds–your data scientists–and giving them the tools they need to interactively and collaboratively explore rich information sets.

Data scientist productivity is a critical concern, especially when you’re talking about high-priced talent in short supply. If you don’t provide your data scientists with scalable modeling platforms, you won’t realize the full value of your investment in big data.

Today’s statistical modelers and business analysts need high-performance cloud-centric development platforms–often known as “sandboxes”–where they can aggregate and prepare data sets, tweak segmentations and decision trees, and iterate through statistical models as they look for deep statistical patterns.

Big data sandboxes are where you develop the all-important intellectual property – advanced analytic models – that extract intelligence from otherwise inchoate gobs of content. To be as productive as possible, teams of data scientists must have massively parallel cloud-computing resources–including CPU, memory, storage, and I/O capacity–at their fingertips, available within their sandboxing platforms and in the operational cloud environments to which they will deploy their models.

If you fail to provide them with the cloud-based scalability they need to run a growing range of jobs, you’ll be wasting their time as they queue up for access to limited processing and storage resource.

Sandbox scalability is critical, but it’s more than just raw horsepower. Your sandboxing platform must also embed comprehensive, extensible libraries of reusable algorithms and models for advanced analytics. Today your data science requirements may revolve around traditional statistical analysis, data mining, and predictive modeling, and these libraries should be included in all of your sandboxing environments. But your data scientists will increasingly need to incorporate libraries of MapReduce, R, geospatial, matrix manipulation, natural language processing, sentiment analysis, and other advanced analytic algorithms as well.

And don’t skimp on training and other skills-enhancement initiatives to ensure that you have sufficient numbers of the right kinds of data scientists for your big-data projects. Data science’s learning curve is formidable. Your organization may need to establish a data-science center of excellence and a structured training curriculum to ensure you have the right kinds of professionals who’ve mastered this demanding discipline.

Here, for your inspiration, are several IBM resources on the topic of data scientists in the business:

And here are several blogs that I authored examining various aspects of data scientists in modern business:

Last but not least, we will be holding a Twitter chat on “The Rise of the Data Scientist,” on May 9 from 4-5 p.m. ET. We invite you to join us on this chat, using hashtag #cloudchat. I will be one of the panelists, along with bit.ly’s chief scientist Hilary Mason and STORM Insights founder & CEO Adrian Bowles. More info on the chat can be found here.

We look forward to engaging you further on this exciting topic.

Add Comment
No Comments

Leave a Reply

Your email address will not be published.Required fields are marked *

More Archive Stories

Summary of 2012 IBM Technical World for Smarter Computing Conference

I just returned from the 2012 IBM Technical World for Smarter Computing Conference that took place from April 15 – 19 in San Francisco. I want to share my impressions with you, while they are still fresh in my mind.

Special clouds for special needs: High Performance Computing Clouds

Today, we are used to speaking about desktop or test clouds, or maybe moving our email or CRM to cloud-based solutions. What do all these workloads have in common? They have some of the main characteristics to be a good candidate for cloud computing, but beyond that, they all are related to enterprise workloads with low or non-predictable demand for computing resources, do not generate a significant amount of data during application run, and are loosely coupled to the infrastructure.

University Relations program – IBM Cloud Computing Workshop at Kshitij 2012

IBM India recently organized a cloud computing workshop at one of India’s premier technical institutes, Indian Institute of Technology(IIT), Kharagpur, at the annual technology festival, Kshitij 2012.