May 7, 2013 | Written by: James Kobielus
Share this post:
Big data is not just about scaling your data analytics processing platforms to keep up with the onslaught of new information. Just as important, big data is about bringing together your best and brightest minds–your data scientists–and giving them the tools they need to interactively and collaboratively explore rich information sets.
Data scientist productivity is a critical concern, especially when you’re talking about high-priced talent in short supply. If you don’t provide your data scientists with scalable modeling platforms, you won’t realize the full value of your investment in big data.
Today’s statistical modelers and business analysts need high-performance cloud-centric development platforms–often known as “sandboxes”–where they can aggregate and prepare data sets, tweak segmentations and decision trees, and iterate through statistical models as they look for deep statistical patterns.
Big data sandboxes are where you develop the all-important intellectual property – advanced analytic models – that extract intelligence from otherwise inchoate gobs of content. To be as productive as possible, teams of data scientists must have massively parallel cloud-computing resources–including CPU, memory, storage, and I/O capacity–at their fingertips, available within their sandboxing platforms and in the operational cloud environments to which they will deploy their models.
If you fail to provide them with the cloud-based scalability they need to run a growing range of jobs, you’ll be wasting their time as they queue up for access to limited processing and storage resource.
Sandbox scalability is critical, but it’s more than just raw horsepower. Your sandboxing platform must also embed comprehensive, extensible libraries of reusable algorithms and models for advanced analytics. Today your data science requirements may revolve around traditional statistical analysis, data mining, and predictive modeling, and these libraries should be included in all of your sandboxing environments. But your data scientists will increasingly need to incorporate libraries of MapReduce, R, geospatial, matrix manipulation, natural language processing, sentiment analysis, and other advanced analytic algorithms as well.
And don’t skimp on training and other skills-enhancement initiatives to ensure that you have sufficient numbers of the right kinds of data scientists for your big-data projects. Data science’s learning curve is formidable. Your organization may need to establish a data-science center of excellence and a structured training curriculum to ensure you have the right kinds of professionals who’ve mastered this demanding discipline.
Here, for your inspiration, are several IBM resources on the topic of data scientists in the business:
And here are several blogs that I authored examining various aspects of data scientists in modern business:
Last but not least, we will be holding a Twitter chat on “The Rise of the Data Scientist,” on May 9 from 4-5 p.m. ET. We invite you to join us on this chat, using hashtag #cloudchat. I will be one of the panelists, along with bit.ly’s chief scientist Hilary Mason and STORM Insights founder & CEO Adrian Bowles. More info on the chat can be found here.
We look forward to engaging you further on this exciting topic.