By: IBM Cloud Education

Data science combines the scientific method, math and statistics, specialized programming, advanced analytics, AI, and even storytelling to uncover and explain the business insights buried in data.

What is data science?

Data science is a multidisciplinary approach to extracting actionable insights from the large and ever-increasing volumes of data collected and created by today’s organizations. Data science encompasses preparing data for analysis and processing, performing advanced data analysis, and presenting the results to reveal patterns and enable stakeholders to draw informed conclusions.

Data preparation can involve cleansing, aggregating, and manipulating it to be ready for specific types of processing. Analysis requires the development and use of algorithms, analytics and AI models. It’s driven by software that combs through data to find patterns within to transform these patterns into predictions that support business decision-making. The accuracy of these predictions must be validated through scientifically designed tests and experiments. And the results should be shared through the skillful use of data visualization tools that make it possible for anyone to see the patterns and understand trends.

As a result, data scientists (as data science practitioners are called) require computer science and pure science skills beyond those of a typical data analyst. A data scientist must be able to do the following:

  • Apply mathematics, statistics, and the scientific method
  • Use a wide range of tools and techniques for evaluating and preparing data—everything from SQL to data mining to data integration methods
  • Extract insights from data using predictive analytics and artificial intelligence (AI), including machine learning and deep learning models
  • Write applications that automate data processing and calculations
  • Tell—and illustrate—stories that clearly convey the meaning of results to decision-makers and stakeholders at every level of technical knowledge and understanding
  • Explain how these results can be used to solve business problems

This combination of skills is rare, and it’s no surprise that data scientists are currently in high demand. According to an IBM survey, the number of job openings in the field continues to grow at over 5% per year, with over 60,000 forecast for 2020.

The data science lifecycle

The data science lifecycle—also called the data science pipeline—includes anywhere from five to sixteen (depending on whom you ask) overlapping, continuing processes. The processes common to just about everyone’s definition of the lifecycle include the following:

  • Capture: This is the gathering of raw structured and unstructured data from all relevant sources via just about any method—from manual entry and web scraping to capturing data from systems and devices in real time.
  • Prepare and maintain: This involves putting the raw data into a consistent format for analytics or machine learning or deep learning models. This can include everything from cleansing, deduplicating, and reformatting the data, to using ETL (extract, transform, load) or other data integration technologies to combine the data into a data warehouse, data lake, or other unified store for analysis.
  • Preprocess or process: Here, data scientists examine biases, patterns, ranges, and distributions of values within the data to determine the data’s suitability for use with predictive analytics, machine learning, and/or deep learning algorithms (or other analytical methods).
  • Analyze: This is where the discovery happens—where data scientists perform statistical analysis, predictive analytics, regression, machine learning and deep learning algorithms, and more to extract insights from the prepared data.
  • Communicate: Finally, the insights are presented as reports, charts, and other data visualizations that make the insights—and their impact on the business—easier for decision-makers to understand. A data science programming language such as R or Python (see below) includes components for generating visualizations; alternatively, data scientists can use dedicated visualization tools.

Data science tools

Data scientists must be able to build and run code in order to create models. The most popular programming languages among data scientists are open source tools that include or support pre-built statistical, machine learning and graphics capabilities. These languages include:

  • R: An open source programming language and environment for developing statistical computing and graphics, R is the most popular programming language among data scientists. R provides a broad variety of libraries and tools for cleansing and prepping data, creating visualizations, and training and evaluating machine learning and deep learning algorithms. It’s also widely used among data science scholars and researchers.
  • Python: Python is a general purpose, object-oriented, high-level programming language that emphasizes code readability through its distinctive generous use of white space. Several Python libraries support data science tasks, including Numpy for handling large dimensional arrays, Pandas for data manipulation and analysis, and Matplotlib for building data visualizations.

Data scientists need to be proficient in the use of big data processing platforms, such as Apache Spark and Apache Hadoop. They also need to be skilled with a wide range of data visualization tools, including the simple graphics tools included with business presentation and spreadsheet applications, built-for-purpose commercial visualization tools like Tableau and Microsoft PowerBI, and open source tools like D3.js (a JavaScript library for creating interactive data visualizations) and RAW Graphs.

Data science and cloud computing

Cloud computing is bringing many data science benefits within reach of even small and midsized organizations.

Data science’s foundation is the manipulation and analysis of extremely large data sets; the cloud provides access to storage infrastructures capable of handling large amounts of data with ease. Data science also involves running machine learning algorithms that demand massive processing power; the cloud makes available the high-performance compute that’s necessary for the task. To purchase equivalent on-site hardware would be far too expensive for many enterprises and research teams, but the cloud makes access affordable with per-use or subscription-based pricing.

Cloud infrastructures can be accessed from anywhere in the world, making it possible for multiple groups of data scientists to share access to the data sets they’re working with in the cloud—even if they’re located in different countries.

Open source technologies are widely used in data science tool sets. When they’re hosted in the cloud, teams don’t need to install, configure, maintain, or update them locally. Several cloud providers also offer prepackaged tool kits that enable data scientists to build models without coding, further democratizing access to the innovations and insights that this discipline is making available.

Data science use cases

There’s no limit to the number or kind of enterprises that could potentially benefit from the opportunities data science is creating. Nearly any business process can be made more efficient through data-driven optimization, and nearly every type of customer experience (CX) can be improved with better targeting and personalization.

Here are a few representative use cases for data science and AI:

Data science and IBM Cloud

IBM Cloud offers a highly secure public cloud infrastructure with a full-stack platform that includes more than 170 products and services, many of which were designed to support data science and AI.

IBM’s data science and AI lifecycle product portfolio is built upon our longstanding commitment to open source technologies and includes a range of capabilities that enable enterprises to unlock the value of their data in new ways.

AutoAI, a powerful new automated development capability in IBM Watson Studio, speeds the data preparation, model development, and feature engineering stages of the data science lifecycle. This allows data scientists to be more efficient and helps them make better-informed decisions about which models will perform best for real-world use cases. AutoAI simplifies enterprise data science across any cloud environment.

The IBM Cloud Pak for Data provides a fully integrated and extensible data and information architecture built on the Red Hat OpenShift Container Platform that runs on any cloud. With the IBM Cloud Paks for Data, enterprises can more easily collect, organize and analyze data, making it possible to infuse insights from AI throughout the entire organization.

Want to learn more about building and running data science models on IBM Cloud? Get started for free by signing up for an IBM Cloud account today.

Follow IBM Cloud

IBM Cloud News connects you to insight and information you can put to work right away—straight from the minds of IBM Cloud experts, IBM customers, and business and IT leaders.

Email subscribeRSS

Be the first to hear about news, product updates, and innovation from IBM Cloud