menu icon

Data Science

Data science combines the scientific method, math and statistics, specialized programming, advanced analytics, AI, and even storytelling to uncover and explain the business insights buried in data.

What is data science?

Data science is a multidisciplinary approach to extracting actionable insights from the large and ever-increasing volumes of data collected and created by today’s organizations. Data science encompasses preparing data for analysis and processing, performing advanced data analysis, and presenting the results to reveal patterns and enable stakeholders to draw informed conclusions.

Data preparation can involve cleansing, aggregating, and manipulating it to be ready for specific types of processing. Analysis requires the development and use of algorithms, analytics and AI models. It’s driven by software that combs through data to find patterns within to transform these patterns into predictions that support business decision-making. The accuracy of these predictions must be validated through scientifically designed tests and experiments. And the results should be shared through the skillful use of data visualization tools that make it possible for anyone to see the patterns and understand trends.

As a result, data scientists (as data science practitioners are called) require computer science and pure science skills beyond those of a typical data analyst. A data scientist must be able to do the following:

  • Apply mathematics, statistics, and the scientific method
  • Use a wide range of tools and techniques for evaluating and preparing data—everything from SQL to data mining to data integration methods
  • Extract insights from data using predictive analytics and artificial intelligence (AI), including machine learning and deep learning models
  • Write applications that automate data processing and calculations
  • Tell—and illustrate—stories that clearly convey the meaning of results to decision-makers and stakeholders at every level of technical knowledge and understanding
  • Explain how these results can be used to solve business problems

This combination of skills is rare, and it’s no surprise that data scientists are currently in high demand. According to an IBM survey (PDF, 3.9 MB), the number of job openings in the field continues to grow at over 5% per year, with over 60,000 forecast for 2020.

The data science lifecycle

The data science lifecycle—also called the data science pipeline—includes anywhere from five to sixteen (depending on whom you ask) overlapping, continuing processes. The processes common to just about everyone’s definition of the lifecycle include the following:

  • Capture: This is the gathering of raw structured and unstructured data from all relevant sources via just about any method—from manual entry and web scraping to capturing data from systems and devices in real time.
  • Prepare and maintain: This involves putting the raw data into a consistent format for analytics or machine learning or deep learning models. This can include everything from cleansing, deduplicating, and reformatting the data, to using ETL (extract, transform, load) or other data integration technologies to combine the data into a data warehouse, data lake, or other unified store for analysis.
  • Preprocess or process: Here, data scientists examine biases, patterns, ranges, and distributions of values within the data to determine the data’s suitability for use with predictive analytics, machine learning, and/or deep learning algorithms (or other analytical methods).
  • Analyze: This is where the discovery happens—where data scientists perform statistical analysis, predictive analytics, regression, machine learning and deep learning algorithms, and more to extract insights from the prepared data.
  • Communicate: Finally, the insights are presented as reports, charts, and other data visualizations that make the insights—and their impact on the business—easier for decision-makers to understand. A data science programming language such as R or Python (see below) includes components for generating visualizations; alternatively, data scientists can use dedicated visualization tools.

Data science tools

Data scientists must be able to build and run code in order to create models. The most popular programming languages among data scientists are open source tools that include or support pre-built statistical, machine learning and graphics capabilities. These languages include:

  • R: An open source programming language and environment for developing statistical computing and graphics, R is the most popular programming language among data scientists. R provides a broad variety of libraries and tools for cleansing and prepping data, creating visualizations, and training and evaluating machine learning and deep learning algorithms. It’s also widely used among data science scholars and researchers.
  • Python: Python is a general-purpose, object-oriented, high-level programming language that emphasizes code readability through its distinctive generous use of white space. Several Python libraries support data science tasks, including Numpy for handling large dimensional arrays, Pandas for data manipulation and analysis, and Matplotlib for building data visualizations.

For a deep dive into the differences between these approaches, check out "Python vs. R: What's the Difference?"

Data scientists need to be proficient in the use of big data processing platforms, such as Apache Spark and Apache Hadoop. They also need to be skilled with a wide range of data visualization tools, including the simple graphics tools included with business presentation and spreadsheet applications, built-for-purpose commercial visualization tools like Tableau and Microsoft PowerBI, and open source tools like D3.js (a JavaScript library for creating interactive data visualizations) and RAW Graphs.

Data science and cloud computing

Cloud computing is bringing many data science benefits within reach of even small and midsized organizations.

Data science’s foundation is the manipulation and analysis of extremely large data sets; the cloud provides access to storage infrastructures capable of handling large amounts of data with ease. Data science also involves running machine learning algorithms that demand massive processing power; the cloud makes available the high-performance compute that’s necessary for the task. To purchase equivalent on-site hardware would be far too expensive for many enterprises and research teams, but the cloud makes access affordable with per-use or subscription-based pricing.

Cloud infrastructures can be accessed from anywhere in the world, making it possible for multiple groups of data scientists to share access to the data sets they’re working with in the cloud—even if they’re located in different countries.

Open source technologies are widely used in data science tool sets. When they’re hosted in the cloud, teams don’t need to install, configure, maintain, or update them locally. Several cloud providers also offer prepackaged tool kits that enable data scientists to build models without coding, further democratizing access to the innovations and insights that this discipline is making available.

Data science use cases

There’s no limit to the number or kind of enterprises that could potentially benefit from the opportunities data science is creating. Nearly any business process can be made more efficient through data-driven optimization, and nearly every type of customer experience (CX) can be improved with better targeting and personalization.

Here are a few representative use cases for data science and AI:

  • An international bank created a mobile app offering on-the-spot decisions to loan applicants using machine learning-powered credit risk models and a hybrid cloud computing architecture that is both powerful and secure.
  • An electronics firm is developing ultra-powerful 3D-printed sensors that will guide tomorrow’s driverless vehicles. The solution relies on data science and analytics tools to enhance its real-time object detection capabilities.
  • A robotic process automation (RPA) solution provider developed a cognitive business process mining solution that reduces incident handling times between 15%  and 95%  for its client companies. The solution is trained to understand the content and sentiment of customer emails, directing service teams to prioritize those that are most relevant and urgent.
  • A digital media technology company created an audience analytics platform that enables its clients to see what’s engaging TV audiences as they’re offered a growing range of digital channels. The solution employs deep analytics and machine learning to gather real-time insights into viewer behavior.
  • An urban police department created statistical incident analysis tools to help officers understand when and where to deploy resources in order to prevent crime. The data-driven solution creates reports and dashboards to augment situational awareness for field officers.
  • A smart healthcare company developed a solution enabling seniors to live independently for longer. Combining sensors, machine learning, analytics, and cloud-based processing, the system monitors for unusual behavior and alerts relatives and caregivers, while conforming to the strict security standards that are mandatory in the healthcare industry.

Data science and IBM Cloud®

IBM Cloud offers a highly secure public cloud infrastructure with a full-stack platform that includes more than 170 products and services, many of which were designed to support data science and AI.

IBM’s data science and AI lifecycle product portfolio is built upon our longstanding commitment to open source technologies and includes a range of capabilities that enable enterprises to unlock the value of their data in new ways.

AutoAI, a powerful new automated development capability in IBM Watson® Studio, speeds the data preparation, model development, and feature engineering stages of the data science lifecycle. This allows data scientists to be more efficient and helps them make better-informed decisions about which models will perform best for real-world use cases. AutoAI simplifies enterprise data science across any cloud environment.

The IBM Cloud Pak® for Data platform provides a fully integrated and extensible data and information architecture built on the Red Hat OpenShift Container Platform that runs on any cloud. With IBM Cloud Pak for Data, enterprises can more easily collect, organize and analyze data, making it possible to infuse insights from AI throughout the entire organization.

Want to learn more about building and running data science models on IBM Cloud? Get started for no-charge by signing up for an IBM Cloud® account today.


Autostrade per l’Italia

Large bridge with viaduct

Autostrade per l’Italia implemented several IBM solutions for a complete digital transformation to improve how it monitors and maintains its vast array of infrastructure assets.

Read the case study →


MANA Community

Indigenous child hugging a large tree

MANA Community teamed with IBM Garage to build an AI platform to mine huge volumes of environmental data volumes from multiple digital channels and thousands of sources.

Read the case study →