What is data science?

Concentrated young african american woman working with economic report.

What is data science?

Data science combines math and statistics, specialized programming, advanced analytics, artificial intelligence (AI) and machine learning with specific subject matter expertise to uncover actionable insights hidden in an organization’s data. These insights can be used to guide decision making and strategic planning.

The accelerating growth of data sources—and the data they generate—has made data science one of the fastest‑growing fields across all industries. As a result, it is no surprise that the role of the data scientist was dubbed the “sexiest job of the 21st century” by Harvard Business Review. Organizations are increasingly reliant on them to interpret data and provide actionable recommendations to improve business outcomes.

The data science lifecycle involves various roles, tools and processes, which enables analysts to glean actionable insights. Typically, a data science project undergoes the following stages:

Data ingestion: The lifecycle begins with the data collection, both raw structured and unstructured data from all relevant sources, gathered through multiple methods. These methods can include manual entry, web scraping and real-time streaming data from systems and devices. Data sources can include structured data, such as customer data, along with unstructured data like log files, video, audio, pictures, the Internet of Things (IoT), social media and more.

Data storage and data processing: Because data can have different formats and structures, companies need to consider different storage systems based on the type of data that needs to be captured. Data management teams help to set standards around data storage and structure, which facilitate workflows around analytics, machine learning and deep learning models. This stage includes cleaning data, deduplicating, transforming and combining the data through ETL (extract, transform, load) jobs or other data integration technologies. This data preparation is essential for promoting data quality before loading into a data warehouse, data lake or other repository.

Data analysis: Here, data scientists conduct an exploratory data analysis to examine biases, patterns, ranges and distributions of values within the data. This data analytics exploration drives hypothesis generation for a/b testing. It also allows analysts to determine the data’s relevance for use within modeling efforts for predictive analytics, machine learning or deep learning. Depending on a model’s accuracy, organizations can become reliant on these insights for business decision making, allowing them to drive greater scalability.

Communicate: Finally, insights are presented as reports and other data visualizations that make the insights and their impact on business easier for business analysts and other decision-makers to understand. A data science programming language such as R or Python includes components for generating visualizations, while data scientists can also use dedicated visualization tools.

Join 100,000+ subscribers for the latest tech news

Stay up to date on the most important—and intriguing—industry news on AI, automation, data, quantum, infrastructure and security with the Think Newsletter, delivered twice weekly.

What data scientists do

Data scientists are experts at extracting industry-specific insights and answers from data. They possess computer science and pure science skills beyond the skills of a typical business analyst or data analyst. They also have a deep understanding of the industry or business discipline in which they work, such as automobile manufacturing, e-commerce or healthcare.

A data scientist must be able to:

Know enough about the business to ask pertinent questions and identify business pain points.
Apply statistics and computer science, along with business acumen, to data analysis.
Use a wide range of tools and techniques for preparing and extracting data, everything from databases and SQL to data mining to data integration methods.
Extract insights from big data through predictive analytics and artificial intelligence (AI), including machine learning models, natural language processing and deep learning.
Write programs and algorithms that automate data processing and calculations.
Tell and illustrate stories that clearly convey the meaning of results to decision-makers and stakeholders at every level of technical understanding.
Explain how the results can be used to solve business problems.
Collaborate with other data science team members, such as data and business analysts, IT architects, data engineers and application developers.

These skills are in high demand. As a result, many individuals breaking into a data science career explore a range of data science programs, such as certification programs, data science courses and degree programs offered by educational institutions.

Data scientists are not necessarily directly responsible for all the processes involved in the data science lifecycle. Data engineers typically handle data pipelines, but data scientists can recommend what types of data are useful or required.

While data scientists can build machine learning models, scaling these efforts at a larger level requires more software engineering skills to optimize a program to run more quickly. As a result, it’s common for a data scientist to partner with machine learning engineers to scale machine learning models.

Data scientist responsibilities can commonly overlap with a data analyst, particularly with exploratory data analysis and data visualization. However, a data scientist’s skillset is typically broader than the average data analyst. Comparatively speaking, data scientist use common programming languages, such as R and Python, to conduct more statistical inference and data visualization.

What is Apache Kafka?

In this video, you will learn what Apache Kafka is, how it works and the core concepts behind building real-time event streaming applications.

Explore Confluent

Data science versus business intelligence

It is often easy to confuse the terms “data science” and “business intelligence” (BI) because they both relate to an organization’s data and the analysis of that data. However, they differ in focus.

Business intelligence (BI) is typically an umbrella term for the technology that enables data preparation, data mining, data management and data visualization. Business intelligence tools and processes allow end users to identify actionable information from raw data, facilitating data-driven decision-making within organizations across various industries.

While data science tools overlap in much of this regard, business intelligence focuses more on data from the past and the insights from BI tools are more descriptive in nature. It uses data to understand what happened before to inform a course of action. BI is geared toward static (unchanging) data that is structured.

While data science uses descriptive data, it typically uses it to determine predictive variables, which are then used to categorize data or to make forecasts. Data science and BI are not mutually exclusive, digitally savvy organizations use both to fully understand and extract value from their data.

Data science tools

Data scientists rely on widely used programming languages to conduct exploratory data analysis and statistical regression. These open source tools support pre-built statistical modeling, machine learning and graphics capabilities. These languages include the following (read more at “Python versus R: What’s the difference?“):

R Studio: An open source programming language and environment for developing statistical computing and graphics.
Python: It is a dynamic and flexible programming language. The Python includes numerous libraries, such as NumPy, Pandas, Matplotlib, for analyzing data quickly.

To facilitate sharing code and other information, data scientists can use GitHub and Jupyter Notebooks.

Some data scientists often prefer a user interface and two common enterprise tools for statistical analysis include:

SAS: A comprehensive tool suite, including visualizations and interactive dashboards, for analyzing, reporting, data mining and predictive modeling.
IBM SPSS: Offers advanced statistical analysis, a large library of machine learning algorithms, text analysis, open source extensibility, integration with big data and seamless deployment into applications.

Data scientists also gain proficiency in using big data processing platforms, such as Apache Spark, the open source framework Apache Hadoop and NoSQL databases. They are also skilled with a wide range of data visualization tools, including simple graphics tools included with business presentation and spreadsheet applications (such as Microsoft Excel). They also use built‑for‑purpose commercial visualization tools such as Tableau and IBM Cognos® and open source tools such as D3.js and RAWGraphs.

For building machine learning models, data scientists frequently turn to several frameworks like PyTorch, TensorFlow, MXNet and Spark MLib.

Given the steep learning curve in data science, many companies are seeking to accelerate their return on investment for AI projects. They often struggle to hire the talent needed to realize a data science project’s full potential. To address this gap, they are turning to multipersona data science and machine learning (DSML) platforms, leading to the role of “citizen data scientist.”

Multipersona DSML platforms use automation, self‑service portals and low‑code/no‑code user interfaces so that people with little or no background in digital technology or expert data science can participate in data‑driven work. These platforms enable them to create business value through data science and machine learning. These platforms also support expert data scientists by also offering a more technical interface. Using a multipersona DSML platform encourages collaboration across the enterprise.

Data science and cloud computing

Cloud computing scales data science by providing access to extra processing power, storage and other tools required for data science projects.

As data science frequently uses large datasets, tools that can scale with the size of the data is incredibly important, particularly for time-sensitive projects. Cloud storage solutions, such as data lakes, provide access to storage infrastructure, which is able to ingest and process large volumes of data with ease.

These storage systems provide flexibility to end users, allowing them to spin up large clusters as needed. They can also add incremental compute nodes to expedite data processing jobs, allowing the business to make short-term tradeoffs for a larger long-term outcome. Cloud platforms typically have different pricing models, such a per-use or subscriptions to meet the needs of their end user, whether they are a large enterprise or a small startup.

Open source technologies are widely used in data science tool sets. When they’re hosted in the cloud, teams don’t need to install, configure, maintain or update them locally. Several cloud providers, including IBM Cloud®, also offer prepackaged toolkits that enable data scientists to build models without coding, further democratizing access to technology innovations and data insights.

Data science use cases

Enterprises can unlock numerous benefits from data science. Common use cases include process optimization through intelligent automation and enhanced targeting and personalization to improve the customer experience (CX). However, more specific examples include:

Here are a few representative use cases for data science and artificial intelligence:

An international bank delivers faster loan services with a mobile app through machine learning-powered credit risk models and a hybrid cloud computing architecture that is both powerful and secure.
An electronics firm is developing ultra-powerful 3D-printed sensors to guide tomorrow’s driverless vehicles. The solution relies on data science and analytics tools to enhance its real-time object detection capabilities.
A robotic process automation (RPA) solution provider developed a cognitive business process mining solution that reduces incident handling times between 15% and 95% for its client companies. The solution is trained to understand the content and sentiment of customer emails, directing service teams to prioritize the emails that are most relevant and urgent.
A digital media technology company created an audience analytics platform that enables its clients to see what’s engaging TV audiences as they are presented with growing range of digital channels. The solution employs deep analytics and machine learning to gather real-time insights into viewer behavior.
An urban police department created statistical incident analysis tools to help officers understand when and where to deploy resources to prevent crime. The data-driven solution creates reports and dashboards to augment situational awareness for field officers.
Shanghai Changjiang Science and Technology Development used IBM Watson® technology to build an AI-based medical assessment platform. The platform can analyze existing medical records to categorize patients based on their risk of experiencing a stroke and predict the success rate of different treatment plans.”

3D render of a spiral of several icons lined up such as a camera, volume knob and a clipboard

Download our ebook to get actionable steps you can take to make your organization's data AI-ready.

Resources

Podcast starring Cassie Kozyrkov: Right Data, right decisions

Podcast: Decision Intelligence: Thoughtful, data-driven choices

Learn about the concept of decision intelligencd and how data-driven decision-making can create real impact within your business

3D render of two lines of several icons such as a camera, volume knob and a clipboard

Unleash the power of AI for seamless data integration

Discover how a unified, AI-powered data integration approach can help you move faster, reduce complexity, and unlock the full potential of your data

3D render of various lines with several icons such as a camera, volume knob and a clipboard

Your AI is only as good as your data

See a framework that can help organizations manage and prepare quality data to meet the requirements of their AI use cases.

IBM named a Leader in the 2025 Gartner Magic Quadrant for Data Integration Tools

Access the full report to learn why IBM is recognized as a Leader

IDC names IBM a Leader

Download the report to learn why IBM is recognized as a leader for Worldwide Data Integration Software Platforms

3D render of several icons lined up such as a camera, volume knob and a clipboard

Bridging the data engineering skills gap

Get an exclusive look at 3 authoring styles that empower every user, regardless of skill level, to build pipelines, speeding delivery and ensuring data teams can meet the businessís growing demands.

IBM named a Leader in Data Science and Machine Learning

Read how IBM is delivering flexible, AI-focused solutions that empower data scientists and machine learning engineers to build, deploy, and govern impactful AI applications across their enterprises.

Unlock your unstructured data to boost AI accuracy

Learn how to automate and scale data access, enrichment, storage, and delivery of AI-ready unstructured and structured data to power accurate, differentiated gen AI.

What is data science?