My IBM

What is data?

1 October 2024

Authors

What is data?

Data is a collection of facts, numbers, words, observations or other useful information. Through data processing and data analysis, organizations transform raw data points into valuable insights that improve decision-making and drive better business outcomes.

Organizations collect data from various sources and in various formats, including non-numerical qualitative data (such as customer reviews) and numerical quantitative data (such as sales figures). Other examples of data include public data, such as government statistics and census records, and private data, such as customer purchase histories or a person’s healthcare records.

Over the past decade, big data—large, complex data sets from sources such as social media, e-commerce and financial transactions—has driven digital transformation across industries. In fact, big data has earned the nickname “the new oil” due to its value as a driver of business growth and innovation.

In recent years, the rise of artificial intelligence (AI) has further increased the focus on data. Organizations need data to train machine learning (ML) models and refine predictive algorithms. The more high-quality data these AI systems analyze, the more accurate and effective they become.

As data’s volume, complexity and importance grow, organizations need effective data management processes to keep information organized and accessible for data analysis.

At the same time, mounting concerns around data security and privacy—from both users and regulators—have placed growing emphasis on data protection and compliance with laws such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA).

Types of data

Data comes in many different forms, each defined by its unique characteristics, sources and formats. Understanding these distinctions can allow for more effective organization and data analysis, as different types of data support different use cases.

Furthermore, a single data point or data set can fall under multiple categories. For example, structured and quantitative, unstructured, qualitative and so on.

Some of the most common types of data include:

Quantitative data
Qualitative data
Structured data
Unstructured data
Semi-structured data
Metadata
Big data

Quantitative data

Quantitative data consists of values that can be measured numerically. Examples of quantitative data include discrete data points (such as the number of products sold) or continuous data points (such as temperature or revenue figures).

Quantitative data is often structured, making it easy to analyze using mathematical tools and algorithms.

Common use cases of quantitative data include trend forecasting, statistical analysis, budgeting, pattern identification and performance measurement.

Qualitative data

Qualitative data is descriptive and non-numerical, capturing characteristics, concepts or experiences that numbers cannot measure. Examples include customer feedback, product reviews and social media comments.

Qualitative data can be structured (such as coded survey responses) or unstructured (such as free-text responses or interview transcripts).

Common use cases for qualitative data include understanding customer behavior, market trends and user experiences.

Structured data

Structured data is organized in a clear, defined format, often stored in relational databases or spreadsheets. It can consist of both quantitative (such as sales figures) and qualitative data (such as categorical labels like “yes or no”).

Examples of structured data include customer records and financial reports, where data fits neatly into rows and columns with predefined fields.

The highly organized nature of structured data allows for quick querying and data analysis, making it useful for business intelligence systems and reporting processes.

Unstructured data

Unstructured data lacks a strictly defined format. It often comes in complex forms such as text documents, images and videos. Unstructured data can include both qualitative information (such as customer comments) and quantitative elements (such as numerical values embedded in text).

Examples of unstructured data include emails, social media content and multimedia files.

Unstructured data doesn’t easily fit into traditional relational databases, and organizations often use techniques such as natural language processing (NLP) and machine learning to streamline analysis of unstructured data.

Unstructured data often plays a key role in sentiment analysis, complex pattern recognition and other advanced analytics projects.

Learn more about structured vs. unstructured data

Semi-structured data

Semi-structured data blends elements of structured and unstructured data. It doesn't follow a rigid format but can include tags or markers that make it easier to organize and analyze. Examples of semi-structured data include XML files and JSON objects.

Semi-structured data is widely used in scenarios such as web scraping and data integration projects because it offers flexibility while retaining some structure for search and analysis.

Metadata

Metadata is data about data. In other words, it is information about the attributes of a data point or data set, such as file names, authors, creation dates or data types.

Metadata enhances data organization, searchability and management. It is critical to systems such as databases, digital libraries and content management platforms because it helps users more easily sort and find the data they need.

Big data

Big data refers to massive, complex data sets that traditional systems can't handle. It includes both structured and unstructured data from sources such as sensors, social media and transactions.

Big data analytics helps organizations process and analyze these large data sets to systematically extract valuable insights. It often requires advanced tools such as machine learning.

Common use cases for big data include customer behavior analysis, fraud detection and predictive maintenance.

Why data is important

Data enables organizations to transform raw information into actionable insights to predict customer behavior, optimize supply chains and fuel innovation.

The term "data" comes from the plural of "datum", a Latin word meaning "something given": a definition that remains just as relevant today. Every day, millions of people provide data to businesses through interactions such as impressions, clicks, transactions, sensor readings or even just browsing online.

Organizations across industries can then use this constant flow of information to drive growth and innovation. For example, e-commerce retailers use vast data sets and data analytics to forecast demand, helping to ensure that they stock the right products at the right time.

Similarly, data-driven streaming platforms use machine learning algorithms not only to recommend content but also to optimize it, analyzing which scenes resonate most with viewers to help inform future production decisions.

Data is also increasingly essential in the era of artificial intelligence (AI), where large, high-quality data sets are necessary for training machine learning models (see “The role of data in artificial intelligence (AI)” for more information).

Additionally, AI's real-time data processing ability is critical in areas such as cybersecurity, where rapid data analysis identifies threats before they escalate; financial trading, where split-second decisions impact profits; and edge computing, where handling data closer to its source leads to faster insights, quicker decision-making and better bandwidth.

Stay ahead of the latest tech news

Weekly insights, research and expert views on AI, security, cloud and more in the Think Newsletter.

Subscribe today

How is data used?

Organizations across industries use data for various purposes, including improving decision-making, streamlining operations and driving innovation.

Common ways organizations have used data in their operations include:

Predictive analytics
Generative AI
Healthcare innovations
Social science research
Cybersecurity and risk management
Operational efficiency
Customer experience
Government initiatives
Business intelligence (BI)

Predictive analytics

Predictive analytics is a branch of advanced analytics that predicts future trends and outcomes using historical data combined with statistical modeling, data mining and machine learning.

E-commerce companies frequently use predictive analytics to anticipate customer purchasing behaviors based on past transactions. In manufacturing and transportation, predictive analytics enables predictive maintenance by analyzing real-time machine data to predict when equipment will likely fail and recommend proactive maintenance.

Generative AI

Generative AI sometimes called gen AI, is artificial intelligence (AI) that can create original content—such as text, images, video, audio or software code—in response to a user’s prompt or request.

Generative AI relies on sophisticated machine learning models called deep learning models. These models are trained on vast data sets, which allows them to do things such as understand users’ requests, generate personalized marketing content and write code.

Healthcare innovations

Data analytics can help healthcare providers improve patient care, predict disease outbreaks and enhance treatment protocols.

For instance, monitoring patients through time series data, such as tracking patient vitals over time, provides real-time insights into patient conditions. This, in turn, enables faster interventions and more personalized treatments.

Social science research

Social science researchers frequently analyze quantitative and qualitative data from surveys, census reports and social media. Examining these data sets allows them to study behaviors, trends and policy impacts.

For instance, researchers might use census data to track population changes, survey responses to measure public opinion and social media data to analyze emerging trends.

Cybersecurity and risk management

As cyberattacks and data breaches become more frequent, organizations are increasingly turning to data analysis to identify and respond to threats faster, minimizing damage and reducing downtime.

For example, security information and event management (SIEM) systems can help detect and respond to anomalies in real time by aggregating and analyzing security alerts from throughout the network.

Operational efficiency

Machine learning algorithms, trained on vast data sets, can help organizations boost operational efficiency by optimizing logistics, predicting demand, improving scheduling and automating workflows.

For example, e-commerce companies frequently collect and analyze real-time sales data to inform inventory management, reducing the likelihood of stockouts or overstocking.

Customer experience

Data is the backbone of personalized customer experiences, particularly in marketing, where organizations can use data analytics to tailor content and ads to different users.

For example, streaming services rely on machine learning algorithms to analyze viewing habits and recommend content.

Government initiatives

Governments worldwide frequently use open data policies to make valuable data sets publicly accessible, encouraging businesses and organizations to use these resources for research and innovation.

For example, the US government's Data.gov platform provides access to various data sets across healthcare, education and transportation. This access helps foster transparency and allows businesses across industries to develop data-driven solutions based on publicly available information.

Business intelligence (BI)

Business intelligence (BI) is a set of technological processes for collecting, managing and analyzing data, turning raw data into insights that can guide business decisions.

Business analytics complements BI by helping organizations interpret and visualize data through graphs, dashboards and reports, making it easier to spot trends and make informed decisions.

Data collection

Data collection is the systematic process of gathering data from various sources while helping to ensure its quality and integrity. Typically performed by data scientists and analysts, it is the foundation for accurate and reliable data analysis.

Data collection starts with setting clear objectives and identifying relevant sources. Data is then acquired, cleaned and integrated into a unified data set. Data storage systems and ongoing quality checks help ensure the collected data is accurate and reliable.

Without proper data collection, organizations risk basing their analyses on incomplete, inaccurate or misleading data, leading to compromised insights and decision-making.

Some common data sources include:

Social media interactions: Real-time data from platforms such as Twitter and Facebook can be used to track brand engagement, gauge public opinion and discover consumer sentiment.

Public data: Freely available data sets from governments and organizations, such as census data and economic indicators, can help provide context for demographic shifts, market segmentation and financial analysis.

Open data sets: Data sets from academic institutions and governments on topics such as climate change and geospatial data are often used for research and policymaking.

Transactional data: Data from business transactions, such as sales records, invoices and payment information, can help businesses track performance, optimize pricing and improve the customer experience.

Surveys and questionnaires: Qualitative or quantitative data collected through customer feedback or research surveys can provide insights into preferences, opinions and trends.

Web analytics: Data from website interactions, such as page views and click-through rates, help companies understand user behavior, optimize content and improve user experiences.

IoT devices: Data from Internet of Things (IoT) devices such as smart meters and wearable trackers can support real-time analytics and predictive maintenance and prevent equipment downtime.

Data management

Organizations handle vast amounts of data in multiple formats scattered across public and private clouds, making data fragmentation and mismanagement significant challenges.

According to the IBM Data Differentiator, 82% of enterprises struggle with data silos that disrupt workflows, and 68% of data goes unanalyzed, limiting its full potential.

Data management is the practice of collecting, processing and using data securely and efficiently to improve business outcomes. It addresses critical challenges such as managing large data sets, breaking down silos and handling inconsistent data formats.

Data management solutions typically integrate with existing infrastructure to help ensure access to high-quality, usable data for data scientists, analysts and other stakeholders. These solutions often incorporate data lakes, data warehouses or data lakehouses, combined in a unified data fabric.

Data lakes are low-cost storage environments that house raw, unstructured data, which can later be processed and analyzed.

Data warehouses store structured data from various sources, optimized for data mining and analysis tasks.

Data lakehouses merge the best aspects of data warehouses and data lakes, offering a unified solution for managing both structured and unstructured data.

These systems help create a solid data management foundation, feeding high-quality data into business intelligence (BI) tools, dashboards and AI models, including machine learning (ML) and generative AI.

Additionally, AI is transforming how organizations handle data. AI data management is the practice of using artificial intelligence (AI) and machine learning in the data management lifecycle. Examples include applying AI to automate or streamline data collection, data cleaning, data analysis, data security and other data management processes.

Data scientists and data analysts

As businesses across industries increasingly rely on data to drive decision-making, improve operations and enhance customer experiences, the demand for skilled data professionals has surged.

2 of the most significant roles in the field of data science are data scientists and data analysts.

Data scientist: Data scientists perform complex, foundational data tasks. For example, they create models and algorithms to find insights in large data sets, often using advanced tools such as machine learning and predictive modeling.

Data analyst: Data analysts focus on more immediate, practical tasks. They use statistics to analyze data and answer specific business questions. Their main goal is to find useful insights that help with everyday decisions and strategies.

Both roles span data collection, data modeling, analyzing data and ensuring high-quality data. Analysts and scientists alike might use various methodologies and tools to wrangle and prepare data, including Microsoft Excel, Python and structured query language (SQL).

They might also use data visualization techniques, such as dashboards and graphs, to help discover trends, correlations and insights in the data, though in different ways.

For example, a data scientist might develop a predictive model using machine learning to forecast future customer behavior. This model could help the company anticipate trends, personalize marketing campaigns and make informed long-term strategic decisions.

By comparison, a data analyst on the same project might use a visualization tool to create a dashboard showing customer behavior patterns over time. This ability to chart historical sales trends alongside engagement metrics could help the team optimize current marketing strategies or adjust product offerings to increase profits.

Data protection

Data protection is the practice of safeguarding sensitive information from data loss, theft and corruption. Data protection is increasingly important as organizations handle larger volumes of sensitive data across complex, distributed environments.

The growing risk of cyberthreats and stricter data privacy regulations have also made data protection a priority for businesses and consumers. According to a recent study, 81% of Americans are concerned about how companies use the data collected about them.¹

There is also a strong business case to be made for prioritizing data protection. The average data breach costs an organization USD 4.88 million between lost business, system downtime, reputational damage and response efforts, according to the IBM Cost of a Data Breach Report.

Learn more about data protection

Data security and data privacy

Data protection has 2 critical subfields: data security and data privacy. Both play distinct yet complementary roles in safeguarding and managing data.

Data security involves protecting digital information from unauthorized access, corruption or theft. It encompasses various aspects of information security, spanning physical security, organizational policies and access controls.

Data privacy focuses on policies that support the general principle that a person should have control over their personal data, including the ability to decide how organizations collect, store and use their data.

Data vulnerabilities

Data faces many vulnerabilities and potential cyberthreats, particularly as AI capabilities advance.

Some of the most common threats include:

Insider threats: Employees or contractors with authorized access can pose significant risks. According to the Cost of a Data Breach Report, data breaches initiated by malicious insiders cost USD 4.99 million on average.

S ocial engineering: Threat actors often use social engineering attacks such as phishing to exploit human weaknesses to trick individuals into revealing sensitive information. Generative AI tools can now craft highly convincing phishing emails, increasing the success rate of such attacks.

Ransomware: Cybercriminals use ransomware to encrypt an organization’s data and demand a ransom in exchange for the decryption key. Healthcare systems, financial institutions and government data agencies are particularly vulnerable to these attacks.

Cloud security: With the widespread adoption of cloud services, misconfigurations, insecure APIs and poor access control can lead to public data leaks. According to the Cost of a Data Breach Report, data breaches involving public clouds are the most expensive, costing USD 5.17 million on average.

Data protection solutions

Organizations use various data protection technologies to defend against threat actors and help ensure data integrity, confidentiality and availability.

Some of the most popular solutions include:

Encryption uses symmetric encryption or asymmetric encryption to protect data during storage and transmission, preventing attackers from reading or misusing it. End-to-end encryption (E2EE) specifically encrypts data before transferring it to another endpoint, keeping it secure throughout its journey.

Data backups regularly create and store copies of critical data, allowing fast restoration if there is loss or corruption while minimizing downtime.

Firewalls monitor and control network traffic, acting as the first line of defense to block unauthorized access.

Authentication and authorization verify user identities and control access to sensitive information. Multi-factor authentication (MFA) adds an extra layer of security, requiring users to provide multiple forms of verification.

Identity and access management (IAM) manages how users access digital resources and what they can do with those resources to reduce insider threats and prevent unauthorized access.

Antivirus and anti-malware tools detect, prevent and remove malicious software such as viruses, spyware and ransomware that could compromise data.

Data loss prevention (DLP) tools monitor user activity and flag suspicious behavior to prevent unauthorized access, transmission or leakage of sensitive information.

The role of data in artificial intelligence (AI)

72% of top-performing CEOs agree that having a competitive advantage depends on who has the most advanced generative AI. Yet, having cutting-edge AI is only part of the equation. Without properly managed and accessible data, even the most powerful AI tools cannot reach their full potential.

Data is the foundation for the advancement and success of artificial intelligence. AI systems, particularly machine learning models, rely on data to learn, adapt and deliver value across industries.

Data quality and bias

Machine learning models are trained on vast data sets and use this data to identify patterns and make decisions.

The diversity and data quality of an AI model’s training data directly affect its performance. If data is biased or incomplete, AI outputs can become inaccurate and unreliable.

For example, in healthcare, AI models trained on biased data sets might underrepresent certain racial groups, leading to poor diagnostic outcomes. Similarly, in hiring, poor data quality can result in flawed predictions, potentially reinforcing gender or racial stereotypes and creating AI models that favor certain demographic groups over others.

In short, AI is only as good as the data it processes.

Ensuring high-quality input through comprehensive data validation and cleansing is essential for building ethical, reliable AI systems that avoid perpetuating bias.

Generative AI and data vulnerabilities

While generative AI can create valuable content, it also presents new challenges. AI models can generate false or misleading data, which attackers can exploit to deceive systems or individuals.

Data authenticity and security are growing concerns. A recent report found that 75% of senior cybersecurity professionals are seeing more cyberattacks, with 85% attributing the rise to bad actors using generative AI.²

To counter these threats, many organizations are turning to AI security, using AI itself to automate detection, prevention and response and enhance data protection.

Footnotes

All links reside outside ibm.com.

¹ How Americans View Data Privacy, Pew Research Center, 18 October 2023.

² AI advances risk facilitating cyber crime, top US officials say, Reuters, 9 January 2024.

Four steps to better business forecasting with analytics

Use the power of analytics and business intelligence to plan, forecast and shape future outcomes that best benefit your company and customers.

Resources

Gartner® Predicts 2024: How AI will impact analytics users

Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.

The hybrid, open data lakehouse for AI

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

The data differentiator

Explore the data leader’s guide to building a data-driven organization and driving business advantage.

Managing data for AI and analytics at scale

Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.

How to successfully align your AI, data and analytics strategy

Connect your data and analytics strategy to business objectives with these 4 key steps.

Overcoming low adoption to make smart decisions

Take a deeper look into why business intelligence challenges might persist and what it means for users across an organization.

What is data?

1 October 2024

Authors

Annie Badman

Matthew Kosinski

What is data?

Types of data

Quantitative data

Qualitative data

Structured data

Unstructured data

Semi-structured data

Metadata

Big data

Why data is important

Stay ahead of the latest tech news

How is data used?

Predictive analytics

Generative AI

Healthcare innovations

Social science research

Cybersecurity and risk management

Operational efficiency

Customer experience

Government initiatives

Business intelligence (BI)

Data collection

Data management

Data scientists and data analysts

Data protection

Data security and data privacy

Data vulnerabilities

Data protection solutions

The role of data in artificial intelligence (AI)

Data quality and bias

Generative AI and data vulnerabilities

Footnotes

Resources

Related solutions