Table of contents

External data sets

Transactional data is central to your business. But if you only analyze your internal transactional data, your analysis is incomplete. Complement your data with external data sets that give you a 360-degree view of your business landscape. The data in the external data sets can help you complete a more comprehensive analysis that can help you make better decisions.

IBM® Cloud Pak for Data has partnered with industry leaders to provide easy and seamless access to external reference data that you can use to enrich your transactional data. Some of the data sets provide historical data, while others provide real-time data.

The data sets make it easy for data scientists to access the data that they need from the same platform where they run build and run their analytic models.

Data offering Provided by Pricing Learn more
Weather Company Data Limited Edition The Weather Company® Included with Cloud Pak for Data
About this offering
90-day access to cloud-based APIs that enable you to obtain historical weather data, current conditions, and forecast conditions.
Use cases
You can use weather data to optimize operations, reduce overhead costs, increase safety, and uncover new revenue opportunities. For example, you can:
  • Predict power outages with greater accuracy so that you can restore power to customers faster
  • Reduce utility costs with smarter vegetation management
  • Improve flight safety, efficiency and performance
  • Keep policyholders safe while reducing insurance claims and fraud
  • Improve supply chain visibility and minimize weather-related disruptions
  • Transport people and goods more safely
Industry accelerators
The following industry accelerators can help you get started with this data set:
Get started
For details, see https://www.ibm.com/weather.
Document Layout Analysis Data
(Image analysis)
IBM Included with Cloud Pak for Data
About this offering
This offering is comprised of two data sets:
  • The PubLayNet data set contains images of research papers, articles, and annotations that identify elements such as text, titles, lists, tables, and figures..

    File types: JPG, JSON

  • The PubTabNet data set contains tables in image and HTML format.

    File types: PNG, JSON

Use cases
Build models that can:
  • Identify the layout of unstructured documents, such as PDF files
  • Interpret the structure and content of image-based tables
Get started
For details, see:
  • PubLayNet data set on the IBM Developer site
  • PubTabNet data set on the IBM Developer site
Visual Question Answering Data
(Image analysis)
IBM Included with Cloud Pak for Data
About this offering
This offering is comprised of a single data set.

The VizWiz - Visual Question Answering data set contains numerous images for training, testing, and validating your model. Each training image and validation image has a set of questions and answers that are associated with the image.

File types: JSON

Use cases
  • Build applications that can interpret images for visually impaired people
  • Create educational or recreational applications that can generate a set of questions and answers about an image
  • Create an image retrieval system
Get started
For details, see the Viz Wiz - Visual Question Answering data set on the IBM Developer site.
Finance and Contract Report Data
(Natural language processing)
IBM Included with Cloud Pak for Data
About this offering
This offering is comprised of two data sets:
  • The Contracts Proposition Bank data set contains approximately 1000 compliance sentences from IBM's publicly available contracts. The sentences focus on 60 predicates that are specific to contract compliance.

    File types: CoNLL-U

  • The Finance Proposition Bank data set contains approximately 1000 finance sentences from IBM's publicly available annual financial reports. The sentences focus on 40 predicates that are specific to financial reporting.

    File types: CoNLL-U

Use cases
Build natural language processing models to:
  • Analyze contracts to identify content related to agreement terms, intellectual property protection, limitation of liability, warranty terms, and so on.
  • Analyze financial reports to identify content related to market risk, investment outcomes, financial health, and so on.
Get started
For details, see the following data sets on the IBM Developer site:
Rich Text Data
(Natural language processing)
IBM Included with Cloud Pak for Data
About this offering
This offering is comprised of more than 10 data sets:
  • The WikiText-103 data set contains ore than 1 million tokens from good or featured articles on Wikipedia.

    File types: TXT

  • The Groningen Meaning Bank data set contains more than 1 million multi-sentence texts with annotations for parts-of-speech, named entities, lexical categories and other natural language structural phenomena.

    File types: TXT

  • The IBM Debater® Mention Detection Benchmark data set contains 3000 sentences that are annotated with mentions so that they can be mapped to the relevant concepts in a knowledge base.

    File types: ANN

  • The Forum Classify data set contains 100 discussion threads. Each message in a thread is classified as a question, repeat question, clarification, further details, solution, positive feedback, negative feedback, or junk.

    File types: XML

  • The Forum Summarize data set contains more than 113,000 discussion threads. The data set contains information about the structure and metadata (title, posts, user IDs, and so on) of each thread.

    File types: XML

  • The IBM Debater Sentiment Lexicon of IDiomatic Expressions (SLIDE) data set contains 5000 idioms that are annotated with sentiment analysis.

    File types: TSV

  • The IBM Debater Claim Sentences Search data set contains 1,490,000 sentences. The sentences contain information about a series of preselected topics. The sentences can be used as claims (phrases that are used to support an argument).

    File types: CSV

  • The IBM Debater® Wikipedia Category Stance data set contains information about more than 4600 Wikipedia pages. Each page discusses one of 132 concepts. The pages are annotated with their stance (for or against) on the concept.

    File types: CSV

  • The IBM Debater Thematic Clustering of Sentences data set contains 692 Wikipedia articles. The sentences in each article are annotated with the thematic cluster that they belong to.

    File types: CSV

  • The IBM Debater Wikipedia Oriented Relatedness data set identifies the relatedness between different concepts on Wikipedia. The data set is composed of more than 19,000 pairs of related concepts.

    File types: CSV

  • The IBM Debater Multi Word Term Relatedness Benchmark data set identifies the relatedness between multiword terms, including acronyms and named entities. The data set is composed of more than 9,800 labeled pairs of terms.

    File types: CSV

  • The Nutch data set includes execution log files from Nutch, an open source web crawler application. The log files are generated before changes were committed to the application and after changes were committed to the application. The data set highlights the difference in execution behavior based on the changes that were committed.

    File types: CSV, JSON

  • The IBM Debater Labeled Emphasized Words in Speech data set contains more than 4000 sentences from speeches that were given to an audience. The sentences are annotated with the words that were emphasized most when the sentence was spoken.

    File types: TXT

  • The IBM Debater Sentiment Composition Lexicons data set contains more than 66,000 words and 262,000 two-word phrases. Each word and phrase is annotated with a positive or negative sentiment score. For example, absolute bliss has a positive sentiment score, but absolute chaos has a negative sentiment score.

    File types: TXT, XSLX

  • The IBM Debater Concept Abstractness data set contains 100,000 words, 100,000 two-word phrases, and 100,000 three-word phrases that represent concepts. Each word and phrase is annotated with the degree of abstractness. For example, a bad dream is more abstract than a hammer.

    File types: CSV

Use cases
Build natural language processing models to:
  • Discover the content of documents
  • Search the contents of documents
  • Classify and organize documents
  • Generate article or product recommendations
  • Determine the topic of a document
  • Retrieve relevant information
  • Identify similarities between documents
  • Detect plagiarism
  • Analyze customer sentiment
  • Create plans based on customer sentiment
  • Create applications that better emulate human speech by predicting which words should be emphasized when converting text to speech
  • Analyze log files to identify differences in behavior between two versions of an application
Get started
For details, see the following data sets on the IBM Developer site: .
Speech Command Data
(Audio analysis)
IBM Included with Cloud Pak for Data
About this offering
This offering is comprised of a single data set.

The TensorFlow Speech Commands data set contains a set of audio files that contain core words, auxiliary words, or background noises.

File types: WAV

Use cases
Build systems that are capable of recognizing spoken commands. For example, you can build:
  • Voice-activated assistants
  • Voice-operated IoT devices
Get started
For details, see the TensorFlow Speech Commands data set on the IBM Developer site.
IBM Debater® Data
(Audio analysis)
IBM Included with Cloud Pak for Data
About this offering
This offering is comprised of three data sets:
  • The IBM Debater Recorded Debating #1 data set contains 60 argumentative speeches given by expert debaters. The data set covers 16 controversial topics. The data set also includes transcripts of the recordings and an annotated list of claims that could be used to support the argument.

    File types: WAV, CSV, TXT

  • IBM Debater Recorded Debating #2 data set contains 200 argumentative speeches given by expert debaters. The data set covers 50 controversial topics. The data set also includes transcripts of the recordings and an annotated list of claims that could be used to support the argument.

    File types: WAV, CSV, TXT

  • The IBM Debater Recorded Debating #3 data set contains audio recordings of 400® argumentative speeches given by expert debaters. The data set covers 200 controversial topics. The data set also includes transcripts of the recordings and an annotated list of claims that could be used to support the argument.

    File types: WAV, CSV, TXT

Use cases
Build systems like IBM Debater that are capable of understanding and rebutting arguments. For example, you could use the system to understand potential counter arguments for legal cases, legislation, and public policy.
Get started
For details, see the following data sets on the IBM Developer site:
Historical Weather Data
(Time series analysis)
IBM Included with Cloud Pak for Data
About this offering
This offering is comprised of a single data set.

The NOAA Weather Data – JFK Airport data set includes over 114,000 hourly observations of weather data from JFK Airport. The weather data includes visibility, temperature, wind speed and direction, humidity, dew point, and pressure.

File types: CSV

Use cases
Build models that can generate weather predictions.
Get started
For details, see the NOAA Weather Data – JFK Airport data set on the IBM Developer site.
Activity Verification Data
(Video analysis)
IBM Included with Cloud Pak for Data
About this offering
This offering is comprised of a single data set.

The Video-Text Compliance data set is a series of videos that show atomic activities. The videos are accompanied by text instructions and compliance labels.

File types: MP4, CSV

Use cases
Build models that can determine whether the person being monitored is performing a task according to an associated set of text-based instructions.
Get started
For details, see the Video-Text Compliance data set on the IBM Developer site.
Core Science Data
(Video analysis)
IBM Included with Cloud Pak for Data
About this offering
This offering is comprised of a single data set.

The Double Pendulum Chaotic data set is a series of videos that show the motion of a double pendulum over the course of 21 different runs. The data set also includes an annotated list of frames.

File types: H.264, CSV

Use cases
Build models that can generate spatiotemporal predictions for the behavior of a chaotic system.
Get started
For details, see the Double Pendulum Chaotic data set on the IBM Developer site.
People data People Data Labs Separately priced
About this offering
Access more than 1 billion profiles of people from around the world. The data covers more than 150 data points and includes information such as professional experience, interests, social profiles, and more.

You can purchase a bulk data license or you can purchase access to the APIs.

Use cases
Build models that help you:
  • Enrich inbound sales
  • Identify new prospects
  • Deduplicate existing data
Get started
For details, see the following pages on the People Data Labs website: .