Important:

IBM Cloud Pak® for Data Version 4.6 will reach end of support (EOS) on 31 July, 2025. For more information, see the Discontinuance of service announcement for IBM Cloud Pak for Data Version 4.X.

Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.6 reaches end of support. For more information, see Upgrading IBM Software Hub in the IBM Software Hub Version 5.1 documentation.

Loading and accessing data in a notebook (Watson Studio)

You can integrate data into notebooks by accessing the data from a local file, from free data sets, or from a data source connection. You load that data into a data structure or container in the notebook, for example, a pandas.DataFrame, numpy.array, Spark RDD, or Spark DataFrame.

To work with data in a notebook, you can choose between the following options:

Option Recommended method Requirements Details
Add data from a file from your local system Use code generated by the Insert to code function The file must exist as a project asset Add a file from your local system
Load data from data source connections Use code generated by the Insert to code function The connections must exist as project assets Load data from a data source connection
Access project assets and metadata programmatically Use ibm-watson-studio-lib library functions The data sources must exist as project assets Use the ibm-watson-studio-lib library to interact with data assets
Generate your own code to read or write data Use the Flight Service and the Apache Arrow Flight protocol to read from and write to data assets in a project The data assets must exist as assets in the project before you can read or write data Access data sources using the Flight service in Python notebooks or Access data sources using the Flight service in R notebooks
Access data using an API function or operating system command For example, use Wget N/A Access data using an API function or operating system command

Important: Make sure that the environment in which the notebook is started has enough memory to store the data that you load to the notebook. Oftentimes this means that the environment must have significantly more memory than the total size of the data loaded to the notebook. The reason is that some data frameworks, like pandas, can hold multiple copies of the data in memory.

Load data from local files

To access data from a local file, you can load the file from within a notebook, or first load the file into your project. From your notebook, you add automatically generated code to access the data by using the Insert to code function. The inserted code serves as a quick start to allow you to easily begin working with data sets.

The Insert to code function supports file types such as CSV, JSON and XLSX. To learn which data structures are generated for which notebook language, see Data load support. For file types that are not supported, you can only insert the file credentials. With the credentials, you can write your own code to load the file data into a DataFrame or other data structure in a notebook cell.

To add a file from your local system to your notebook:

  1. Click the Find and Add Data icon (Shows the find data icon), and then browse a data file or drag it into your notebook sidebar.
  2. Click in an empty code cell in your notebook and then click the Insert to code link below the file.

Load data from data source connections

You must create a connection to an IBM data service or an external data source before you can add data from that data source to your notebook. See Adding connections to projects.

The Insert to code function supports loading data from database connections and connected database assets. To learn which database connections are supported, see Data load support. For database connections that are not supported, you can only insert the database connection credentials. With the credentials, you can write your own code to load the data into a DataFrame or other data structure in a notebook cell.

To load data from an existing data source connection into a data structure in your notebook:

  1. Open the notebook in edit mode.
  2. Click in an empty code cell, click Find and Add Data, and then click the Connections tab to see your connections.
  3. Click Insert to code under the connection name.
  4. If necessary, enter your personal credentials for locked data connections that are marked with a key icon (the key symbol for connections with personal credentials). This is a one-time step that permanently unlocks the connection for you. After you have unlocked the connection, the key icon is no longer displayed. See Adding connections to projects.
  5. If the connection is supported, choose how to load the data to your notebook. Select the schema and choose a table.
  6. If the connection is not supported, load the credentials and open the database connection that references your credentials. Write code to load the data.

For a Planning Analytics connection, see Adding data from a Planning Analytics connection.

Use an API function or operating system command to access the data

You can use API functions or operating system commands in your notebook to access data, for example, the Wget command to access data by using the HTTP, HTTPS or FTP protocols.

Best practices when loading large volumes of data from a file or connection

Where possible, you should always try to use the Insert to code function to generate code with which to load data from a file or connection when working in a Python notebook.

However, if you need to write your own code and use Flight service APIs to access large data volumes from a file or connection, you should consider the following to improve efficiency and avoid memory issues, especially when working with pandas DataFrames.

  • Create smaller DataFrames by loading smaller subsets of the data, or do not use pandas DataFrames at all. Pandas uses a lot of memory, a lot more than the amount of data that is loaded.
  • You can read data from Flight using the pyarrow Flight client only, you don't need to use pandas.
  • When running many iterations of CPU-intensive tests, make sure to periodically run Python garbage collection manually. It is possible that the garbage collection is not able to run normally to clean up released memory if CPU usage is never idle, so forcing garbage collection to be done periodically can help keep memory usage lower.
  • You should also make sure that the data goes out of scope as soon as it's read so that it can be properly cleaned up by the garbage collection.
  • You need to be aware of the type differences between Arrow and pandas and ensure that you know which Arrow types map to which Pandas type. For details, see Type differences.
  • Ensure that the requested schema contains types that pandas can work with efficiently. For example, decimal types result in Python objects that take a large amount of space, requesting these as single precision floating point will use less memory. You can cast types in an SQL statement or in the fields property of a flight request.
  • If the table is too big to read into a single DataFrame, then consider partitioning the data and working with a single partition at a time.
  • The data can be partitioned by setting the num_partitions property in the request. This will return up to num_partitions Flight endpoints that can be used to read the partitions independently.
  • If you must work within a specific partitioning size, try to reduce the number of columns with a SELECT statement to what is necessary.
  • If you still must have all of the columns and the set partition size and memory usage is too high, you can change the batch size of the Flight request and process the DataFrames incrementally.

Learn more

If you are using the project-lib library in a notebook, see Use the project-lib library to interact with data assets. However, bear in mind that the ibm-watson-studio-lib library is the successor of project-lib. To move to using ibm-watson-studio-lib, see:

Parent topic: Coding and running notebooks