Best practices for loading large volumes of data from a file or connection

You can use generated code to load data from a file or connection when you work in a Python notebook. If you must write your own code and use Flight service APIs to access large data volumes from a file or connection, consider the following recommendations to improve efficiency and avoid memory issues when you work with pandas DataFrames.

  • Use the write_table function to specify max_chunksize parameters for Flight service write data requests from a pyarrow table to Db2. Do not exceed a chunk size of 3MB as chunk sizes larger than 3MB will cause out-of-memory issues.
  • Create smaller DataFrames by loading smaller subsets of the data or do not use pandas DataFrames at all. Pandas use a lot of memory, a lot more than the amount of data that is loaded.
  • Read data from Flight service with the pyarrow Flight service client. You don't need to use pandas.
  • Run Python garbage collection manually when you run many iterations of CPU-intensive tests. It is possible that the garbage collection is not able to run normally to clean up released memory if the CPU usage is never idle. Forcing garbage collection to be done periodically can help reduce memory usage.
  • Make sure that the data goes out of scope when it is read so that garbage collection cleans it up.
  • Be aware of the type differences between Apache Arrow Flight and pandas. Make sure that you know which Apache Arrow Flight types map to which pandas types. For more information, see Type differences.
  • Ensure that the requested schema contains types that pandas can work with efficiently. For example, decimal types result in Python objects that take a large amount of space. Request these types with single-precision floating-point format to use less memory. You can cast types in an SQL statement or in the fields property of a flight request.
  • Partition the data and work with a single partition at a time when a table is too large to read into a single DataFrame.
  • Partition data by setting the num_partitions property in the request. This request returns up to num_partitions Flight service endpoints that can be used to read the partitions independently.
  • Reduce the number of columns with a SELECT statement to what is necessary when you must work within a specific partitioning size.
  • Change the batch size of Flight service request and process the DataFrames incrementally when you must have all of the columns and the set partition size and memory usage is too high.

Parent topic: Notebooks and scripts