Convert CSV objects to Parquet in Cloud Object Storage
IBM Cloud SQL Query is a serverless solution that allows you to use standard SQL to quickly analyze your data stored in IBM Cloud Object Storage (COS) without ETL or defining schemas. SQL Query allows you to query multiple types of data in your COS buckets—including CSV, JSON, and Parquet—and each one has its benefits. However, if you’re looking to lower storage costs by compressing your data and get better query performance when querying the data stored in COS, you might have considered how to convert your data into Parquet. We have you covered.
If you’re starting out with a bucket of CSV objects and you want to convert those to Parquet, I’ve developed a simple command-line tool written in Python that’ll help you do that called
csvtoparquet. All you need to do is list the names of the CSV objects in your COS bucket that you want to convert, provide those objects with a new name (you can add prefixes to the names as well), and it will use Apache Arrow to convert that CSV data to Parquet and store it back into the COS bucket.
In this article, we’ll show you how to set up the tool and get started using it.
How the tool converts CSV objects
csvtoparquet uses Pandas and Apache Arrow to convert your CSV objects to Parquet. First, the tool retrieves the objects without storing them on your system using the Python IBM COS SDK:
Those objects are then read with Pandas
read_csv and converted to a dataframe:
CSV objects that have column names with spaces need to have the spaces converted to
_, otherwise, Parquet throws an error.
Next, Pandas has a
to_parquet method that will convert the dataframe to Parquet. The engine that we’ve selected to make the conversion is
pyarrow, which is the Python library for Apache Arrow. That’s installed as a dependency of the tool:
Once the object is converted to Parquet, we’ll save it back to the COS bucket and add the
.parquet extension to the new object:
Now that we understand how the converter works, let’s move to understanding how to use the tool to start converting objects to Parquet.
Installing and using the tool
You will need to have Python 3 installed on your system to use the tool. The tool is readily available to install using pip:
Now that it’s installed, typing
csvtoparquet -h will provide you with the options that are available. However, let’s examine the required parameters first:
Under “usage,” everything in brackets
[ ] is optional, while the flags
-b are required.
-aflag is your IBM Cloud API Key that you can get in your IBM Cloud account by clicking Manage > Security > Platform API Keys.
-e, you’ll need the endpoint for the location of the COS bucket where your CSV objects reside. That’s available in the COS management panel by clicking on Endpoint.
-b, you need the name of the bucket where your objects are located.
Once you have those, you’re able to start using the optional commands that come with the tool:
-c are commands that only list the objects that are currently stored in a COS bucket.
-l lists all the objects in a specified bucket, while
-c will list objects that are only CSV. All this does internally is check the object name to see if it has
.csv attached. Let’s look at this using some data in a COS bucket.
For this example, I’ve uploaded three months of CSV data (January to March 2018) from a New York City bike share company called Citi Bike to my COS bucket. Once the data has been uploaded, run
csvtoparquet using the
-c flag to show the CSV objects. If you already have CSV objects in your bucket, all of them will appear in your results.
As you can see, the command will provide you with the object name (Key), the date it was uploaded or modified, a tag, the size in bytes, and the COS storage class. When converting these objects to Parquet, you’ll see a significant size reduction of the object size because they are compressed and thus optimized for SQL Query use.
In order to convert these CSV objects, you’ll need to use the
-n flags together. The
-f flag should be a list of objects that you want to convert stored in COS, and the
-n flag is a list of new names of the objects in Parquet (you can add prefixes). Both lists should not have commas separating the objects. So, if we wanted to convert the three CSV objects in the example above, we’ll need to select them by name and then rename them. In the command below, I’ve added a prefix just to separate the new Parquet objects from those I’ve uploaded.
Running this command, you’ll see that it’s converting the CSV objects and providing you with the new Parquet object extension:
Looking in your COS bucket or using the
csvtoparquet tool with the
-l flag, we can view the new Parquet objects in the bucket along with our original CSV objects. Below are the new Parquet objects we created with
Notice the significant size difference of the Parquet objects. By converting them to Parquet, we’ve reduced their size from 3-4MB to 300-400KB. As a result, the speed of any SQL queries we run in SQL Query will be increased. Running the following query on the original CSV objects will take about 30 seconds to run:
* as a wildcard in the query to join all three CSV files together without having to use the SQL
JOIN clause. To run the query over the converted Parquet objects, we’ll only need to modify the name of
start station id to
end station id to
_ for spaces in column names. Additionally, we can use the wildcard
* after the
trip/ prefix to join all of our trip Parquet objects along with
STORED AS PARQUET to indicate the objects use the Parquet file format:
Running this query, you will see a significant time reduction as it goes from a 30-second query to just 8 seconds. The significant time reduction is a very good indicator of the advantage that querying Parquet objects has over CSV. Nonetheless, if our objects were any smaller than 2MB, the query times between CSV and Parquet objects might be negligible for your use case.
The new tool that I’ve developed to get started converting your CSV objects to Parquet is easy to install and try out. One caveat is that it’s meant for smaller objects that are less than 2 GB in size because it uses your system’s resources to make the conversion, not the cloud. At the moment, we’re experimenting with turning the tool into a serverless function with IBM Cloud Functions. We will be writing a blog post about how to do that soon. In the mean time, use the tool and give us some feeback about how you use it and some improvement’s that you’d like to see made.