How To Convert CSV Objects to Parquet in Cloud Object Storage

5 min read

Convert CSV objects to Parquet in Cloud Object Storage

IBM Cloud SQL Query is a serverless solution that allows you to use standard SQL to quickly analyze your data stored in IBM Cloud Object Storage (COS) without ETL or defining schemas. SQL Query allows you to query multiple types of data in your COS buckets—including CSV, JSON, and Parquet—and each one has its benefits. However, if you’re looking to lower storage costs by compressing your data and get better query performance when querying the data stored in COS, you might have considered how to convert your data into Parquet. We have you covered.

If you’re starting out with a bucket of CSV objects and you want to convert those to Parquet, I’ve developed a simple command-line tool written in Python that’ll help you do that called csvtoparquet. All you need to do is list the names of the CSV objects in your COS bucket that you want to convert, provide those objects with a new name (you can add prefixes to the names as well), and it will use Apache Arrow to convert that CSV data to Parquet and store it back into the COS bucket.

In this article, we’ll show you how to set up the tool and get started using it.

How the tool converts CSV objects

csvtoparquet uses Pandas and Apache Arrow to convert your CSV objects to Parquet. First, the tool retrieves the objects without storing them on your system using the Python IBM COS SDK:

def _get_object(self, obj):
        """ get the body of an object """

        cos = self.connection
        cos_object = cos.get_object(Bucket=self.cos_bucket, Key=obj)
        body = cos_object['Body'].read().decode('utf-8')

        return body

Those objects are then read with Pandas read_csv and converted to a dataframe:

def _read_csv_to_df(self, obj):
        """ read the object contents and put them into a panda's dataframe """

        get_object = self._get_object(obj)
        buff = StringIO(get_object)

            df = pd.read_csv(buff)

            # replaces spaces in column names like `sample column` with `sample_column`
            cols = df.columns.str.strip().str.replace(' ', '_')
            df.columns = cols   

            return df
        except Exception:
            logging.error("Can't create panda's dataframe from object")

CSV objects that have column names with spaces need to have the spaces converted to _, otherwise, Parquet throws an error.

Next, Pandas has a to_parquet method that will convert the dataframe to Parquet. The engine that we’ve selected to make the conversion is pyarrow, which is the Python library for Apache Arrow. That’s installed as a dependency of the tool:

def _convert_to_parquet(self, obj, new_obj_name):
        """ convert the pandas dataframe to Parquet """

        df = self._read_csv_to_df(obj)
        parquet_obj = BytesIO()
        df.to_parquet(parquet_obj, compression="gzip", engine="pyarrow")
        return self._save_object(new_obj_name, parquet_obj)

Once the object is converted to Parquet, we’ll save it back to the COS bucket and add the .parquet extension to the new object:

def _save_object(self, new_obj_name, parquet_data):
        """ save the body of the object and rename with the .parquet file extension """

        cos = self.connection
        new_obj_name_parquet = "{}.parquet".format(new_obj_name)
        return cos.put_object(Body=parquet_data.getvalue(), Bucket=self.cos_bucket, Key=new_obj_name_parquet)

Now that we understand how the converter works, let’s move to understanding how to use the tool to start converting objects to Parquet.

Installing and using the tool

You will need to have Python 3 installed on your system to use the tool. The tool is readily available to install using pip:

pip install csvtoparquet

Now that it’s installed, typing csvtoparquet -h will provide you with the options that are available. However, let’s examine the required parameters first:

usage: csvtoparquet [-h] -a APIKEY -e COS_ENDPOINT -b COS_BUCKET [-l] [-c]
                    [-f FILES [FILES ...]] [-n NAMES [NAMES ...]]

Under “usage,” everything in brackets [ ] is optional, while the flags -a-e, and -b are required.

  • The-a flag is your IBM Cloud API Key that you can get in your IBM Cloud account by clicking Manage > Security > Platform API Keys.

  • For -e, you’ll need the endpoint for the location of the COS bucket where your CSV objects reside. That’s available in the COS management panel by clicking on Endpoint.

  • For -b, you need the name of the bucket where your objects are located.

Once you have those, you’re able to start using the optional commands that come with the tool:


Options -l and -c are commands that only list the objects that are currently stored in a COS bucket. -l lists all the objects in a specified bucket, while -c will list objects that are only CSV. All this does internally is check the object name to see if it has .csv attached. Let’s look at this using some data in a COS bucket.

For this example, I’ve uploaded three months of CSV data (January to March 2018) from a New York City bike share company called Citi Bike to my COS bucket. Once the data has been uploaded, run csvtoparquet using the -c flag to show the CSV objects. If you already have CSV objects in your bucket, all of them will appear in your results.

csvtoparquet -a API_KEY -e -b mydata -c
        "Key": "201801_citibikejc_tripdata.csv",
        "LastModified": "2018-06-23T02:31:31.031000+00:00",
        "ETag": "\"536c4c2cf6f9b5ed9ae0bb090a157c4d\"",
        "Size": 2523746,
        "StorageClass": "STANDARD"
        "Key": "201802_citibikejc_tripdata.csv",
        "LastModified": "2018-06-23T02:31:34.224000+00:00",
        "ETag": "\"d3c8cbe0082d11f5c6d5c005ec04a2d1\"",
        "Size": 3008614,
        "StorageClass": "STANDARD"
        "Key": "201803_citibikejc_tripdata.csv",
        "LastModified": "2018-06-23T02:32:18.667000+00:00",
        "ETag": "\"684925a792d9a2afaa7bc54f3c911762\"",
        "Size": 3402749,
        "StorageClass": "STANDARD"

As you can see, the command will provide you with the object name (Key), the date it was uploaded or modified, a tag, the size in bytes, and the COS storage class. When converting these objects to Parquet, you’ll see a significant size reduction of the object size because they are compressed and thus optimized for SQL Query use.

In order to convert these CSV objects, you’ll need to use the -f and -n flags together. The-f flag should be a list of objects that you want to convert stored in COS, and the -n flag is a list of new names of the objects in Parquet (you can add prefixes). Both lists should not have commas separating the objects. So, if we wanted to convert the three CSV objects in the example above, we’ll need to select them by name and then rename them. In the command below, I’ve added a prefix just to separate the new Parquet objects from those I’ve uploaded.

csvtoparquet -a API_KEY -e -b mydata -f 201801_citibikejc_tripdata.csv 
201802_citibikejc_tripdata.csv 201803_citibikejc_tripdata.csv -n trip/01 trip/02 trip/03

Running this command, you’ll see that it’s converting the CSV objects and providing you with the new Parquet object extension:

csvtoparquet -a API_KEY -e -b mydata -f 201801_citibikejc_tripdata.csv 
201802_citibikejc_tripdata.csv 201803_citibikejc_tripdata.csv -n trip/01 trip/02 trip/03
Now Converting: 201801_citibikejc_tripdata.csv --> trip/01.parquet
Now Converting: 201802_citibikejc_tripdata.csv --> trip/02.parquet
Now Converting: 201803_citibikejc_tripdata.csv --> trip/03.parquet

Querying data

Looking in your COS bucket or using the csvtoparquet tool with the -l flag, we can view the new Parquet objects in the bucket along with our original CSV objects. Below are the new Parquet objects we created with csvtoparquet:

        "Key": "trip/01.parquet",
        "LastModified": "2018-08-14T00:10:58.173000+00:00",
        "ETag": "\"4a42038cbd6f19365d345dee2d10360d\"",
        "Size": 314153,
        "StorageClass": "STANDARD"
        "Key": "trip/02.parquet",
        "LastModified": "2018-08-14T00:11:11.833000+00:00",
        "ETag": "\"c135c35caee3359320e60b9ec9e1d4d5\"",
        "Size": 372361,
        "StorageClass": "STANDARD"
        "Key": "trip/03.parquet",
        "LastModified": "2018-08-14T00:11:29.132000+00:00",
        "ETag": "\"dd7467dcd209d10f1df35d7c2d23efd2\"",
        "Size": 426614,
        "StorageClass": "STANDARD"

Notice the significant size difference of the Parquet objects. By converting them to Parquet, we’ve reduced their size from 3-4MB to 300-400KB. As a result, the speed of any SQL queries we run in SQL Query will be increased. Running the following query on the original CSV objects will take about 30 seconds to run:

        `start station id`,
        `end station id`,
FROM cos://us-geo/geodata/2018* 
WHERE bikeid = 31929 AND `start station id` != `end station id`
ORDER BY tripduration DESC

We use * as a wildcard in the query to join all three CSV files together without having to use the SQL JOIN clause. To run the query over the converted Parquet objects, we’ll only need to modify the name of start station id to start_station_id and end station id to end_station_id since csvtoparquet added _ for spaces in column names. Additionally, we can use the wildcard * after the trip/ prefix to join all of our trip Parquet objects along with STORED AS PARQUET to indicate the objects use the Parquet file format:

FROM cos://us-geo/geodata/trip/* STORED AS PARQUET
WHERE bikeid = 31929 AND `start station id` != `end station id`
ORDER BY tripduration DESC

Running this query, you will see a significant time reduction as it goes from a 30-second query to just 8 seconds. The significant time reduction is a very good indicator of the advantage that querying Parquet objects has over CSV. Nonetheless, if our objects were any smaller than 2MB, the query times between CSV and Parquet objects might be negligible for your use case.

Summing up

The new tool that I’ve developed to get started converting your CSV objects to Parquet is easy to install and try out. One caveat is that it’s meant for smaller objects  that are less than 2 GB in size because it uses your system’s resources to make the conversion, not the cloud. At the moment, we’re experimenting with turning the tool into a serverless function with IBM Cloud Functions. We will be writing a blog post about how to do that soon. In the mean time, use the tool and give us some feeback about how you use it and some improvement’s that you’d like to see made.

Get started with IBM Cloud SQL Query

Be the first to hear about news, product updates, and innovation from IBM Cloud