December 18, 2018 By Daniel Pittner 3 min read

IBM Cloud SQL Query opens up serverless data transformation in Cloud Object Storage

We’re excited to announce that IBM Cloud SQL Query now allows you to specify the format and layout in which a result for a SQL query is written. By adding these abilities, we’re opening up serverless data transformation in IBM Cloud Object Storage.

SQL Query is a serverless, interactive querying service used for analyzing data stored in Cloud Object Storage. Users can query data directly where it is stored with no ETL, databases, or infrastructure to manage and seamlessly create an active workspace for a range of big data analytics use cases. With these latest enhancements, users can now layout data for better efficiency, cost, and speed with the power of ANSI SQL.

Format transformation

You can now add an INTO clause like this to your SQL statement:

SELECT * from cos://<endpoint>/<bucket>/<data set prefix> STORED AS CSV LIMIT 10 
INTO cos://<endpoint>/<bucket>/<result prefix> STORED AS PARQUET

With an INTO clause, you can now control the format of your result. Replace STORED AS PARQUET with STORED AS ORCSTORED AS JSONSTORED AS CSV, or STORED AS AVRO as necessary for your use case.

This means that you can now transform your data from a format that doesn’t give optimal query performance, like CSV or JSON, to a very query friendly format like Apache Parquet with a single statement.

SELECT * from cos://<endpoint>/<bucket>/large-csv-files/ STORED AS CSV
INTO cos://<endpoint>/<bucket>/optimized/ STORED AS PARQUET

Data partitioning layout transformation

To optimize layout of your data, you may already be following the best practices in this blog article. SQL Query now helps you to re-layout your data for better efficiency, cost, and speed by simply running a single SQL statement. In addition to the format, you can also specify a custom data partitioning in the new INTOclause.

SQL Query can produce three types of partitioning layouts (see blog article):

1. Generate hive-style partitioning for one or more columns of your result data:

SELECT * from cos://<endpoint>/<bucket>/<data set prefix> STORED AS CSV 
INTO cos://<endpoint>/<bucket>/<result prefix> STORED AS PARQUET PARTITIONED BY (<colum1>, <column2> ...)

When querying the result with subsequent queries, SQL Query or any runtime that is aware of hive-style partitioning can exploit information encoded into the object names to avoid unnecessary access to object storage.

2. Segmenting results into a fixed number of objects:

SELECT * from cos://<endpoint>/<bucket>/<data set prefix> STORED AS CSV 
INTO cos://<endpoint>/<bucket>/<result prefix> STORED AS PARQUET PARTITIONED INTO 3 BUCKETS

This is very useful if you want to optimize layout for processing with an IBM Analytics Engine using an Apache Spark cluster. By aligning the number of objects with the amount of nodes and executors, huge gains in efficiency and speed are possible.

3. Creating paginated result objects:

SELECT * from cos://<endpoint>/<bucket>/<data set prefix> STORED AS CSV 
INTO cos://<endpoint>/<bucket>/<result prefix> STORED AS JSON PARTITIONED EVERY 30 ROWS

Using this clause is very useful if you’re building an application and want to paginate results. By writing one object per page, your application can easily page through the result by fetching one object that represents a single page.

If you don’t specify an INTO clause, we’ll keep writing the result as a single CSV.

This means that conversion from one format to another also becomes a lot easier. This blog post that describes the previous process of how to convert CSV to Apache Parquet can now be radically simplified into a single statement.

Example

The table below compares query cost and run times between different data formats and layouts. All data sets hold 2.47 GB data, once stored as CSV, once as Parquet, and once as partitioned Parquet.

Both Parquet data sets used in this example have been generated using the following statements. Adapt them to your data to optimize dataset layout and see similar benefits.

Converting CSV to Parquet:

SELECT * from cos://us-geo/iot/csv/2017/09/ STORED AS CSV
INTO cos://us-geo/iot/parquet/2017/09/ STORED AS PARQUET

Converting CSV to Parquet with Hive-Style partitioning on SENSOR_ID:

SELECT * from cos://<endpoint>/<bucket>/large-csv-files/ STORED AS CSV
INTO cos://us-geo/iot/parquetBySensorId/2017/09/ STORED AS PARQUET PARTITONED BY (SENSOR_ID)

Just by converting to Parquet—a columnar, compressed format—the query executes roughly seven times (6.8x) faster, while the cost for the query is reduced from $0.12 to $0.003 (38x less cost). When data is partitioned in a query efficient way, the cost drops even further. Query price is only the minimum charge per query (10 Mb or $0.0047) for this example.

When using the IBM Cloud REST API directly, you can now omit the result_target parameter if you specify an INTO clause. We’ve also updated the sql query python SDK (link resides outside IBM) to include support for these new features.

You can also use this self-service Starter Notebook to interactively experiment with the new data partitioning capabilities.

For more information, consult the IBM Cloud SQL Query documentation.

More from Analytics

In preview now: IBM watsonx BI Assistant is your AI-powered business analyst and advisor

3 min read - The business intelligence (BI) software market is projected to surge to USD 27.9 billion by 2027, yet only 30% of employees use these tools for decision-making. This gap between investment and usage highlights a significant missed opportunity. The primary hurdle in adopting BI tools is their complexity. Traditional BI tools, while powerful, are often too complex and slow for effective decision-making. Business decision-makers need insights tailored to their specific business contexts, not complex dashboards that are difficult to navigate. Organizations…

IBM unveils Data Product Hub to enable organization-wide data sharing and discovery

2 min read - Today, IBM announces Data Product Hub, a data sharing solution which will be generally available in June 2024 to help accelerate enterprises’ data-driven outcomes by streamlining data sharing between internal data producers and data consumers. Often, organizations want to derive value from their data but are hindered by it being inaccessible, sprawled across different sources and tools, and hard to interpret and consume. Current approaches to managing data requests require manual data transformation and delivery, which can be time-consuming and…

A new era in BI: Overcoming low adoption to make smart decisions accessible for all

5 min read - Organizations today are both empowered and overwhelmed by data. This paradox lies at the heart of modern business strategy: while there's an unprecedented amount of data available, unlocking actionable insights requires more than access to numbers. The push to enhance productivity, use resources wisely, and boost sustainability through data-driven decision-making is stronger than ever. Yet, the low adoption rates of business intelligence (BI) tools present a significant hurdle. According to Gartner, although the number of employees that use analytics and…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters