Don’t treat your data like an object

By: Thom Crowe

Don’t treat your data like an object

Gaining rapid insight into object storage data with IBM® Cloud SQL Query

  • Why does object storage often create a trade-off between cost-efficiency and analyzability?

  • Is there a way to sidestep the complexity of data warehouse-based analysis?

  • How can we gain insight without building and maintaining yet another set of analytics tools?

In recent years, object storage has become a dominant paradigm for large-scale, cost-effective data storage. Solutions like IBM Cloud Object Storage are both highly scalable and extremely resilient, utilizing ingenious algorithms to distribute data across a cluster of storage systems and provide a high level of fault tolerance. At the same time, these algorithms minimize the number of complete copies of the data that need to be stored, which reduces total data volumes and keeps costs low.

IBM Cloud Object Object Storage is a great option for countless uses in storing structured data, but if you need built-in data processing capabilities of a relational or NoSQL database, it may not be the most ideal fit. It’s difficult to perform large-scale transformations or analytics tasks. The good news is, you don’t have to let cost-efficient storage of IBM Object Storage translate to data that becomes opaque.

Working around the problem

One response is simply to avoid storing structured data in object storage if it needs to be transformed or analyzed regularly. This approach uses object storage primarily as a repository for “cold” data—archives, log files, and information that needs to be stored for regulatory or legal reasons, but rarely needs to be accessed.

However, this limits the usefulness of object storage and also fails to address the problem of what happens when you eventually do need to analyze the data after all.or want to analyze fresh data in pipeline (For example, if there’s a security breach in your network, the ability to analyze server logs could become extremely important; or if your organization needs to go through a legal discovery process to defend itself against a lawsuit, insight into your archives might prove to be priceless).

Extract, transform and load

A more effective strategy, then, is to find ways to integrate object storage data into your main enterprise analytics process when necessary—for example, by building extract, transform and load (ETL) processes that allow you to move data temporarily into a data warehouse for analysis.

The problem is that these ETL processes are often complex to build and maintain, slow to execute, and greedy for processor power and network bandwidth. They place an extra burden on your data warehouse, which is probably already fully occupied with serving the day-to-day reporting needs of the business. Scheduling ETL jobs for rarely analyzed data sets may disrupt business-as-usual reporting, or require considerable extra effort from your data warehousing team to work around the problem.

The problem is that moving data is always expensive—the ideal solution would be to find a way to analyze object storage data where it lives. And that’s where IBM Cloud SQL Query comes in.

Introducing IBM Cloud SQL Query

As a cloud-native service available via IBM Cloud, SQL Query uses Apache Spark and Spark SQL to allow users to execute queries directly against an object storage repository. It supports common structured data formats, such as CSV, JSON and Parquet, so as long as your data can be exported to one of these file-types, you can store it in IBM Cloud Object Storage and execute queries against it with just a few clicks.

By allowing you to analyze your data in situ, SQL Query frees you from the need to build complex ETL pipelines, or to add to the workload of your data warehousing team. It also eliminates the need to wait for the data loading process to finish before analysis can begin: once you submit your request, SQL Query spins itself up and starts processing data almost instantly. Moreover, since the service is built on the lightning-fast Spark architecture, it is often able to return results in seconds, even for complex queries on very large data sets.

Going serverless

If these capabilities sound too good to be true, it is only natural to be skeptical. Many IT teams would be wary of adding a new analytics solution that is completely separate from the main data warehouse environment, because every time a new system is introduced, it creates additional costs and maintenance effort.

SQL Query mitigates these issues because it is built on a serverless architecture, which means it spins up when it is needed, and spins down again as soon as its work is done. Not only is there no physical hardware or software for your organization to deploy, maintain and upgrade—you don’t even need to pay a monthly subscription to use the service. Instead, you simply pay a one-off fee per query, which is calculated based on the amount of data that each query needs to process.

As a result, the costs and risks involved in adopting SQL Query are extremely low. You can start and stop using the service whenever you wish—and when you are not using it, you pay nothing.

Get started today

SQL Query provides a fast, convenient, low-impact solution for many different use cases, from analyzing server logs and data archives to gaining quick insight into clickstream or IoT data. What’s more, since it supports the full ANSI-standard SQL syntax, it can also be used for data transformation tasks; for example, aggregating time-series data, or compressing CSV or JSON data into a more compact Parquet format.

SQL Query can give your business a quick, easy and affordable way to analyze and transform structured data in object storage. The service is currently in open beta, and you can use it free of charge during the beta period—just sign up for a free IBM Cloud account today, and take a look at our guide to help you get started.

Be the first to hear about news, product updates, and innovation from IBM Cloud