What is Presto?
Explore IBM's Presto solution Subscribe for AI updates
Illustration with collage of pictograms of clouds, pie chart, graph pictograms
What is Presto?

Presto, or Presto database (PrestoDB), is an open-source, distributed SQL query engine that can query large data sets from different data sources, enabling enterprises to address data problems at scale.

Presto gives organizations of all sizes a fast, efficient way to analyze big data from various sources including on-premise systems and the cloud. It also helps businesses query petabytes of data using their current SQL capabilities, without having to learn a new language.

Today, Presto is most commonly used for running queries on Hadoop and other common data storage providers, enabling users to manage multiple query languages and interfaces to databases and storage. 

In the digital age, big data analytics is fast becoming a core competency for enterprises regardless of size or industry. The ability to gather, store and analyze large amounts of data relating to business processes, customer preferences and market trends is extremely valuable. Presto’s primary importance to data analytics is its ability to analyze data regardless of where the data is being stored, and without having to move it into a more structured system first, such as a data warehouse or data lake.

The data store for AI

Discover the power of integrating a data lakehouse strategy into your data architecture, including enhancements to scale AI and cost optimization opportunities.

The benefits of Presto

Presto has become a popular tool for data scientists and engineers dealing with multiple query languages, siloed databases and different types of storage. Its high-performance capabilities enable users to query large volumes of data in real-time, regardless of where the data is located, using a simple ANSI SQL interface. Presto’s speed and performance at executing queries on large volumes of data have made it an indispensable tool for some of the largest companies in the world including Facebook, Airbnb, Netflix, Microsoft, Apple (iOS) and AWS (Athena and Amazon s3).

Presto architecture is unique in that it is built to query data no matter where the data is being stored, making it more scalable and efficient than other, similar solutions. Presto queries allow engineers to use data without having to physically move it from location to location. This is an important capability to have as organizations deal with an ever-increasing amount of data they need to store and analyze.

Presto was built to empower data scientists and engineers to interactively query vast amounts of data regardless of the source or type of storage. Because Presto doesn’t store data, but rather communicates with a separate database for its queries, it is more flexible than its competitors and can scale queries up or down swiftly based on the shifting needs of the organization. According to an IBM whitepaper, Presto, optimized for business intelligence (BI) workloads, can help enterprises optimize the pricing of their data warehouses and reduce costs by up to 50 percent. 

Here are some of the key benefits to using a Presto workflow:

Lower costs: As the size of data warehouses and the number of users conducting queries grows, it’s not uncommon for enterprises to see their costs rapidly increase. Presto, however is optimized for large amounts of small queries, making it easy to query any amount of data while also keeping costs down. Also, since Presto is open-source, there are no fees associated with deploying it, which can result in significant savings for enterprises looking to process large volumes of data.

Increased scalability: It’s common for engineers to set up multiple engines and languages on a single data lake storage system, which can make it necessary to re-platform in the future and limit the scalability of their solution. With Presto, all queries are conducted using the universal ANSI SQL language and interface, making re-platforming redundant. Additionally, Presto can be used for both small and large amounts of data and easily scaled up from one or two users to thousands. Presto deploys multiple compute engines with unique SQL dialects and APIs, making it an ideal tool for scaling workloads that could be too complex and time-intensive for teams of engineers and data scientists to handle.

Better performance: While many query engines that run SQL on Hadoop are restricted in their compute performance because they are built to write their results to disk, Presto’s distributed in-memory model enables it to run large amounts of interactive queries at once against large data sets. Following a classic massively parallel processing (MPP) design, Presto schedules as many queries as it can on a single worker node and uses in-memory streaming shuffle to increase its processing speeds even more. Executing tasks in-memory makes writing and reading from the disk between stages redundant and shortens the time of each query execution, making Presto a lower latency option than its competitors.

Improved flexibility: Presto uses a plug-and-play model for all its data sources including Cassandra, Kafka, MySQL, Hadoop distributed file system (HDFS), PostgreSQL and others, making querying across them faster and easier than with other comparable tools that lack this functionality. Also, Presto’s flexible architecture means it isn’t restricted to a single vendor but runs on most Hadoop distributions, making it one of the most portable tools available.

While Presto isn’t the only SQL-on-Hadoop option available to developers and data engineers, its unique architecture that keeps query functionality separate from data storage makes it one of the most flexible. Unlike other tools, Presto siloes off the query engine from the data storage and uses connectors to communicate between them. This added functionality gives engineers more flexibility than other tools in how they construct solutions using Presto.

How does Presto work?

Presto uses an MPP database management system with one coordinator node that works in tandem with other nodes. A Presto ecosystem is made up of three server types, a coordinator server, worker server and resource manager server.

Coordinator: A coordinator is considered the “brains” of a Presto installation. It is responsible for some of the most critical tasks including parsing statements, planning queries and managing Presto worker nodes. Ultimately, it is responsible for retrieving data from the worker nodes and delivering the results to the client.

Worker: The worker is responsible for gathering data from the worker nodes and ensuring the smooth exchange of data between itself and the connectors.

Resource manager: The resource manager gathers data from all the coordinator and worker nodes and creates a global view, or a “Presto cluster”. 

When the Presto coordinator SQL server receives an SQL query from a user, the first thing it does is use a custom query to parse, plan and schedule a distributed plan across the other nodes. Presto Rest API is used to submit query statements for execution on a server and to retrieve the results for the client. Presto supports standard ANSI SQL meanings, including joins, queries, sub-queries and aggregations. Once it has compiled the query, Presto parses the request into different stages between the worker nodes.

Since Presto was built on the concept of data abstraction, it is extensible to any data source and can easily query data sources such as data lakes, data warehouses and relational databases. Data abstraction is a programming process that allows data to be stored and manipulated more efficiently by separating its representation from its physical storage. This abstraction allows for the query engine to focus exclusively on the aspects of the data that are relevant to its query. Using the process of data abstraction, data is queried wherever it is being stored rather than once it has been moved into another analytics system.

A brief history of Presto

Initially developed at Facebook to run interactive queries on a massive Apache Hadoop data warehouse, Presto’s developers always envisioned it as open-source software and sought to make it free for commercial use so anyone could use it for data analytics and data management. In 2013, it was open-sourced on GitHub for anyone to download under the Apache Software license. In 2019, three of the original members of the Presto development team left the project and founded a “fork” of Presto known as the Presto Software Foundation, or more commonly, as prestosql.

The Linux foundation and other open-source communities offer webinars and training on Presto in English and other languages for engineers and developers looking to obtain certification. These forums are also a good place to visit to find out what’s new in Presto.

Presto use cases 

Presto enables organizations to query large-scale data repositories and NoSQL databases quickly and efficiently for a variety of business purposes. Here are some of its most common use cases:

Ad-hoc queries

Presto enables speedy data exploration and straightforward reporting for a variety of business purposes. Using popular Presto connectors, such as Hive, MongoDB or Cassandra, users can query data they have an interest in and get results in seconds. With its speed and flexibility, Presto empowers users to iterate on and further explore data sets, regardless of where they reside.

Here are some of the most widely used data repositories that Presto can connect to:

  • BigQuery
  • HDFS
  • Cloud Storage
  • Cloud SQL for MySQL
  • Apache Cassandra or Kafka
Cloud and hybrid cloud deployments

According to a 2021 performance assessment by RedHat, the growing use of hybrid cloud environments by enterprises is placing increased pressure on cloud-native storage, for which Presto, “the fastest distributed query engine available today” is ideal. 1 Moving workloads from an on-premise environment to a cloud or hybrid cloud infrastructure has many benefits, including increased performance and scalability. Presto’s architecture makes it a strong choice for such deployments because it can be launched in a few minutes without additional provisioning, configuration or tuning. 

Machine learning (ML)

Presto helps engineers prepare data and perform feature engineering and extraction in a highly efficient way that ensures it is ready for machine learning (ML). Its number of connectors, SQL engine and querying capabilities make it ideal for engineers seeking fast, easy access to large volumes of data. Additionally, Presto has tools designed specifically for ML functions such as aggregation, which allow data scientists to train support vector machine (SVM) classifiers and regressors to address supervised learning problems.

Reporting

Presto allows for data to be queried from multiple sources generating a single, easily accessible report or dashboard for BI purposes. Presto is simple and easy enough to use that analysts can conduct queries and create reports without the help of engineers.

Analytics

Presto enables analysts to conduct queries on both structured and unstructured data directly on a data lake without going through a data transformation process.

Data preparation

The process of gathering and preparing data can be costly and inefficient. Data scientists can spend hours each day collecting and preparing data before it can even be analyzed. Presto automates this process with speed and accuracy so data scientists and engineers can focus more of their time on higher value tasks.  

NVMe solutions
Data and AI IBM® watsonx.data

Presto is open source and can be installed manually. You can also use Presto with a data lakehouse solution such as IBM® watsonx.data for faster scaling of your AI workloads. IBM® watsonx.data is a fit-for-purpose data store built on open lakehouse architecture and supported by querying, governance and open data formats to help access and share data.

Learn more about IBM® watsonx.data

Related resources Presto: Make sense of all your data, any size, anywhere

See how Presto, a fast and flexible open-source SQL query engine can help deliver the insights enterprises need.

IBM to help businesses scale AI workloads

Find out more about IBM® watsonx.data, a data store that helps enterprises easily unify and govern their structured and unstructured data.

The disruptive potential of open data lakehouse architectures and IBM® watsonx.data

Explore open data lakehouse architecture and find out how it combines the flexibility, and cost advantages of data lakes with the performance of data warehouses.

IBM® watsonx.data: An open, hybrid, governed data store

Discover how IBM® watsonx.data helps enterprises address the challenges of today’s complex data landscape and scale AI to suit their needs.

Take the next step

Scale AI workloads for all your data, anywhere, with IBM watsonx.data, a fit-for-purpose data store built on an open data lakehouse architecture.

Explore watsonx.data Book a live demo
Footnotes

1 External-mode performance characterization for databases and analytics (link resides outside ibm.com), Red Hat, January 18th, 2021