Presto has become a popular tool for data scientists and engineers dealing with multiple query languages, siloed databases and different types of storage. Its high-performance capabilities enable users to query large volumes of data in real-time, regardless of where the data is located, using a simple ANSI SQL interface. Presto’s speed and performance at executing queries on large volumes of data have made it an indispensable tool for some of the largest companies in the world including Facebook, Airbnb, Netflix, Microsoft, Apple (iOS) and AWS (Athena and Amazon s3).
Presto architecture is unique in that it is built to query data no matter where the data is being stored, making it more scalable and efficient than other, similar solutions. Presto queries allow engineers to use data without having to physically move it from location to location. This is an important capability to have as organizations deal with an ever-increasing amount of data they need to store and analyze.
Presto was built to empower data scientists and engineers to interactively query vast amounts of data regardless of the source or type of storage. Because Presto doesn’t store data, but rather communicates with a separate database for its queries, it is more flexible than its competitors and can scale queries up or down swiftly based on the shifting needs of the organization. According to an IBM whitepaper, Presto, optimized for business intelligence (BI) workloads, can help enterprises optimize the pricing of their data warehouses and reduce costs by up to 50 percent.
Here are some of the key benefits to using a Presto workflow:
Lower costs: As the size of data warehouses and the number of users conducting queries grows, it’s not uncommon for enterprises to see their costs rapidly increase. Presto, however is optimized for large amounts of small queries, making it easy to query any amount of data while also keeping costs down. Also, since Presto is open-source, there are no fees associated with deploying it, which can result in significant savings for enterprises looking to process large volumes of data.
Increased scalability: It’s common for engineers to set up multiple engines and languages on a single data lake storage system, which can make it necessary to re-platform in the future and limit the scalability of their solution. With Presto, all queries are conducted using the universal ANSI SQL language and interface, making re-platforming redundant. Additionally, Presto can be used for both small and large amounts of data and easily scaled up from one or two users to thousands. Presto deploys multiple compute engines with unique SQL dialects and APIs, making it an ideal tool for scaling workloads that could be too complex and time-intensive for teams of engineers and data scientists to handle.
Better performance: While many query engines that run SQL on Hadoop are restricted in their compute performance because they are built to write their results to disk, Presto’s distributed in-memory model enables it to run large amounts of interactive queries at once against large data sets. Following a classic massively parallel processing (MPP) design, Presto schedules as many queries as it can on a single worker node and uses in-memory streaming shuffle to increase its processing speeds even more. Executing tasks in-memory makes writing and reading from the disk between stages redundant and shortens the time of each query execution, making Presto a lower latency option than its competitors.
Improved flexibility: Presto uses a plug-and-play model for all its data sources including Cassandra, Kafka, MySQL, Hadoop distributed file system (HDFS), PostgreSQL and others, making querying across them faster and easier than with other comparable tools that lack this functionality. Also, Presto’s flexible architecture means it isn’t restricted to a single vendor but runs on most Hadoop distributions, making it one of the most portable tools available.
While Presto isn’t the only SQL-on-Hadoop option available to developers and data engineers, its unique architecture that keeps query functionality separate from data storage makes it one of the most flexible. Unlike other tools, Presto siloes off the query engine from the data storage and uses connectors to communicate between them. This added functionality gives engineers more flexibility than other tools in how they construct solutions using Presto.