Database Deep Dives: PostgreSQL
30 July 2019
14 min read
Author
Josh Mintz Program Director

In the latest installment of Database Deep Dives, we caught up with Dave Cramer, an Architect and Engineer for the IBM Cloud Databases portfolio, and Brad Nicholson, the maintainer of the PostgreSQL JDBC driver. Read the interview below to learn how the PostgreSQL community can be its greatest strength—but also, sometimes, a weakness—and how the new pluggable storage engine can help take Postgres from a database to an application development platform.

 

Tell us a little about yourself and what you are working on today.

Dave Cramer (DC): I currently live in a small town in Ontario, Canada, and I have been fortunate enough to work at home all of my career. For fun and excitement, I like to explore my current understanding of physics on a race track with my car.

As to what I’m working on today, Crunchy Data has allowed me to work on Postgres full time and that’s really, really cool. I’m currently working on the JDBC driver (link resides outside ibm.com), fixing bugs; and actually, as we speak, adding something to the backend of PostgreSQL to help the JDBC driver—and other drivers—deal with things like people changing the search path on them.

I also help out with the PL/R procedural language (link resides outside ibm.com) that Joe Conway wrote (who also works for Crunchy Data). There are a bunch of other interesting technologies that I’m helping out with or getting involved with. One of my favorites is Logical decoding, which enables Change Data Capture in a unique way that only Postgres can do. More recently in the Java PostgreSQL world, there are a number of reactive/async drivers that I’m very interested in.

Brad Nicholson (BN): I live in Toronto. Today, I’m working on IBM Cloud Databases. I’m an Architect and an Engineer on that, working in the Postgres space and across other databases as well. Really dealing with the challenges of bringing a database-as-a-service from its internal homegrown solution to a full-blown, Kubernetes (link resides outside ibm.com)-native platform and product.

 

How did you get involved with PostgreSQL?

BN: Completely by accident. I started my career as a web developer. The stuff I was working on was MySQL (back in the days before transactions existed), and there was all sorts of weird stuff happening, like columns were getting truncated silently, people would stop their browser in that day and get one table written to and not the other one. I was thinking, there has got to be a better way than this, and that was right around the time that Postgres was becoming a viable open source product.

I can’t remember exactly what version that would have been, but folks like Tom Lane had taken it over and started doing a lot of work on it. So, I switched on over to that and it fixed my problems pretty quickly because it works like you expected it to. From there, I got moved to a Postgres DBA job and went from there.

DC: It was pretty much by accident, kind of . . . maybe. There is a bit of a story here—back around 1998, I quit my job to be a consultant. The contract was supporting a Java application, and, at that time, that meant getting a Microsoft Network subscription.

I had an issue and called up support and they said it would be three weeks. I went, “Huh, how is that going to work?” I was getting paid by the hour so three weeks seemed a little long time. I thought there has to be a better way, and I started looking at open source.

I knew nothing about it [at first], so I asked a friend how to get started. He said, “Go on the mailing list and start answering every question you can.” I was interested in Java, so I found the JDBC list, and I started answering every question. The first one took a couple of days to figure out, the second one took a little bit less, and by a month or two later, I could answer questions without doing the research.

At some point, the guy maintaining the JDBC driver decided to step down and Bruce Momjian asked me if I was interested in supporting it. I said, “Yep,” and I have been doing it for 20 years now. That’s how I got involved in the community.

 

In your opinion, what are the strengths and weaknesses of PostgreSQL?

DC: Oh wow, this an interesting challenge. I think the strengths and weaknesses are in the community, not necessarily in the technology. Quoting the Roadmap (link resides outside ibm.com)—which I’ll summarize because it’s a bit long—it says it’s a “non-commercial, all-volunteer, free software project, and, as such, there is no formal list of feature requirements required for the development. We enjoy allowing developers to explore the topics of their choosing.” There is more, but that’s the gist of it.

I see the advantage in some ways because the project tends to not get locked into a specific solution because there is a company marketing that specific solution. Because it’s completely open source, when we realize what we have been doing is wrong or there is a better way to do it, we have no vested interested in maintaining the status quo. So, we give it up and change our path.

To give you a concrete example, a few years ago, there was a de-facto standard for replication, which was a trigger-based solution. It had a lot of challenges, and since then, the community developed something called logical replication which has superseded trigger-based replication. We now have a much better solution.

In my opinion, the strength is also actually a weakness—the notion that we have this big project and all these different solutions. Typically, in a company-developed product, there is one way to do a specific thing and that’s the only way.

In PostgreSQL, there are 10 or 15 ways to do something. And right now, Postgres is enjoying considerable uptake in adoption, and that means there are a lot of new people to the project trying to navigate what is the right way to do something. There may not be a right way, but there may be a whole bunch of wrong ways.

For someone used to commercial software, where there is one supported solution, figuring out how to architect your solution using Postgres can be pretty challenging. The good thing about this, for me and Crunchy Data, is that this presents an opportunity for companies like Crunchy Data to provide support for known working solutions (at least known working solutions that we understand and can support.) [That’s why] I see it as both a strength and a weakness.

BN: Community is one of the strengths, for sure. Dovetailing off that a bit is the licensing. There has been a lot of stuff in the news recently about open source licensing with Mongo, Redis, and Elasticsearch.

In Postgres, the license is open—people can’t change it or take things away from it. Nobody can buy Postgres; it’s going to be an open source project as it is as long as it exists. That’s a huge, huge benefit.

On the more technical side, as well as an obvious and boring thing, is stability. It’s by and large a rock-solid database as long as you stay away from some of the bleeding-edge things that might have a few edge cases in them.

[It’s also] very secure. One of things I really like about it—especially compared to most databases—is that predictability of it. By and large, it’s quite predictable in how it works and how it fails. The predictability in how it fails is a really, really nice, especially when you are getting into automating solutions on for it. The lack of predictability is one of the biggest challenges I see in some of the other databases I work with (too many strange edge cases).

The other thing is extensibility. When you look at the extension system, the type of stuff that you can build on top of Postgres without forking the main codebase is actually kind of remarkable. You take a look at this thing that started its life as a relational database over 30 years ago and is now moving towards being an application development platform or really even a database development platform. That’s really cool.

On weaknesses, one of the things we’re starting to see more and more is the lack of a native scale-out story like native sharding that allows larger workloads to run on Postgres natively.

I think the connection model is also really limiting. This whole per-process connection thing and using external connection poolers; we have third-party tools that work for that but they don’t tend to work well with multi-tenancy because you have to relax your security model at some point due to needing to keep the total number of connections relatively low and not being able to pool connections between database users.

The logical decoding and native Postgres logical replication is really cool, as well. I think there is a big hole there right now—it doesn’t actually work all that well with streaming replication. You can’t switchover or failover a member and have it sync your place. It’s really nice for building downstream systems to consume changes off them but as soon as you have to switchover or failover, you lose your place. You have to think about re-syncing your systems, and that gets messy.

The final one I’ll mention is an overreliance on external tooling to do a lot of basic stuff properly. Postgres gives you this rock-solid database with hooks to deploy backups, replicate, etc., but it’s up to you to put the solution together. I think Dave touched on this earlier, it can be convoluted to do that. You have to have a lot of knowledge to put together a reliable Postgres deployment, use a DbaaS solution, or pay people to do it for you.

I think especially if you look at more modern databases, it’s really not that tough to set up replication and have failover and all that stuff. It’s a bit of configuration, it’s built in the system and kind of just works.

 

Speaking of modern databases, PG 12 is coming out pretty soon. What are you looking forward to there?

DC: It dovetails into what I said earlier about the cool, new features and all the different itches people scratch. One of the things that come about is basically the basis for the next, sort-of jump in Postgres’s ability—the pluggable table storage interface. It’s the “developer cool thing” I’m interested in.

Since the beginning, sort of the way Postgres does things—in one version, we’ll start the framework of a feature and it won’t be particularly useful but the next version will be a little more useful, [and the one after that will be even more useful.] Pluggable table storage interface is going to enable columnar storage, encrypted columns, and other domain-specific storage systems. In addition, we are starting to see improvements in partitioning. We are adding functionality to partitioning in each major version. This will enable PostgreSQL to handle much larger workloads.

BN: Pluggable table storage for me as well. That’s kind of the really exciting thing for Postgres. I mentioned the extensibility bit earlier and this dovetails into that, right? Once you start looking at combining extensions and pluggable table storage together, you have this development platform where you can start building all sorts of cool stuff on top of there.

Another thing, which is more developer-centric, is the JSON path. Postgres has really awesome JSON capabilities. But if you aren’t a Postgres user and you come to it from using a document store, it’s going to be really weird to use. The JSON path gets closer to being able to work with JSON in Postgres in a way that developers should find more friendly.

 

What advice would you give people to want to deploy PostgreSQL on Kubernetes?

BN: Don’t. Seriously, everyone wants to right now because it’s the buzz technology. It definitely solves a certain set of problems, but it introduces another set of them. It’s all about tradeoffs and understanding the ones you need to make.

You have to really ask yourself why you want to run it. You need to really understand what you are buying when you move into the Kubernetes world. Coming from a more standard deployment profile, you are totally changing your deployment platform— how things get deployed, how things move around.

Chances are that you’re not running your databases in containers already. So, when you come to Kubernetes, you are coming to containerization. You are going to have to throw a lot of what you know [around things like memory management and resource utilization] out the window and relearn it all. It’s a really big undertaking.

You also likely need to be comfortable cracking open Kubernetes source code and figuring out why it’s doing what you don’t think it should be doing (because it will sometimes).

From there, if you want to forge ahead with it, don’t reinvent the wheel. Basically, there are a couple operators out there (there is a plug for Dave there, Crunchy has a Postgres Operator) (link resides outside ibm.com) and Zalando, who do Patroni (which we love), also have a Postgres Operator (link resides outside ibm.com). So, if you really want to do it, take a look at those operators and work from there. Don’t start from scratch.

DC: As opposed to don’t, I would say call Crunchy Data. You know, Kubernetes was initially designed for stateless workloads. It certainly presents a challenge when you are working with PostgreSQL; the two have some impedance mismatches.

Crunchy Data has deep insight into the Kubernetes world and deep insight into Postgres world. As Brad pointed out, trying to do this on your own is pretty daunting. You have to be a master of two worlds here! One person doing both is quite a challenge. If you are trying to run it on Kubernetes, you have to either hire someone or call us.

BN: Adding one more thing—the blogosphere has tons of stuff about running databases on Kubernetes. They are very narrowly focused on Day 1: Standing up a database. That stuff is easy these days. It’s everything that happens after that—the lifecycle of the database—that gets very difficult.

DC: Same thing, we have solutions to many problems. If you want to get just get a database running, that’s fine. But if you want to keep it running, that’s a much bigger problem. Kubernetes might just decide to shut down your pods for instance. Reschedule a running Postgres—we have solutions for those sorts of things. I’ll tell you, there was hair pulled to find those things. I’ll reiterate what Brad said: “Don’t reinvent the wheel.”

 

Running on Kubernetes by yourself might be one mistake, what are the other top mistakes people make with PostgreSQL?

BN: I would say the top one is not using a connection pooler and not understanding the Postgres connection model and its limitations. People fire up 20 copies of their application. Each one tries to open couple hundred connections each, and the next thing you know, you are running out of connections or hitting performance problems. You have to understand how the connection model in Postgres works and use a pooler like PgBouncer (link resides outside ibm.com) to bring things down to a more manageable number.

DC: Sort of the same thing. I think it depends on the scale of the project. There are some large projects that would have benefited from some preliminary architectural consulting. They aren’t database experts (and specifically not Postgres experts).

I actually had someone call me the other day about an IoT project, and the guy had a particular solution in mind that I knew just wouldn’t work. He had a naive, simple solution that was going to fail. Reading, researching, or calling someone up that has that kind of knowledge (and saying, “This is what I want to do, what tools are available to do it?”) will go a long way at the beginning of a project.

Certainly, on a smaller scale, if you aren’t trying to make something architecturally challenging, learning how to tune Postgres. At a minimum, reading the documentation on settings and security would go a long way to making your life easier. We have an amazing documentation site, and it just appears not a lot of people read it. I guess it’s kind of like trying to eat an elephant, where do you start? But I think some reading goes a long way.

 

Do you have any secret performance tuning tips or tricks?

BN: I don’t think there are any secret performance tuning tips. The ones that I would say to people are:

  1. Learn how to read a query plan and tune your queries.
  2. Adjust your data models.
  3. Fix bad access patterns.

Honestly, in my experience, you can tune knobs to make things change in the planner, add more memory, those types of things. But, by and large, the biggest wins come from fixing bad queries and fixing bad access patterns properly. To that end, install the pg_stat_statements (link resides outside ibm.com) extension and learn how to use it. Analyze your access pattern, validate your access patterns. Do it in an iterative fashion—don’t just do it once and forget about it. Do it as your product evolves.

One of the more nuanced things—in Postgres and most databases—indexes are expensive. Postgres suffers from write amplification pretty heavily. People will throw a bunch of indexes on to handle any type of query, but they don’t realize the overall impact that has with all the writes that are driven as a part of that. So, index carefully and look for indexes you aren’t using and remove them. You can find queries on the web really easily that will find those for you.

DC: Pretty much what Brad said. At the end of the day, in a database, the most time is taken reading and writing to disk. It’s pretty straightforward math—the I/O channel has certain bandwidth, and you’re trying to read/write a certain number of bytes on the disk, and you have a certain number of time. You can’t fix that, but you can make sure your queries are running well.

Everybody I have seen that is good at this is vigilant and ruthless about monitoring their database. They make sure that every query they add to the code doesn’t blow up. They check for regressions when they release new code, they measure every SQL statement or instrument the app to measure it. It’s just a matter of making sure you understand what you are doing.

The other big thing is their development systems and development data that they’re developing on are orders of magnitude smaller than their production data. They then get these huge surprises in production! There is a typical moment of: “Why is this not using an index?” Because the index is actually slower. It’s because they don’t have enough development data or test data to actually “test” what they are doing in production. At the end of the day, it’s common sense. Test everything, double-check it, verify it, measure it, and fix what’s broken.

 

PostgreSQL vs. MySQL: What are some things application developers should be considering as they pick a database?

DC: I’m not really familiar enough with MySQL to do justice to this question. But I can talk about what they should be aware of in Postgres. They should understand the question, “What is a transaction?” They should understand the answer to the question, “What is transaction isolation?”

What we’re seeing today in big applications is that the application framework itself is pretty big. They are large, complex frameworks that take time and effort to master in their own right. They tend to look at the database as sort of a “dumb” store. They don’t really pay attention to it until it does something they didn’t expect. So, I think they should at least be aware of what a transaction is and understand what transaction isolation is. They should call someone . . . Crunchy Data would be glad to help.

BN: I don’t have too much to contribute here because I don’t know too much about MySQL from an app-development perspective. I haven’t used it in that context in a very long time and it has changed a lot.

 

Wrapping it up, where do you see Postgres going in the next 5-10 years?

DC: That’s a really good question, which means I really don’t have an answer. I do think there are some things we can look at.

I can give you some guesses; we have overcome the stigma of using open source in big businesses. I would be willing to bet that there isn’t a Fortune 500 company out there without a Postgres running it in somewhere. What this means is that its popularity is going to increase geometrically. The more people that use it, the more people that tell their friends. And the more they tell their friends, the more people use it. What that means to the project is that we are going to have more people working on the codebase.

One of things I find working in the open source community is how smart other people are. It’s kind of humbling. There are some brilliant people out there that are going to solve some of the hard problems we have today. The biggest challenge is that the community is going to have to figure out how to deal with more people. I don’t know what the future brings, but I know we will have more people.

BN: I want to add to what Dave there, that’s a really good point. I think there is going to be friction with how Postgres does stuff—they have their own way of doing things that’s different from the modern GitHub-style workflow most folks are accustomed to these days. There is nothing wrong with that, but it’s a higher barrier of entry for folks coming from GitHub workflows and that type of stuff.

There will be some challenges there. I think as a database in general, no one can predict the future. But if you look at where it’s come from, it’s come from being an academic relational database to a solid, open source relational database to a mixed-model system with relational and non-relational stuff.

When you start looking at the pluggable table storage and the extension framework, I think you will start to see Postgres become even more of a swiss-army knife database. You will start seeing it handling a lot more workloads.

 

Thank you very much to Brad and David for joining us and sharing their thoughts on PostgreSQL. Keep an eye out for our next interview in the coming weeks. Special thanks to Emily Hu for help on this interview. If you haven’t already, check out our previous installments:

 
Learn more about PostgreSQL