Take a Dip into PostgreSQL Arrays
13 min read
PostgreSQL array datatypes
There are a number of datatypes available in PostgreSQL. In this article, we’re going to take a look at the array datatype.
There are times when you might want to store multiple values in one database column instead of across multiple tables. PostgreSQL gives you this capability with the array datatype.
Arrays are some of the most useful datatypes for storing lists of information. Whether you have a list of baseball scores, blog tags, or favorite book titles, arrays can be found everywhere. The bigger question, however, is why you’d even consider using arrays. The answer comes down to performance and ease of use. Aggregating and joining data across tables and rows can be difficult, and depending on the type of data you’re storing, it might get expensive too.
In this post, we’re going to create a table containing student information. We’ll create two fields that demonstrate two ways to create arrays. We’ll insert some data into the table, index the array columns, show how to query data from one- and multidimensional arrays, and we’ll look at some of the performance benefits and drawbacks along the way.
So, let’s get started . . .
Creating a table with arrays
We’ll start by creating a table of students. The table is comprised of their name
, exam scores
as a multidimensional array, and contact information that’s a one-dimensional array of phone numbers. We’ll create the table using both the bracket notation []
for creating arrays and the ARRAY
keyword, which are the two ways you can designate a field to be an array. See the following example:
We are using the text
datatype for the name. The student scores
is an array of integers, but since we will insert four exam scores taken at four different times, we’ll create a multidimensional array of integers to hold all the scores. This is indicated using the double brackets. Student contacts
is also an array using the varchar
datatype, but instead of using brackets, we’ll use the ARRAY
constructor. The brackets and the ARRAY
constructor are synonymous.
Once the table has been created, we can run \d students
to see how PostgreSQL has stored the datatypes for each table row.
Notice that the table doesn’t indicate explicitly that scores
is a multidimensional array. That’s because PostgreSQL allows arrays to be either multidimensional and one-dimensional. So, even though we created the table with double brackets [][]
, PostgreSQL would have allowed us to insert a one-dimensional array as well. At the same time, we could have simply created thescores
field as a one-dimensional array, with single brackets []
, and inserted a multidimensional array.
Now, let’s move on and insert some data into our student table.
Inserting data with arrays
Once our table has been created, let’s insert some data. For this article, we’ve entered 400 students into our database as show in the following example:
student_
with a number is the unique student name for each record. The exam scores is a multidimensional array. Each of these arrays contains four exam arrays that represent a single exam enclosed using braces {}
. Each of the values within an array represents a portion of the exam. The key when inserting a multidimensional array is that the enclosed arrays must have matching dimensions. So, we couldn’t have a student who has only three scores for a single exam, for example, because we’ll get an error like this:
The last values that we’ll insert are enclosed in a one-dimensional array containing two phone numbers for each student. If a student has one number or two numbers, it won’t matter. The only constraints we have are for multidimensional arrays where the array dimensions must match.
After inserting the data, we should see something like:
Now, let’s set up some indexes on these array columns.
Indexing arrays
To index arrays in PostgreSQL, it’s best to use a GIN or GiST index. Using either index has its benefits and drawbacks; however, GiST indexes were primarily developed for geometric datatypes while GIN indexes were designed for arrays. As a rule of thumb, PostgreSQL recommends using a GIN index for static data, especially for arrays, and a GiST index for data that is frequently updated.
Both types of indexes provide us with special operator classes that we can use when searching for data in arrays. GIN indexed queries support <@, @>, =, and && operators, while GiST indexed queries support many more operators like <<, &<, &>, >>, <<|, &<|, |&>, |>>, @>, <@, ~=, and &&.
Indexing arrays don’t require a special syntax to use. Using PostgreSQL’s CREATE INDEX
operation, we’ll create a Btree index for student names and GIN indexes on the other columns like:
Once the indexed have been created, let’s get into querying the data.
Accessing array elements
To take advantage of our indexes when accessing student scores and contacts, we can use GIN operators to find whether students got a certain exam score or whether a contact number matches with a student record. As a simple example, we could find out which student has the contact number (555)941-8927. For that, we’d write something like this:
Since we used the @>
operator, which looks for numbers contained in the array, the GIN index is automatically used. Using EXPLAIN
on the query, we can view its performance while searching for the number in the contacts column.
Without the GIN index, PostgreSQL would perform a sequential scan that would have to look at all of our contacts in the table before finding returning the right student. Using a data source of more than 400 students, one can see that the costs of the query would be much higher without an index:
Finding a specific exam score within a multidimensional array is a little tougher. Let’s try to find the number of students who got a 97 on their exam. One way to write this query is like this:
The problem with this query is that we will get 87 students back who got a 97 in either four of their exams. At the same time, PostgreSQL didn’t use our index because it had to look at each student to find out who got a 97.
This way of querying is quite costly, so let’s see if there is a better approach. One of the more interesting ways to search through arrays containing integers is using the intarray extension, which contains special operator classes for arrays. To install the extension, we’d write:
Once the extension is installed, we can drop our scores index DROP INDEX idx_scores;
and create a new index using the intarray extension we installed and the gin__int_ops
operator class that comes with intarray.
Now that we’ve created the new index when we run EXPLAIN
on the query, we can see that the index is used and it’s much faster to search for the score we want.
Sometimes we want to search for particular scores within a single exam and don’t want to include all the exams a student has taken. Since we have four arrays representing four exams, we could select one of the arrays/exams and see how many of the 87 students got a 97 on the first exam. To do that, we’d write a query like the following:
From this query, we’ll get 20 students who got a 97 as one of their scores on their first exam. Notice that the only addition to the query is scores[1:1]
. This is an array slice, which selects only the first array of our scores
multidimensional array.
The only problem with the query is that even though we set up an index on the scores column, the index won’t be used because we haven’t indexed both dimensions of the array.
Since the scores array is two-dimensional, we’d have to set up two indexes instead of one. To get around this, we can create a materialized view that will contain the student names in one column and their first exam scores as a one-dimensional array in another column. Then we can index the score’s column and see what performance differences are there. Let’s set that up now:
With the new materialized view, our data looks like the following with scores as a one-dimensional array:
Now, we can set up a GIN index on the scores column:
CREATE INDEX idx_materialized_student_score_view ON student_scores USING GIN(scores);
Let’s run the query using the student_scores
materialized view:
We’ll still get the 20 students who got a 97 on their first exam, but let’s run EXPLAIN
on the query to see its performance.
As we can see, PostgreSQL chose to do a sequential scan on our data without using the index. PostgreSQL intuitively selects the fastest way to query our data, and in this case, it was via the sequential scan and not the index. But, if we look at the performance of querying our data using the slice of the scores
array (scores[1:1]
) or creating the materialized view, we can see that creating the materialized view and searching for our scores that way outperformed looking for values through the slice.
Selecting ANY of the Array Elements`54’098765l,
Another way that PostgreSQL allows us to search for values is using the array subexpression ANY
. This expression lets us check if any of the values in an array meet the expression requirements. The only drawback with this is that it doesn’t use an index, so on larger datasets, we might see significant performance loss if we use it. Nonetheless, let’s try it out.
When constructing a query using ANY
, we place the expression in the WHERE
clause on the right side of an operator. So, if we are looking again for students who scored a 97 on an exam we’d write:
Like the query above where we used the GIN and GiST indexes with the @> operator, the ANY
expression provides us with the same result of 87 students. Using EXPLAIN
, however, shows us the real performance problems using the expression:
Therefore, it’s not that advantageous to use ANY
and it’s favorable to put an index on the columns that we’re going to be accessing a lot.
Summing it up
We’ve discussed quite a bit about how to create tables with arrays, how to access them, how to index them and their performance, as well as a trick to put values from a multidimensional array into a materialized view for better indexing. By shedding some light into how PostgreSQL arrays work and how to query them, your performance and understanding of how arrays work will enhance your effectiveness when coming across these datatypes in your database.