Posted Wednesday, July 27, 2016
I know, what a cool word, right?
Since I have been dying to write about something (anything) since retiring, this seemed like a fun subject to attack. I’m sure I’ll get comments correcting everything I’ve said, that’s fine. After all, we are talking about something that started being discussed over 30 years ago.
Unfortunately, in my excitement to get back to blogging, it appears that I was less than thorough in my research of the term “sargable”. Its origin is documented, or at least it is sort of documented, in a 1979 research paper co-authored by Pat Selinger.
On the off chance you've only been working with relational database for a very very short period of time, and don’t know Pat, Pat Selinger is an IBM Fellow, now retired from IBM (however, once an IBM Fellow, always an IBM Fellow in my opinion). If you ever have the opportunity to hear her speak, she’s excellent.
She is one of the original (and one of the lead) members of the IBM Research team responsible for System R, the SQL language, and cost-based query optimization for relational database.
Back to “sargable”.
Pat coined the word "sargable" for the System R optimizer back when she was part of a small team computer scientist working at IBM Research in San Jose, CA. The term “sargable” was first used in a paper she co-authored titled “ Access path selection in a relational database management system” that was published as part of the proceedings of the 1979 ACM SIGMOD international Conference on Management of Data.
Pat explained to me that the inspiration for “sargable” was the word "sargs" that IMS used to refer to search arguments that the database could use for filtering results. She decided it made sense to her at the time to call predicates that could be sent to the storage layer as search arguments "sargable".
BTW, if you cannot get to the ACM Digital Library, the paper is also available (at least for now) from the Stanford Engineering Computer Science website for download: http://cs.stanford.edu/people/chrismre/cs345/rl/selinger.pdf. The discussion of the term “sargable” starts on page 25 of the ACM proceedings or physical page 3 of the downloadable PDF mentioned in the previous sentence.
That should end our back story, now on to more about “sargable”.
Let’s kick off this conversation with a question. Does anyone here remember Prem Mehra?
Prem was with IBM’s Dallas System Center and was considered by many to be one of the elite DB2 performance people in the early days of DB2. To the best of my knowledge, he may have been one of the first to use the term “sargable”. In fact, he was one of the authors (or the author) of an IBM Washington System Center “blue book” (because the books had blue covers of course) describing DB2 sargable predicates.
The exact origin of the word in my opinion is a tad vague. There is even disagreement about where (or is it what) the word was derived from. I’ve always thought sargable was an acronym for: “Search ARGumentaBLE”. Maybe that’s a bit of a stretch, or is it? It does works.
Sargable is (was) used to describe predicates that can be applied during the first stage of queries processing.
Although some may debate how the word was constructed, no one debated its meaning. It meant simply that you could use an index to satisfy a predicate. Today, you almost always hear this step in predicate processing described as Stage 1.
The terms “sargable” and “non-sargable” became common place in the very first releases of DB2. They were explained in the above mentioned “Blue Book” put out by the IBM Washington Systems Center about a hundred of years ago, the one that Prem authored. If anyone still has a copy of this “Blue Book”, send me a picture of its cover. I had one at one point. However, with all the moves over the years, it seems to have disappeared.
Enough of the chit-chat, let’s talk dated optimization terms.
The terms most commonly used today are indexable, stage 1, and stage 2. Indexable predicates are always applied first, stage 1 predicates are applied second, and stage 2 predicates are applied last.
Tying today’s terms back to our terms from the beginning days of DB2, “sargable” refers to stage 1, while Stage 2 predicates would correspond to “non-sargable”. The terms “sargable” and “non-sargable” are seldom used anymore because they can sometimes confuse. However, even though stage 1 and stage 2 are today’s commonly used terminology, I’m not sure they are any less confusing. today. The following are more detailed definitions of the both the old and current terms.
BTW, I actually came across someone using “non-sargable” on Twitter the other day. That’s what prompted this post.
Stage 1 (sargable) – rows are processed by the DB2 component data manager. Data manager is less costly and preforms closer to the data. All indexed predicates are stage 1, although not all stage 1 predicates are indexable. For indexable predicates, only qualifying rows are selected.
Stage 2 (non-sargable) – Rows not handled by the data manager are passed on and processed by the DB2 component RDS (relational data system). The RDS is responsible for all of the heavy lifting. Any predicate that cannot be resolved by the data manager are passed on to the RDS; additional processing, additional code path, and further from the data. Predicate processing by the RDS is far more expensive than resolving a predicate by the data manager.
If you are interested in IBM’s descriptions of all of the above, you can read about it at IBM’s web based Knowledge Center in the “Summary of stage 1 and stage 2 predicate processing” section.
I’d also like to take a shot at an analogy of what stage 1 and stage 2 are and how they differ. Please let me know if my analogy helps, causes more confusion, or is just plain incorrect. I’m thick skinned so negative will not crush me.
I also would like to point out that my analogy uses coins minted in the USA. The denominations of the coins are nowhere near as important as their appearance. To make sure anyone reading this gets an accurate picture (someone outside the US for example), I’m including a few pictures of various coinage I’ll be referring to at the end of this post.
"S" mint code, circled in red in the second example below, designates that this coin was minted in San Francisco, CA. The San Francisco mint only stamps a small fraction of the coins produced by the US mints as compared to the Denver mint (“D"s stamp) and Philadelphia mint (“no stamp" designation). Hopefully, the importance of that "S" will become more evident as the analogy progresses.
Each US coin type is labeled "A" through "F”. I will refer to those labels in this analogy along with the coin denomination.
OK, here goes. This analogy is how I made sense of all this stage 1/stage 2 stuff. Of course, it also assumes that I actually understand it all. Please jump in if you think this is incorrect or falls apart somehow.
I get a big bag of miscellaneous coins. Using USA terms, the bag contains pennies (A), dimes (C), quarters (D), rolls of dimes (F), and rolls of quarters (E). I dump this huge bag of coins out onto a table so I can find all the pennies marked with the mint code “S” (that penny is labeled B).
If you look at all these coins laying on the table, the rolls should jump out immediately. They are different than the actual coins because they are dozens of the same coins rolled up into a paper tube. If I go through the pile of coins I can easily and quickly select all of the rolls. Think of that selection as the cheapest part of looking for my pennies with the “S”. I could also think of those rolls as being indexable, sargable, and stage 1.
Next, I can find all the loose dines and quarters. This will cost me a bit more in search time. However, it’s still a pretty inexpensive procedure. This operation I equate to sargable and stage 1.
Finally, I start looking for my pennies with the “S” mint stamp. Now everything left on the table are pennies. So I have to pick up each individual penny and examine the mint code to if it is an “S”. Those pennies that match my search criteria are selected to satisfy my query. I think of this as being non-sargable or stage 2.
Hopefully, from this analogy, you can see that the cost of find my pennies marked with an “S” is far more tedious and expensive compared to find the rolls of coins or even the dines and quarters.
If this helps, great. If not, let me know why. Maybe we’ll be lucky enough to have one of my optimization friends come across this post and offer their opinion.
Let me remove the coins from my table and I’ll spend the rest of this afternoon play Star Wars with my Grandson on his PS4.
....While writing today's blog entry, I was listening to Keiko Matsui's 2007 release "Moyo". This is a pianist who covers smooth jazz to new age and makes it all sound magnificent. If you have never listened to her, you really should........