July 13, 2017 | Written by: James Young
Categorized: Data Analytics
Share this post:
For decades now, SQL has been the colossus of data management query languages. Generations of database administrators, developers and analysts have built highly successful careers around tables and foreign key relationships, to the point where thinking about data in a relational schema has become second nature. In fact, many believe that relational databases are still the only safe and effective way to model complex data relationships.
Over the past decade, that belief has been tested like never before. The rise of NoSQL databases built on data structures such as JSON and key-value pairs has shown that if large-scale, high-speed data ingestion and flexibility are the key requirements, a relational data model may no longer always be the best option.
Graph databases in particular are one flavor of NoSQL data stores that have gained momentum in the last couple of years, especially for use-cases relating to real-time fraud detection, recommendation engines, and deep data analysis. The ability to analyze complex relationships between people, places, products, and transactions by traversing a graph can provide an enormous boost in performance, compared to running a complex SQL query with dozens of joins.
If you’re a SQL expert who is skeptical about non-relational databases, it’s probably a good idea to know how these other data models work, how to use them, and what the trade-offs are.
For the most part, SQL developers don’t have much trouble with a JSON or key-value database. In fact, most actually find them easier to use than SQL, although the lack of a rigorous schema and ACID-compliant transactions can make it harder to build a truly robust application.
With graph databases, though, it’s a different story. Until recently, graph databases were considered a fairly esoteric branch of technology, mainly used in academic research rather than enterprise applications. Under the hood, these databases are heavily reliant on graph theory, a branch of advanced mathematics that is more complex than the principles behind tables and joins. As a result, a myth has arisen that graph databases are highly specialized tools, which require a skill-set that SQL users find difficult to acquire.
Register for webinar
Fortunately, this is no longer the case. The current generation of graph databases are built for enterprise deployment and modern graph query languages such as Gremlin (the language of choice for the Apache® TinkerPop™ graph computing framework), although superficially very different to SQL, are designed to be highly expressive and easy to learn.
Let’s look at a simplified example of a product recommendation use-case — first in SQL, and then in Gremlin.
When working with SQL, we have three tables: one for people, one for products, and one for purchases of products by people:
Given the relational data model above, imagine you wanted a product recommendation for Jim. First, let’s find out what he has already purchased, using a SQL query:
JOIN Purchases ON
Purchases.Product = Products.ID
JOIN People ON
Purchases.Person = People.ID
WHERE People.Name = "Jim";
The result tells us that Jim has purchased a pizza. Now we need to know what other pizza-purchasers are buying:
SELECT Products.Name, COUNT(Products.ID)
JOIN Purchases ON
Purchases.Product = Products.ID AND
Purchases.Person <> "Jim"
Products.Name <> "Pizza" AND
JOIN Products ON
Purchases.Product = Products.ID AND
Products.Name = "Pizza"
JOIN People ON
Purchases.Person = People.ID AND
People.Name <> "Jim")
GROUP BY Products.Name;
This tells us that among people other than Jim who have purchased pizza, there have been two purchases of milkshakes, but only one purchase of soda. So, if we’re seeking the best product recommendation for Jim, we might want to ask him if he’s thirsty for a milkshake.
However, notice how complex our query has gotten already. We already have several joins and a sub-query, even for a relatively simple inquiry. As the level of complexity of the relationships we want to explore increases, we’ll need to write more lines of code, and as the amounts of data and number of tables grow, the joins and sub-queries will become more and more complicated and expensive to compute.
Now let’s look at how we might model this same problem in a graph database, using Gremlin. If you would like to try this out for yourself, download a copy of the Gremlin Console from
Apache® TinkerPop™, and follow along with the examples.
This time, we have vertices for people and products, connected by edges representing the purchases.
// Create a new graph
graph = TinkerGraph.open()
// Add our people as ‘vertices’ (nodes) in the graph
jim = graph.addVertex(id, 1, label, ‘person’, ‘name’, ‘Jim’)
joanna = graph.addVertex(id, 2, label, ‘person’, ‘name’, ‘Joanna’)
david = graph.addVertex(id, 3, label, ‘person’, ‘name’, ‘David’)
lynn = graph.addVertex(id, 4, label, ‘person’, ‘name’, ‘Lynn’)
// Add our products as vertices in the graph
pizza = graph.addVertex(id, 5, label, ‘product’, ‘name’, ‘Pizza’)
milkshake = graph.addVertex(id, 6, label, ‘product’, ‘name’, ‘Milkshake’)
soda = graph.addVertex(id, 7, label, ‘product’, ‘name’, ‘Soda’)
// Add our purchases as ‘edges’, pointing from our people vertices
// to our product vertices
jim.addEdge(‘purchased’, pizza, id, 8)
joanna.addEdge(‘purchased’, pizza, id, 9)
joanna.addEdge(‘purchased’, milkshake, id, 10)
david.addEdge(‘purchased’, pizza, id, 11)
david.addEdge(‘purchased’, soda, id, 12)
lynn.addEdge(‘purchased’, pizza, id, 13)
lynn.addEdge(‘purchased’, milkshake, id, 14)
You’ll notice that instead of treating purchases as just another set of rows in a table, we use Gremlin to model them as a different type of object from the people and products: they are edges, whereas the people and products are vertices. Often, it’s easy to think of vertices as “objects” or “nouns”, and edges as “actions” or “verbs”.
The relationships that edges represent are unidirectional; you can imagine them as arrows pointing from one vertex to another. In our case, all the edges point from people to products: the “Jim” vertex has an outbound edge labeled “purchased”, which points to the “Pizza” vertex. This one-way relationship makes sense if you consider the edges as verbs: saying that “Jim has purchased a pizza” does not also imply that “a pizza has purchased Jim”.
When we start writing queries, the direction of the edge is important. If an edge points from “Jim” to “Pizza”, we say that the edge goes “out” from “Jim”, and “in” to “Pizza”. You can visualize the graph we’ve just created as follows:
Now let’s look at some queries. First, to find what Jim has purchased, all we need to do is this:
// Set up a graph traverser – this provides some settings to help Gremlin understand
// how to traverse our graph
g = graph.traversal()
// With the traverser in place, we can start our query:
g.V(). // 1. Get all vertices in the graph
has(‘name’, ‘Jim’). // 2. Select any vertices that have the name ‘Jim’
out(‘purchased’). // 3. Traverse any outbound edges from the ‘Jim’
// vertex that have the ‘purchased’ label. Then
// select the vertices at the ends of those edges
values(‘name’) // 4. Return the names of the vertices that we’ve found
This will give us the same answer as our original SQL query: Jim has purchased pizza. Unlike the SQL query, however, the Gremlin query takes the form of a chain of functions. Each function returns a result, which is immediately fed into the next function in the chain, and the output of the last function is returned to the user.
Now let’s see how we would find some product recommendations for Jim. We’re using the same basic logic as in the SQL query: looking for products (except pizza) purchased by people who have also purchased pizza (except Jim). But instead of the abstraction of joining tables together, Gremlin gives us a much more natural way to think about how to solve the problem by simply following the relationships between vertices.
g.V(). // 1. As before, we start by getting a list of products
// purchased by Jim, which leads us to the pizza vertex.
in(‘purchased’). // 2. Follow the ‘purchased’ edges from the pizza vertex
// to get all the people who have purchased pizza
has(‘name’, neq(‘Jim’)). // 3. Exclude Jim from the list of people
out(‘purchased’). // 4. Follow the ‘purchased’ edges again to get all the
// products that these people have purchased
has(‘name’, neq(‘Pizza’)). // 5. Exclude pizza from the list of purchases
groupCount().by(‘name’) // 6. Group the results by product name and count
// the number of purchases of each product
Incidentally, as a SQL expert, you may note that the two queries above are written in an imperative or procedural style—that is, we are giving specific instructions to Gremlin to tell it how to traverse the graph at each step. This is a change in mindset from SQL’s declarative approach, which focuses on specifying the results that we are looking for, and leaves it up to the database to decide how to find them.
If you find this imperative approach unfamiliar, you may be interested to know that Gremlin can also provide a declarative approach to querying, using a method known as pattern matching. It’s a little beyond the scope of this blog to explain in detail, but we could rewrite the product recommendation query above in a declarative style as follows:
This has been a whistle-stop tour of Gremlin syntax for the SQL user. Of course, there is a lot more to learn.
If we’ve piqued your interest and you would like to take a deeper dive, please register for our one-hour webinar, “Gremlin Traversals for the SQL User”, at 11:00 US Eastern Daylight Time on Tuesday, July 25, 2017. You can sign up through
The New Builders Webinar Series.