In the third installment of Database Deep Dives, we caught up with JanusGraph PMC members Florian Hockmann and Jason Plurad to get some guidance on the wide world of Graph.
JanusGraph is scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. The project was forked from Titan and brought under open governance in the Linux Foundation in 2017.
Read below to learn from Florian Hockmann of G Data and Jason Plurad of IBM how JanusGraph compares with Neo4j, why you should be keeping an eye out for TinkerPop 4, and get expert tips on graph data modeling.
Tell us a little about yourself and what you are working on today?
Florian Hockmann (FH): My name is Florian Hockmann, and I’m working as a R&D engineer at G DATA, a German antivirus vendor. The team I’m part of is responsible for analysing the hundreds of thousands of malware samples we receive each day. We use a graph database to store information about these malware samples to be able to find connections between similar malware samples.
Jason Plurad (JP): I'm Jason Plurad, an open source developer and advocate with IBM's Cognitive Applications. I've been active in graph communities like JanusGraph and Apache TinkerPop to help grow those open source communities and to enable our product teams and clients with graph and other open source data technology. I've also been getting more involved with my team on exploring other emerging open source data and AI initiatives.
How did you get involved with JanusGraph?
JP: IBM was a founding member of JanusGraph, and I was on the team that pulled it together. We had been using its predecessor, Titan, for several different products. We liked Titan because of its open source license and the flexibility it gave us with respect to building an overall graph platform.
When Aurelius (the company that created Titan) was acquired by DataStax, the open source community was left wondering what would be the future of Titan. Eventually, DataStax released a graph offering as part of DataStax Enterprise, but there was no open source option. We knew we weren't alone in wanting an open source graph database, so we found others in the community and worked together to fork Titan and bring JanusGraph to the Linux Foundation with open governance.
FH: We originally used Titan, which was the predecessor of JanusGraph. Titan was a natural fit for us as we were looking for a database that could scale horizontally and that enabled us to find connections between malware samples, which is a typical use case for a graph database. After the company that developed Titan was acquired and shortly after it stopped all work on Titan, we were left with a database system that wasn’t maintained anymore. So, we were of course quite happy when IBM and others forked Titan to found JanusGraph, and we wanted to contribute to this new project to play our part in ensuring that JanusGraph succeeds as a scalable open source graph database.
I had already been involved in Apache TinkerPop—where I mostly develop the Gremlin .NET variant Gremlin.Net—and, therefore, it was a natural fit to contribute an extension library to that for JanusGraph. But I also made small contributions to other parts of the project and helped new users on the mailing list or on StackOverflow as well. This was a good way for me to get to know the various parts of the project to get more involved in it.
What should people know when deciding between Neo4j and JanusGraph?
JP: People should also know that JanusGraph and Neo4j support the Apache TinkerPop graph framework. TinkerPop gives you the ability to use the same graph structure and Gremlin graph traversal language to evaulate multiple graph databases with the same code. TinkerPop is compatible with many other vendors, including Amazon Neptune, Microsoft Azure Cosmos DB, and DataStax Enterprise Graph, although keep in mind that many of the TinkerPop implementations are not free open source.
This might not be the answer folks would expect, but teams should work with their lawyers to evaluate the licenses to determine which fits their needs. JanusGraph uses the Apache License, which is a liberal open source license that allows you to use it with few restrictions. Neo4j Community Edition uses the GNU General Public License, which has more restrictive requirements on distributing software. Many developers eventually need the scaling and availability features that are only available in the Neo4j Enterprise Edition, which requires a commercial subscription license.
FH: I see mainly two differentiating factors between these two graph databases. Firstly, Neo4j is mostly a project that is kind of self-contained. What I mean by this is that it implements its own storage engine, indices, server component, network protocol, and query language.
JanusGraph, on the other hand, relies on third-party projects for most of these aspects. The reasoning behind this is that there are already existing solutions for these problems that are good at their specific job. By using them, JanusGraph can really concentrate on the graph aspect instead of having to also solve these problems again.
JanusGraph can, for example, use Elasticsearch or Apache Solr for advanced index capabilities like full-text search and scalable databases like Apache Cassandra or HBase to store the data. Because of that, it’s probably easier to get started with Neo4j as fewer moving parts are involved, but JanusGraph offers more flexibility as users can choose, for example, between different storage and index backends based on their specific needs. Users can decide for themselves which approach they prefer.
The other key differentiating factors I see are the user-facing interfaces of these two graph databases with the query language as the central aspect of that. JanusGraph implements TinkerPop for this (which can be considered as the de-facto standard for graph databases right now as most graph databases currently implement it), which offers users mostly the same experience across different graph databases, similar to the role SQL plays for relational databases.
While it’s also possible to use TinkerPop with its query language Gremlin together with Neo4j, Neo4j mostly promotes their own query language—Cipher. So, most Neo4j users probably end up using that language.
Users, of course, have to decide for themselves again which query language they prefer, Gremlin or Cipher, and how important it is for them to be able to easily switch to another graph database at some point in the future.
Apart from these technical aspects, I, of course, also want to point out that JanusGraph is an open source project that is completely community-driven. Users who want to see a certain feature implemented can therefore simply implement it themselves.
What advice would you give people to want to deploy JanusGraph in production?
FH: I already mentioned that JanusGraph uses a few different components to create a graph database which offers rich functionality, like index and storage engines. While this approach gives users great flexibility and a rich feature set, it can also be a bit overwhelming for new users.
I would, however, like to point out that one doesn’t need a deep knowledge of all components to get started with JanusGraph. When I started with Titan—and it’s basically still the same for JanusGraph—I didn’t know really anything about Cassandra or Elasticsearch, but I was still able to setup and deploy Titan with these backends quickly.
Over the years, we switched from Cassandra to Scylla, added Apache Spark for machine learning, and made our deployment easier to scale by moving JanusGraph into Docker containers hosted on Docker Swarm.
So, my advice is to start with a small and simple deployment and then increase the size of the deployment and its complexity as needed. JanusGraph’s docs also contain a chapter, “Deployment Scenarios,” that describes a relatively simple getting-started scenario and how it can be evolved into a more advanced scenario.
Another project that is very important for JanusGraph is TinkerPop, which I already mentioned a few times. So, I would advise new users to get familiar with TinkerPop and, most importantly, its graph query language Gremlin. There are really good resources to get started like TinkerPop’s tutorials or the free e-book Practical Gremlin.
JP: First and foremost, be prepared to fully embrace and contribute to open source. JanusGraph is a community project, and it is neither owned nor driven by a single vendor. Your team should become engaged with the JanusGraph community in identifying and resolving bugs that you encounter since you'll be most the motivated to fix them. Over time, with continued contributions, your team can become leaders in JanusGraph to help move the project forward. Operations can be a big obstacle as teams go into production. When you're dealing with a rather large stack of technologies that may be new to your team, you should put enough due diligence into understanding how to keep your data infrastructure up and running. Since JanusGraph relies on an external storage backend (such as Apache Cassandra or Apache HBase), ultimately, your team will need skills in deploying and operating those horizontally scalable databases and their dependencies. Of course, you should get involved in those open source communities as well.
What are you looking forward to in JanusGraph and TinkerPop in the next few years?
JP: I've been in the graph data space for several years, but it is still emerging. In the next few years, I'd love to see improved tooling around the graph ecosystem. This would include tools for graph modeling, graph visualization, and graph database operations.
Graphs usually aren't alone in an overall data architecture, so tooling that allows you to bridge the gap between graph data and other data models will be useful to propel graphs into the mainstream.
There has been interest growing this year at the W3C on standardization for graph data, including property graphs, RDF, and SQL. With an open standard specification for a graph data, it could better align the graph database vendors to grow their share of the database market.
FH: Especially for JanusGraph, it’s hard to predict the future development as the project is completely community-driven and many contributions come from developers who are basically interested users who want to improve JanusGraph based on their own experiences and needs.
Apart from many small performance improvements, JanusGraph will most likely soon have an in-memory backend with significantly improved performance that is also ready for production usage, as opposed to the current in-memory backend which is only intended for testing purposes. This improved backend is a good example for a contribution made by users of JanusGraph, in this case developers at Goldman Sachs.
Backends are in general an area where I expect substantial improvements in the next few years for JanusGraph. We, of course, simply benefit from improvements in new releases of the backends themselves, but completely new backends can also provide big improvements or completely new functionality for JanusGraph.
FoundationDB looks, for example, very promising as it concentrates completely on achieving a scalable storage engine that offers transactions with ACID properties, and additional layers can add features like rich data models or advanced index capabilities. This approach seems to be a good fit for JanusGraph’s modular architecture and has the potential to solve some frequent problems with JanusGraph, like storing supernodes or performing upserts.
But, it’s good that you also asked about TinkerPop, as many improvements for JanusGraph will actually come from TinkerPop, especially when the next major version, TinkerPop 4, gets released.
The development of TinkerPop 4 is still in a very early state, but some major improvements can already be identified. What I’m personally, especially looking forward to are a wider range of execution engines for Gremlin traversals. Right now, one can choose between executing a traversal with a single thread—which is a good fit for real time use cases—or on a computing cluster with Spark (e.g., for machine learning or graph analytics).
At G DATA, we often have use cases that are in the middle of these two options, as they should be answered in a matter of a few seconds—which isn’t quite possible with Spark since it has some overhead—but they involve traversing over a significant number of edges, which also isn’t a good fit for single-threaded execution. An additional execution engine that is able to use more computing resources but that doesn’t need to load the whole graph first could be the perfect fit for those use cases.
A lot of effort is also currently spent in creating a more abstract data model for TinkerPop that is not specific to graphs. This has the potential to open up TinkerPop also for non-graph databases and computing engines. So, it can really increase the ecosystem of TinkerPop-enabled databases.
Do you have any tips or tricks for performant graph modeling?
FH: This may sound obvious, but I think many users still aren’t doing it—namely evaluating a new schema or major changes to a schema before taking it into production.
This should be done with real data if possible, and the evaluation should include queries that model actual use cases. There is really no other way to ensure that your schema is actually a good fit for your use cases, and changing the schema later in production is a lot more time consuming than doing an initial evaluation.
A topic that is very important for probably all graph databases are supernodes, as they can be really painful and lead to very high query execution times. So, it’s best to check early whether supernodes can occur in your data model and then to work around them, for example, by changing the schema accordingly.
Another general thing to consider for a graph model is whether something should be a property on a vertex or a different vertex on its own connecting to the other vertex with an edge. My usual approach is to decide whether I want to be able to search for other vertices who have the same value for that property, in which case, I model it as its own vertex with edges connecting it to all vertices with that value. Otherwise, it can usually just be a vertex property.
JP: Graph modeling takes time. It's easy to get started with a naive graph model but, most likely, you won't get the best model on the first try. It usually takes several iterations to get the model right for your use cases.
Be prepared to work with a small representative dataset for your domain and a list of the queries that you want to run so you can see how well the model performs against your use cases. Pay close attention to the branching factor as you jump from vertex to vertex. Even with a reasonable number of edges on a given vertex, the number of graph elements the query will touch can explode exponentially with a few jumps. Consider denormalizing the graph structure so you can better leverage filtering (matching on label or properties) to reduce the number of elements early in the query.
How can someone get involved with JanusGraph?
FH: It depends on whether you want to contribute code, improve the documentation, or help in some other way, like helping other users on the mailing who encounter a problem you have also already encountered and know how to solve.
For code or documentation changes, you can just look through our open issues on GitHub to find one that interests you or create a new issue to describe the suggested improvement and then just submit a pull request for it.
This is not different than for other open source projects. One advantage of JanusGraph for new contributors is probably that it consists of so many different modules that there is also a wide range of topics to contribute to, from something specific to a certain backend like Cassandra or Elasticsearch over core areas like how a query is executed to utility aspects around JanusGraph like schema management or client libraries for a certain programming language. So, you can choose an area to contribute to where you already have some knowledge in or that you are interested in.
If someone is interested in contributing to JanusGraph but needs some guidance to get started, then it’s of course always possible to ask me or any other active contributor and we are more than happy to help.
JP: JanusGraph is an open community, and the diversity of our community has helped drive the project in many new directions. We've had solid contributions from our community to expand JanusGraph with drivers for different programming languages and with storage adapters for different database backends.
Our developers from IBM Compose contributed features back to open source for dynamic graph management on the server. We've received improvements on the build and testing infrastructures, and also integrations with Docker and Apache Ambari.
We'd love to see more people get involved, and there are many ways to help even beyond programming the source code. I think it's most important as a collaborative community that people share their knowledge and experiences—by answering questions on the forums, by updating the JanusGraph documentation, by building example projects that use JanusGraph in innovative ways, by presenting on JanusGraph at local meetups or conferences. The best way to reach out is on our Google Groups. If Google is restricted by your corporate firewall or in your geography, you can subscribe to them as a mailing list with an email address.https://github.com/JanusGraph/janusgraph/#community
Thank you to Jason and Florian for taking the time to talk with us! If you want to read other Database Deep Dives, check out our other interviews: