A fairly long time ago we became enamored of the RDF data model for representing application data. This is not the place to explain why, but we have built up some experience and expertise in designing in this model, and it seems to me to have brought some simplicity and elegance that we would not have otherwise achieved. A problem we have had with it, though, is implementing query. We have tried 2 different approaches:
- Represent the RDF data as XML, store it in an XML database, and query on it using XQuery
- Store the RDF as triples in an RDF triple store (Jena), and query on it using SPARQL
Neither of these approaches has been the success we might have hoped for. The first approach, based on XML, had several major issues. One was the performance of XQuery in our chosen implementation (DB2), another was the difficulty of XQuery itself, and a third was the portability of XQuery across different DBMS implementations (a corollary of the difficulty/impossibility of using DB2 as an embedded component). Major problems with the second approach - native RDF stores - have been performance, scalability and enterprise-readiness of Jena.
My current thought is that we will use JSON-LD as a standard for representing RDF data as JSON, store the JSON-LD in MongoDB and query it with MongoDB’s query capability. Ironically, this approach is closely analogous to our first approach, with JSON-LD replacing XML, MongoDB replacing DB2 and MongoDB's query replacing XQuery. So why do I think this might work better?
- JSON is a much better model for representing data than XML.
- MongoDB query is more understandable than XQuery
query on JSON-LD data might perform significantly better than XQuery on
RDF/XML. The primary reason for optimism here is the simplicity of MongoDB
query. My (unsubstantiated) hypothesis is that the primary problem with
both XQuery and SPARQL from a performance perspective is that they are
simply too ambitious. They both provide great power and flexibility, with
the consequence that many, if not all queries run too slowly. The performance of similar queries can also be hugely different and hard to predict. By being
significantly more restrictive in the sort of queries that can be
expressed, MongoDB may provide higher performance and – just as
importantly – predictable performance. There are two problems with this
hypothesis. The first is that it may or may not be true. The second is
that even if it is true, it will impose a burden on
application developers to solve application problems with less reliance on
complex queries, and we do not know what the consequences of this will be. Finding out will be one of the goals of the project.