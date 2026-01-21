DataStax® Astra DB on IBM watsonx.data® simplifies machine accessibility and app development on this 120-million entry knowledge graph, boosting query speed times 30 and cutting build time by 90%.
Wikipedia is renowned for its thoroughness, widespread accessibility and the trust it has engendered. Key to these characteristics is its community-based creation and maintenance. This massive compilation of knowledge—to the tune of 300 languages and 25 billion monthly views—is a reliable, collaborative and open source of information used by countless people every day.
However, with the rise of AI, machine accessibility posed a new challenge to the organizations that develop and support Wikipedia. Wikidata, the linked, open platform that makes Wikipedia data available to thousands of developers across the open source landscape, needed to make this massive, multilingual data knowledge graph (with about 120 million entries and 2.4 billion edits to date) more accessible and usable by large language models (LLMs).
After test-driving several vector databases, Wikimedia Deutschland, the organization that develops Wikidata, turned to DataStax Astra DB on IBM watsonx.data. Compared to computing vectors locally, the highly scalable, low-latency Astra DB boosted query speed—a critical factor for retrieval augmented generation (RAG) apps—by 30 times. Development time at Wikimedia Deutschland saw a 90% reduction, as its development team can now focus on innovation rather than hosting and maintaining data infrastructure.
Wikimedia’s use case is grounded in the fact that LLM adoption is rising, and teams want to use trusted data to make generative AI more reliable and transparent. They also want to provide the community more control over which data is referenced.
But access was a hurdle: Wikidata is primarily accessed through SPARQL (a semantic query language). It’s powerful but requires users to learn both the query language and Wikidata’s domain-specific structure.
Wikimedia sought a simpler way for developers to explore and retrieve relevant items before writing precise graph queries.
Building an API layer atop a vector database provided this access for developers, supporting downstream applications. These applications include multilingual user experiences (OpenStreetMap is a good example) and search engines that need fast, trusted context (information about museums, books and cultural institutions, for example).
This reduces time spent crafting complex queries, lowers the learning curve for new developers and speeds iteration for RAG pipeline systems.
Wikidata’s API layer provides machines with access to a vector database through two routes:
The search route starts with a natural-language query plus configuration parameters, and performs hybrid search by combining:
Results from keyword and vector search are merged by using reciprocal rank fusion, a simple method that rewards items that rank highly and appear in both lists.
Finally, Wikimedia adds an optional reranking step. When enabled, the system calls the Wikidata API to fetch the latest item information, then applies a Jina.ai reranker model to reorder results by relevance. The reranking step is intentionally optional because, in some RAG use cases, the full list is passed downstream to an LLM and ordering is less critical. Users can skip reranking for faster response times.
The Astra DB vector database is segmented by:
The similarity score route starts with a natural language query and a user-specified list of Wikidata entities. Instead of retrieving candidates, the system measures how closely each provided entity aligns with the query.
The process begins by embedding the query with the same Jina.ai model. It then looks up the stored vectors for the specified entities in Astra DB and computes their similarity scores against the query vector.
This route supports applications such as classification, entity linking or named entity disambiguation, where downstream systems can use the similarity scores directly to choose the best label or resolve which entity a mention refers to.
The API components run on Wikimedia Cloud Services, an infrastructure hosted by the Wikimedia Foundation. Wikimedia’s reasons for hosting their own infrastructure are tied to privacy (protecting the contributor community and taking responsibility for data stewardship). They are also tied to control over where and what information is stored and who can access it.
This project is ultimately about making a foundational, widely reused knowledge asset easier to use in modern AI pipelines—without asking every developer to become a graph-query expert first.
Relying on Astra DB resulted in some clear benefits:
Wikimedia also came across a meaningful multilingual insight: creating discrete vectors for each language initially seemed redundant, but experiments showed that accuracy improved as more languages were incorporated. The results suggested that the embedding approach captured language nuance rather than simple one-to-one translation.
Wikimedia promoted the launch of this API in October 2025 and they’re committed to updating it to continue to improve access to grounding data to serve Wikidata reusers and AI developers.
Wikimedia’s next steps focus on expanding language coverage, encouraging real-world usage and collecting feedback from developers building atop Astra DB. Wikimedia also aims to continue building out a model context protocol (MCP) integration for Wikidata that uses Astra DB to support exploration while retaining the precision of graph querying. Wikimedia is also exploring advanced RAG techniques, including GraphRAG, which incorporates graph-structured data to handle highly complex queries.
By separating the API layer, combining keyword and vector retrieval and making reranking optional, Wikimedia created a flexible path that can serve both interactive exploration and production AI retrieval flows. It did so without forcing a replatforming of Wikimedia’s core infrastructure or governance posture.
The managed vector database capability, performance and scalability headroom and reduced development overhead provided by adopting Astra DB help Wikimedia move faster while keeping the focus on user outcomes. These outcomes mean better retrieval, faster responses and simplified access to Wikidata for the developers building the next generation of AI-enabled experiences.