⚠️ This project is archived and no longer maintained.
Parts of this prototype supported the development of the Wikidata Embedding Project.
A Prototype for a Wikidata Question-Answering System
This system allows users to query Wikidata using natural language questions. The responses contain links to sources. If Wikidata does not provide the information requested, the system refuses to answer.
The system is in an early proof of concept state.
To give it a try, use ➡️ this Google Colab Notebook or load AskWikidata_Quickstart.ipynb in your infrastructure.
In order to answer questions based on Wikidata, the system uses retrieval augmented generation. First it transforms Wikidata items to text and generates embeddings for them. The user query is then embedded as well. Using nearest neighbor search, most relevant Wikidata items are identified. A reranker model selects only the best matches from the neighbors. Finally, these matches are incorporated into the LLM prompt in order to allow the LLM to generate using Wikidata knowledge.
All models, including the LLM, can run on the local machine using pytorch and bitsandbytes quantization. For nearest neighbor search, an annoy index is used.
On Nix the dev shell will install all required dependencies.
nix develop .Alternatively, install python requirements using pip.
pip install -r requirements.txtFor faster execution, the results of some pre-computation steps are cached. In order to use those caches, unpack them:
bunzip2 --keep --force *.json.bz2Generate text representations for Wikidata items. The list of items to use is currently hardcoded in text_representation.py.
python text_representation.pyThis python code will use AskWikidata to answer one question.
from askwikidata import AskWikidata
config = {
"chunk_size": 1280,
"chunk_overlap": 0,
"index_trees": 1024,
"retrieval_chunks": 16,
"context_chunks": 5,
"embedding_model_name": "BAAI/bge-small-en-v1.5",
"reranker_model_name": "BAAI/bge-reranker-base",
"qa_model_url": "Qwen/Qwen2.5-3B-Instruct",
}
askwikidata = AskWikidata(**config)
askwikidata.setup()
print(askwikidata.ask("Who is the current mayor of Berlin? And since when is them serving?"))A simple interactive read eval print loop can be used to ask questions.
python repl.pyA script to evaluate the performance of different configurations is provided.
python eval.pyIf you do not want to use a local LLM, AskWikidata can access the Huggingface LLM API. Configure your Hugginface API key in the HUGGINGFACE_API_KEY environment variable.
To execute the unit test suite, run:
$ python -m unittestTo get a coverage report, run
$ coverage run -m unittest
$ coverage report --omit="test_*,/nix/*" --show-missing
