Ask-xDD: RAG for Academic Publications
Built a RAG-powered chat interface that makes 17M+ scholarly articles queryable in natural language, bridging the gap between academic research and public understanding.
Overview
Ask-xDD is a conversational search system that lets users query a massive database of full-text scholarly articles through natural language. It combines dense passage retrieval with LLM reasoning to surface evidence-backed answers from academic literature.
Problem
Scholarly articles are locked behind copyright restrictions and technical jargon, making them inaccessible to non-experts. Meanwhile, misinformation spreads freely on social media. There was no easy way for the public or researchers outside a field to get evidence-based answers from the academic literature.
Approach
Implemented a retrieval-augmented generation (RAG) architecture: Dense Passage Retriever finds relevant content from the xDD database, then ChatGPT with ReAct prompting generates contextual answers grounded in the retrieved passages.
Constraints
- Must handle 17M+ full-text articles across diverse domains
- Answers must be grounded in retrieved passages, not LLM knowledge
- System must scale across geoscience, climate change, and COVID-19 topics
Key Decisions
RAG over fine-tuned LLM
With 17M+ articles constantly growing, retrieval-augmented generation keeps answers grounded in source material without expensive retraining. It also provides transparent citations.
Dual retrieval backend with Weaviate and ElasticSearch
Weaviate handles semantic vector search while ElasticSearch provides keyword matching. Combining both ensures recall across different query types.
Result & Impact
- 17M+Articles indexed
Made academic literature conversationally accessible to non-experts, with initial coverage spanning geoscience, climate change, and COVID-19. Funded by DARPA and USGS.
Learnings
- Hybrid retrieval (semantic + keyword) significantly outperforms either approach alone on diverse academic queries.
- Grounding LLM responses in retrieved passages is essential for trust—users need to see where answers come from.