Ask-xDD: RAG for Academic Publications

Data Scientist · 2024 · 15 months · 7 people · 2 min read · Updated · github.com (opens in new tab)

Built a RAG-powered chat interface that makes 17M+ scholarly articles queryable in natural language, bridging the gap between academic research and public understanding.

WeaviateElasticSearchDense Passage RetrieverOpenAI ChatGPTFastAPIStreamlitPythonKubernetes

Overview

Ask-xDD is a conversational search system that lets users query a massive database of full-text scholarly articles through natural language. It combines dense passage retrieval with LLM reasoning to surface evidence-backed answers from academic literature.

Problem

Scholarly articles are locked behind copyright restrictions and technical jargon, making them inaccessible to non-experts. Meanwhile, misinformation spreads freely on social media. There was no easy way for the public or researchers outside a field to get evidence-based answers from the academic literature.

Approach

Implemented a retrieval-augmented generation (RAG) architecture: Dense Passage Retriever finds relevant content from the xDD database, then ChatGPT with ReAct prompting generates contextual answers grounded in the retrieved passages.

Constraints

  • Must handle 17M+ full-text articles across diverse domains
  • Answers must be grounded in retrieved passages, not LLM knowledge
  • System must scale across geoscience, climate change, and COVID-19 topics

Key Decisions

RAG over fine-tuned LLM

With 17M+ articles constantly growing, retrieval-augmented generation keeps answers grounded in source material without expensive retraining. It also provides transparent citations.

Alternatives: Fine-tuned domain-specific LLMTraditional keyword search with summarization

Dual retrieval backend with Weaviate and ElasticSearch

Weaviate handles semantic vector search while ElasticSearch provides keyword matching. Combining both ensures recall across different query types.

Result & Impact

  • 17M+
    Articles indexed

Made academic literature conversationally accessible to non-experts, with initial coverage spanning geoscience, climate change, and COVID-19. Funded by DARPA and USGS.

Learnings

  • Hybrid retrieval (semantic + keyword) significantly outperforms either approach alone on diverse academic queries.
  • Grounding LLM responses in retrieved passages is essential for trust—users need to see where answers come from.