Ongoing

LLM-Assisted Meta-Analysis in Agriscience

Data Scientist & Developer · 2026 · 6 months · 7 people · 2 min read · Updated

Built a multi-agent LLM pipeline to replicate an agronomic meta-analysis, achieving >99% screening accuracy while revealing fundamental limitations in automated quantitative data extraction.

GPT-5.1Gemini 3 Pro PreviewPythonRmetaforLanding.ai

Overview

Developed a five-agent LLM pipeline that automates key steps of systematic review and meta-analysis—from study screening to effect-size extraction—using GPT-5.1 and Gemini 3 Pro Preview with structured human-in-the-loop verification.

Problem

Systematic reviews and meta-analyses in agriscience are labor-intensive, requiring manual screening, data extraction, and synthesis across dozens of studies with non-standardized reporting of statistics, yield data, and experimental designs.

Approach

Designed a modular five-agent pipeline: metadata extraction, study design analysis, statistical mining, effect-size calculation, and model-specific filtering. Each agent was run in parallel on two LLMs (GPT-5.1 and Gemini 3 Pro Preview), with outputs compared and discrepancies flagged for expert review. PDFs were converted to structured Markdown using Landing.ai's document extraction. Final meta-analysis was conducted in R using the metafor package.

Constraints

  • Extracted data must match human expert quality—errors in effect sizes directly affect downstream inference
  • Non-standardized reporting across agriscience journals makes automated extraction unreliable for variance components and sample sizes
  • Human-in-the-loop verification required at every pipeline stage to prevent hallucination propagation
  • Pipeline must reproduce a published meta-analysis as ground truth benchmark

Key Decisions

Five specialized agents instead of a single monolithic prompt

Modular design lets each agent focus on a well-defined role—analogous to a research team—reducing prompt complexity and making errors easier to trace and fix.

Alternatives: Single-agent extraction with one large promptTwo-stage pipeline (screening + extraction only)

Dual-LLM consensus with human arbitration

Running GPT and Gemini in parallel on identical tasks surfaces disagreements automatically. Cases with >1% divergence are flagged for human review, catching hallucinations that either model alone would miss.

Alternatives: Single LLM with manual spot-checkingThree-model voting without human review

Structured Markdown intermediary format instead of raw PDF parsing

Landing.ai's document extraction preserves table structures, figure annotations, and statistical groupings in Markdown, enabling more reliable downstream extraction than raw text.

Alternatives: Direct PDF-to-text with pdftotextOCR-based extraction

Result & Impact

  • >99%
    Screening Recall
  • 28.5%
    Manual Effort Reduction
  • 293 of 671
    Effect Sizes Recovered (HITL)

Demonstrated that LLMs can reliably automate study screening but struggle with quantitative data extraction in agriscience due to non-standardized reporting. Findings advocate for 'AI-Ready' reporting formats in agricultural publications.

Learnings

  • LLMs excel at structured screening tasks but struggle with quantitative extraction when primary studies report statistics inconsistently—missing SDs accounted for 57% of extraction failures.
  • A multi-agent architecture with dual-LLM consensus catches errors that single-model pipelines miss, but 71.5% of extraction cases still required human review.
  • The bottleneck isn't the model—it's the source material. Standardized, machine-readable reporting in primary studies would unlock far greater automation gains.