ProGov21: Automated Policy Data Extraction Using AI
Built an AI-powered pipeline that automates policy document extraction and labeling for ProGov21.org, replacing a fully manual workflow with a semi-automated, human-in-the-loop system.
Overview
An automated data pipeline that crawls, extracts, labels, and publishes policy documents on ProGov21.org. The system uses LLMs to extract structured metadata from policy PDFs, with human review for quality assurance before publication through a RESTful API.
Problem
ProGov21.org hosts a digital library of local, state, and federal policy documents. Each document had to be manually reviewed and labeled with structured metadata—a time-consuming bottleneck that limited how quickly new policy information could reach the public.
Approach
Built a multi-stage pipeline: a web crawler discovers and downloads policy PDFs from partner sites, an LLM-powered labeling step extracts structured metadata, a human-in-the-loop review interface enables quality assurance, and a RESTful API serves validated data to the ProGov21.org frontend.
Constraints
- Policy PDFs come from diverse sources with inconsistent formatting
- Extracted metadata must map to a pre-defined schema (title, date, jurisdiction, issue area, summary)
- Automated labeling must be accurate enough for human reviewers to validate efficiently
- System must run as a scheduled job with minimal manual intervention
Key Decisions
LLM-based extraction over rule-based parsing
Policy documents vary widely in format and structure. LLMs handle this diversity far better than brittle rule-based parsers, and can extract semantic fields like summaries that rules cannot.
Human-in-the-loop review before publication
Policy data accuracy is critical for public trust. Automated labeling reduces workload but a human validation step ensures quality and catches edge cases the model misses.
RESTful API for frontend integration
Decoupling the data layer from the frontend allows the policy library to support faceted search and flexible querying without tight coupling to the crawl/label pipeline.
Result & Impact
Drastically reduced the manual workload for processing policy documents while maintaining data quality. Policy information on ProGov21.org is now easier to find and browse by issue area, state, and year, helping the public stay informed on local, state, and federal policies.
Learnings
- Human-in-the-loop design is worth the overhead when data accuracy directly affects public trust.
- LLMs excel at extracting structured data from messy, inconsistent document formats where rule-based approaches would be brittle.
- Scheduling crawlers as cron jobs keeps the system simple and maintainable for a small team.
“The DSI team were excellent collaborators. Their help with designing an interface capable of scraping data with LLM-assistance for our digital policy library was a huge service. Their team was responsive, creative, and thoughtful in managing the project and its development. I’d recommend their work to anyone looking to implement similar cutting-edge tools.” — Project collaborator, High Road Strategy Center