Research Paper • June 2025

Automation of Systematic Reviews with Large Language Models

Introducing otto-SR: An end-to-end agentic workflow that achieves superhuman performance in systematic review automation, completing 12 work-years of research in just 2 days.

96.7%
Screening Sensitivity

vs 81.7% human performance

93.1%
Data Extraction Accuracy

vs 79.7% human performance

2 Days
Complete Cochrane Issue

12 reviews, ~12 work-years

146K
Citations Processed

Across 12 systematic reviews

Revolutionizing Evidence Synthesis

Systematic reviews are the foundation of evidence-based medicine, but they typically take over 16 months and cost $100,000+ to complete. otto-SR changes that.

The Challenge

Time-Intensive Process

Traditional systematic reviews take 16+ months to complete

Human Error Prone

Dual human screening shows significant variability and missed studies

Resource Intensive

Costs upwards of $100,000 and requires specialized expertise

The Solution

AI-Powered Automation

GPT-4.1 for screening, o3-mini-high for data extraction

Superhuman Accuracy

Outperforms human reviewers in both sensitivity and specificity

Rapid Processing

Complete systematic reviews in days, not months

How otto-SR Works

An end-to-end agentic workflow supporting both fully automated and human-in-the-loop systematic reviews

1. Literature Search
Comprehensive search across databases to capture all potentially relevant citations
  • RIS format upload
  • Multiple database support
  • Automated deduplication
2. AI Screening
GPT-4.1 powered screening agent for abstract and full-text review
  • 96.7% sensitivity
  • 97.9% specificity
  • PDF to Markdown conversion
3. Data Extraction
o3-mini-high model for precise data extraction and analysis
  • 93.1% accuracy
  • Structured data output
  • Meta-analysis ready

Breakthrough Results

otto-SR demonstrated superhuman performance across multiple systematic review tasks

Cochrane Reproducibility Study
Reproduced and updated an entire issue of Cochrane reviews (n=12) in under 2 days
Studies correctly identified
64/64 (100%)
Median studies incorrectly excluded
0 (IQR 0-0.25)
Additional eligible studies found
54 studies
New statistically significant findings
2 reviews
Performance Comparison
otto-SR vs traditional dual human reviewers across key metrics
Screening Sensitivityotto-SR vs Human
96.7%
vs
81.7%
Data Extraction Accuracyotto-SR vs Human
93.1%
vs
79.7%
Processing Timeotto-SR vs Traditional
2 days
vs
12 work-years

Research Team

A collaborative effort across leading institutions worldwide

Lead Authors

Christian Cao - University of Toronto

Rohit Arora - Harvard Medical School

Paul Cento - Independent Researcher

Key Contributors

Niklas Bobrovitz - University of Calgary

George Church - Harvard Medical School

David Moher - University of Ottawa

Institutions

University of Toronto

Harvard Medical School

University of Calgary

MIT

McGill University

+ 12 more institutions

Transform Your Research Process

otto-SR represents a major advancement in systematic review automation. Join the future of evidence synthesis.