arXiv Preprint

DeepEvidence

Deep Knowledge Graph Research Agent

DeepEvidence is a hierarchical multi-agent system for comprehensive biomedical literature research and evidence synthesis. It leverages deep knowledge graph exploration to systematically gather, analyze, and synthesize evidence from multiple biomedical knowledge bases.

Github Publication

Multi-Agent Architecture

Orchestrator Agent

Coordinates research strategy, decides which knowledge bases to explore, and synthesizes findings.

BFRS Agent

Breadth-First ReSearch (BFRS) of knowledge graphs to discover related concepts and broad connections.

DFRS Agent

Depth-First ReSearch (DFRS) of specific knowledge paths to extract detailed information.

Unified Knowledge Graph APIs

DeepEvidence integrates 15+ biomedical APIs across multiple domains for comprehensive research.

Literature & Publications

PubMedPubTatorUMLS

Genes & Proteins

MyGeneUniProtGene OntologyMyVariants

Drugs & Chemicals

PubChemMyChemOpenFDA

Diseases & Phenotypes

MyDiseaseOpenTargets

Pathways & Reactions

ReactomeKEGG

Clinical Data

ClinicalTrials.gov

Evidence Graph

DeepEvidence builds a persistent knowledge graph during research that captures entities (papers, genes, diseases, drugs) and their relationships.

Accumulates knowledge across search rounds
Enables retrieval of previously discovered information
Supports iterative refinement of research questions
Exports to interactive HTML/PDF visualizations

python

# Export evidence graph
results.export_evidence_graph_html(
    "evidence_graph.html"
)

# Access discovered entities
entities = results.evidence_graph_data[
    'entities'
]
relations = results.evidence_graph_data[
    'relations'
]

Evidence Graph Exploration

Interactive visualization showing how DeepEvidence iteratively builds a knowledge graph through systematic exploration. Use the controls to step through the 6 phases of target identification.

Loading visualization...

←→Navigate steps

New nodes (this step)

New edges (this step)

Drag nodes to rearrange

Benchmark Results

DeepEvidence significantly outperforms existing methods across biomedical research benchmarks.

40%

HLE-Medicine

+12× vs GPT-5

80%

LabBench-LitQA2

+1.7× vs GPT-5

47%

SuperGPQA-Hard

+1.6× vs GPT-5

96%

TrialPanorama

+1.3× vs GPT-5

HLE-Medicine

Hard medicine questions

DeepEvidence

40%

Biomni

20%

Sonnet-4.5

10%

ToolUniverse

6.7%

GPT-5

3.3%

LabBench-LitQA2

Literature QA

DeepEvidence

80%

GPT-5

48%

Sonnet-4.5

40%

Biomni

32%

ToolUniverse

24%

SuperGPQA-Hard

Expert medicine questions

DeepEvidence

47.1%

Sonnet-4.5

43.6%

Biomni

40.7%

ToolUniverse

37.2%

GPT-5

29.4%

TrialPanorama

Evidence synthesis

DeepEvidence

96%

Sonnet-4.5

88%

Biomni

84%

ToolUniverse

73.7%

GPT-5

72%

DeepEvidence Benchmark

7 knowledge-graph-driven deep research tasks spanning the biomedical discovery pipeline.

View on HuggingFace

Target Identification

25 tasks

Identify therapeutic targets for diseases by integrating gene-disease associations and pathway evidence.

MoA Pathway Reasoning

25 tasks

Multi-hop mechanistic reasoning to explain molecular perturbation propagation through pathways.

Metabolic Flux Response

25 tasks

Predict metabolic flux suppression in pre-clinical models based on pathway dependencies.

Drug Regimen Design

25 tasks

Design drug dosing regimens considering pharmacological and clinical factors.

Surrogate Endpoint

14 tasks

Identify plausible surrogate endpoints that reflect downstream clinical outcomes.

Sample Size Estimation

25 tasks

Estimate clinical trial sample sizes under design assumptions and outcome constraints.

Evidence Gap Discovery

20 tasks

Identify missing, weak, or conflicting evidence across biomedical knowledge sources.

159

Total Tasks

Task Types

15+

APIs Integrated

Multi-hop

Reasoning Required

Example Discovery Tasks

DeepEvidence tackles complex biomedical research questions requiring multi-hop reasoning across knowledge bases.

Target Identification

Therapeutic target discovery for inflammatory diseases

User Instruction

Which genes are most promising to be effective therapeutic targets for modulating the inflammatory response in Ulcerative Colitis?

Agent Execution Plan

Identify candidate gene targets using disease-gene association databases (OpenTargets, DisGeNET). Filter for genes involved in inflammatory pathways (NF-κB, JAK-STAT, IL-6 signaling). Prioritize druggable targets with existing small molecule or antibody modulators. Cross-reference with clinical trial data for IBD therapeutics.

Quick Start

python

from biodsa.agents import DeepEvidenceAgent

# Initialize the agent
agent = DeepEvidenceAgent(
    model_name="gpt-5",
    api_type="azure",
    api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
    endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
    model_kwargs={
        "max_completion_tokens": 5000,
        "reasoning_effort": "minimal",
    },
    subagent_action_rounds_budget=5,  # action rounds for sub research agents
    main_search_rounds_budget=2,      # search rounds for main orchestrator
    main_action_rounds_budget=15,     # action rounds for main orchestrator
    light_mode=False,                  # use full memory graph
    llm_timeout=120,
)

# Run the agent
execution_results = agent.go(
    "Summarizing the cutting-edge immunotherapy drugs in late clinical "
    "trial phase or have been approved for NSCLC?",
    knowledge_bases=[
        "pubmed_papers", "clinical_trials", "drug", "disease"
    ]
)

Citation

@article{wang2025deepevidence,
  title   = {DeepEvidence: Empowering Biomedical Discovery with Deep Knowledge Graph Research},
  author  = {Wang, Zifeng and Chen, Zheng and Yang, Ziwei and Wang, Xuan and Jin, Qiao and Peng, Yifan and Lu, Zhiyong and Sun, Jimeng},
  journal = {arXiv preprint},
  year    = {2025}
}