BioDSA

Open-source framework for benchmarking and prototyping of AI agents for biomedicine.

BioDSA is a modular, open-source framework for building, reproducing, and evaluating biomedical data science agents. It is designed for AI agent research, prioritizing clean abstractions, rapid prototyping, and systematic benchmarking to accelerate R&D of AI agents for biomedicine.

Github Publication Huggingface

For Biomedical Researchers

Not a developer? TrialMind by Keiji AI offers these AI agents as a ready-to-use platform—no coding required. Analyze your data, explore literature, and accelerate your research today.

Try TrialMind

Explore

Agent Framework

A Quickstart to Prototyping Data Science Agents

BioDSA is a modular framework for rapidly prototyping and experimenting with data science agents. Extend the base agent, plug in your tools, define your workflow graph, and run—no boilerplate, just clean agent logic.

Modular Base Agent

Extend BaseAgent to build custom agents with built-in LLM support (OpenAI, Anthropic, Azure, Google), sandbox execution, and retry handling

LangGraph Workflows

Define agent logic as composable state graphs with conditional edges, enabling complex multi-step reasoning patterns

Plug-and-Play Tools

Ready-to-use tool wrappers for PubMed, cBioPortal, ClinicalTrials.gov, gene databases, and more—or create your own

Sandboxed Execution

Execute generated code safely in Docker containers with resource monitoring, automatic cleanup, and artifact collection

Agent Implementation Patterns

Same BaseAgent abstraction powers linear, loop, multi-phase, and multi-agent workflows. Click to see each pattern.

class BaseAgent:
    """All agents inherit these capabilities for free."""
    
    # Multi-provider LLM support
    llm: BaseLanguageModel          # OpenAI, Anthropic, Azure, Google
    
    # Sandboxed code execution
    sandbox: ExecutionSandboxWrapper # Docker container with monitoring
    
    # Workspace management
    registered_datasets: List[str]   # Auto-uploaded to sandbox
    workdir: str                     # Isolated working directory
    
    def __init__(self, model_name, api_type, api_key, ...):
        self.llm = self._get_model(api_type, model_name, api_key)
        self.sandbox = ExecutionSandboxWrapper()
        self.install_biodsa_tools_in_sandbox()
    
    def register_workspace(self, workspace_dir):
        # Upload CSVs to sandbox, track datasets
        ...
    
    def _call_model(self, messages, tools=None):
        # Retry logic, timeout handling, token tracking
        return run_with_retry(self.llm.invoke, messages)
    
    def go(self, query: str):
        # Subclasses implement this
        return self.agent_graph.invoke({"messages": [query]})

Why BioDSA?Unified LLM interface•Safe sandboxed execution•Composable workflow graphs•Built-in bio tools

Research Built with BioDSA

Specialized Agents

Agents and benchmarks developed using BioDSA's modular architecture

DSWizard

Biomedical Data Science Agent

Using BioDSA, we rapidly implemented and benchmarked a wide range of agent prompting methods—including Few-shot, AutoPrompt, RAG, PlanPrompt, and DSWizard—across more than 15 LLMs such as GPT-4o, Gemini, Claude, DeepSeek, Llama-3, and Qwen. The DSWizard agent achieved up to 0.74 Pass@1, more than 2× higher than vanilla prompting.

0.74 Pass@1 Python & R 15+ LLMs

Learn more

Core Modules

Single Agent

Tool calling execution

Multi-Agent

Hierarchical orchestration

Memory

BM25-indexed context

MCP Tools

Protocol integrations

Sandbox

Docker execution

Knowledge APIs

15+ connectors

Live Demo

See BioDSA Agents in Action

Watch how our AI agents autonomously handle complex biomedical research tasks, from data analysis to evidence synthesis.

DSWizard

Biomedical Data Science Agent

Make a clustering of the patients based on their genomic mutation data to maximize the separation of the prognostic survival outcomes.

Attached: brca_tcga dataset (6 files)

Evaluation Suite

Comprehensive Benchmarks

BioDSA includes multiple benchmark datasets for evaluating data science agent performance on biomedical tasks.

0+Total Benchmark Tasks

BioDSA-1K

Hypothesis validation tasks from real biomedical studies

Task TypeHypothesis Testing

Data Source39 cBioPortal Studies

AgentDSWizard

Learn more HF

BioDSBench-Python

Python coding tasks for biomedical data analysis

LanguagePython

FormatJSONL with schemas

AgentDSWizard

Learn more HF

BioDSBench-R

R coding tasks for biomedical data analysis

LanguageR

FormatJSONL with schemas

AgentDSWizard

Learn more HF

DeepEvidence

Knowledge-graph-driven deep research tasks

Task Types7 categories

ReasoningMulti-hop KG

AgentDeepEvidence

Learn more HF

Additional Benchmarks

Curated benchmarks from external sources

HLE-Medicine30 tasks

HLE-Biomedicine40 tasks

SuperGPQA172 tasks

Examples of BioDSA Benchmark Analyses

BioDSA-1K Hypothesis

Real biomedical hypothesis validation

User Instruction

Validate whether TP53 mutation status correlates with overall survival in breast cancer patients using the provided clinical and genomic data.

Experiment Instruction

Load the clinical and mutation data from cBioPortal. Filter patients with complete survival data. Stratify by TP53 mutation status. Perform Kaplan-Meier survival analysis and log-rank test. Generate survival curves with hazard ratios and 95% confidence intervals. Report statistical significance and clinical interpretation.

Performance Analysis

Benchmark Results

Comprehensive evaluation across data science, deep research, and biomedical discovery tasks.

DSWizard: Biomedical Data Science Programming

Pass@1 scores on BioDSBench tasks by difficulty level

Easy

30 tasks

Few-shot

AutoPrompt

RAG

ManualPrompt

CoderAgent

PlanPrompt

DSWizard

Medium

38 tasks

Few-shot

AutoPrompt

RAG

ManualPrompt

CoderAgent

PlanPrompt

DSWizard

Hard

60 tasks

Few-shot

AutoPrompt

RAG

ManualPrompt

CoderAgent

PlanPrompt

DSWizard

Key Results

+0% over baselines on Easy tasks

+0% over baselines on Medium tasks

+0% over baselines on Hard tasks

Publication

Research Papers

BioDSA is based on a series of research papers. Cite the relevant paper if you use BioDSA in your research.

Core Publication

Nature Biomedical Engineering

Making large language models reliable data science programming copilots for biomedical research

https://www.nature.com/articles/s41551-025-01587-2

@article{wang2026reliable,
  title     = {Making large language models reliable data science programming copilots for biomedical research},
  author    = {Wang, Zifeng and Danek, Benjamin and Yang, Ziwei and Chen, Zheng and Sun, Jimeng},
  journal   = {Nature Biomedical Engineering},
  year      = {2026},
  doi       = {10.1038/s41551-025-01587-2},
}

Related Publications

These papers are developed on top of the core DSWizard framework.

arXiv

BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research

https://arxiv.org/abs/2505.16100

@article{wang2025biodsa1k,
  title={BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research},
  author={Wang, Zifeng and Danek, Benjamin and Sun, Jimeng},
  journal={arXiv preprint arXiv:2505.16100},
  year={2025}
}

Built for reproducible biomedical data science research.

For questions or collaborations, please open an issue on GitHub.