BioDSA
Open-source framework for benchmarking and prototyping of AI agents for biomedicine.
BioDSA is a modular, open-source framework for building, reproducing, and evaluating biomedical data science agents. It is designed for AI agent research, prioritizing clean abstractions, rapid prototyping, and systematic benchmarking to accelerate R&D of AI agents for biomedicine.
For Biomedical Researchers
Not a developer? TrialMind by Keiji AI offers these AI agents as a ready-to-use platform—no coding required. Analyze your data, explore literature, and accelerate your research today.
A Quickstart to Prototyping Data Science Agents
BioDSA is a modular framework for rapidly prototyping and experimenting with data science agents. Extend the base agent, plug in your tools, define your workflow graph, and run—no boilerplate, just clean agent logic.
Modular Base Agent
Extend BaseAgent to build custom agents with built-in LLM support (OpenAI, Anthropic, Azure, Google), sandbox execution, and retry handling
LangGraph Workflows
Define agent logic as composable state graphs with conditional edges, enabling complex multi-step reasoning patterns
Plug-and-Play Tools
Ready-to-use tool wrappers for PubMed, cBioPortal, ClinicalTrials.gov, gene databases, and more—or create your own
Sandboxed Execution
Execute generated code safely in Docker containers with resource monitoring, automatic cleanup, and artifact collection
Agent Implementation Patterns
Same BaseAgent abstraction powers linear, loop, multi-phase, and multi-agent workflows. Click to see each pattern.
class BaseAgent:
"""All agents inherit these capabilities for free."""
# Multi-provider LLM support
llm: BaseLanguageModel # OpenAI, Anthropic, Azure, Google
# Sandboxed code execution
sandbox: ExecutionSandboxWrapper # Docker container with monitoring
# Workspace management
registered_datasets: List[str] # Auto-uploaded to sandbox
workdir: str # Isolated working directory
def __init__(self, model_name, api_type, api_key, ...):
self.llm = self._get_model(api_type, model_name, api_key)
self.sandbox = ExecutionSandboxWrapper()
self.install_biodsa_tools_in_sandbox()
def register_workspace(self, workspace_dir):
# Upload CSVs to sandbox, track datasets
...
def _call_model(self, messages, tools=None):
# Retry logic, timeout handling, token tracking
return run_with_retry(self.llm.invoke, messages)
def go(self, query: str):
# Subclasses implement this
return self.agent_graph.invoke({"messages": [query]})Specialized Agents
Agents and benchmarks developed using BioDSA's modular architecture
DSWizard
Biomedical Data Science Agent
Using BioDSA, we rapidly implemented and benchmarked a wide range of agent prompting methods—including Few-shot, AutoPrompt, RAG, PlanPrompt, and DSWizard—across more than 15 LLMs such as GPT-4o, Gemini, Claude, DeepSeek, Llama-3, and Qwen. The DSWizard agent achieved up to 0.74 Pass@1, more than 2× higher than vanilla prompting.
Single Agent
Tool calling execution
Multi-Agent
Hierarchical orchestration
Memory
BM25-indexed context
MCP Tools
Protocol integrations
Sandbox
Docker execution
Knowledge APIs
15+ connectors
See BioDSA Agents in Action
Watch how our AI agents autonomously handle complex biomedical research tasks, from data analysis to evidence synthesis.
DSWizard
Biomedical Data Science Agent
Make a clustering of the patients based on their genomic mutation data to maximize the separation of the prognostic survival outcomes.
Comprehensive Benchmarks
BioDSA includes multiple benchmark datasets for evaluating data science agent performance on biomedical tasks.
BioDSA-1K
0Hypothesis validation tasks from real biomedical studies
BioDSBench-Python
0Python coding tasks for biomedical data analysis
BioDSBench-R
0R coding tasks for biomedical data analysis
DeepEvidence
0Knowledge-graph-driven deep research tasks
Additional Benchmarks
0+Curated benchmarks from external sources
Examples of BioDSA Benchmark Analyses
Benchmark Results
Comprehensive evaluation across data science, deep research, and biomedical discovery tasks.
DSWizard: Biomedical Data Science Programming
Pass@1 scores on BioDSBench tasks by difficulty level
Easy
30 tasksMedium
38 tasksHard
60 tasksResearch Papers
BioDSA is based on a series of research papers. Cite the relevant paper if you use BioDSA in your research.
Core Publication
Making large language models reliable data science programming copilots for biomedical research
https://www.nature.com/articles/s41551-025-01587-2@article{wang2026reliable,
title = {Making large language models reliable data science programming copilots for biomedical research},
author = {Wang, Zifeng and Danek, Benjamin and Yang, Ziwei and Chen, Zheng and Sun, Jimeng},
journal = {Nature Biomedical Engineering},
year = {2026},
doi = {10.1038/s41551-025-01587-2},
}Related Publications
These papers are developed on top of the core DSWizard framework.
BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research
https://arxiv.org/abs/2505.16100@article{wang2025biodsa1k,
title={BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research},
author={Wang, Zifeng and Danek, Benjamin and Sun, Jimeng},
journal={arXiv preprint arXiv:2505.16100},
year={2025}
}Built for reproducible biomedical data science research.
For questions or collaborations, please open an issue on GitHub.