DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

Authors: Shaolei Zhang, Ju Fan*, Meihao Fan, Guoliang Li, Xiaoyong Du

DeepAnalyze is the first agentic LLM for autonomous data science. It can autonomously complete a wide range of data-centric tasks without human intervention, supporting:

🛠 Entire data science pipeline: Automatically perform any data science tasks such as data preparation, analysis, modeling, visualization, and report generation.
🔍 Open-ended data research: Conduct deep research on diverse data sources, including structured data (Databases, CSV, Excel), semi-structured data (JSON, XML, YAML), and unstructured data (TXT, Markdown), and finally produce analyst-grade research reports.
📊 Fully open-source: The model, code, training data, and demo of DeepAnalyze are all open-sourced, allowing you to deploy or extend your own data analysis assistant.

Welcome to ⭐ star DeepAnalyze. Any useful issues and pull requests will be included in contributors.

🖥 Demo

Upload the data, DeepAnalyze can perform data-oriented deep research 🔍 and any data-centric tasks 🛠

deepanalyze-8b.mp4

Tip

Clone this repository to deploy DeepAnalyze locally as your data analyst, completing any data science tasks without any workflow or closed-source APIs.

🔥 The UI of the demo is an initial version. Welcome to further develop it, and we will include you as a contributor.

Clone this repo and download DeepAnalyze-8B.
Run these scripts to launch the API and interface, and then interact through the browser (http://localhost:4000):
```
cd demo/chat
npm install
cd ..
bash start.sh

# stop the api and interface
bash stop.sh
```
If you want to deploy under a specific IP, please replace localhost with your IP address in ./demo/backend.py and ./demo/chat/lib/config.ts

🚀 Quick Start

Requirements

Install packages: torch==2.6.0, transformers==4.53.2, vllm==0.8.5

conda create -n deepanalyze python=3.12 -y
conda activate deepanalyze
pip install -r requirements.txt

# For training
(cd ./deepanalyze/ms-swift/ && pip install -e .)
(cd ./deepanalyze/SkyRL/ && pip install -e .)

Command Interaction

Deploy DeepAnalyze-8B via vllm: vllm serve DeepAnalyze-8B

Run these scripts for any data science tasks:

You can specify any data science tasks, including specific data tasks and open-ended data research.
You can specify any number of data sources, and DeepAnalyze will automatically explore them.
You can specify any type of data sources, e.g., structured data (Databases, CSV, Excel), semi-structured data (JSON, XML, YAML), and unstructured data (TXT, Markdown)

from deepanalyze import DeepAnalyzeVLLM

prompt = """# Instruction
Generate a data science report.

# Data
File 1: {"name": "bool.xlsx", "size": "4.8KB"}
File 2: {"name": "person.csv", "size": "10.6KB"}
File 3: {"name": "disabled.xlsx", "size": "5.6KB"}
File 4: {"name": "enlist.csv", "size": "6.7KB"}
File 5: {"name": "filed_for_bankrupcy.csv", "size": "1.0KB"}
File 6: {"name": "longest_absense_from_school.xlsx", "size": "16.0KB"}
File 7: {"name": "male.xlsx", "size": "8.8KB"}
File 8: {"name": "no_payment_due.xlsx", "size": "15.6KB"}
File 9: {"name": "unemployed.xlsx", "size": "5.6KB"}
File 10: {"name": "enrolled.csv", "size": "20.4KB"}"""

workspace = "/home/u2023000922/zhangshaolei/deepanalyze_dev/example/student_loan/"

deepanalyze = DeepAnalyzeVLLM(
    "/fs/fast/u2023000922/zhangshaolei/checkpoints/deepanalyze-8b/"
)
answer = deepanalyze.generate(prompt, workspace=workspace)
print(answer["reasoning"])

You shoud get a deep research report, which can be rendered as a PDF.:

# Comprehensive Analysis of Student Enrollment Patterns and Institutional Transfers

## Introduction and Research Context

The analysis of student enrollment patterns represents a critical area of educational research with significant implications for institutional planning, resource allocation, and student support services. This comprehensive study examines a comprehensive dataset encompassing 1,194 enrollment records across six educational institutions, merged with supplementary demographic, financial, and employment status data. The research employs advanced analytical techniques including network analysis, predictive modeling, and temporal pattern recognition to uncover both macro-level institutional trends and micro-level student mobility patterns. The dataset's longitudinal nature, spanning fifteen months of enrollment records, provides unique insights into the complex dynamics of student pathways through higher education systems.

Our methodological approach combines quantitative analysis of enrollment durations, transfer probabilities, and financial indicators with qualitative ...

The research contributes to the growing body of literature on student mobility by providing empirical evidence of institutional transfer networks and their relationship to student outcomes...
.....

For more examples and task completion details, please refer to DeepAnalyze's homepage.

API

You can build an OpenAI-Style API, using this script (note to change MODEL_PATH = "DeepAnalyze-8B" in demo/backend.py to your vllm model name):
```
python demo/backend.py
```

API usage (streaming response):

curl -X POST http://localhost:8200/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
           "messages": [
             {
               "role": "user",
               "content": "Generate a data science report."
             }
           ],
           "workspace": "example/student_loan/"
         }'

🎈 Develop Your Own DeepAnalyze

1. Download Model and Training Data

Download DeepSeek-R1-0528-Qwen3-8B. Or you can directly finetune based on DeepAnalyze-8B.

If you use DeepSeek-R1-0528-Qwen3-8B as the base model, you should add the special tokens, using:

MODEL_PATH=path_to_DeepSeek-R1-0528-Qwen3-8B
SAVE_PATH=path_to_save_DeepSeek-R1-0528-Qwen3-8B-addvocab

python deepanalyze/add_vocab.py \
  --model_path "$MODEL_PATH" \
  --save_path "$SAVE_PATH" \
  --add_tags

Download training data DataScience-Instruct-500K.
- unzip DataScience-Instruct-500K/RL/data.zip

2. Curriculum-based Agentic Training

Single-ability Fine-tuning: ./scripts/single.sh
Multi-ability Agentic Training (cold start): ./scripts/multi_coldstart.sh
Multi-ability Agentic Training (RL): ./scripts/multi_rl.sh

3. Evaluation

We have unified the evaluation of most existing data science benchmarks using vLLM (with more being continuously added...). You can directly follow the introduction in ./playground to quickly evaluate DeepAnalyze or your own agent.

🤝 Acknowledgement

Training framework: ms-swift, SkyRL
Source of Training Data: Reasoning-Table, Spider, BIRD, DABStep

🖋Citation

If this repository is useful for you, please cite as:

@misc{deepanalyze,
      title={DeepAnalyze: Agentic Large Language Models for Autonomous Data Science}, 
      author={Shaolei Zhang and Ju Fan and Meihao Fan and Guoliang Li and Xiaoyong Du},
      year={2025},
      eprint={2510.16872},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.16872}, 
}

If you have any questions, please feel free to submit an issue or contact zhangshaolei98@ruc.edu.cn.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
deepanalyze		deepanalyze
demo		demo
example/student_loan		example/student_loan
playground		playground
scripts		scripts
LICENSE		LICENSE
README.md		README.md
deepanalyze.py		deepanalyze.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

🖥 Demo

🚀 Quick Start

Requirements

Command Interaction

API

🎈 Develop Your Own DeepAnalyze

1. Download Model and Training Data

2. Curriculum-based Agentic Training

3. Evaluation

🤝 Acknowledgement

🖋Citation

About

Uh oh!

Releases

Packages

Languages

License

ruc-datalab/DeepAnalyze

Folders and files

Latest commit

History

Repository files navigation

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

🖥 Demo

🚀 Quick Start

Requirements

Command Interaction

API

🎈 Develop Your Own DeepAnalyze

1. Download Model and Training Data

2. Curriculum-based Agentic Training

3. Evaluation

🤝 Acknowledgement

🖋Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages