Company Logo

Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents

1The University of Hong Kong, 2Microsoft 3The Chinese University of Hong Kong, ShenZhen 4The Hong Kong University of Science and Technology 5Chinese Academy of Sciences

Abstract

Interactive Data Analysis, a collaboration between humans and LLM agents, enables real-time data exploration for informed decision-making. The challenges and costs of collecting realistic interactive logs for data analysis hinder the quantitative evaluation of Language Model (LLM) agents in this task. To mitigate this issue, we introduce Tapilot-Crossing, a new benchmark to evaluate LLM agents on interactive data analysis. Tapilot-Crossing contains 1024 interactions, covering 4 practical scenarios: Normal, Action, Private, and Private Action. Notably, Tapilot-Crossing is constructed by an economical multi-agent environment, Decision Company, with few human efforts. We evaluate popular and most advanced LLM agents in Tapilot-Crossing, which underscores the challenges of interactive data analysis. Furthermore, we propose Adaptive Interaction Reflection (AIR), a self-generated reflection strategy that guides LLM agents to learn from successful history. Experiments demonstrate that Air can evolve LLMs into effective interactive data analysis agents, achieving a relative performance improvement of up to 44.6%.

Teaser Image

How Decision_Company Generates Interactive Data Analysis Data

Agent Action Taxonomy

Tapilot-Crossing contains 6 common action types that agent will meet.
1) Update_Code: Update_Code: this refers to instances where the user requests corrections for bugs or refinements to the conditions of previous queries.
2) Fast_Fail: Fast_Fail: it is an action that alerts users when the current data contents or resources are insufficient to meet their requests, or when user queries contain factual errors.
3) Clarification: Clarification: this is a common action in response to under-specified questions, which are frequent in data-analysis queries. In this action, agents make the conditions of the question more specific and clear by seeking additional information from users.
3) Best_Guess: Best_Guess: while clarification is an effective action to reduce the uncertainty, it can lead to issues such as user impatience due to unsteadily asking, and long dialog histories that result in attention distraction and long-context problems. Therefore, this action can address these issues by making appropriate assumptions based on data contents, domain knowledge, and commonsense knowledge for under-specific questions. However, there is also a risk that incorrect guesses can lead to hallucinations.
3) Plot_QA: Plot_QA: in real data analysis settings, agents are also expected to answer user questions about insights derived from plots. The Plot_QA action can assist users in better understanding the contents of plots for decision making.
3) Insight_Mining: Insight_Mining: beyond generating codes for users to retrieve expected results, interactive data analysis agents are also tasked with summarizing executed results from the environment to assist users in making informed decisions. This process, known as Insight_Mining, plays an important role in data analysis since it contributes to the evolution of code agents into comprehensive data analysis agents.

Your image description

Leaderboard

Overall CodeGen PrivateGen Analysis Clarification BestGuess Unanswerable PlotQA Correction
Rank Model Setting Score

1

Feb 20, 2024
GPT-4-32k + Inter-Agent

HKU & Microsoft

Li et al., '23
CG Reasoning: COT
MC Reasoning: React
Reflection: Air
30.2

2

Feb 20, 2024
GPT-4-32k + Agent

HKU & Microsoft

Li et al., '23
CG Reasoning: COT
MC Reasoning: React
Reflection: None
25.9

3

Feb 20, 2024
GPT-4-Turbo + Inter-Agent

HKU & Microsoft

Li et al., '23
CG Reasoning: COT
MC Reasoning: React
Reflection: Air
21.5

4

Feb 20, 2024
GPT-4-32k

OpenAI

OpenAI, '23
CG Reasoning: N/A
MC Reasoning: N/A
Reflection: N/A
20.9

5

Feb 20, 2024
GPT-4-Turbo + Agent

HKU & Microsoft

Li et al., '23
CG Reasoning: COT
MC Reasoning: React
Reflection: N/A
20.4

6

Feb 20, 2024
CodeLlama-34B + Inter-Agent

HKU & Microsoft

Li et al., '23
CG Reasoning: COT
MC Reasoning: ReAct
Reflection: Air
19.2

7

Feb 20, 2024
GPT-4-Turbo

OpenAI

OpenAI, '23
CG Reasoning: N/A
MC Reasoning: N/A
Reflection: N/A
17.2

8

Feb 20, 2024
CodaLlama-34B

Meta

Rozière et al., '23
CG Reasoning: N/A
MC Reasoning: N/A
Reflection: N/A
16.7

9

Feb 20, 2024
Claude-2.1 + Inter-Agent

HKU & Microsoft

Li et al., '23
CG Reasoning: COT
MC Reasoning: ReAct
Reflection: Air
16.4

10

Feb 20, 2024
Claude-2.1 + Agent

HKU & Microsoft

Li et al., '23
CG Reasoning: COT
MC Reasoning: ReAct
Reflection: N/A
15.6

11

Feb 20, 2024
CodeLlama-34B + Agent

HKU & Microsoft

Li et al., '23
CG Reasoning: COT
MC Reasoning: ReAct
Reflection: N/A
15.2

12

Feb 20, 2024
Claude-2.1

Anthropic

Anthropic, '23
CG Reasoning: N/A
MC Reasoning: N/A
Reflection: N/A
13.7

1

Feb 20, 2024
GPT-4-32k + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: Air
32.2

2

Feb 20, 2024
GPT-4-32k

OpenAI

OpenAI, '23
Reasoning: N/A
Reflection: N/A
29.7

3

Feb 20, 2024
GPT-4-Turbo + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: Air
29.2

4

Feb 20, 2024
GPT-4-Turbo + Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: N/A
29.1

5

Feb 20, 2024
CodeLlama-34B + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: Air
28.8

6

Feb 20, 2024
GPT-4-Turbo

OpenAI

OpenAI '23
Reasoning: N/A
Reflection: N/A
27.6

7

Feb 20, 2024
CodaLlama-34B

Meta

Rozière et al., '23
Reasoning: N/A
Reflection: N/A
27.5

8

Feb 20, 2024
Claude-2.1 + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: AIR
24.8

9

Feb 20, 2024
Claude-2.1 + Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: N/A
23.4

9

Feb 20, 2024
GPT-4-32k + Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: N/A
23.4

11

Feb 20, 2024
Claude-2.1

Anthropic

Anthropic, '23
Reasoning: N/A
Reflection: N/A
20.2

12

Feb 20, 2024
CodeLlama-34B + Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: N/A
18.5

1

Feb 20, 2024
GPT-4-Turbo + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: Air
12.0

2

Feb 20, 2024
GPT-4-32k + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: Air
10.6

3

Feb 20, 2024
GPT-4-Turbo + Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: N/A
10.1

4

Feb 20, 2024
GPT-4-32k + Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: N/A
9.1

5

Feb 20, 2024
GPT-4-32k

OpenAI

OpenAI, '23
Reasoning: N/A
Reflection: N/A
7.1

6

Feb 20, 2024
GPT-4-Turbo

OpenAI

OpenAI, '23
Reasoning: N/A
Reflection: N/A
5.3

7

Feb 20, 2024
Claude-2.1 + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: Air
5.1

8

Feb 20, 2024
Claude-2.1 + Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: N/A
3.9

9

Feb 20, 2024
CodaLlama-34B

Meta

Rozière et al., '23
Reasoning: N/A
Reflection: N/A
2.4

10

Feb 20, 2024
CodeLlama-34B + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: Air
2.1

11

Feb 20, 2024
Claude-2.1

Anthropic

Anthropic, '23
Reasoning: N/A
Reflection: N/A
1.5

12

Feb 20, 2024
CodeLlama-34B + Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: Air
1.0

1

Feb 20, 2024
GPT-4-32k + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: Air
48.0

2

Feb 20, 2024
GPT-4-32k + Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: N/A
45.9

3

Feb 20, 2024
CodaLlama-34B

Meta

Rozière et al., '23
Reasoning: N/A
Reflection: N/A
31.5

4

Feb 20, 2024
GPT-4-Turbo + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: AIR
27.6

5

Feb 20, 2024
CodeLlama-34B + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: N/A
26.7

6

Feb 20, 2024
GPT-4-32k

OpenAI

OpenAI, '23
Reasoning: N/A
Reflection: N/A
25.5

7

Feb 20, 2024
Claude-2.1 + Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: N/A
24.6

8

Feb 20, 2024
CodeLlama-34B + Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: N/A
23.5

9

Feb 20, 2024
GPT-4-Turbo + Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: AIR
21.5

10

Feb 20, 2024
Claude-2.1

Anthropic

Anthropic, '23
Reasoning: N/A
Reflection: N/A
20.0

11

Feb 20, 2024
GPT-4-Turbo

OpenAI

Li et al., '23
Reasoning: N/A
Reflection: N/A
19.8

12

Feb 20, 2024
Claude-2.1 + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: AIR
14.8

1

Feb 20, 2024
GPT-4-Turbo + Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: N/A
20.4

2

Feb 20, 2024
GPT-4-Turbo

OpenAI

OpenAI, '23
Reasoning: N/A
Reflection: N/A
19.1

3

Feb 20, 2024
Claude-2.1 + Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: N/A
17.8

4

Feb 20, 2024
GPT-4-Turbo + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: Air
14.3

5

Feb 20, 2024
GPT-4-32k + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: Air
13.7

6

Feb 20, 2024
GPT-4-32k + Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: N/A
13.1

7

Feb 20, 2024
Claude-2.1 + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: Air
9.9

8

Feb 20, 2024
CodeLlama-34B + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: Air
7.2

9

Feb 20, 2024
Claude-2.1

Anthropic

Anthropic, '23
Reasoning: N/A
Reflection: N/A
6.2

10

Feb 20, 2024
CodaLlama-34B

Meta

Rozière et al., '23
Reasoning: N/A
Reflection: N/A
6.1

11

Feb 20, 2024
CodeLlama-34B + Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: N/A
5.9

12

Feb 20, 2024
GPT-4-32k

OpenAI

OpenAI, '23
Reasoning: N/A
Reflection: N/A
5.6

1

Feb 20, 2024
GPT-4-Turbo + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: Air
46.4

2

Feb 20, 2024
GPT-4-32k + Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: N/A
45.6

3

Feb 20, 2024
CodeLlama-34B + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: Air
42.8

4

Feb 20, 2024
GPT-4-32k + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: Air
34.8

5

Feb 20, 2024
GPT-4-Turbo + Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: N/A
34.4

6

Feb 20, 2024
CodeLlama-34B + Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: N/A
30.6

7

Feb 20, 2024
CodaLlama-34B

Meta

Rozière et al., '23
Reasoning: N/A
Reflection: N/A
24.2

8

Feb 20, 2024
GPT-4-32k

OpenAI

OpenAI, '23
Reasoning: N/A
Reflection: N/A
21.4

9

Feb 20, 2024
Claude-2.1 + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: Air
15.3

10

Feb 20, 2024
GPT-4-Turbo

OpenAI

OpenAI, '23
Reasoning: N/A
Reflection: N/A
11.2

11

Feb 20, 2024
Claude-2.1

Anthropic

Anthropic, '23
Reasoning: N/A
Reflection: N/A
11.1

12

Feb 20, 2024
Claude-2.1 + Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: N/A
5.5

1

Feb 20, 2024
GPT-4-32k + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: Air
28.6

2

Feb 20, 2024
Claude-2.1 + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: Air
27.2

2

Feb 20, 2024
GPT-4-32k + Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: N/A
27.2

4

Feb 20, 2024
CodeLlama-34B + Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: N/A
21.6

5

Feb 20, 2024
GPT-4-32k

OpenAI

OpenAI, '23
Reasoning: N/A
Reflection: N/A
20.3

6

Feb 20, 2024
Claude-2.1 + Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: N/A
19.3

7

Feb 20, 2024
GPT-4-Turbo

OpenAI

OpenAI, '23
Reasoning: N/A
Reflection: N/A
19.0

8

Feb 20, 2024
Claude-2.1

Anthropic

Anthropic, '23
Reasoning: N/A
Reflection: N/A
15.8

9

Feb 20, 2024
GPT-4-Turbo + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: Air
14.9

10

Feb 20, 2024
CodeLlama-34B + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: Air
14.7

11

Feb 20, 2024
CodaLlama-34B

Meta

Rozière et al., '23
Reasoning: N/A
Reflection: N/A
9.2

12

Feb 20, 2024
GPT-4-Turbo + Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Reflection: Air
9.1

1

Feb 20, 2024
GPT-4-32k + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Tool: DePlot
Reflection: Air
31.4

2

Feb 20, 2024
GPT-4-Turbo + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Tool: DePlot
Reflection: Air
26.0

3

Feb 20, 2024
GPT-4-32k

OpenAI

OpenAI, '23
Reasoning: N/A
Tool: N/A
Reflection: N/A
22.1

4

Feb 20, 2024
GPT-4-Turbo

OpenAI

OpenAI, '23
Reasoning: N/A
Tool: N/A
Reflection: N/A
20.8

5

Feb 20, 2024
CodeLlama-34B + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Tool: DePlot
Reflection: Air
20.7

6

Feb 20, 2024
Claude-2.1

Anthropic

Anthropic, '23
Reasoning: N/A
Tool: N/A
Reflection: N/A
14.1

7

Feb 20, 2024
Claude-2.1 + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: ReAct
Tool: DePlot
Reflection: N/A
14.0

8

Feb 20, 2024
CodaLlama-34B

Meta

Rozière et al., '23
Reasoning: N/A
Tool: N/A
Reflection: N/A
4.4

1

Feb 20, 2024
GPT-4-32k + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: Air
6.1

2

Feb 20, 2024
GPT-4-Turbo + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: Air
5.0

3

Feb 20, 2024
GPT-4-32k + Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: N/A
3.6

4

Feb 20, 2024
GPT-4-Turbo

OpenAI

OpenAI, '23
Reasoning: N/A
Reflection: N/A
3.4

5

Feb 20, 2024
GPT-4-Turbo + Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: N/A
3.1

6

Feb 20, 2024
Claude-2.1

Anthropic

Anthropic, '23
Reasoning: N/A
Reflection: N/A
2.7

7

Feb 20, 2024
Claude-2.1 + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: Air
1.9

8

Feb 20, 2024
CodaLlama-34B

Meta

Rozière et al., '23
Reasoning: N/A
Reflection: N/A
1.7

9

Feb 20, 2024
Claude-2.1 + Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: N/A
1.6

10

Feb 20, 2024
CodeLlama-34B + Inter-Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: Air
0.9

11

Feb 20, 2024
CodeLlama-34B + Agent

HKU & Microsoft

Li et al., '23
Reasoning: COT
Reflection: N/A
0.0

11

Feb 20, 2024
GPT-4-32k

OpenAI

OpenAI, '23
Reasoning: N/A
Reflection: N/A
0.0

Why Tapilot-Crossing?

The name of our project is inspired by the popular Switch game, Animal-Crossing. where users can perform complex tasks such as constructing fantastic architectures through interactions with animal citizens (agents). Our research aims to convey such bench playground, enabling AGI development for more complicated and more realistic data analysis tasks.

Acknowledgement

We thank Bowen Li, Bowen Qin for their early discussions. We also sincerely thank Prof.Laks V.S. Lakshmanan and Dr. Xiaodong Li for their suggestions. The bgm of video has been liscensed. Please contact first authors for copyright. The final data is generated by Azure OpenAI from HKU ITS Support. Thanks for the support!

BibTeX

@@article{li2024tapilot,
  title={Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents},
  author={Li, Jinyang and Huo, Nan and Gao, Yan and Shi, Jiayi and Zhao, Yingxiu and Qu, Ge and Wu, Yurong and Ma, Chenhao and Lou, Jian-Guang and Cheng, Reynold},
  journal={arXiv preprint arXiv:2403.05307},
  year={2024}
}