Interactive Data Analysis, a collaboration between humans and LLM agents, enables real-time data exploration for informed decision-making. The challenges and costs of collecting realistic interactive logs for data analysis hinder the quantitative evaluation of Language Model (LLM) agents in this task. To mitigate this issue, we introduce Tapilot-Crossing, a new benchmark to evaluate LLM agents on interactive data analysis. Tapilot-Crossing contains 1024 interactions, covering 4 practical scenarios: Normal, Action, Private, and Private Action. Notably, Tapilot-Crossing is constructed by an economical multi-agent environment, Decision Company, with few human efforts. We evaluate popular and most advanced LLM agents in Tapilot-Crossing, which underscores the challenges of interactive data analysis. Furthermore, we propose Adaptive Interaction Reflection (AIR), a self-generated reflection strategy that guides LLM agents to learn from successful history. Experiments demonstrate that Air can evolve LLMs into effective interactive data analysis agents, achieving a relative performance improvement of up to 44.6%.
Tapilot-Crossing contains 6 common action types that agent will meet.
Update_Code: this refers to instances where the user requests corrections for bugs or refinements to the conditions of previous queries.
Fast_Fail: it is an action that alerts users when the current data contents or resources are insufficient to meet their requests, or when user queries contain factual errors.
Clarification: this is a common action in response to under-specified questions, which are frequent in data-analysis queries. In this action, agents make the conditions of the question more specific and clear by seeking additional information from users.
Best_Guess: while clarification is an effective action to reduce the uncertainty, it can lead to issues such as user impatience due to unsteadily asking, and long dialog histories that result in attention distraction and long-context problems. Therefore, this action can address these issues by making appropriate assumptions based on data contents, domain knowledge, and commonsense knowledge for under-specific questions. However, there is also a risk that incorrect guesses can lead to hallucinations.
Plot_QA: in real data analysis settings, agents are also expected to answer user questions about insights derived from plots. The Plot_QA action can assist users in better understanding the contents of plots for decision making.
Insight_Mining: beyond generating codes for users to retrieve expected results, interactive data analysis agents are also tasked with summarizing executed results from the environment to assist users in making informed decisions. This process, known as Insight_Mining, plays an important role in data analysis since it contributes to the evolution of code agents into comprehensive data analysis agents.
Rank | Model | Setting | Score |
---|---|---|---|
1 Feb 20, 2024 |
GPT-4-32k + Inter-Agent
HKU & Microsoft Li et al., '23 |
CG Reasoning: COT MC Reasoning: React Reflection: Air |
30.2 |
2 Feb 20, 2024 |
GPT-4-32k + Agent
HKU & Microsoft Li et al., '23 |
CG Reasoning: COT MC Reasoning: React Reflection: None |
25.9 |
3 Feb 20, 2024 |
GPT-4-Turbo + Inter-Agent
HKU & Microsoft Li et al., '23 |
CG Reasoning: COT MC Reasoning: React Reflection: Air |
21.5 |
4 Feb 20, 2024 |
GPT-4-32k
OpenAI OpenAI, '23 |
CG Reasoning: N/A MC Reasoning: N/A Reflection: N/A |
20.9 |
5 Feb 20, 2024 |
GPT-4-Turbo + Agent
HKU & Microsoft Li et al., '23 |
CG Reasoning: COT MC Reasoning: React Reflection: N/A |
20.4 |
6 Feb 20, 2024 |
CodeLlama-34B + Inter-Agent
HKU & Microsoft Li et al., '23 |
CG Reasoning: COT MC Reasoning: ReAct Reflection: Air |
19.2 |
7 Feb 20, 2024 |
GPT-4-Turbo
OpenAI OpenAI, '23 |
CG Reasoning: N/A MC Reasoning: N/A Reflection: N/A |
17.2 |
8 Feb 20, 2024 |
CodaLlama-34B
Meta Rozière et al., '23 |
CG Reasoning: N/A MC Reasoning: N/A Reflection: N/A |
16.7 |
9 Feb 20, 2024 |
Claude-2.1 + Inter-Agent
HKU & Microsoft Li et al., '23 |
CG Reasoning: COT MC Reasoning: ReAct Reflection: Air |
16.4 |
10 Feb 20, 2024 |
Claude-2.1 + Agent
HKU & Microsoft Li et al., '23 |
CG Reasoning: COT MC Reasoning: ReAct Reflection: N/A |
15.6 |
11 Feb 20, 2024 |
CodeLlama-34B + Agent
HKU & Microsoft Li et al., '23 |
CG Reasoning: COT MC Reasoning: ReAct Reflection: N/A |
15.2 |
12 Feb 20, 2024 |
Claude-2.1
Anthropic Anthropic, '23 |
CG Reasoning: N/A MC Reasoning: N/A Reflection: N/A |
13.7 |
1 Feb 20, 2024 |
GPT-4-32k + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: Air |
32.2 |
2 Feb 20, 2024 |
GPT-4-32k
OpenAI OpenAI, '23 |
Reasoning: N/A Reflection: N/A |
29.7 |
3 Feb 20, 2024 |
GPT-4-Turbo + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: Air |
29.2 |
4 Feb 20, 2024 |
GPT-4-Turbo + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: N/A |
29.1 |
5 Feb 20, 2024 |
CodeLlama-34B + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: Air |
28.8 |
6 Feb 20, 2024 |
GPT-4-Turbo
OpenAI OpenAI '23 |
Reasoning: N/A Reflection: N/A |
27.6 |
7 Feb 20, 2024 |
CodaLlama-34B
Meta Rozière et al., '23 |
Reasoning: N/A Reflection: N/A |
27.5 |
8 Feb 20, 2024 |
Claude-2.1 + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: AIR |
24.8 |
9 Feb 20, 2024 |
Claude-2.1 + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: N/A |
23.4 |
9 Feb 20, 2024 |
GPT-4-32k + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: N/A |
23.4 |
11 Feb 20, 2024 |
Claude-2.1
Anthropic Anthropic, '23 |
Reasoning: N/A Reflection: N/A |
20.2 |
12 Feb 20, 2024 |
CodeLlama-34B + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: N/A |
18.5 |
1 Feb 20, 2024 |
GPT-4-Turbo + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: Air |
12.0 |
2 Feb 20, 2024 |
GPT-4-32k + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: Air |
10.6 |
3 Feb 20, 2024 |
GPT-4-Turbo + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: N/A |
10.1 |
4 Feb 20, 2024 |
GPT-4-32k + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: N/A |
9.1 |
5 Feb 20, 2024 |
GPT-4-32k
OpenAI OpenAI, '23 |
Reasoning: N/A Reflection: N/A |
7.1 |
6 Feb 20, 2024 |
GPT-4-Turbo
OpenAI OpenAI, '23 |
Reasoning: N/A Reflection: N/A |
5.3 |
7 Feb 20, 2024 |
Claude-2.1 + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: Air |
5.1 |
8 Feb 20, 2024 |
Claude-2.1 + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: N/A |
3.9 |
9 Feb 20, 2024 |
CodaLlama-34B
Meta Rozière et al., '23 |
Reasoning: N/A Reflection: N/A |
2.4 |
10 Feb 20, 2024 |
CodeLlama-34B + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: Air |
2.1 |
11 Feb 20, 2024 |
Claude-2.1
Anthropic Anthropic, '23 |
Reasoning: N/A Reflection: N/A |
1.5 |
12 Feb 20, 2024 |
CodeLlama-34B + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: Air |
1.0 |
1 Feb 20, 2024 |
GPT-4-32k + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: Air |
48.0 |
2 Feb 20, 2024 |
GPT-4-32k + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: N/A |
45.9 |
3 Feb 20, 2024 |
CodaLlama-34B
Meta Rozière et al., '23 |
Reasoning: N/A Reflection: N/A |
31.5 |
4 Feb 20, 2024 |
GPT-4-Turbo + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: AIR |
27.6 |
5 Feb 20, 2024 |
CodeLlama-34B + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: N/A |
26.7 |
6 Feb 20, 2024 |
GPT-4-32k
OpenAI OpenAI, '23 |
Reasoning: N/A Reflection: N/A |
25.5 |
7 Feb 20, 2024 |
Claude-2.1 + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: N/A |
24.6 |
8 Feb 20, 2024 |
CodeLlama-34B + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: N/A |
23.5 |
9 Feb 20, 2024 |
GPT-4-Turbo + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: AIR |
21.5 |
10 Feb 20, 2024 |
Claude-2.1
Anthropic Anthropic, '23 |
Reasoning: N/A Reflection: N/A |
20.0 |
11 Feb 20, 2024 |
GPT-4-Turbo
OpenAI Li et al., '23 |
Reasoning: N/A Reflection: N/A |
19.8 |
12 Feb 20, 2024 |
Claude-2.1 + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: AIR |
14.8 |
1 Feb 20, 2024 |
GPT-4-Turbo + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: N/A |
20.4 |
2 Feb 20, 2024 |
GPT-4-Turbo
OpenAI OpenAI, '23 |
Reasoning: N/A Reflection: N/A |
19.1 |
3 Feb 20, 2024 |
Claude-2.1 + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: N/A |
17.8 |
4 Feb 20, 2024 |
GPT-4-Turbo + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: Air |
14.3 |
5 Feb 20, 2024 |
GPT-4-32k + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: Air |
13.7 |
6 Feb 20, 2024 |
GPT-4-32k + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: N/A |
13.1 |
7 Feb 20, 2024 |
Claude-2.1 + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: Air |
9.9 |
8 Feb 20, 2024 |
CodeLlama-34B + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: Air |
7.2 |
9 Feb 20, 2024 |
Claude-2.1
Anthropic Anthropic, '23 |
Reasoning: N/A Reflection: N/A |
6.2 |
10 Feb 20, 2024 |
CodaLlama-34B
Meta Rozière et al., '23 |
Reasoning: N/A Reflection: N/A |
6.1 |
11 Feb 20, 2024 |
CodeLlama-34B + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: N/A |
5.9 |
12 Feb 20, 2024 |
GPT-4-32k
OpenAI OpenAI, '23 |
Reasoning: N/A Reflection: N/A |
5.6 |
1 Feb 20, 2024 |
GPT-4-Turbo + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: Air |
46.4 |
2 Feb 20, 2024 |
GPT-4-32k + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: N/A |
45.6 |
3 Feb 20, 2024 |
CodeLlama-34B + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: Air |
42.8 |
4 Feb 20, 2024 |
GPT-4-32k + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: Air |
34.8 |
5 Feb 20, 2024 |
GPT-4-Turbo + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: N/A |
34.4 |
6 Feb 20, 2024 |
CodeLlama-34B + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: N/A |
30.6 |
7 Feb 20, 2024 |
CodaLlama-34B
Meta Rozière et al., '23 |
Reasoning: N/A Reflection: N/A |
24.2 |
8 Feb 20, 2024 |
GPT-4-32k
OpenAI OpenAI, '23 |
Reasoning: N/A Reflection: N/A |
21.4 |
9 Feb 20, 2024 |
Claude-2.1 + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: Air |
15.3 |
10 Feb 20, 2024 |
GPT-4-Turbo
OpenAI OpenAI, '23 |
Reasoning: N/A Reflection: N/A |
11.2 |
11 Feb 20, 2024 |
Claude-2.1
Anthropic Anthropic, '23 |
Reasoning: N/A Reflection: N/A |
11.1 |
12 Feb 20, 2024 |
Claude-2.1 + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: N/A |
5.5 |
1 Feb 20, 2024 |
GPT-4-32k + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: Air |
28.6 |
2 Feb 20, 2024 |
Claude-2.1 + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: Air |
27.2 |
2 Feb 20, 2024 |
GPT-4-32k + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: N/A |
27.2 |
4 Feb 20, 2024 |
CodeLlama-34B + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: N/A |
21.6 |
5 Feb 20, 2024 |
GPT-4-32k
OpenAI OpenAI, '23 |
Reasoning: N/A Reflection: N/A |
20.3 |
6 Feb 20, 2024 |
Claude-2.1 + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: N/A |
19.3 |
7 Feb 20, 2024 |
GPT-4-Turbo
OpenAI OpenAI, '23 |
Reasoning: N/A Reflection: N/A |
19.0 |
8 Feb 20, 2024 |
Claude-2.1
Anthropic Anthropic, '23 |
Reasoning: N/A Reflection: N/A |
15.8 |
9 Feb 20, 2024 |
GPT-4-Turbo + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: Air |
14.9 |
10 Feb 20, 2024 |
CodeLlama-34B + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: Air |
14.7 |
11 Feb 20, 2024 |
CodaLlama-34B
Meta Rozière et al., '23 |
Reasoning: N/A Reflection: N/A |
9.2 |
12 Feb 20, 2024 |
GPT-4-Turbo + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Reflection: Air |
9.1 |
1 Feb 20, 2024 |
GPT-4-32k + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Tool: DePlot Reflection: Air |
31.4 |
2 Feb 20, 2024 |
GPT-4-Turbo + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Tool: DePlot Reflection: Air |
26.0 |
3 Feb 20, 2024 |
GPT-4-32k
OpenAI OpenAI, '23 |
Reasoning: N/A Tool: N/A Reflection: N/A |
22.1 |
4 Feb 20, 2024 |
GPT-4-Turbo
OpenAI OpenAI, '23 |
Reasoning: N/A Tool: N/A Reflection: N/A |
20.8 |
5 Feb 20, 2024 |
CodeLlama-34B + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Tool: DePlot Reflection: Air |
20.7 |
6 Feb 20, 2024 |
Claude-2.1
Anthropic Anthropic, '23 |
Reasoning: N/A Tool: N/A Reflection: N/A |
14.1 |
7 Feb 20, 2024 |
Claude-2.1 + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: ReAct Tool: DePlot Reflection: N/A |
14.0 |
8 Feb 20, 2024 |
CodaLlama-34B
Meta Rozière et al., '23 |
Reasoning: N/A Tool: N/A Reflection: N/A |
4.4 |
1 Feb 20, 2024 |
GPT-4-32k + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: Air |
6.1 |
2 Feb 20, 2024 |
GPT-4-Turbo + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: Air |
5.0 |
3 Feb 20, 2024 |
GPT-4-32k + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: N/A |
3.6 |
4 Feb 20, 2024 |
GPT-4-Turbo
OpenAI OpenAI, '23 |
Reasoning: N/A Reflection: N/A |
3.4 |
5 Feb 20, 2024 |
GPT-4-Turbo + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: N/A |
3.1 |
6 Feb 20, 2024 |
Claude-2.1
Anthropic Anthropic, '23 |
Reasoning: N/A Reflection: N/A |
2.7 |
7 Feb 20, 2024 |
Claude-2.1 + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: Air |
1.9 |
8 Feb 20, 2024 |
CodaLlama-34B
Meta Rozière et al., '23 |
Reasoning: N/A Reflection: N/A |
1.7 |
9 Feb 20, 2024 |
Claude-2.1 + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: N/A |
1.6 |
10 Feb 20, 2024 |
CodeLlama-34B + Inter-Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: Air |
0.9 |
11 Feb 20, 2024 |
CodeLlama-34B + Agent
HKU & Microsoft Li et al., '23 |
Reasoning: COT Reflection: N/A |
0.0 |
11 Feb 20, 2024 |
GPT-4-32k
OpenAI OpenAI, '23 |
Reasoning: N/A Reflection: N/A |
0.0 |
@@article{li2024tapilot,
title={Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents},
author={Li, Jinyang and Huo, Nan and Gao, Yan and Shi, Jiayi and Zhao, Yingxiu and Qu, Ge and Wu, Yurong and Ma, Chenhao and Lou, Jian-Guang and Cheng, Reynold},
journal={arXiv preprint arXiv:2403.05307},
year={2024}
}