Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents

Abstract

Interactive Data Analysis, a collaboration between humans and LLM agents, enables real-time data exploration for informed decision-making. The challenges and costs of collecting realistic interactive logs for data analysis hinder the quantitative evaluation of Language Model (LLM) agents in this task. To mitigate this issue, we introduce Tapilot-Crossing, a new benchmark to evaluate LLM agents on interactive data analysis. Tapilot-Crossing contains 1024 interactions, covering 4 practical scenarios: Normal, Action, Private, and Private Action. Notably, Tapilot-Crossing is constructed by an economical multi-agent environment, Decision Company, with few human efforts. We evaluate popular and most advanced LLM agents in Tapilot-Crossing, which underscores the challenges of interactive data analysis. Furthermore, we propose Adaptive Interaction Reflection (AIR), a self-generated reflection strategy that guides LLM agents to learn from successful history. Experiments demonstrate that Air can evolve LLMs into effective interactive data analysis agents, achieving a relative performance improvement of up to 44.6%.

Agent Action Taxonomy

Tapilot-Crossing contains 6 common action types that agent will meet.
Update_Code: this refers to instances where the user requests corrections for bugs or refinements to the conditions of previous queries.
Fast_Fail: it is an action that alerts users when the current data contents or resources are insufficient to meet their requests, or when user queries contain factual errors.
Clarification: this is a common action in response to under-specified questions, which are frequent in data-analysis queries. In this action, agents make the conditions of the question more specific and clear by seeking additional information from users.
Best_Guess: while clarification is an effective action to reduce the uncertainty, it can lead to issues such as user impatience due to unsteadily asking, and long dialog histories that result in attention distraction and long-context problems. Therefore, this action can address these issues by making appropriate assumptions based on data contents, domain knowledge, and commonsense knowledge for under-specific questions. However, there is also a risk that incorrect guesses can lead to hallucinations.
Plot_QA: in real data analysis settings, agents are also expected to answer user questions about insights derived from plots. The Plot_QA action can assist users in better understanding the contents of plots for decision making.
Insight_Mining: beyond generating codes for users to retrieve expected results, interactive data analysis agents are also tasked with summarizing executed results from the environment to assist users in making informed decisions. This process, known as Insight_Mining, plays an important role in data analysis since it contributes to the evolution of code agents into comprehensive data analysis agents.

Rank	Model	Setting	Score
1 Feb 20, 2024	GPT-4-32k + Inter-Agent HKU & Microsoft Li et al., '23	CG Reasoning: COT MC Reasoning: React Reflection: Air	30.2
2 Feb 20, 2024	GPT-4-32k + Agent HKU & Microsoft Li et al., '23	CG Reasoning: COT MC Reasoning: React Reflection: None	25.9
3 Feb 20, 2024	GPT-4-Turbo + Inter-Agent HKU & Microsoft Li et al., '23	CG Reasoning: COT MC Reasoning: React Reflection: Air	21.5
4 Feb 20, 2024	GPT-4-32k OpenAI OpenAI, '23	CG Reasoning: N/A MC Reasoning: N/A Reflection: N/A	20.9
5 Feb 20, 2024	GPT-4-Turbo + Agent HKU & Microsoft Li et al., '23	CG Reasoning: COT MC Reasoning: React Reflection: N/A	20.4
6 Feb 20, 2024	CodeLlama-34B + Inter-Agent HKU & Microsoft Li et al., '23	CG Reasoning: COT MC Reasoning: ReAct Reflection: Air	19.2
7 Feb 20, 2024	GPT-4-Turbo OpenAI OpenAI, '23	CG Reasoning: N/A MC Reasoning: N/A Reflection: N/A	17.2
8 Feb 20, 2024	CodaLlama-34B Meta Rozière et al., '23	CG Reasoning: N/A MC Reasoning: N/A Reflection: N/A	16.7
9 Feb 20, 2024	Claude-2.1 + Inter-Agent HKU & Microsoft Li et al., '23	CG Reasoning: COT MC Reasoning: ReAct Reflection: Air	16.4
10 Feb 20, 2024	Claude-2.1 + Agent HKU & Microsoft Li et al., '23	CG Reasoning: COT MC Reasoning: ReAct Reflection: N/A	15.6
11 Feb 20, 2024	CodeLlama-34B + Agent HKU & Microsoft Li et al., '23	CG Reasoning: COT MC Reasoning: ReAct Reflection: N/A	15.2
12 Feb 20, 2024	Claude-2.1 Anthropic Anthropic, '23	CG Reasoning: N/A MC Reasoning: N/A Reflection: N/A	13.7
1 Feb 20, 2024	GPT-4-32k + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: Air	32.2
2 Feb 20, 2024	GPT-4-32k OpenAI OpenAI, '23	Reasoning: N/A Reflection: N/A	29.7
3 Feb 20, 2024	GPT-4-Turbo + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: Air	29.2
4 Feb 20, 2024	GPT-4-Turbo + Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: N/A	29.1
5 Feb 20, 2024	CodeLlama-34B + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: Air	28.8
6 Feb 20, 2024	GPT-4-Turbo OpenAI OpenAI '23	Reasoning: N/A Reflection: N/A	27.6
7 Feb 20, 2024	CodaLlama-34B Meta Rozière et al., '23	Reasoning: N/A Reflection: N/A	27.5
8 Feb 20, 2024	Claude-2.1 + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: AIR	24.8
9 Feb 20, 2024	Claude-2.1 + Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: N/A	23.4
9 Feb 20, 2024	GPT-4-32k + Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: N/A	23.4
11 Feb 20, 2024	Claude-2.1 Anthropic Anthropic, '23	Reasoning: N/A Reflection: N/A	20.2
12 Feb 20, 2024	CodeLlama-34B + Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: N/A	18.5
1 Feb 20, 2024	GPT-4-Turbo + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: Air	12.0
2 Feb 20, 2024	GPT-4-32k + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: Air	10.6
3 Feb 20, 2024	GPT-4-Turbo + Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: N/A	10.1
4 Feb 20, 2024	GPT-4-32k + Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: N/A	9.1
5 Feb 20, 2024	GPT-4-32k OpenAI OpenAI, '23	Reasoning: N/A Reflection: N/A	7.1
6 Feb 20, 2024	GPT-4-Turbo OpenAI OpenAI, '23	Reasoning: N/A Reflection: N/A	5.3
7 Feb 20, 2024	Claude-2.1 + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: Air	5.1
8 Feb 20, 2024	Claude-2.1 + Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: N/A	3.9
9 Feb 20, 2024	CodaLlama-34B Meta Rozière et al., '23	Reasoning: N/A Reflection: N/A	2.4
10 Feb 20, 2024	CodeLlama-34B + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: Air	2.1
11 Feb 20, 2024	Claude-2.1 Anthropic Anthropic, '23	Reasoning: N/A Reflection: N/A	1.5
12 Feb 20, 2024	CodeLlama-34B + Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: Air	1.0
1 Feb 20, 2024	GPT-4-32k + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: Air	48.0
2 Feb 20, 2024	GPT-4-32k + Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: N/A	45.9
3 Feb 20, 2024	CodaLlama-34B Meta Rozière et al., '23	Reasoning: N/A Reflection: N/A	31.5
4 Feb 20, 2024	GPT-4-Turbo + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: AIR	27.6
5 Feb 20, 2024	CodeLlama-34B + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: N/A	26.7
6 Feb 20, 2024	GPT-4-32k OpenAI OpenAI, '23	Reasoning: N/A Reflection: N/A	25.5
7 Feb 20, 2024	Claude-2.1 + Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: N/A	24.6
8 Feb 20, 2024	CodeLlama-34B + Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: N/A	23.5
9 Feb 20, 2024	GPT-4-Turbo + Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: AIR	21.5
10 Feb 20, 2024	Claude-2.1 Anthropic Anthropic, '23	Reasoning: N/A Reflection: N/A	20.0
11 Feb 20, 2024	GPT-4-Turbo OpenAI Li et al., '23	Reasoning: N/A Reflection: N/A	19.8
12 Feb 20, 2024	Claude-2.1 + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: AIR	14.8
1 Feb 20, 2024	GPT-4-Turbo + Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: N/A	20.4
2 Feb 20, 2024	GPT-4-Turbo OpenAI OpenAI, '23	Reasoning: N/A Reflection: N/A	19.1
3 Feb 20, 2024	Claude-2.1 + Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: N/A	17.8
4 Feb 20, 2024	GPT-4-Turbo + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: Air	14.3
5 Feb 20, 2024	GPT-4-32k + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: Air	13.7
6 Feb 20, 2024	GPT-4-32k + Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: N/A	13.1
7 Feb 20, 2024	Claude-2.1 + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: Air	9.9
8 Feb 20, 2024	CodeLlama-34B + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: Air	7.2
9 Feb 20, 2024	Claude-2.1 Anthropic Anthropic, '23	Reasoning: N/A Reflection: N/A	6.2
10 Feb 20, 2024	CodaLlama-34B Meta Rozière et al., '23	Reasoning: N/A Reflection: N/A	6.1
11 Feb 20, 2024	CodeLlama-34B + Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: N/A	5.9
12 Feb 20, 2024	GPT-4-32k OpenAI OpenAI, '23	Reasoning: N/A Reflection: N/A	5.6
1 Feb 20, 2024	GPT-4-Turbo + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: Air	46.4
2 Feb 20, 2024	GPT-4-32k + Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: N/A	45.6
3 Feb 20, 2024	CodeLlama-34B + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: Air	42.8
4 Feb 20, 2024	GPT-4-32k + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: Air	34.8
5 Feb 20, 2024	GPT-4-Turbo + Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: N/A	34.4
6 Feb 20, 2024	CodeLlama-34B + Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: N/A	30.6
7 Feb 20, 2024	CodaLlama-34B Meta Rozière et al., '23	Reasoning: N/A Reflection: N/A	24.2
8 Feb 20, 2024	GPT-4-32k OpenAI OpenAI, '23	Reasoning: N/A Reflection: N/A	21.4
9 Feb 20, 2024	Claude-2.1 + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: Air	15.3
10 Feb 20, 2024	GPT-4-Turbo OpenAI OpenAI, '23	Reasoning: N/A Reflection: N/A	11.2
11 Feb 20, 2024	Claude-2.1 Anthropic Anthropic, '23	Reasoning: N/A Reflection: N/A	11.1
12 Feb 20, 2024	Claude-2.1 + Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: N/A	5.5
1 Feb 20, 2024	GPT-4-32k + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: Air	28.6
2 Feb 20, 2024	Claude-2.1 + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: Air	27.2
2 Feb 20, 2024	GPT-4-32k + Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: N/A	27.2
4 Feb 20, 2024	CodeLlama-34B + Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: N/A	21.6
5 Feb 20, 2024	GPT-4-32k OpenAI OpenAI, '23	Reasoning: N/A Reflection: N/A	20.3
6 Feb 20, 2024	Claude-2.1 + Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: N/A	19.3
7 Feb 20, 2024	GPT-4-Turbo OpenAI OpenAI, '23	Reasoning: N/A Reflection: N/A	19.0
8 Feb 20, 2024	Claude-2.1 Anthropic Anthropic, '23	Reasoning: N/A Reflection: N/A	15.8
9 Feb 20, 2024	GPT-4-Turbo + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: Air	14.9
10 Feb 20, 2024	CodeLlama-34B + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: Air	14.7
11 Feb 20, 2024	CodaLlama-34B Meta Rozière et al., '23	Reasoning: N/A Reflection: N/A	9.2
12 Feb 20, 2024	GPT-4-Turbo + Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Reflection: Air	9.1
1 Feb 20, 2024	GPT-4-32k + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Tool: DePlot Reflection: Air	31.4
2 Feb 20, 2024	GPT-4-Turbo + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Tool: DePlot Reflection: Air	26.0
3 Feb 20, 2024	GPT-4-32k OpenAI OpenAI, '23	Reasoning: N/A Tool: N/A Reflection: N/A	22.1
4 Feb 20, 2024	GPT-4-Turbo OpenAI OpenAI, '23	Reasoning: N/A Tool: N/A Reflection: N/A	20.8
5 Feb 20, 2024	CodeLlama-34B + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Tool: DePlot Reflection: Air	20.7
6 Feb 20, 2024	Claude-2.1 Anthropic Anthropic, '23	Reasoning: N/A Tool: N/A Reflection: N/A	14.1
7 Feb 20, 2024	Claude-2.1 + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: ReAct Tool: DePlot Reflection: N/A	14.0
8 Feb 20, 2024	CodaLlama-34B Meta Rozière et al., '23	Reasoning: N/A Tool: N/A Reflection: N/A	4.4
1 Feb 20, 2024	GPT-4-32k + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: Air	6.1
2 Feb 20, 2024	GPT-4-Turbo + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: Air	5.0
3 Feb 20, 2024	GPT-4-32k + Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: N/A	3.6
4 Feb 20, 2024	GPT-4-Turbo OpenAI OpenAI, '23	Reasoning: N/A Reflection: N/A	3.4
5 Feb 20, 2024	GPT-4-Turbo + Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: N/A	3.1
6 Feb 20, 2024	Claude-2.1 Anthropic Anthropic, '23	Reasoning: N/A Reflection: N/A	2.7
7 Feb 20, 2024	Claude-2.1 + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: Air	1.9
8 Feb 20, 2024	CodaLlama-34B Meta Rozière et al., '23	Reasoning: N/A Reflection: N/A	1.7
9 Feb 20, 2024	Claude-2.1 + Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: N/A	1.6
10 Feb 20, 2024	CodeLlama-34B + Inter-Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: Air	0.9
11 Feb 20, 2024	CodeLlama-34B + Agent HKU & Microsoft Li et al., '23	Reasoning: COT Reflection: N/A	0.0
11 Feb 20, 2024	GPT-4-32k OpenAI OpenAI, '23	Reasoning: N/A Reflection: N/A	0.0

Rank

Model

Setting

Score

Feb 20, 2024

GPT-4-32k + Inter-Agent

HKU & Microsoft

Li et al., '23

CG Reasoning: COT
MC Reasoning: React
Reflection: Air

30.2

Feb 20, 2024

GPT-4-32k + Agent

HKU & Microsoft

Li et al., '23

CG Reasoning: COT
MC Reasoning: React
Reflection: None

25.9

Feb 20, 2024

GPT-4-Turbo + Inter-Agent

HKU & Microsoft

Li et al., '23

CG Reasoning: COT
MC Reasoning: React
Reflection: Air

21.5

Feb 20, 2024

GPT-4-32k

OpenAI

OpenAI, '23

CG Reasoning: N/A
MC Reasoning: N/A
Reflection: N/A

20.9

Feb 20, 2024

GPT-4-Turbo + Agent

HKU & Microsoft

Li et al., '23

CG Reasoning: COT
MC Reasoning: React
Reflection: N/A

20.4

Feb 20, 2024

CodeLlama-34B + Inter-Agent

HKU & Microsoft

Li et al., '23

CG Reasoning: COT
MC Reasoning: ReAct
Reflection: Air

19.2

Feb 20, 2024

GPT-4-Turbo

OpenAI

OpenAI, '23

CG Reasoning: N/A
MC Reasoning: N/A
Reflection: N/A

17.2

Feb 20, 2024

CodaLlama-34B

Meta

Rozière et al., '23

CG Reasoning: N/A
MC Reasoning: N/A
Reflection: N/A

16.7

Feb 20, 2024

Claude-2.1 + Inter-Agent

HKU & Microsoft

Li et al., '23

CG Reasoning: COT
MC Reasoning: ReAct
Reflection: Air

16.4

Feb 20, 2024

Claude-2.1 + Agent

HKU & Microsoft

Li et al., '23

CG Reasoning: COT
MC Reasoning: ReAct
Reflection: N/A

15.6

Feb 20, 2024

CodeLlama-34B + Agent

HKU & Microsoft

Li et al., '23

CG Reasoning: COT
MC Reasoning: ReAct
Reflection: N/A

15.2

Feb 20, 2024

Claude-2.1

Anthropic

Anthropic, '23

CG Reasoning: N/A
MC Reasoning: N/A
Reflection: N/A

13.7

Feb 20, 2024

GPT-4-32k + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: Air

32.2

Feb 20, 2024

GPT-4-32k

OpenAI

OpenAI, '23

Reasoning: N/A
Reflection: N/A

29.7

Feb 20, 2024

GPT-4-Turbo + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: Air

29.2

Feb 20, 2024

GPT-4-Turbo + Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: N/A

29.1

Feb 20, 2024

CodeLlama-34B + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: Air

28.8

Feb 20, 2024

GPT-4-Turbo

OpenAI

OpenAI '23

Reasoning: N/A
Reflection: N/A

27.6

Feb 20, 2024

CodaLlama-34B

Meta

Rozière et al., '23

Reasoning: N/A
Reflection: N/A

27.5

Feb 20, 2024

Claude-2.1 + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: AIR

24.8

Feb 20, 2024

Claude-2.1 + Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: N/A

23.4

Feb 20, 2024

GPT-4-32k + Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: N/A

23.4

Feb 20, 2024

Claude-2.1

Anthropic

Anthropic, '23

Reasoning: N/A
Reflection: N/A

20.2

Feb 20, 2024

CodeLlama-34B + Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: N/A

18.5

Feb 20, 2024

GPT-4-Turbo + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: Air

12.0

Feb 20, 2024

GPT-4-32k + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: Air

10.6

Feb 20, 2024

GPT-4-Turbo + Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: N/A

10.1

Feb 20, 2024

GPT-4-32k + Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: N/A

9.1

Feb 20, 2024

GPT-4-32k

OpenAI

OpenAI, '23

Reasoning: N/A
Reflection: N/A

7.1

Feb 20, 2024

GPT-4-Turbo

OpenAI

OpenAI, '23

Reasoning: N/A
Reflection: N/A

5.3

Feb 20, 2024

Claude-2.1 + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: Air

5.1

Feb 20, 2024

Claude-2.1 + Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: N/A

3.9

Feb 20, 2024

CodaLlama-34B

Meta

Rozière et al., '23

Reasoning: N/A
Reflection: N/A

2.4

Feb 20, 2024

CodeLlama-34B + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: Air

2.1

Feb 20, 2024

Claude-2.1

Anthropic

Anthropic, '23

Reasoning: N/A
Reflection: N/A

1.5

Feb 20, 2024

CodeLlama-34B + Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: Air

1.0

Feb 20, 2024

GPT-4-32k + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: Air

48.0

Feb 20, 2024

GPT-4-32k + Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: N/A

45.9

Feb 20, 2024

CodaLlama-34B

Meta

Rozière et al., '23

Reasoning: N/A
Reflection: N/A

31.5

Feb 20, 2024

GPT-4-Turbo + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: AIR

27.6

Feb 20, 2024

CodeLlama-34B + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: N/A

26.7

Feb 20, 2024

GPT-4-32k

OpenAI

OpenAI, '23

Reasoning: N/A
Reflection: N/A

25.5

Feb 20, 2024

Claude-2.1 + Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: N/A

24.6

Feb 20, 2024

CodeLlama-34B + Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: N/A

23.5

Feb 20, 2024

GPT-4-Turbo + Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: AIR

21.5

Feb 20, 2024

Claude-2.1

Anthropic

Anthropic, '23

Reasoning: N/A
Reflection: N/A

20.0

Feb 20, 2024

GPT-4-Turbo

OpenAI

Li et al., '23

Reasoning: N/A
Reflection: N/A

19.8

Feb 20, 2024

Claude-2.1 + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: AIR

14.8

Feb 20, 2024

GPT-4-Turbo + Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: N/A

20.4

Feb 20, 2024

GPT-4-Turbo

OpenAI

OpenAI, '23

Reasoning: N/A
Reflection: N/A

19.1

Feb 20, 2024

Claude-2.1 + Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: N/A

17.8

Feb 20, 2024

GPT-4-Turbo + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: Air

14.3

Feb 20, 2024

GPT-4-32k + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: Air

13.7

Feb 20, 2024

GPT-4-32k + Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: N/A

13.1

Feb 20, 2024

Claude-2.1 + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: Air

9.9

Feb 20, 2024

CodeLlama-34B + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: Air

7.2

Feb 20, 2024

Claude-2.1

Anthropic

Anthropic, '23

Reasoning: N/A
Reflection: N/A

6.2

Feb 20, 2024

CodaLlama-34B

Meta

Rozière et al., '23

Reasoning: N/A
Reflection: N/A

6.1

Feb 20, 2024

CodeLlama-34B + Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: N/A

5.9

Feb 20, 2024

GPT-4-32k

OpenAI

OpenAI, '23

Reasoning: N/A
Reflection: N/A

5.6

Feb 20, 2024

GPT-4-Turbo + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: Air

46.4

Feb 20, 2024

GPT-4-32k + Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: N/A

45.6

Feb 20, 2024

CodeLlama-34B + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: Air

42.8

Feb 20, 2024

GPT-4-32k + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: Air

34.8

Feb 20, 2024

GPT-4-Turbo + Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: N/A

34.4

Feb 20, 2024

CodeLlama-34B + Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: N/A

30.6

Feb 20, 2024

CodaLlama-34B

Meta

Rozière et al., '23

Reasoning: N/A
Reflection: N/A

24.2

Feb 20, 2024

GPT-4-32k

OpenAI

OpenAI, '23

Reasoning: N/A
Reflection: N/A

21.4

Feb 20, 2024

Claude-2.1 + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: Air

15.3

Feb 20, 2024

GPT-4-Turbo

OpenAI

OpenAI, '23

Reasoning: N/A
Reflection: N/A

11.2

Feb 20, 2024

Claude-2.1

Anthropic

Anthropic, '23

Reasoning: N/A
Reflection: N/A

11.1

Feb 20, 2024

Claude-2.1 + Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: N/A

5.5

Feb 20, 2024

GPT-4-32k + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: Air

28.6

Feb 20, 2024

Claude-2.1 + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: Air

27.2

Feb 20, 2024

GPT-4-32k + Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: N/A

27.2

Feb 20, 2024

CodeLlama-34B + Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: N/A

21.6

Feb 20, 2024

GPT-4-32k

OpenAI

OpenAI, '23

Reasoning: N/A
Reflection: N/A

20.3

Feb 20, 2024

Claude-2.1 + Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: N/A

19.3

Feb 20, 2024

GPT-4-Turbo

OpenAI

OpenAI, '23

Reasoning: N/A
Reflection: N/A

19.0

Feb 20, 2024

Claude-2.1

Anthropic

Anthropic, '23

Reasoning: N/A
Reflection: N/A

15.8

Feb 20, 2024

GPT-4-Turbo + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: Air

14.9

Feb 20, 2024

CodeLlama-34B + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: Air

14.7

Feb 20, 2024

CodaLlama-34B

Meta

Rozière et al., '23

Reasoning: N/A
Reflection: N/A

9.2

Feb 20, 2024

GPT-4-Turbo + Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Reflection: Air

9.1

Feb 20, 2024

GPT-4-32k + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Tool: DePlot
Reflection: Air

31.4

Feb 20, 2024

GPT-4-Turbo + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Tool: DePlot
Reflection: Air

26.0

Feb 20, 2024

GPT-4-32k

OpenAI

OpenAI, '23

Reasoning: N/A
Tool: N/A
Reflection: N/A

22.1

Feb 20, 2024

GPT-4-Turbo

OpenAI

OpenAI, '23

Reasoning: N/A
Tool: N/A
Reflection: N/A

20.8

Feb 20, 2024

CodeLlama-34B + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Tool: DePlot
Reflection: Air

20.7

Feb 20, 2024

Claude-2.1

Anthropic

Anthropic, '23

Reasoning: N/A
Tool: N/A
Reflection: N/A

14.1

Feb 20, 2024

Claude-2.1 + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: ReAct
Tool: DePlot
Reflection: N/A

14.0

Feb 20, 2024

CodaLlama-34B

Meta

Rozière et al., '23

Reasoning: N/A
Tool: N/A
Reflection: N/A

4.4

Feb 20, 2024

GPT-4-32k + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: Air

6.1

Feb 20, 2024

GPT-4-Turbo + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: Air

5.0

Feb 20, 2024

GPT-4-32k + Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: N/A

3.6

Feb 20, 2024

GPT-4-Turbo

OpenAI

OpenAI, '23

Reasoning: N/A
Reflection: N/A

3.4

Feb 20, 2024

GPT-4-Turbo + Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: N/A

3.1

Feb 20, 2024

Claude-2.1

Anthropic

Anthropic, '23

Reasoning: N/A
Reflection: N/A

2.7

Feb 20, 2024

Claude-2.1 + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: Air

1.9

Feb 20, 2024

CodaLlama-34B

Meta

Rozière et al., '23

Reasoning: N/A
Reflection: N/A

1.7

Feb 20, 2024

Claude-2.1 + Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: N/A

1.6

Feb 20, 2024

CodeLlama-34B + Inter-Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: Air

0.9

Feb 20, 2024

CodeLlama-34B + Agent

HKU & Microsoft

Li et al., '23

Reasoning: COT
Reflection: N/A

0.0

Feb 20, 2024

GPT-4-32k

OpenAI

OpenAI, '23

Reasoning: N/A
Reflection: N/A

0.0

BibTeX

@@article{li2024tapilot, title={Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents}, author={Li, Jinyang and Huo, Nan and Gao, Yan and Shi, Jiayi and Zhao, Yingxiu and Qu, Ge and Wu, Yurong and Ma, Chenhao and Lou, Jian-Guang and Cheng, Reynold}, journal={arXiv preprint arXiv:2403.05307}, year={2024} }

Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents

Subscribe to Tapilot Updates!

Abstract

How Decision_Company Generates Interactive Data Analysis Data

Agent Action Taxonomy

Leaderboard

Why Tapilot-Crossing?

Acknowledgement

BibTeX