top of page
  • Peter Toumbourou

Evaluating AI for Tax Domains: Insights from Instant.Tax





In the realm of tax evaluation, AI faces some of its toughest challenges. Tax questions are complex, often requiring advanced reasoning, calculations, and a deep understanding of tax laws.


Our recent Instant.Tax technology framework assessment provides a thorough assessment of how today’s leading AI models perform on tax-related tasks. Here's an in-depth look at the results and implications for the future of AI in this crucial field.

 

 Key Models and Performance Metrics

 

The Instant.Tax benchmark evaluates several leading AI models on accuracy, cost, and latency when solving tax-related questions. The leaderboard is topped by Claude 3.5 Sonnet and GPT-4o, both achieving an accuracy of 61.2%. However, the difference between these models lies in their costs and latency. For instance, Claude 3.5 Sonnet is priced at $3 per million tokens (M) for input, while GPT-4o costs $5/M. Latency also differs, with GPT-4o responding in 0.43 seconds, making it one of the fastest models tested.


Interestingly, other models such as Llama 3.1 Instruct (405B) performed relatively well with an accuracy of 52.2%, but the performance drops significantly for models like GPT-4o Mini (43.8%) and Llama 3.1 Instruct (70B) (39.7%). Lower-tier models, such as Claude 3 Sonnet and Gemini Pro, struggle to deliver competitive results, with accuracies hovering around the 40% mark.

 

 Challenging the Models: Tax Reasoning is No Easy Task

 

The Instant.Tax benchmark highlights the complexities involved in tax-related questions. While Claude 3.5 Sonnet and GPT-4o performed well on multiple-choice questions, both struggled significantly in free-response questions, especially when dealing with advanced tax calculations. This suggests that while models may perform adequately when given constrained options, they falter in open-ended scenarios, reflecting the challenges of using AI for tax predictions and estimates in real-world applications.

 

Additionally, the performance of open-source models such as Llama 3.1 shows promise, though there is a significant performance gap compared to closed-source models like GPT-4o. These open-source models still fall short on tasks requiring deeper tax knowledge, such as deferred tax asset calculations or determining the appropriate tax rate for complex financial transactions.

 

 

Insights into Tax-Specific Model Tasks

 

Instant.Tax focuses on six main categories of tax-related tasks, including:

 

  1. Taxable Income Calculation – Understanding differences between accounting and taxable income.

  2. Tax Rates Application – Applying the appropriate tax rates to income.

  3. Deferred Tax Assets and Liabilities – Recognizing and measuring temporary differences.

  4. Effective Tax Rate Calculation – Analyzing effective tax rates.

  5. Tax Accounting Methods – Understanding cash-basis and accrual-basis methods.

  6. Discontinued Operations – Calculating the tax impact on disposed operations.

 

These categories present diverse challenges, with models performing better in basic tax knowledge tasks (e.g., rule recall) and poorly on multi-step calculations or nuanced tax law interpretations.



 

 Notable Challenges: Math and Reasoning in Tax AI

 

Tax-related tasks often require precise numerical reasoning, something even top models like GPT-4o and Claude 3.5 Sonnet struggle with. For instance, GPT-4o was shown to excel in rule-recall questions—such as identifying the appropriate IRS rule—but exhibited difficulties when asked to calculate taxable income across multiple steps.

 

This challenge is particularly significant in areas like deferred tax asset calculations, where models not only need to identify the rules but also apply them to hypothetical scenarios, considering multiple factors. The inherent complexity of tax reasoning, especially in cases that require precise calculations, makes this an area where human expertise remains indispensable.

 

Looking Forward: The Future of AI in Tax

 

Despite these challenges, there’s optimism surrounding the future of AI in the tax domain. While models are not yet at the level where they can replace tax professionals, they show potential as assistive tools. With further fine-tuning, particularly in mathematical reasoning, these models could eventually handle routine tax calculations, allowing tax professionals to focus on more complex tasks.

 

In the meantime, benchmarks like Instant.Tax play a crucial role in highlighting the current limitations and pushing the development of better, more specialized AI models for the tax industry. As model architectures evolve and datasets become more sophisticated, we can expect steady improvements in AI’s ability to handle intricate tax-related questions.

 

 Methodology Bytes for Instant.Tax

 

The methodology for Instant.Tax is rigorous, ensuring that the results are reliable. The dataset comprises both multiple-choice and free-response tax questions sourced from real-world tax scenarios. These questions span multiple sub-parts and often involve multi-step calculations.

  • Multiple-Choice Questions: Evaluated directly based on accuracy, with random guessing yielding 25% accuracy.

  • Free-Response Questions: Evaluated using an AI-driven auto-evaluation system, which compares the model’s response with a pre-determined correct answer. This method allows for more tasks to be tested without the need for human review.

 

Each API request was retried four times to mitigate transient errors, ensuring the robustness of the results.

 

AI in the Tax Industry is a Work in Progress

In conclusion, while AI models have made significant strides in other industries, they still have a long way to go in tax evaluation. Top performers like GPT-4o and Claude 3.5 Sonnet show promise but need substantial improvements, especially in areas requiring mathematical reasoning. Tax professionals can expect AI to be a useful assistive tool in the near future, but for now, human expertise remains essential for tackling complex tax scenarios.

Instant.Tax provides a valuable framework for understanding these models' capabilities and limitations, offering a roadmap for future developments in the integration of AI into the tax domain.


The future of AI Tax is exciting.

Peter Toumbourou


Stay tuned for more insights and updates on how Instant.Tax is empowering individuals, businesses and professionals globally. For more information or to request a demo, contact us.


Comments


bottom of page