Generative AI ignited the first wave of conversational AI, yet it speaks only for the individual user. When an agent must act on behalf of an entity—an airline, retailer, or bank—LLM output becomes an existential liability: broken compliance, policy violations, zero audit trail. Generative AI remains an unpredictable black box in scenarios where conversational agents operate on behalf of companies. Yet these scenarios represent a substantial and economically critical portion of all potential AI applications.
The fundamental shortcomings of Generative-only AI have become increasingly visible (and costly):
Generative AI alone simply cannot fulfill the promises the world expects from advanced artificial intelligence. Recent academic work underscores how far today’s LLM agents are from that bar. A May 2025 paper, “CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions”, reports that state-of-the-art agents score only 58% on single-turn tasks and collapse to ≈35% once the dialogue spans multiple turns, leading the authors to highlight a “significant gap between current LLM capabilities and real-world enterprise demands.”1
For decades, AI researchers debated two distinct paths: symbolic AI, emphasizing rules, logic, and explicit reasoning; and neural networks, skilled at pattern recognition and statistical learning. Both have strengths, both have critical limitations. Purely symbolic AI struggles with natural, conversational interaction. Purely neural, generative AI falters when trust, reliability, and consistency are non-negotiable. The long-standing goal was to combine the two—leveraging each to offset the other’s weaknesses. With Apollo-1, that vision is now reality: conversational agents that converse fluently and act reliably.
To reach the necessary accuracy, trust, and reliability τ-Bench measures, the model core must shift from neural to neuro-symbolic reasoning. Apollo-1, our neuro-symbolic foundation model for conversational agents, marks the beginning of this transformation. It is designed to power conversational agents acting on behalf of entities across industries and use-cases. Apollo-1 enables advanced native tool use through reliable, structured symbolic interactions with complex external systems and APIs. It provides comprehensive traceability, with each decision logged in a fully inspectable, editable format. Finally, it offers steerability and controllability, allowing organizations to consistently steer agents toward desired behaviors by providing granular context, instructions, and guidelines.
On tasks that require conversational fluency combined with dependable, transparent action, Apollo-1 consistently outperforms state-of-the-art generative models by wide margins, as illustrated by the recent τ-Bench evaluation results detailed below.
Neuro-Symbolic AI bridges the gap between Generative AI’s linguistic capabilities and Symbolic AI’s structured reasoning, unlocking actionable, reliable, transparent, and steerable AI interactions. This second wave moves beyond conversation to dependable execution, enabling conversational agents that work for entities, not just end-users.
τ-Bench-Airline is the toughest public benchmark for tool-using conversational agents, measuring complex, multi-turn user-agent interactions and advanced tool use. On the headline Pass^¹ metric, preliminary results place Apollo-1 in the 90.8–92.5 % range—over 30 percentage points ahead of Claude-4’s 60 %.
Final Pass^¹, Pass^², Pass^³, and Pass^⁴ numbers will be published in our official τ-Bench report in September 2025.
Task #12: Reservation Update & Booking.
Time | Speaker | Message |
---|---|---|
12:18 PM | Agent | Welcome! This is a simulation environment for the TAU Benchmark – Airlines. How can I help you today? |
12:18 PM | User | Hi, I want to book the same flights as my current reservation, but this time for my friend Ivan Smith. His date of birth should be in my profile. |
12:18 PM | Agent | Thank you for your request! To get started with booking the same flights for Ivan Smith, could you please provide your user ID? |
12:19 PM | User | My user ID is ivan_muller_7015. |
12:19 PM | Agent | Agent Options:
HAT097, HAT251 HAT275, HAT156 HAT169, HAT069 HAT275, HAT251 HAT035, HAT045 HAT097, HAT045 |
12:19 PM | Agent | I found the same one-stop flights as your current reservation—HAT097 (DTW to PHX, departs 7:00am) and HAT251 (PHX to SEA, departs 12:00pm) in economy class, total fare $375. Please confirm if you’d like to proceed with booking these flights for Ivan Smith. |
12:19 PM | User | Yes, please proceed. I’d like to use my certificate for payment—can you tell me how much of my certificate balance will remain after this booking? |
12:19 PM | Agent | Thanks for confirming. The total fare for Ivan Smith’s flight is $375, which will be deducted from your $500 certificate—leaving a $125 balance that will be forfeited, as unused certificate funds are non-refundable. Would you like to add travel insurance for $30 (covers full refund for health or weather-related cancellations), and should I go ahead with the booking? |
12:19 PM | User | If $125 will be lost, I’d rather use my gift card and credit card instead. Can you switch the payment to those? |
12:19 PM | Agent | Thanks for the update. To confirm, you’d like to book flights HAT097 (DTW to PHX, departs 7:00am) and HAT251 (PHX to SEA, departs 12:00pm) on 05-17-2024 in economy for Ivan Smith, using your gift card and credit card for the $375 fare. Would you like to add travel insurance for $30 (covers full refund for health or weather-related cancellations)? |
12:19 PM | User | No insurance, and I don’t need any bags. Please go ahead and book it. |
12:19 PM | Agent | Ivan Smith is now booked on flights HAT097 and HAT251 in economy on 05-17-2024, with no checked bags or insurance. Payment was completed using your gift card ($128) and credit card ($247), and your reservation ID is HATHAU. Let me know if you need the itinerary details or have any other requests. |
A single reservation-update chat shows how quickly reliability collapses for generative-only agents. Even at a seemingly healthy 60 % Pass^¹, the odds of clearing real-world workflows plummet:
Requirement | Success at 60% Pass^¹ | Odds All Succeed |
1 scenario | 60% | 1 in 1.7 |
5 scenarios | 0.6⁵ ≈ 7.8 % | 1 in 13 |
50 scenarios (full τ-Bench suite) | 0.6⁵⁰ ≈ 8 × 10⁻¹² | 1-in-125 billion moonshot |
Zero-slip reality: an airline, bank, or hospital doesn’t care if the agent usually works. Purely generative AI makes that practically impossible.
The Generative Catch-22
When revenue, reputation, or regulation are on the line, a 60% Pass^¹ leaves four in ten customer journeys unfinished.
Apollo-1 is a neuro-symbolic foundation model purpose-built for Conversational Agents.
In Apollo-1, a Neuro-Symbolic Reasoner replaces the transformer as the model’s decision-making core. To dive into Apollo-1’s architecture, click here.
When an agent acts on behalf of an entity, every interaction can affect revenue, reputation, or regulatory exposure. Hallucinations cost money; tone mis‑steps erode trust; missed guardrails create legal risk.
Domain | Test Size | Apollo‑1 | Model |
Travel (Google Flights live chats) | 111 scenarios | 83 % one‑shot | 22 % Gemini 2.5‑Flash |
Retail (Amazon live chats) | 120 scenarios | 91 % one‑shot | 17 % Amazon Rufus |
We pitted Apollo-1 against two leading LLM Agents to test whether the τ‑Bench results extrapolate to real world use-cases and APIs. The results from Google Flights and Amazon Retail tests reflect one-take scoring. Each scenario allowed only a single attempt per interaction—zero retries permitted. Scenarios required perfect multi-turn execution to pass: any incorrect step resulted in scenario failure.
Google Flights evaluation: 111 scenarios (booking seats, adding bags, eco-friendly flights, reroutes, upgrades) using real-time inventory from Google Flights. Amazon Retail evaluation: 120 scenarios (4K TVs, laptop comparisons, color variants, cart management, gift selection, policy queries) using Amazon.com’s real-time product catalog. Both tests ensured identical API access, ensuring performance differences resulted solely from reasoning and agentic capabilities.
These comparisons specifically utilized Google Flights and Amazon Inventory because these platforms represent some of the most structured and organized API environments globally, providing a rigorous test for an AI agent’s ability to reliably interact with complex external tools, especially given Apollo-1’s lack of internal or privileged access to these systems.
Generative AI unlocked fluent chat. AUI’s Apollo-1 adds something generative models can’t: verifiable action. Apollo-1 makes conversational agents that converse fluently and act reliably a reality by combining Generative and Symbolic AI.
τ-Bench shows that a 60 % Pass^¹ leaves four in ten journeys unresolved, while CRMArena-Pro confirms leading LLM agents hover at 58 % single-turn and collapse in multi-turn dialogues.
Apollo-1’s neuro-symbolic core unites generative fluency with reliable actionability to clear > 90 % Pass^¹ where Large Language Models falter. Generative AI cannot operate for an entity; neuro-symbolic AI can.
Trajectories and reward logs will be released with the public results.
Click here to explore Apollo-1’s Eval Playground, where you can experience its capabilities firsthand. Interact with the model and its Reasoning Panel, navigate real-time conversations across multiple evaluation domains, and review live benchmark trajectories (passkey required; request access).
1 Huang, K-H.; Prabhakar, A.; Thorat, O.; Agarwal, D.; Choubey, P.K.; Mao, Y.; Savarese, S.; Xiong, C.; Wu, C-S. CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions. Salesforce AI Research (2025). arXiv:2505.18878.