News

Apollo-1 One-Shots 83% of 111 Live Google-Flights Scenarios — A Controllable AI Showcase

Apollo-1’s 83% vs Gemini 2.5-Flash’s 22%: Where generative AI stalls, neuro-symbolic controllable AI powers through

06/25/25

Introduction

Generative AI excels when acting on behalf of a user, but it falters when an agent must act for an entity—whether it’s an insurance company, a brand, or a travel platform. That’s where controllable AI becomes essential, giving companies the levers to keep an agent aligned with their policies, guardrails, and desired behaviour. To demonstrate, we wired Apollo-1 into the same real-time Google Flights feed powering Gemini 2.5-Flash and ran 111 multi-turn conversations across both models.

Apollo-1 is engineered to operate on behalf of an entity, not an individual. Its neuro-symbolic core merges generative fluency with deterministic policy enforcement, giving organisations the control, traceability, and reliability that conventional LLM co-pilots lack. The head-to-head below shows how that design translates into real-world wins over Gemini when every query hits identical inventory, prices, and schedules.

Most leaderboards cherry-pick the best answer from a batch of generations. We do the opposite. Every prompt gets one shot—exactly what a live user would see—and that single response is what we score.

Rule

01One shot per scenario

02Multi-turn pass rule

03111 customer scenarios

04Live Google Flights backend

05Sampling window

What it enforces

The agent gets zero retries; whatever it says in each turn is final.

A scenario passes only if every step (R-1…R-3) is correct. One miss → scenario fail.

Book a seat, add a bag, pick the greenest flight, reroute through a hub, quote an upgrade, etc.

Both agents hit the same real-time inventory, so any gap is pure reasoning and control.

Runs executed between May 30 and June 2nd, 2025; results reflect that snapshot.

We reuse τ-Bench’s philosophy in miniature: no retries, binary checks that mirror real customer journeys. And because there’s no second chance, the outcome is a single, unforgiving line in the sand—a simple, reproducible, brutally honest snapshot of how an agent really behaves when someone is on the other side of the chat box.

Results in a Snapshot

How Controllable AI Keeps Agents on the Rails

For almost eight years we built toward one goal: agents that act on behalf of an entity, not a user.

Neuro-symbolic foundation model — a Symbolic Reasoner replaces the transformer as the model’s decision-making core, enabling controllable, context-aware interactions every turn.
Deterministic control on demand — symbolic guardrails guarantee brand and regulatory compliance before the answer leaves the model, while still leveraging generative fluency.

Real-time Control Panel — teams inspect reasoning traces, tweak context schemas, inject policies, and replay trajectories; continuous fine-tuning at the sub-interaction level instead of retraining the whole stack.

This is how an agent can operate on behalf of an airline, a bank, or a government office – not just with a user – and still sound natural.

Example Scenario

Scoring criteria — Scenario 28 “Same-day One-Way, Arrive Before Lunch” Scenario

Step	Tester prompt	Check	Pass criterion
R-1	“One-way LAX → JFK overnight on 9 Aug 2025 (arrive morning 10 Aug), 1 adult, economy.”	Return at least one qualifying red-eye itinerary	Departs 9 Aug, arrives next day before 12:00 PM, shows total price
R-2	“What’s the cost to upgrade the cheapest red-eye?”	Quote upgrade price (or say none) for that same itinerary	States whether an upgrade exists and the correct price delta
R-3	“If I skip the upgrade and instead add one checked bag, what’s the total cost?”	Add the airline’s checked-bag fee and compute new grand total	Gives per-bag fee, adds it to base fare, and returns the exact all-in total

Scoring: binary (1 = pass, 0 = fail) per step. Scenario score per run = R-1 + R-2 + R-3 (max 3).

Scenario Reward Logs

Model	Red-eye found?	Upgrade price?	Bag math right?	Score	Reward
Apollo-1	✅ 22:00 → 06:30 $204	✅ +$80 Blue Extra	✅ Fare + $40 bag = $244	3/3	1
Gemini	✅ 20:47 → 05:30 $139	❌ pulled upgrade from wrong airline	❌ bag fee range, no total	1/3	0

Benchmark Overview & Group Analysis

The 111 scenarios are grouped around five distinct technical competencies that an enterprise-grade flight agent must master. Each group isolates one capability, so a perfect model would score 100 % inside every row:

Group	Technical capability under test
Baseline Retrieval	Same-/next-day queries with minimal dialogue: slot-filling from a single prompt, immediate cheapest-non-stop extraction, earliest-arrival lookup, basic aircraft or alliance attribution, and quick FAA‐safety or risk answers.
Core Search	Two-turn clarification flows and flexible-window optimisation: resolving missing dates or trip type, iterating on “next flight / land-by” constraints, and selecting the absolute lowest fare from a live inventory once parameters are locked.
Ancillary Costs	End-to-end fee arithmetic: parsing carrier-specific price tables (checked bags, pets, seat upgrades, Wi-Fi, refundability), aggregating per-leg vs per-trip charges, recomputing totals after user changes, and preserving fare-rule validity.
Cabin Precision	Fare-family graph reasoning and cabin hierarchy mapping (Economy → Premium Eco → Business/First): real-time availability checks, upgrade-delta calculation, group-size propagation through re-quotes, and class-of-service validation for non-stop constraints.
Constrained Planning	Multi-constraint itinerary optimisation: enforcing strict arrival/departure windows, airline/airport and lay-over filters, short-connection thresholds, hub or region routing, CO₂ or environmental factors, historical delay-risk ranking, and fallback-option generation when no direct match exists.

Group	Scenario IDs	What it measures	Apollo-1	Gemini
Baseline Retrieval	37–39 · 70–72 · 91–93 · 100–111	Same-/next-day look-ups, cheapest non-stop pick, earliest-arrival check, basic aircraft/alliance flags, FAA-safety acknowledgement. e.g. “Cheapest BOS→MIA next Saturday”, “Land before midnight despite FAA alert”	→ 85.7%	→ 71.4%
Core Search	1–6 · 25–27 · 40–42 · 64–66 · 73–78	Multi-turn date/trip-type clarification, flexible-window lowest fare, next-flight/arrival-window filters. e.g. round-trip MIA⇄NYC date probe; 1–7 Jul cheapest RT	→ 95.2%	→ 28.6%
Ancillary Costs	7–9 · 28–30 · 34–36 · 52–54 · 61–63 · 82–84 · 94–96	Fee arithmetic: checked bags, pet-in-cabin, upgrade-vs-bag deltas, Wi-Fi/refundability tiers, OW-vs-RT price optimisation. e.g. basic-econ + checked bag; red-eye upgrade vs bag	→ 57.1%	→ 0%
Cabin Precision	13–18 · 46–51 · 85–90 · 97–99	Fare-family graph reasoning, Premium/Biz quotes, group upgrades, non-stop cabin comparisons, multi-city upgrade pricing. e.g. Economy→Premium delta on nonstop JFK→LAX	→ 90.5%	→ 0%
Constrained Planning	10–12 · 19–24 · 31–33 · 43–45 · 55–60 · 67–69 · 79–81	Multi-constraint routing: arrival/departure cut-offs, red-eyes + Wi-Fi, CO₂ & offset, hub/layover filters, delay-risk backups, national-park & festival trips.	→ 85.2%	→ 11.1%
—	Totals (n = 111 Scenarios)	—	→ 82.9%	→ 21.6%

Apollo-1 Strengths Across the Five Competency Buckets

Context-aware dialogue control – in Core Search scenarios Apollo-1 clarifies missing slots (dates, O/W vs R/T, party size) before searching, then keeps that context intact through follow-up constraints.
High-fidelity data retrieval – consistently surfaces the exact fare family, bag fee, upgrade delta or aircraft type demanded in Ancillary Costs and Cabin Precision scenarios.
Multi-constraint filtering – satisfies arrival-before / depart-after windows, hub-routing, Wi-Fi-only, airline-specific and environmental (CO₂) filters in a single pass, dominating the Constrained Planning bucket.
Complete itineraries – returns full outbound + return legs where required, a failure point that still plagues Gemini in every bucket except Baseline Retrieval.

Areas that Still Need Work

True multi-city pricing flows – both models stumble on leg-by-leg bag or upgrade math in scenarios that chain three or more segments.
Long-tail ancillaries – niche items such as carrier-published CO₂ offset add-ons or exotic pet-in-cabin fees remain patchy (Gemini far more so).

Gemini’s Persistent Gaps

Fails slot clarification: often assumes round-trip, ignores supplied date windows, or answers with only the outbound leg.
Zero full-conversation passes in Ancillary Costs and Cabin Precision buckets.

Bottom Line

Generative AI is fine for user chat; Controllable AI is mandatory when an agent operates on behalf of a business or an entity. Across 111 one-shot scenarios Apollo-1 delivers a full-conversation pass on 92 (82.9 %); Gemini 2.5-Flash manages 24 (21.6 %), proving the need for neuro-symbolic control. The advantage grows as queries add pricing arithmetic or routing constraints, signalling Apollo-1’s readiness for production settings where predictable, auditable behaviour is mandatory.

Appendix A: Reward Logs

View Full Scenario List

Scenario

Gemini 2.5-Flash

Hi! I need round-trip flights from MIA to NYC.

Where does the return leg depart from

R1 0

R2 0

R3 --

Result: Fail (0/2)

Hi! I need round-trip flights from BOS to WAS

Where does the return leg depart from?

R1 0

R2 0

R3 --

Result: Fail (0/2)

R1: Hi! I need round-trip flights from LON to PAR.

Where does the return leg depart from?

R1 0

R2 0

R3 --

Result: Fail (0/2)

Hi! I need to find round-trip flights from LON to PAR in August.

What is the duration of each leg?

What is the baggage allowance for each leg?

R1 0

R2 0

R3 0

Result: Fail (0/3)

Hi! I need to find round-trip flights from MIA to NYC in August

What is the duration of each leg?

What is the baggage allowance for each leg?

R1 0

R2 0

R3 0

Result: Fail (0/3)

Hi! I need to find round-trip flights from BOS to WAS in August.

What is the duration of each leg?

What is the baggage allowance for each leg?

R1 0

R2 0

R3 0

Result: Fail (0/3)

BOS → DUB 1 Nov – 8 Nov, basic-econ

Add one checked bag each way—what’s the new total?

R1 0

R2 0

R3 --

Result: Fail (0/2)

MCO → DFW 10 Jul – 16 Jul, basic-econ.

Add one checked bag each way—what’s the new total?

R1 0

R2 0

R3 --

Result: Fail (0/2)

LAS → LAX 11 Aug – 14 Aug, basic-econ.

Add one checked bag each way—what’s the new total?

R1 0

R2 0

R3 --

Result: Fail (0/2)

Looking to fly tomorrow from BOS to NYC.

Need to get to NYC on time for lunch.

Is there a seat with more legroom on this flight?

R1 0

R2 1

R3 0

Result: Fail (1/3)

Looking to fly tomorrow from LAX to SFO.

Need to get to SFO on time for lunch.

Is there a seat with more legroom on this flight?

R1 0

R2 0

R3 0

Result: Fail (0/3)

Looking to fly tomorrow from MAD to BER.

Need to get to BER on time for lunch.

Is there a seat with more legroom on this flight?

R1 0

R2 0

R3 0

Result: Fail (0/3

Apollo-1

R1 1

R2 1

R3 --

Result: Pass (2/2)

R1 1

R2 1

R3 --

Result: Pass (2/2)

R1 1

R2 1

R3 --

Result: Pass (2/2)

R1 1

R2 1

R3 1

Result: Pass (3/3)

R1 1

R2 1

R3 1

Result: Pass (3/3)

R1 1

R2 1

R3 1

Result: Pass (3/3)

R1 1

R2 1

R3 --

Result: Pass (2/2)

R1 1

R2 1

R3 --

Result: Pass (2/2)

R1 1

R2 1

R3 --

Result: Pass (2/2)

R1 1

R2 1

R3 1

Result: Pass (3/3)

R1 1

R2 1

R3 1

Result: Pass (3/3)

R1 1

R2 1

R3 1

Result: Pass (3/3)

Appendix B : Trajectories

View Full Trajectories

Back