News

Apollo-1 One-Shots 83% of 111 Live Google-Flights Scenarios — A Controllable AI Showcase

Apollo-1’s 83% vs Gemini 2.5-Flash’s 22%: Where generative AI stalls, neuro-symbolic controllable AI powers through

06/25/25

Introduction

Generative AI excels when acting on behalf of a user, but it falters when an agent must act for an entity—whether it’s an insurance company, a brand, or a travel platform. That’s where controllable AI becomes essential, giving companies the levers to keep an agent aligned with their policies, guardrails, and desired behaviour. To demonstrate, we wired Apollo-1 into the same real-time Google Flights feed powering Gemini 2.5-Flash and ran 111 multi-turn conversations across both models.

Apollo-1 is engineered to operate on behalf of an entity, not an individual. Its neuro-symbolic core merges generative fluency with deterministic policy enforcement, giving organisations the control, traceability, and reliability that conventional LLM co-pilots lack. The head-to-head below shows how that design translates into real-world wins over Gemini when every query hits identical inventory, prices, and schedules.

Most leaderboards cherry-pick the best answer from a batch of generations. We do the opposite. Every prompt gets one shot—exactly what a live user would see—and that single response is what we score.

Corner Dark
Corner Dark
Rule
01One shot per scenario
02Multi-turn pass rule
03111 customer scenarios
04Live Google Flights backend
05Sampling window
What it enforces
The agent gets zero retries; whatever it says in each turn is final.
A scenario passes only if every step (R-1…R-3) is correct. One miss → scenario fail.
Book a seat, add a bag, pick the greenest flight, reroute through a hub, quote an upgrade, etc.
Both agents hit the same real-time inventory, so any gap is pure reasoning and control.
Runs executed between May 30 and June 2nd, 2025; results reflect that snapshot.
Corner Dark
Corner Dark

We reuse τ-Bench’s philosophy in miniature: no retries, binary checks that mirror real customer journeys. And because there’s no second chance, the outcome is a single, unforgiving line in the sand—a simple, reproducible, brutally honest snapshot of how an agent really behaves when someone is on the other side of the chat box.

Results in a Snapshot

Concept 01 V2 (1)resize icon

How Controllable AI Keeps Agents on the Rails

For almost eight years we built toward one goal: agents that act on behalf of an entity, not a user.

  • Neuro-symbolic foundation model a Symbolic Reasoner replaces the transformer as the model’s decision-making core, enabling controllable, context-aware interactions every turn.
  • Deterministic control on demand symbolic guardrails guarantee brand and regulatory compliance before the answer leaves the model, while still leveraging generative fluency.

Real-time Control Panel teams inspect reasoning traces, tweak context schemas, inject policies, and replay trajectories; continuous fine-tuning at the sub-interaction level instead of retraining the whole stack.

This is how an agent can operate on behalf of an airline, a bank, or a government office – not just with a user – and still sound natural.

Example Scenario

Scoring criteria — Scenario 28 “Same-day One-Way, Arrive Before Lunch” Scenario

Step Tester prompt Check Pass criterion
R-1 “One-way LAX → JFK overnight on 9 Aug 2025 (arrive morning 10 Aug), 1 adult, economy.” Return at least one qualifying red-eye itinerary Departs 9 Aug, arrives next day before 12:00 PM, shows total price
R-2 “What’s the cost to upgrade the cheapest red-eye?” Quote upgrade price (or say none) for that same itinerary States whether an upgrade exists and the correct price delta
R-3 “If I skip the upgrade and instead add one checked bag, what’s the total cost?” Add the airline’s checked-bag fee and compute new grand total Gives per-bag fee, adds it to base fare, and returns the exact all-in total

Scoring: binary (1 = pass, 0 = fail) per step. Scenario score per run = R-1 + R-2 + R-3 (max 3).

Scenario Reward Logs

Model Red-eye found? Upgrade price? Bag math right? Score Reward
Apollo-1 ✅ 22:00 → 06:30
$204
✅ +$80
Blue Extra
✅ Fare + $40 bag
= $244
3/3 1
Gemini ✅ 20:47 → 05:30
$139
❌ pulled upgrade
from wrong airline
❌ bag fee range,
no total
1/3 0

Benchmark Overview & Group Analysis 

The 111 scenarios are grouped around five distinct technical competencies that an enterprise-grade flight agent must master. Each group isolates one capability, so a perfect model would score 100 % inside every row:

Group Technical capability under test
Baseline Retrieval Same-/next-day queries with minimal dialogue: slot-filling from a single prompt, immediate cheapest-non-stop extraction, earliest-arrival lookup, basic aircraft or alliance attribution, and quick FAA‐safety or risk answers.
Core Search Two-turn clarification flows and flexible-window optimisation: resolving missing dates or trip type, iterating on “next flight / land-by” constraints, and selecting the absolute lowest fare from a live inventory once parameters are locked.
Ancillary Costs End-to-end fee arithmetic: parsing carrier-specific price tables (checked bags, pets, seat upgrades, Wi-Fi, refundability), aggregating per-leg vs per-trip charges, recomputing totals after user changes, and preserving fare-rule validity.
Cabin Precision Fare-family graph reasoning and cabin hierarchy mapping (Economy → Premium Eco → Business/First): real-time availability checks, upgrade-delta calculation, group-size propagation through re-quotes, and class-of-service validation for non-stop constraints.
Constrained Planning Multi-constraint itinerary optimisation: enforcing strict arrival/departure windows, airline/airport and lay-over filters, short-connection thresholds, hub or region routing, CO₂ or environmental factors, historical delay-risk ranking, and fallback-option generation when no direct match exists.
Graph Gemini V2resize icon
Group Scenario IDs What it measures  Apollo-1 Gemini
Baseline Retrieval 37–39 · 70–72 · 91–93 · 100–111 Same-/next-day look-ups, cheapest non-stop pick, earliest-arrival check, basic aircraft/alliance flags, FAA-safety acknowledgement. e.g. “Cheapest BOS→MIA next Saturday”, “Land before midnight despite FAA alert” → 85.7% → 71.4%
Core Search 1–6 · 25–27 · 40–42 · 64–66 · 73–78 Multi-turn date/trip-type clarification, flexible-window lowest fare, next-flight/arrival-window filters. e.g. round-trip MIA⇄NYC date probe; 1–7 Jul cheapest RT → 95.2% → 28.6%
Ancillary Costs 7–9 · 28–30 · 34–36 · 52–54 · 61–63 · 82–84 · 94–96 Fee arithmetic: checked bags, pet-in-cabin, upgrade-vs-bag deltas, Wi-Fi/refundability tiers, OW-vs-RT price optimisation. e.g. basic-econ + checked bag; red-eye upgrade vs bag → 57.1% → 0%
Cabin Precision 13–18 · 46–51 · 85–90 · 97–99 Fare-family graph reasoning, Premium/Biz quotes, group upgrades, non-stop cabin comparisons, multi-city upgrade pricing. e.g. Economy→Premium delta on nonstop JFK→LAX → 90.5% → 0%
Constrained Planning 10–12 · 19–24 · 31–33 · 43–45 · 55–60 · 67–69 · 79–81 Multi-constraint routing: arrival/departure cut-offs, red-eyes + Wi-Fi, CO₂ & offset, hub/layover filters, delay-risk backups, national-park & festival trips. → 85.2% → 11.1%
Totals (n = 111 Scenarios) → 82.9% → 21.6%

Apollo-1 Strengths Across the Five Competency Buckets

  • Context-aware dialogue control – in Core Search scenarios Apollo-1 clarifies missing slots (dates, O/W vs R/T, party size) before searching, then keeps that context intact through follow-up constraints.
  • High-fidelity data retrieval – consistently surfaces the exact fare family, bag fee, upgrade delta or aircraft type demanded in Ancillary Costs and Cabin Precision scenarios.
  • Multi-constraint filtering – satisfies arrival-before / depart-after windows, hub-routing, Wi-Fi-only, airline-specific and environmental (CO₂) filters in a single pass, dominating the Constrained Planning bucket.
  • Complete itineraries – returns full outbound + return legs where required, a failure point that still plagues Gemini in every bucket except Baseline Retrieval.

Areas that Still Need Work

  • True multi-city pricing flows – both models stumble on leg-by-leg bag or upgrade math in scenarios that chain three or more segments.
  • Long-tail ancillaries – niche items such as carrier-published CO₂ offset add-ons or exotic pet-in-cabin fees remain patchy (Gemini far more so).

Gemini’s Persistent Gaps

  • Fails slot clarification: often assumes round-trip, ignores supplied date windows, or answers with only the outbound leg.
  • Zero full-conversation passes in Ancillary Costs and Cabin Precision buckets.

Bottom Line

Generative AI is fine for user chat; Controllable AI is mandatory when an agent operates on behalf of a business or an entity. Across 111 one-shot scenarios Apollo-1 delivers a full-conversation pass on 92 (82.9 %); Gemini 2.5-Flash manages 24 (21.6 %), proving the need for neuro-symbolic control. The advantage grows as queries add pricing arithmetic or routing constraints, signalling Apollo-1’s readiness for production settings where predictable, auditable behaviour is mandatory.

Appendix A: Reward Logs

View Full Scenario List

ID
Scenario
Gemini 2.5-Flash
R1

Hi! I need round-trip flights from MIA to NYC.

R2

Where does the return leg depart from

R1 0
R2 0
R3 --

Result: Fail (0/2)

R1

Hi! I need round-trip flights from BOS to WAS

R2

Where does the return leg depart from?

R1 0
R2 0
R3 --

Result: Fail (0/2)

R1

R1: Hi! I need round-trip flights from LON to PAR.

R2

Where does the return leg depart from?

R1 0
R2 0
R3 --

Result: Fail (0/2)

R1

Hi! I need to find round-trip flights from LON to PAR in August.

R2

What is the duration of each leg?

R3

What is the baggage allowance for each leg?

R1 0
R2 0
R3 0

Result: Fail (0/3)

R1

Hi! I need to find round-trip flights from MIA to NYC in August

R2

What is the duration of each leg?

R3

What is the baggage allowance for each leg?

R1 0
R2 0
R3 0

Result: Fail (0/3)

R1

Hi! I need to find round-trip flights from BOS to WAS in August.

R2

What is the duration of each leg?

R3

What is the baggage allowance for each leg?

R1 0
R2 0
R3 0

Result: Fail (0/3)

R1

BOS → DUB 1 Nov – 8 Nov, basic-econ

R2

Add one checked bag each way—what’s the new total?

R1 0
R2 0
R3 --

Result: Fail (0/2)

R1

MCO → DFW 10 Jul – 16 Jul, basic-econ.

R2

Add one checked bag each way—what’s the new total?

R1 0
R2 0
R3 --

Result: Fail (0/2)

R1

LAS → LAX 11 Aug – 14 Aug, basic-econ.

R2

Add one checked bag each way—what’s the new total?

R1 0
R2 0
R3 --

Result: Fail (0/2)

R1

Looking to fly tomorrow from BOS to NYC.

R2

Need to get to NYC on time for lunch.

R3

Is there a seat with more legroom on this flight?

R1 0
R2 1
R3 0

Result: Fail (1/3)

R1

Looking to fly tomorrow from LAX to SFO.

R2

Need to get to SFO on time for lunch.

R3

Is there a seat with more legroom on this flight?

R1 0
R2 0
R3 0

Result: Fail (0/3)

R1

Looking to fly tomorrow from MAD to BER.

R2

Need to get to BER on time for lunch.

R3

Is there a seat with more legroom on this flight?

R1 0
R2 0
R3 0

Result: Fail (0/3

Apollo-1
R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 1

Result: Pass (3/3)

Appendix B : Trajectories

View Full Trajectories 

Corner Light
Corner Light
Back
Share
Corner Light
Corner Light