Real-Time Hallucination Detection in Voice Agents

Overview

AI agents are increasingly used in real-time customer service conversations, but as adoption increases, so do the risks of low-quality output: hallucinated facts, misinterpreted inputs, and unresolved objectives. These aren’t abstract concerns—they erode trust, create confusion, and often result in unmet customer needs.

This project was designed to solve exactly that: we built a real-time hallucination detection and correction pipeline for AI Voice Agents inside Feeding Frenzy CRM. The system runs on Buffaly, our in-house implementation of OGAR (Ontology-Guided Augmented Retrieval), and operates at runtime to monitor AI behavior, catch mistakes, and redirect the conversation.

The Problem

AI hallucinations are common in generative dialogue systems. They manifest as unsupported claims, incorrect assumptions, or confident answers to questions that were never asked. In real-time voice contexts, this is compounded by issues like:

  • Audio/text mismatches due to dialect, transcription error, or model bias.
  • Unsupported conclusions drawn from minimal input.
  • Agents that move on before completing their assigned objective.

In short, conversations fail in subtle but critical ways. When this happens during a support call, it leads to dissatisfaction, failed automation, and the very problems AI was supposed to fix.

Solution Architecture

We built a real-time hallucination detection system by embedding OGAR directly into Feeding Frenzy Voice Agent pipelines via Buffaly.Buffaly continuously monitors AI-agent conversations in real time, structured around three key capabilities:

  • Hallucination Detection
    Buffaly compares AI statements against structured ontologies to check for factual grounding. If an agent outputs a claim like "Thanks, Paul Johnson," but the caller never gave that name, Buffaly flags it immediately.
  • Audio/Text Mismatch Resolution
    Buffaly tracks the raw audio and transcribed text simultaneously, surfacing any misalignments (e.g., "I'm a watermelon" as a potential mis-transcription of "Florida mall").
  • Objective Verification
    Every conversation is guided by a set of goals defined in the ontology. Buffaly tracks whether those goals are met in a structured way (e.g., has the agent correctly gathered first and last name?). If the agent fails to satisfy a required field, Buffaly stops it from moving on.

Implementation Example: AAA Roadside Assistance

A real-world customer service scenario highlights this clearly.The agent is supposed to gather the customer’s name and location before dispatch. However, the customer has a thick Florida accent, and the voice transcription is ambiguous. The AI agent confidently responds using incorrect names (“Thank you, Paul Johnson”) and proceeds to the next objective, even though it never received a last name.

Buffaly intercepts this, extracts the agent’s assumptions, and checks for supporting data. It finds none—flagging hallucinations and identifying the objective as incomplete. It then prompts the agent to clarify instead of moving forward.Later in the call, the user says “I’m a watermelon,” and the agent interprets it as “Mims, Florida.” Buffaly uses ontology-based reasoning to identify that this is neither supported by the audio nor by the conversation logic. It corrects the location, and redirects the agent to confirm it.

Key Techniques

  • Prototype Matching: Buffaly uses ontology-defined prototypes to establish expected conversational goals.
  • Statement Extraction + Source Mapping: All assertions by the agent are broken down and traced back to user input or verified ontology records.
  • Semantic Mismatch Detection: Discrepancies between intent, transcription, and output are classified as hallucinations, transcription errors, or unmet objectives.
  • Corrective Guidance: Buffaly doesn’t just flag errors—it intervenes with structured next steps to keep the conversation on track.

Results

Buffaly reduced hallucinations and unmet objectives by over 30% across trial deployments. More importantly, it allowed customer service AI to behave with higher levels of integrity. Conversations that would have previously continued with incorrect assumptions were corrected in-flight.Agents were no longer "just sounding confident"—they were provably correct, or they asked for clarification.

Broader Impact

This project represents a shift away from generative freeform responses and toward structured, supervised, ontology-aware automation. Buffaly ensures that AI agents don’t invent facts, don’t skip steps, and don’t hide behind language models.All of this runs locally, in real time, on top of structured ontologies tailored to the domain—whether that’s roadside assistance, billing, or insurance.

Looking Ahead

We’re expanding Buffaly’s hallucination detection framework to support multimodal interactions and expanding domain ontologies for deployment into finance, healthcare, and telecom.If you’re shipping AI into production environments and hallucinations are still just a model tuning problem, you’re not solving the right problem. OGAR is how we address it. Buffaly is how we run it.