Skip to main content

Detecting Hallucinations in Language Models with Natural Language Inference

34 min read

AI systems continue to advance at an impressive pace, yet one particular challenge remains surprisingly persistent across all major models: hallucinations. These are the moments when a language model produces an answer that reads fluently and convincingly, but is factually incorrect. OpenAI’s recent research underscores that hallucinations are not accidental artefacts or unpredictable glitches. Instead, they emerge from the structural incentives and statistical processes that underpin how modern language models are trained, evaluated, and rewarded. Even GPT-5, which exhibits far stronger reasoning ability and significantly fewer hallucinations, cannot avoid them entirely.

Why Hallucinations Occur

OpenAI defines hallucinations as “plausible but false statements” generated by language models. They can appear in the most ordinary interactions. In one example highlighted by the research, chatbots repeatedly provided incorrect answers about the birth date and dissertation title of one of the paper’s authors - each time with complete confidence. This behaviour traces back to how models are taught. Most evaluation systems reward accuracy without rewarding uncertainty. Models are therefore incentivised to guess when unsure, because guessing may earn points while saying “I don’t know” earns nothing.

The analogy OpenAI uses is a multiple-choice exam: if you guess, there’s a chance you get it right; if you leave it blank, you definitely get it wrong. Over thousands of questions, a model that guesses will appear stronger than one that is conservative - even if it makes far more serious factual errors. Metrics that focus exclusively on accuracy unintentionally encourage this behaviour, pushing models to output confident answers even when they lack the information to justify them.

The Role of Next-Word Prediction

Hallucinations are also deeply connected to how models are pretrained. During pretraining, an LLM is only asked to predict the next word in billions of text sequences. It is never told which statements are true or false. Because it learns only from examples of fluent language, the model absorbs patterns of grammar and structure extremely well. However, low-frequency factual details - particularly dates, quantities, personal information, or niche domain facts - do not follow consistent patterns in natural text. As a result, the model fills in the gaps using probability, not truth.

This distinction explains why hallucinations persist even as raw accuracy improves. Some questions are inherently unanswerable due to ambiguous phrasing, incomplete world knowledge, or missing context. Pretraining alone cannot resolve these limitations. Accuracy cannot reach 100% in open-ended settings, which means the model will always face situations where it must either hold back or attempt a guess. If evaluation systems penalise abstention, guesses become the default.

OpenAI’s paper addresses several misconceptions directly:

  • Hallucinations won’t disappear simply by increasing accuracy, because some questions have no determinable answer.
  • Hallucinations are not inevitable, because models can abstain when uncertain.
  • Larger models are not always better at avoiding hallucinations, as smaller models may recognise their own knowledge boundaries more reliably.
  • Hallucinations are not mysterious, but rather a statistical outcome of next-word prediction.
  • We don’t just need better hallucination evals, we need mainline eval metrics that reward uncertainty-aware behaviour.

All of this leads to a practical conclusion: hallucinations require mitigation, not blind trust that future models will magically eliminate them.

Natural Language Inference as a Safety Layer

This is where Natural Language Inference (NLI) becomes particularly useful. NLI is a long-standing NLP task focused on determining the logical relationship between two pieces of text: a premise (trusted information) and a hypothesis (a claim or generated statement).

The relationship falls into one of three categories:

  • Entailment - The claim follows logically from the source.
  • Contradiction - The claim conflicts with the source.
  • Neutral - The source does not confirm or deny the claim.

In the context of hallucination detection:

  • Contradiction indicates a definite hallucination.
  • Neutral suggests that the model introduced information not present in the source.

This makes NLI an elegant and model-agnostic approach: instead of asking, “Is this fact true?”, we ask the simpler, more reliable question: “Does this statement follow from the information we trust?”

Building an NLI-Based Hallucination Detector in JavaScript

To demonstrate how NLI can be applied in practice, I built a small hallucination-detection module using Transformers.js alongside the Xenova/nli-deberta-v3-base model. This setup runs entirely in JavaScript - either in Node.js or directly in the browser - and requires no Python backend.

The detector follows a clear workflow:

  1. Load the NLI model via a Transformers.js pipeline.
  2. Accept a trusted reference text (the premise).
  3. Take the LLM’s generated claim (the hypothesis).
  4. Run the premise–hypothesis pair through the NLI model.
  5. Interpret the label as entailment, contradiction, or neutral.
  6. Flag contradiction and neutral outputs as potential hallucinations.

To make the evaluation more robust, the detector also breaks longer generations into smaller, “atomic” factual units. LLM outputs often contain multiple facts woven together through conjunctions like and, however, because, or but. By splitting these into individual claims, the NLI model can assess each fact independently. This avoids cases where a single incorrect clause corrupts the evaluation of an entire multi-clause sentence.

The implementation then aggregates all results into metrics such as hallucination rate and overall faithfulness score, offering a high-level picture of how reliable the generation is relative to the provided context.

Let’s take a look at how this could be implemented in Node.js.

Import and basic detector skeleton

// Import the Transformers.js pipeline helper.
// This gives us a simple way to load a model and run inference.
import { pipeline } from '@xenova/transformers';

/**
 * HallucinationDetector
 *
 * Uses a Natural Language Inference (NLI) model to check whether a generated
 * claim is supported by a given source text.
 *
 * - sourceText: trusted reference (e.g. documentation, retrieved context)
 * - generatedClaim: model output you want to verify
 *
 * The NLI model classifies the pair as:
 *   - entailment     → claim is supported by the source
 *   - contradiction  → claim conflicts with the source
 *   - neutral        → cannot be determined from the source
 *
 * We treat "contradiction" and "neutral" as potential hallucinations.
 */
class HallucinationDetector {
  constructor() {
    // Will hold the initialised classifier pipeline.
    this.classifier = null;
  }

This imports the high-level pipeline API from Transformers.js and declares the base HallucinationDetector class. The class encapsulates an NLI model and exposes a clean interface for checking claims. The constructor just initialises a classifier field to null, so the model is only loaded when first needed.

Lazy model initialisation

  /**
   * Lazily load and initialise the NLI classifier.
   * This only happens once; subsequent calls reuse the same instance.
   */
  async initialise() {
    // Create a text-classification pipeline using an NLI model.
    // Xenova/nli-deberta-v3-base is a DeBERTa-based NLI model ported to Transformers.js.
    this.classifier = await pipeline('text-classification', 'Xenova/nli-deberta-v3-base');
  }

The initialise method loads the DeBERTa-based NLI model into a text-classification pipeline. It is asynchronous and only called when the first detection is run, which prevents unnecessary model loading if the detector is never used.

Single-claim hallucination detection

  /**
   * Run hallucination detection for a single claim against a source.
   *
   * @param {string} sourceText      - The trusted source text (premise).
   * @param {string} generatedClaim  - The claim to verify (hypothesis).
   * @returns {Promise<{
   *   isHallucination: boolean,
   *   label: 'entailment' | 'neutral' | 'contradiction',
   *   confidence: number,
   *   severity: 'none' | 'low' | 'medium' | 'high'
   * }>}
   */
  async detectHallucination(sourceText, generatedClaim) {
    // Ensure the classifier is loaded before first use.
    if (!this.classifier) {
      await this.initialise();
    }

    // Many NLI models are trained to receive:
    //   "[premise] </s> [hypothesis]"
    // as a single string input.
    const input = `${sourceText} </s> ${generatedClaim}`;

    // Run inference. The pipeline returns an array of predictions.
    // e.g. [{ label: 'entailment', score: 0.95 }]
    const result = await this.classifier(input);

    const label = result[0].label;
    const confidence = result[0].score;

    // We consider "contradiction" and "neutral" as hallucinations:
    // - contradiction: clearly wrong given the source
    // - neutral: not supported by the source (may be unverifiable or extra)
    const isHallucination = label === 'contradiction' || label === 'neutral';

    return {
      isHallucination,
      label,
      confidence,
      severity: this.getSeverity(label, confidence),
    };
  }

This is the core method. It ensures the classifier is loaded, formats the NLI input as "premise </s> hypothesis", and calls the model. It then pulls out the top label and score, interprets contradiction and neutral as hallucinations, and returns a structured result including a severity level derived from the model’s confidence.

Mapping labels to severity

  /**
   * Map the NLI output into a simple severity level.
   *
   * This is a heuristic; you can tune thresholds based on your use case.
   */
  getSeverity(label, confidence) {
    if (label === 'contradiction') {
      // High-confidence contradictions are very likely serious hallucinations.
      return confidence > 0.9 ? 'high' : 'medium';
    } else if (label === 'neutral') {
      // Neutral means "not entailed"; often less severe but still suspicious.
      return confidence > 0.8 ? 'medium' : 'low';
    }

    // For entailment, we consider there to be no hallucination.
    return 'none';
  }

This helper converts the label and confidence score into a coarse severity rating. Strong contradictions are “high” severity, weaker ones are “medium”; neutral claims are treated as low or medium risk depending on confidence; and entailed claims are treated as having no hallucination.

Multi-claim checking and overall score

  /**
   * Check multiple claims for hallucinations against the same source.
   *
   * @param {string} sourceText - Trusted source text.
   * @param {string[]} claims   - Array of claims to check.
   * @returns {Promise<{
   *   claims: Array<{
   *     text: string,
   *     isHallucination: boolean,
   *     label: string,
   *     confidence: number,
   *     severity: string
   *   }>,
   *   overallScore: number
   * }>}
   *
   * overallScore is a simple "faithfulness" metric between 0 and 1:
   *   1 = no hallucinations
   *   0 = all claims hallucinated
   */
  async checkMultipleClaims(sourceText, claims) {
    // Run all claim checks in parallel for efficiency.
    const results = await Promise.all(
      claims.map((claim) => this.detectHallucination(sourceText, claim)),
    );

    return {
      // Attach the original text of each claim to its result.
      claims: claims.map((claim, i) => ({
        text: claim,
        ...results[i],
      })),
      // Aggregate into a simple overall faithfulness score.
      overallScore: this.calculateOverallScore(results),
    };
  }

  /**
   * Compute a simple overall faithfulness score.
   *
   * score = 1 - (# hallucinated claims / total claims)
   */
  calculateOverallScore(results) {
    const hallucinationCount = results.filter((r) => r.isHallucination).length;
    return results.length === 0
      ? 1 // edge-case: no claims → treat as fully faithful
      : 1 - hallucinationCount / results.length;
  }
}

Here you move from single claims to batches. checkMultipleClaims runs all detections in parallel, then returns both per-claim results and a single overallScore which is effectively “1 − hallucination rate”. calculateOverallScore implements that simple metric and handles the edge case of no claims.

Decimal-safe sentence splitting

/**
 * Helper: decimal-safe sentence splitting.
 *
 * Splits on '.', '!' or '?' that are NOT part of a number like "5.6".
 *
 * Examples:
 *   "Singapore has 5.6 million people. It is an island nation."
 *   → ["Singapore has 5.6 million people", "It is an island nation"]
 */
function splitSentences(text) {
  return (
    text
      // Split on punctuation that is not between digits.
      // (?<!\d)  → previous char is NOT a digit
      // (?!\d)   → next char is NOT a digit
      .split(/(?<!\d)[.!?]+(?!\d)/g)
      .map((s) => s.trim())
      .filter((s) => s.length > 0)
  );
}

This helper splits long text into sentences without accidentally splitting numbers like “5.6”. It uses lookbehind and lookahead to ensure punctuation is not surrounded by digits, then trims and filters out empty fragments.

Conjunction-based claim separator

/**
 * Regex for conjunctions/connectors that often indicate separate factual units.
 *
 * This is intentionally a bit aggressive to break compound sentences into
 * simpler clauses that NLI can judge more reliably.
 */
const CLAIM_SEPARATOR_REGEX =
  /\b(?:and|but|or|nor|yet|so|however|although|though|even though|nevertheless|nonetheless|still|on the other hand|in contrast|because|since|therefore|thus|consequently|as a result|also|further|furthermore|moreover|in addition|plus|as well as|when|while|before|after|once|if|unless)\b/i;

This regular expression finds conjunctions and discourse markers that typically join separate factual ideas. You later use it to chop sentences into smaller, more “atomic” claims so that each claim can be evaluated independently by the NLI model.

Advanced detector: decomposing generated text

class AdvancedHallucinationDetector extends HallucinationDetector {
  /**
   * Decompose a generated text into simpler, "atomic" claims.
   *
   * Steps:
   *  1. Split into sentences using decimal-safe punctuation splitting.
   *  2. Further split each sentence on conjunctions/connectors that often
   *     separate independent factual units.
   *  3. Trim and filter out very short fragments to avoid noise.
   *
   * @param {string} text - The generated text to decompose.
   * @returns {string[]}  - Array of claim strings.
   */
  decomposeClaims(text) {
    // Step 1: split text into sentences, but do NOT break numeric decimals.
    const sentences = splitSentences(text);

    const claims = [];

    for (const sentence of sentences) {
      // Step 2: split on common conjunctions/connectors.
      //
      // Example:
      //   "Singapore is a sovereign island nation and has a population of 5.6 million."
      // becomes:
      //   ["Singapore is a sovereign island nation",
      //    "has a population of 5.6 million"]
      //
      // This allows NLI to independently evaluate each factual piece.
      const parts = sentence.split(CLAIM_SEPARATOR_REGEX);

      for (let part of parts) {
        // Step 3: clean up whitespace and trailing punctuation.
        part = part.trim();

        // Remove leading/trailing commas/semicolons/extra spaces.
        part = part.replace(/^[,;:\s]+|[,;:\s]+$/g, '');

        // Filter out very short fragments (e.g., "and", "however", etc).
        // Threshold 10 chars is heuristic; you can tune this.
        if (part.length > 10) {
          claims.push(part);
        }
      }
    }

    return claims;
  }

This subclass extends the base detector with the ability to break a full LLM response into atomic claims. It uses splitSentences to get sentences, then splits each sentence on conjunctions, cleans up punctuation, and discards short fragments that are unlikely to be meaningful claims. The result is a list of reasonably self-contained factual statements.

Full-generation analysis

  /**
   * analyse a generated text against a source text:
   *
   * - Decompose the generated text into smaller claims.
   * - Run hallucination detection on each claim.
   * - Aggregate statistics and return detailed results.
   *
   * @param {string} sourceText     - Trusted source (premise).
   * @param {string} generatedText  - Model-generated text to analyse.
   * @returns {Promise<{
   *   totalClaims: number,
   *   hallucinationCount: number,
   *   hallucinationRate: number,
   *   hallucinations: Array<{
   *     text: string,
   *     isHallucination: boolean,
   *     label: string,
   *     confidence: number,
   *     severity: string
   *   }>,
   *   faithfulnessScore: number,
   *   details: Array<{
   *     text: string,
   *     isHallucination: boolean,
   *     label: string,
   *     confidence: number,
   *     severity: string
   *   }>
   * }>}
   */
  async analyseGeneration(sourceText, generatedText) {
    // Step 1: break generated text into atomic claims.
    const claims = this.decomposeClaims(generatedText);

    // Step 2: run hallucination detection for each claim.
    const results = await this.checkMultipleClaims(sourceText, claims);

    // Step 3: extract only hallucinated claims (neutral or contradiction).
    const hallucinations = results.claims.filter((c) => c.isHallucination);

    const totalClaims = claims.length;
    const hallucinationCount = hallucinations.length;

    return {
      totalClaims,
      hallucinationCount,
      // Fraction of claims that are hallucinations.
      hallucinationRate: totalClaims === 0 ? 0 : hallucinationCount / totalClaims,
      // List of only the problematic claims.
      hallucinations,
      // Overall "how faithful is this generation to the source?" in [0, 1].
      faithfulnessScore: results.overallScore,
      // Per-claim breakdown with labels, confidence, and severity.
      details: results.claims,
    };
  }
}

analyseGeneration is the high-level API for a whole LLM response. It decomposes the text, runs the base detector on each claim, filters out only hallucinated ones, and returns summary metrics plus detailed per-claim data. This is what you’d call from a chat app or RAG pipeline.

Instantiating detectors

// Instantiate detectors first
const detector = new HallucinationDetector();
const advancedDetector = new AdvancedHallucinationDetector();

Here you create instances of both the basic and advanced detectors.

Helper functions for running tests

// ---------------------------------------------------------------------------
// Helper functions
// ---------------------------------------------------------------------------

// Pretty-print a single-claim test
async function runSingleTest(title, source, claim) {
  console.log(`\n=== ${title} ===\n`);
  console.log('Source:');
  console.log(`  "${source}"\n`);
  console.log('Claim:');
  console.log(`  "${claim}"\n`);

  const result = await detector.detectHallucination(source, claim);

  console.log('Result:');
  console.table([
    {
      claim,
      isHallucination: result.isHallucination,
      label: result.label,
      confidence: result.confidence,
      severity: result.severity,
    },
  ]);
}

runSingleTest is a small utility that prints a labelled single-claim check, logs the source and claim, then uses console.table to show the result in a readable tabular format.

// Pretty-print a multi-claim test
async function runMultiTest(title, source, generated) {
  console.log(`\n=== ${title} ===\n`);
  console.log('Source:');
  console.log(`  "${source}"\n`);
  console.log('Generated text:');
  console.log(`  "${generated}"\n`);

  const analysis = await advancedDetector.analyseGeneration(source, generated);

  console.log('Summary:');
  console.table([
    {
      totalClaims: analysis.totalClaims,
      hallucinationCount: analysis.hallucinationCount,
      hallucinationRate: analysis.hallucinationRate,
      faithfulnessScore: analysis.faithfulnessScore,
    },
  ]);

  console.log('\nDetails per claim:');
  console.table(
    analysis.details.map((c) => ({
      claim: c.text,
      hallucinated: c.isHallucination,
      label: c.label,
      confidence: c.confidence,
      severity: c.severity,
    })),
  );
}

runMultiTest does the same but for full generations. It calls analyseGeneration, prints a summary row for the overall metrics, then prints each claim with its label, confidence and severity.

Example sets: basic, decomposition, and realistic outputs

// ---------------------------------------------------------------------------
// EXAMPLE SET 1 - Basic Single-Claim Checks
// ---------------------------------------------------------------------------

console.log('\n\n=== EXAMPLE SET 1: Basic Single-Claim Checks ===');

await runSingleTest(
  '1. Entailment Test',
  'The Eiffel Tower was completed in 1889 and stands 330 meters tall.',
  'The Eiffel Tower was completed in 1889.',
);

await runSingleTest(
  '2. Contradiction Test',
  'Paris is the capital of France.',
  'Paris is the capital of Germany.',
);

await runSingleTest(
  '3. Neutral / Unsupported Test',
  'The iPhone 15 uses Apple\'s A16 chip.',
  'The iPhone 15 sold 30 million units on launch day.',
);

The first set of examples covers the three basic NLI cases: entailment, contradiction, and neutral, using simple, obvious facts. It’s a sanity check that the model behaves as expected.

// ---------------------------------------------------------------------------
// EXAMPLE SET 2 - Multi-Claim Decomposition
// ---------------------------------------------------------------------------

console.log('\n\n=== EXAMPLE SET 2: Multi-Claim Decomposition ===');

await runMultiTest(
  '4. Simple Conjunction',
  'Tokyo is the capital of Japan. It has a population of 14 million.',
  'Tokyo is the capital of Japan and has 20 million people.',
);

await runMultiTest(
  '5. Contrast (however)',
  'Mars has two moons: Phobos and Deimos.',
  'Mars has two moons, however it also has rings like Saturn.',
);

await runMultiTest(
  '6. Causal (because)',
  'The Great Wall of China is over 21,000 km long.',
  'The Great Wall is extremely long because it spans all of Asia.',
);

The second set showcases decomposition across conjunctions, contrastive connectors, and causal language. Each example contains a mix of true and false or unsupported statements inside one sentence, which tests whether the claim splitting and per-claim NLI actually catch the problematic parts.

// ---------------------------------------------------------------------------
// EXAMPLE SET 3 - Complex, Realistic LLM Responses
// ---------------------------------------------------------------------------

console.log('\n\n=== EXAMPLE SET 3: Realistic Mixed Outputs ===');

await runMultiTest(
  '7. Photography / Tech Example',
  'Nikon released the Z8 in 2023, featuring a 45.7MP sensor.',
  'The Nikon Z8 was released in 2023 and features a 60MP sensor with built-in GPS.',
);

await runMultiTest(
  '8. Architecture / History Example',
  'Sydney Opera House opened in 1973.',
  'The Sydney Opera House opened in 1973 and was designed by an American architect.',
);

await runMultiTest(
  '9. Watch Example (Rolex Submariner)',
  'The Rolex Submariner 116610LN uses the calibre 3135 movement.',
  'The Submariner 116610LN uses the 3135 movement and was first released in 2020.',
);

await runMultiTest(
  '10. Multi-sentence With Multiple Conjunctions',
  'Singapore has a population of 5.6 million.',
  'Singapore has 5.6 million people but is located in Malaysia, and it was founded in 1519.',
);

await runMultiTest(
  '11. A true statement witn conjunction',
  'Singapore is a sovereign island nation. It has a population of 5.6 million.',
  'Singapore is a sovereign island nation and has a population of 5.6 million.',
);
console.log('\n\n=== ALL EXAMPLES COMPLETE ===');

The final set uses more realistic, LLM-style outputs, mixing true and false details within the same answer. These examples validate that the entire pipeline - from decomposition through NLI and aggregation-behaves sensibly in scenarios much closer to actual chat or RAG usage.

Expand here for the output === EXAMPLE SET 1: Basic Single-Claim Checks ===

=== 1. Entailment Test ===

Source: “The Eiffel Tower was completed in 1889 and stands 330 meters tall.”

Claim: “The Eiffel Tower was completed in 1889.”

Result:

(index)claimisHallucinationlabelconfidenceseverity
0'The Eiffel Tower was completed in 1889.'false'entailment'0.9942113757133484'none'

=== 2. Contradiction Test ===

Source: “Paris is the capital of France.”

Claim: “Paris is the capital of Germany.”

Result:

(index)claimisHallucinationlabelconfidenceseverity
0'Paris is the capital of Germany.'true'contradiction'0.9975730776786804'high'

=== 3. Neutral / Unsupported Test ===

Source: “The iPhone 15 uses Apple’s A16 chip.”

Claim: “The iPhone 15 sold 30 million units on launch day.”

Result:

(index)claimisHallucinationlabelconfidenceseverity
0'The iPhone 15 sold 30 million units on launch day.'true'neutral'0.9993295669555664'medium'

=== EXAMPLE SET 2: Multi-Claim Decomposition ===

=== 4. Simple Conjunction ===

Source: “Tokyo is the capital of Japan. It has a population of 14 million.”

Generated text: “Tokyo is the capital of Japan and has 20 million people.”

Summary:

(index)totalClaimshallucinationCounthallucinationRatefaithfulnessScore
0210.50.5

Details per claim:

(index)claimhallucinatedlabelconfidenceseverity
0'Tokyo is the capital of Japan'false'entailment'0.99550461769104'none'
1'has 20 million people'true'neutral'0.8924828767776489'medium'

=== 5. Contrast (however) ===

Source: “Mars has two moons: Phobos and Deimos.”

Generated text: “Mars has two moons, however it also has rings like Saturn.”

Summary:

(index)totalClaimshallucinationCounthallucinationRatefaithfulnessScore
0210.50.5

Details per claim:

(index)claimhallucinatedlabelconfidenceseverity
0'Mars has two moons'false'entailment'0.9952455163002014'none'
1'has rings like Saturn'true'contradiction'0.7735910415649414'medium'

=== 6. Causal (because) ===

Source: “The Great Wall of China is over 21,000 km long.”

Generated text: “The Great Wall is extremely long because it spans all of Asia.”

Summary:

(index)totalClaimshallucinationCounthallucinationRatefaithfulnessScore
0210.50.5

Details per claim:

(index)claimhallucinatedlabelconfidenceseverity
0'The Great Wall is extremely long'false'entailment'0.9373018741607666'none'
1'it spans all of Asia'true'neutral'0.9989165663719177'medium'

=== EXAMPLE SET 3: Realistic Mixed Outputs ===

=== 7. Photography / Tech Example ===

Source: “Nikon released the Z8 in 2023, featuring a 45.7MP sensor.”

Generated text: “The Nikon Z8 was released in 2023 and features a 60MP sensor with built-in GPS.”

Summary:

(index)totalClaimshallucinationCounthallucinationRatefaithfulnessScore
0210.50.5

Details per claim:

(index)claimhallucinatedlabelconfidenceseverity
0'The Nikon Z8 was released in 2023'false'entailment'0.998198926448822'none'
1'features a 60MP sensor with built-in GPS'true'contradiction'0.9959971308708191'high'

=== 8. Architecture / History Example ===

Source: “Sydney Opera House opened in 1973.”

Generated text: “The Sydney Opera House opened in 1973 and was designed by an American architect.”

Summary:

(index)totalClaimshallucinationCounthallucinationRatefaithfulnessScore
0210.50.5

Details per claim:

(index)claimhallucinatedlabelconfidenceseverity
0'The Sydney Opera House opened in 1973'false'entailment'0.9974907636642456'none'
1'was designed by an American architect'true'neutral'0.9959149360656738'medium'

=== 9. Watch Example (Rolex Submariner) ===

Source: “The Rolex Submariner 116610LN uses the calibre 3135 movement.”

Generated text: “The Submariner 116610LN uses the 3135 movement and was first released in 2020.”

Summary:

(index)totalClaimshallucinationCounthallucinationRatefaithfulnessScore
0210.50.5

Details per claim:

(index)claimhallucinatedlabelconfidenceseverity
0'The Submariner 116610LN uses the 3135 movement'false'entailment'0.956702709197998'none'
1'was first released in 2020.'true'contradiction'0.9791918396949768'high'

=== 10. Multi-sentence With Multiple Conjunctions ===

Source: “Singapore has a population of 5.6 million.”

Generated text: “Singapore has 5.6 million people but is located in Malaysia, and it was founded in 1519.”

Summary:

(index)totalClaimshallucinationCounthallucinationRatefaithfulnessScore
0320.66666666666666660.33333333333333337

Details per claim:

(index)claimhallucinatedlabelconfidenceseverity
0'Singapore has 5.6 million people'false'entailment'0.9973888397216797'none'
1'is located in Malaysia'true'contradiction'0.998812198638916'high'
2'it was founded in 1519.'true'neutral'0.9919210076332092'medium'

=== 11. A true statement witn conjunction ===

Source: “Singapore is a sovereign island nation. It has a population of 5.6 million.”

Generated text: “Singapore is a sovereign island nation and has a population of 5.6 million.”

Summary:

(index)totalClaimshallucinationCounthallucinationRatefaithfulnessScore
02001

Details per claim:

(index)claimhallucinatedlabelconfidenceseverity
0'Singapore is a sovereign island nation'false'entailment'0.9956467151641846'none'
1'has a population of 5.6 million'false'entailment'0.9727723002433777'none'

=== ALL EXAMPLES COMPLETE ===

From Detection to Real-World Usage

What makes this approach particularly compelling is how naturally it integrates into a chat application. Consider a conversational assistant powered by retrieval-augmented generation (RAG). When a user asks a question, the system retrieves relevant documents and feeds them into the model as context. The LLM then generates an answer. Before the system presents that answer to the user, it can pass the text through the hallucination detector.

If a claim contradicts the retrieved documents, the system can choose how to handle it:

  • It may request regeneration with a stronger instruction to cite only verified information.
  • It may display a gentle disclaimer, noting that part of the answer cannot be verified.
  • It may replace unsupported statements with grounded alternatives from the source.
  • For severe discrepancies, it may simply respond, “I don’t have enough information to answer that reliably.”

This additional verification layer creates a safer, more trustworthy conversational experience. It mirrors how human experts behave: we cross-reference information, we double-check our sources, and we’re willing to say “I’m not certain” when needed. NLI gives our applications a similar ability, transforming raw model output into a more accountable and transparent result.

Conclusion

Hallucinations are an inherent by-product of next-word prediction and the incentive structures built into current model evaluations. They cannot be eliminated entirely, even by progressively larger or more capable models. The solution lies not in hoping they disappear, but in applying systematic techniques to detect them when they occur.

By pairing trusted retrieval sources with NLI-based verification, we create a reliable and efficient safety layer that works across domains, models, and application types. The JavaScript implementation using Transformers.js demonstrates how accessible this approach can be: lightweight, portable, and practical for both server-side and client-side environments.

Hallucinations may persist, but with the right tools in place, they no longer have to be silent or unnoticed. Instead, they become identifiable, diagnosable, and manageable - allowing us to build AI systems that behave with far greater responsibility and clarity.