Module 6 · Slide 01

6. Hallucinations in
Coding Tasks

When large language models generate code that looks correct but invokes non-existent APIs, uses wrong parameters, or produces logically flawed implementations. Understanding, detecting, and mitigating code hallucinations is critical for safe AI-assisted software development.

Learning Objectives

Objective 1

Define code hallucinations and distinguish them from general LLM hallucinations

Objective 2

Apply the CodeHalu and HalluCode taxonomies to classify hallucination types

Objective 3

Evaluate mitigation strategies including De-Hallucinator and MARIN

Key References

CodeHalu

Tian et al. (AAAI 2025) — Taxonomy of 4 categories, 8 subcategories; CodeHaluEval benchmark

De-Hallucinator

Eghbali & Pradel (2024) — Iterative grounding with API documentation

MARIN

FSE 2025 — Hierarchical dependency-aware hallucination mitigation

HalluCode

5 primary categories, 19 specific types of code hallucination

Module 6 · Slide 02

What Are Hallucinations?

In general NLP, a hallucination occurs when a model generates content that is nonsensical, unfaithful to the source, or factually incorrect. In code generation, hallucinations take on a uniquely dangerous character: they produce syntactically valid code that compiles or runs — but behaves incorrectly.

General LLM Hallucination

Output that is fluent and confident but factually wrong. Example: claiming a historical event occurred on the wrong date, or citing a paper that does not exist.

Code Hallucination

Generated code that looks plausible but calls non-existent APIs, uses wrong method signatures, introduces logical errors, or references unavailable libraries. The code often passes superficial review.
— Lee et al. 2025, arXiv:2504.20799

Why Code Hallucinations Are Worse

1. Verifiability gap: Text hallucinations can be fact-checked; code hallucinations require execution or deep expertise to detect.

2. Downstream impact: A hallucinated API call can cause runtime crashes, security vulnerabilities, or silent data corruption in production.

3. False confidence: Syntactically correct code creates a strong illusion of correctness, even for experienced developers.

Module 6 · Slide 03

How LLMs Generate Code: A Quick Primer

Before understanding hallucinations, we need to understand how LLMs produce code token by token. The autoregressive generation process is the root cause of many hallucination patterns.

1

Receive Prompt / Context

The LLM receives the user's prompt plus any system instructions or retrieved context as input tokens.

2

Compute Probability Distribution

The model computes a probability distribution over its entire vocabulary for the next token. Each possible token gets a score based on what the model has learned.

3

Sample a Token

A token is selected from the distribution using sampling parameters (temperature, top-k, top-p). Lower temperature means more deterministic choices.

4

Append & Repeat

The chosen token is appended to the context. The model now sees the original prompt plus all previously generated tokens, and computes the next distribution. This repeats until a stop condition is met.

Visual: Token-by-Token Generation

Prompt: "sort a list"

→

P(def)=0.42

→

Token: def

...sort a list def

→

P(sort)=0.38

→

Token: sort

...def sort

→

P(_list)=0.31

→

Token: _list

Key Insight

Hallucinations emerge because the model always produces the most plausible-sounding token, even when it has no knowledge of the correct answer. It cannot say "I don't know" mid-generation — it must always pick the next token.

Module 6 · Slide 04 · Interactive

Temperature & Sampling: The Hallucination Knob

Generation parameters directly affect hallucination rates. Temperature controls how "peaked" or "flat" the probability distribution is when selecting the next token.

Low Temperature (0.1 – 0.3)

More deterministic — the model strongly favors the highest-probability token.
• Fewer hallucinations
• Less creative / more repetitive
• Best for production code generation
• May repeat common patterns

High Temperature (0.8 – 1.2)

More diverse — lower-probability tokens get a real chance of being selected.
• More hallucinations
• More creative / exploratory
• Better for brainstorming
• Can produce surprising (and wrong) outputs

Practical Guidelines

For production code generation, use T=0.0–0.2. For brainstorming and exploration, use T=0.7–1.0. Never use T>1.0 for code you will actually ship.

Temperature Visualizer interactive

Temperature: 0.5

0.0 (greedy)1.0 (balanced)2.0 (chaotic)

45%

def

25%

func

15%

class

10%

async

5%

xyzq

Probability distribution over next-token candidates

Module 6 · Slide 05

Why Hallucinations Matter in Code

Code hallucinations are not merely academic curiosities. They have real consequences across the software development lifecycle, from introducing subtle bugs to creating exploitable security vulnerabilities.

API Misuse

LLMs frequently invent API methods that do not exist or use real APIs with incorrect parameter types and orderings. This is the most common form of code hallucination.
ISSTA 2025 — API Knowledge Conflicts

Security Vulnerabilities

Hallucinated code may omit input validation, use deprecated cryptographic functions, or introduce injection points that would not appear in human-written code.

Dependency Conflicts

Models may reference library versions that are incompatible, or combine APIs from different major versions of the same framework.
ISSTA 2025 — Dependency Conflicts

Silent Failures

The most insidious hallucinations produce code that runs without errors but produces incorrect results — wrong calculations, missed edge cases, or flawed logic that passes basic tests.

699

Tasks in CodeHaluEval

8,883

Hallucination Samples

17

LLMs Evaluated

4

Hallucination Categories

Source: Tian et al. CodeHalu (AAAI 2025)

Module 6 · Slide 06

Real-World Examples

These examples illustrate common hallucination patterns observed in LLM-generated code. Each snippet looks plausible but contains fabricated or incorrect API usage.

Example 1: Non-Existent API

The model invents a method that does not exist in the library:

# Hallucinated: pandas has no .smart_merge()
import pandas as pd

df1 = pd.read_csv("users.csv")
df2 = pd.read_csv("orders.csv")

# This method does NOT exist in pandas
result = pd.smart_merge(df1, df2,
    on="user_id",
    strategy="fuzzy",   # not a real param
    threshold=0.8)      # not a real param

Example 2: Wrong Method Signature

The model uses a real API but with incorrect parameters:

// Hallucinated: wrong overload of Files.readString
import java.nio.file.Files;
import java.nio.file.Path;

// Files.readString() takes a Path, not
// a Path + boolean + Charset
String content = Files.readString(
    Path.of("data.txt"),
    true,              // no boolean param
    "UTF-8");          // wants Charset obj

Example 3: Fabricated Library

The model imports a library that has never been published:

# Hallucinated: no such package exists
from sklearn.neural import DeepClassifier

model = DeepClassifier(
    layers=[128, 64, 32],
    activation="gelu",
    optimizer="adamw")
model.fit(X_train, y_train)

Example 4: Logic Hallucination

Syntactically valid but logically wrong — returns before completing:

def find_duplicates(lst):
    """Find all duplicate elements."""
    seen = set()
    duplicates = set()
    for item in lst:
        if item in seen:
            duplicates.add(item)
            return duplicates  # BUG: returns on
        seen.add(item)      # first dup found
    return duplicates

Module 6 · Slide 07

CodeHalu Taxonomy (Part 1)

Tian et al. (AAAI 2025) propose a systematic taxonomy of code hallucinations based on execution-based verification, dividing them into four categories with eight subcategories. This slide covers the first two categories.

Category 1: Mapping Hallucinations

The LLM fails to correctly map the task description to executable code. The generated solution does not align with what was asked.

Subcategory 1a — Task Misunderstanding: The model solves a different problem than specified. E.g., asked to sort descending but sorts ascending.

Subcategory 1b — Specification Violation: The code ignores explicit constraints such as time complexity requirements or input/output formats.

# Mapping hallucination: task says "return indices"
# but model returns values instead
def two_sum(nums, target):
    for i in range(len(nums)):
        for j in range(i+1, len(nums)):
            if nums[i] + nums[j] == target:
                return [nums[i], nums[j]]  # wrong!
    # should be: return [i, j]

Category 2: Naming Hallucinations

The LLM references identifiers — variable names, function names, class names — that either do not exist in the current scope or are used inconsistently.

Subcategory 2a — Undefined References: Using variables or functions that were never declared or imported.

Subcategory 2b — Name Confusion: Mixing up similar-sounding API names (e.g., getSize() vs size() vs length()).

// Naming hallucination: mixing up API names
import java.util.ArrayList;

ArrayList<String> list = new ArrayList<>();
list.add("hello");

// ArrayList uses .size(), not .length()
int n = list.length();  // compile error!

// Also confuses with array .length field
// and String .length() method

Source: Tian et al. "CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification" (AAAI 2025)

Module 6 · Slide 08

CodeHalu Taxonomy (Part 2)

The remaining two categories in the CodeHalu taxonomy cover hallucinations related to resource management and logical reasoning — often the hardest to detect without execution.

Category 3: Resource Hallucinations

The LLM generates code that references external resources (files, network endpoints, database tables, environment variables) that do not exist or are inaccessible.

Subcategory 3a — Missing Dependencies: Importing packages that are not installed or do not exist in the target ecosystem.

Subcategory 3b — Environment Assumptions: Assuming the presence of files, directories, services, or configurations that are not guaranteed.

# Resource hallucination: assumes file exists
import json

# Model assumes config.json is always present
with open("config.json") as f:
    config = json.load(f)

# No error handling, no fallback defaults
db_host = config["database"]["host"]
db_port = config["database"]["port"]

Category 4: Logic Hallucinations

The code is syntactically valid and uses real APIs correctly, but implements flawed algorithms or incorrect control flow that produces wrong results.

Subcategory 4a — Algorithm Errors: Implementing a sorting algorithm that does not actually sort, or a search that misses valid results.

Subcategory 4b — Edge Case Failures: Code that works for common inputs but fails on boundary conditions (empty lists, negative numbers, null values).

# Logic hallucination: off-by-one in binary search
def binary_search(arr, target):
    lo, hi = 0, len(arr)  # should be len-1
    while lo <= hi:
        mid = (lo + hi) // 2
        if arr[mid] == target:  # IndexError!
            return mid
        elif arr[mid] < target:
            lo = mid + 1
        else:
            hi = mid - 1
    return -1

Source: Tian et al. "CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification" (AAAI 2025)

Module 6 · Slide 09 · Interactive

Spot the Hallucination

Each pair below contains one snippet with a real API call and one with a hallucinated API. Click the snippet you believe is hallucinated.

Challenge 1: Python HTTP Requests interactive

import requests
response = requests.get("https://api.example.com")
data = response.json()

import requests
response = requests.fetch("https://api.example.com")
data = response.to_json()

Challenge 2: Java Collections interactive

Map<String,Integer> map = new HashMap<>();
map.put("a", 1);
int val = map.getOrDefault("b", 0, true);

Map<String,Integer> map = new HashMap<>();
map.put("a", 1);
int val = map.getOrDefault("b", 0);

Challenge 3: Python File I/O interactive

from pathlib import Path
text = Path("file.txt").read_text()
lines = text.splitlines()

from pathlib import Path
text = Path("file.txt").read_lines()
lines = text.to_list()

Module 6 · Slide 10

HalluCode Taxonomy

HalluCode proposes a complementary taxonomy focusing on the relationship between generated code and the developer's intent, context, and domain knowledge. It defines 5 primary categories and 19 specific types.

1. Intent Conflicting

Generated code contradicts the user's stated intent or task description. The code runs but does something fundamentally different from what was requested. Example: asked to delete duplicates, but the code deletes unique elements instead.

2. Context Inconsistency

Code is inconsistent with the surrounding context — uses variables defined elsewhere with wrong types, breaks established patterns in the codebase, or contradicts previous lines in the same function.

3. Context Repetition

The model unnecessarily repeats code blocks, re-declares variables, or duplicates logic that is already present. This can indicate the model has lost track of what it has already generated.

4. Dead Code

Generated code includes unreachable statements, unused variables, redundant conditions, or function definitions that are never called. While not always incorrect, dead code signals the model is generating without genuine understanding.

5. Knowledge Conflicting

Code contradicts established programming knowledge: using deprecated APIs, violating language-specific idioms, applying algorithms to unsuitable data structures, or using patterns that are known anti-patterns. Example: using a bubble sort where the prompt requires O(n log n) complexity. This category overlaps with CodeHalu's Naming and Resource hallucinations.

Source: HalluCode — 5 primary categories, 19 specific types of code hallucination

Module 6 · Slide 11

Comparing Taxonomies

CodeHalu and HalluCode approach code hallucination classification from different angles. Understanding both provides a more complete picture of the hallucination landscape.

Dimension	CodeHalu (AAAI 2025)	HalluCode
Focus	Execution-based verification; what goes wrong at runtime	Intent alignment; how code deviates from user expectations
Categories	Mapping Naming Resource Logic	Intent Context Repetition Dead Code Knowledge
Granularity	4 categories, 8 subcategories	5 categories, 19 specific types
Detection Method	Execution-based: run code, compare outputs against expected results	Hybrid: static analysis + semantic comparison to intent
Benchmark	CodeHaluEval (699 tasks, 8,883 samples, 17 LLMs)	Curated examples across 19 types
Overlap	CodeHalu's Naming ≈ HalluCode's Knowledge Conflicting; CodeHalu's Mapping ≈ HalluCode's Intent Conflicting

Takeaway

Neither taxonomy is strictly superior. CodeHalu excels at automated, execution-based detection of hallucinations that cause runtime failures. HalluCode captures subtler issues like dead code and context repetition that may not cause crashes but still indicate poor code quality. A comprehensive hallucination analysis should draw from both frameworks.

Module 6 · Slide 12 · Interactive

Interactive: Spot the Hallucination in Code

Each code snippet below was generated by an LLM and contains a specific hallucination. Click the line you think is hallucinated, then click "Reveal" to check your answer.

Code Hallucination Hunt interactive

Snippet A: Python List Sorting

import random
data = [random.randint(1, 100) for _ in range(20)]
sorted_data = data.sortDescending()
print(sorted_data)

Line 3 is hallucinated. Python lists have no .sortDescending() method. The correct approach is data.sort(reverse=True) or sorted_data = sorted(data, reverse=True). This is a Naming Hallucination — the model invented a Java-style method name.

Snippet B: Scikit-learn Import

from sklearn.neural import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(100,), max_iter=300)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)

Line 1 is hallucinated. The correct import is from sklearn.neural_network import MLPClassifier. The module sklearn.neural does not exist — it is sklearn.neural_network. This is a Resource Hallucination (Missing Dependencies).

Snippet C: JavaScript String Method

const text = "Hello, World!";
const reversed = text.reverse();
console.log(reversed);

Line 2 is hallucinated. JavaScript strings do not have a .reverse() method. Only arrays have .reverse(). The correct approach is text.split('').reverse().join(''). This is a Naming Hallucination (Name Confusion) — confusing Array and String APIs.

Snippet D: Python JSON Handling

import json
data = {"name": "Alice", "age": 30}
json_str = json.stringify(data)
print(json_str)

Line 3 is hallucinated. Python's json module uses json.dumps(), not json.stringify(). The model confused Python with JavaScript's JSON.stringify(). This is a Knowledge Conflicting hallucination — mixing language conventions.

Snippet E: Semantic Error (Logic)

def is_palindrome(s):
    s = s.lower().strip()
    return s == s[:-1]  # missing the full reverse

Line 3 is hallucinated. The code compares s to s[:-1] (all characters except the last), not to the reversed string s[::-1]. This would return True for strings like "aa" but False for actual palindromes like "racecar". This is a Logic Hallucination — syntactically valid but semantically wrong.

Module 6 · Slide 13

How LLMs Hallucinate Code

Understanding the mechanisms behind code hallucinations helps inform both detection and mitigation strategies. Several factors contribute to hallucination in code generation.

1

Training Data Gaps

APIs that are infrequent in training data are more likely to be hallucinated. The model may have seen similar but different APIs and conflates them. Less popular libraries have far fewer training examples, leading to fabricated method names and signatures.

2

API Version Confusion

Training data contains code using multiple versions of the same library (e.g., TensorFlow 1.x vs 2.x, Python 2 vs 3). The model may blend APIs across incompatible versions, producing code that was valid in an older version but fails in the current one.
ISSTA 2025 — Environment Conflicts

3

Distribution Shift

When a prompt asks for code combining libraries in novel ways not well-represented in training data, the model falls back on statistical patterns rather than genuine understanding, increasing hallucination probability.

4

Autoregressive Accumulation

Each generated token conditions on all previous tokens. A small error early in generation (e.g., wrong import) cascades — the model then generates code consistent with the hallucinated import, compounding the hallucination throughout the function.

5

Pattern Mimicry over Semantics

LLMs are pattern-matching systems. They learn that methods are often named get_X(), set_X(), to_X(), and may synthesize plausible-sounding method names that follow common conventions but do not actually exist in the target library.
Lee et al. 2025, arXiv:2504.20799

Module 6 · Slide 14

Measuring Hallucinations

Quantifying code hallucinations requires specialized benchmarks and metrics. The CodeHaluEval benchmark by Tian et al. provides a rigorous framework for evaluation.

CodeHaluEval Benchmark

Scale: 699 tasks generating 8,883 code samples across 17 LLMs

Process: Each task is paired with a specification and test cases. Generated code is executed against the tests. Failures are categorized by hallucination type using the CodeHalu taxonomy.

Coverage: Tasks span algorithm design, data structure manipulation, API usage, and systems programming.

ISSTA 2025 Framework

Focuses on hallucinations detectable via static analysis without execution:
• Dependency Conflicts — incompatible library versions
• Environment Conflicts — wrong Python/Java versions
• API Knowledge Conflicts — non-existent or misused APIs

Key Metrics

MiHN (Micro Hallucination Number): The total count of individual hallucinated elements (wrong API calls, incorrect parameters, etc.) across all generated samples. Lower is better.

MaHR (Macro Hallucination Rate): The fraction of generated code samples that contain at least one hallucination. Captures how frequently the model produces any hallucination at all.

Edit Distance: How many edits are needed to fix hallucinated code to match correct code. Used by De-Hallucinator to measure improvement.

API Recall: The fraction of required API calls that are correctly included in generated code.

67.5%

MiHN Decrease (MARIN)

73.6%

MaHR Decrease (MARIN)

50.6%

Edit Dist. Improvement

61.0%

API Recall Improvement

MARIN (FSE 2025) and De-Hallucinator (Eghbali & Pradel 2024) results

Module 6 · Slide 15 · Interactive

Rate the Severity

Not all hallucinations are equally dangerous. Classify each example below by severity: Low (cosmetic / style issue), Medium (will cause errors but detectable), High (runtime crash or wrong results), or Critical (security vulnerability or data loss).

Severity Classification interactive

An unused import statement is generated at the top of the file

An unused import is a Dead Code hallucination (HalluCode). It does not affect execution — at worst, it adds a minor dependency. Severity: Low.

The model calls torch.cuda.memory_optimized_forward(), a method that does not exist

A non-existent method causes an AttributeError at runtime. This is a Naming Hallucination (CodeHalu). Easily caught by testing but will crash in production if undetected. Severity: High.

User input is concatenated directly into a SQL query string instead of using parameterized queries

SQL injection is a security vulnerability. The code runs fine in normal testing but is exploitable. This is a Knowledge Conflicting hallucination (HalluCode) — the model ignores known security practices. Severity: Critical.

A function re-declares a variable that was already defined 3 lines above with the same value

This is a Context Repetition hallucination (HalluCode). It may cause confusion and in some languages could shadow variables leading to subtle bugs. Severity: Medium.

A sorting function returns after finding the first unsorted pair instead of fully sorting the list

A Logic Hallucination (CodeHalu). The function appears to work on small or nearly-sorted inputs but silently produces wrong results on general inputs. Severity: High.

Module 6 · Slide 16

Detection Approaches

Detecting code hallucinations requires multiple complementary strategies, ranging from dynamic execution to static analysis and semantic reasoning.

Execution-Based Verification

How it works: Execute generated code against test suites and compare actual outputs to expected results. Any mismatch indicates a hallucination.

Strengths: Definitive — if code fails a test, something is wrong. Catches all categories of hallucination that affect behavior.

Limitations: Requires comprehensive test suites. Cannot detect dead code or stylistic issues. Expensive for large-scale evaluation.

Used by: CodeHaluEval (AAAI 2025)

AST-Based Analysis

How it works: Parse the generated code into an Abstract Syntax Tree and check for structural anomalies: undefined references, type mismatches, unreachable code paths, redundant expressions.

Strengths: Fast, no execution needed. Good at catching naming hallucinations and dead code.

Limitations: Cannot catch logic hallucinations or semantic errors that are structurally valid.

Static Analysis

How it works: Use existing static analysis tools (linters, type checkers, dependency resolvers) to identify hallucinated dependencies, version conflicts, and API misuse without running the code.

Detects:
• Dependency Conflicts — incompatible package versions
• Environment Conflicts — wrong runtime assumptions
• API Knowledge Conflicts — non-existent or misused APIs

ISSTA 2025 — "LLM Hallucinations in Practical Code Generation"

LLM-Based Self-Verification

How it works: Use a second LLM pass (or the same model) to review generated code for correctness, asking it to identify potential issues.

Strengths: Can catch semantic and logic issues that static tools miss. No test suite required.

Limitations: The verifier LLM may itself hallucinate and approve incorrect code. Not reliable as a sole detection method.

Module 6 · Slide 17

Hallucination Detection: Automated Approaches

Beyond manual code review, several automated techniques can detect hallucinations at scale. The best strategies combine multiple signals for robust detection.

Static Analysis

Check if generated APIs exist in library documentation. Linters and type checkers catch non-existent methods and wrong signatures automatically.

Type Checking

Run a type checker (mypy, TypeScript compiler) on generated code. Type errors often indicate hallucinated APIs or wrong parameter types.

Test Generation

Generate tests for the code and run them. Failing tests suggest hallucinations. This can be automated with the same LLM that generated the code.

Cross-Model Verification

Ask multiple LLMs to solve the same task, then compare outputs. Disagreement between models suggests at least one is hallucinating.

Confidence Scoring

Low-probability tokens in the generation are more likely to be hallucinated. Token-level confidence can flag suspicious API calls and parameter names.

Key Insight

No single detection method is perfect. Static analysis catches naming and resource hallucinations but misses logic errors. Testing catches behavioral issues but requires good test coverage. Cross-model verification adds a probabilistic signal but is expensive. The best approaches combine multiple signals into a detection pipeline.

Module 6 · Slide 18

RAG for Hallucination Reduction

Retrieval-Augmented Generation (RAG) is a foundational technique for reducing hallucinations. Instead of relying on the model's parametric memory alone, RAG retrieves relevant documentation at query time to ground the generation in factual data.

1

The Problem

LLMs hallucinate because they generate from statistical patterns, not factual knowledge. When the training data is sparse or outdated for a given API, the model invents plausible-sounding but incorrect code.

2

The Solution: Retrieve at Query Time

Instead of relying solely on what the model memorized during training, retrieve relevant documentation, code examples, and API specifications from an external knowledge base before generating.

3

The Pipeline

User query → Embed the query → Search codebase / docs → Retrieve relevant files → Add retrieved context to prompt → Generate code with grounded context.

4

The Result

The model generates code based on actual documentation and real code examples rather than imagination. API names, parameter types, and method signatures are grounded in retrieved facts.

User Query

→

Embed

→

Search Docs

→

Retrieve Context

→

Augmented Prompt

→

Grounded Code

Important Caveat

RAG does not eliminate hallucinations, but it dramatically reduces them by providing the model with factual grounding. The model can still hallucinate if retrieved context is irrelevant, incomplete, or if the model ignores the context. More advanced RAG variants (like MARIN and De-Hallucinator) address these limitations.

Module 6 · Slide 19

Mitigation: De-Hallucinator

Eghbali & Pradel (2024) introduce De-Hallucinator, an iterative grounding approach that retrieves relevant API documentation to reduce hallucinations in generated code.

User Prompt

→

Initial LLM Generation

→

Extract API Calls

→

Retrieve API Docs

→

Re-query with Context

→

Refined Output

1

Initial Generation

The LLM generates code from the user's prompt as usual. This code may contain hallucinated API calls.

2

API Extraction

Parse the generated code to identify all external API calls, library imports, and method invocations.

3

Documentation Retrieval

For each identified API, retrieve the official documentation including correct method signatures, parameter types, return types, and usage examples.

4

Iterative Re-query

Re-prompt the LLM with the original request augmented by the retrieved API docs. The model uses this grounding to correct hallucinated calls. This process can repeat for multiple iterations.

Results

De-Hallucinator achieves significant improvements across multiple metrics when tested on code generation benchmarks:

23.3-50.6%

Edit Distance Improvement

23.9-61.0%

API Recall Improvement

63.2%

More Fixed Tests

15.5%

Statement Coverage Increase

Key Insight

The iterative approach is crucial. A single round of documentation retrieval helps, but multiple iterations allow the model to progressively resolve cascading hallucinations where one incorrect API call leads to further errors.

Source: Eghbali & Pradel "De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding" (2024)

Module 6 · Slide 20

Mitigation: MARIN

MARIN (FSE 2025) takes a hierarchical, dependency-aware approach to mitigating API hallucinations by modeling the relationships between APIs in a project's dependency graph.

Core Idea

Rather than treating each API call independently, MARIN builds a hierarchical dependency graph that captures relationships between packages, classes, and methods. When generating code, the model is constrained to only use APIs that are reachable in this dependency hierarchy.

1

Dependency Graph Construction

Build a hierarchical graph of all available APIs from the project's declared dependencies, organized by package → class → method.

2

Hierarchical Context Retrieval

When the model needs an API, retrieve candidates at each level of the hierarchy: first the package, then the class, then compatible methods.

3

Constrained Generation

Provide the LLM with only verified, reachable APIs as context, preventing hallucination of non-existent methods.

MARIN vs RAG Baseline

MARIN significantly outperforms standard RAG (Retrieval-Augmented Generation) approaches because RAG retrieves documentation by text similarity, which can return irrelevant APIs with similar names. MARIN's hierarchical approach ensures structural compatibility.

67.52%

MiHN Decrease vs RAG

73.56%

MaHR Decrease vs RAG

Why Hierarchy Matters

Without hierarchy (flat RAG): A query for "read file" might retrieve BufferedReader.read(), FileInputStream.read(), and Scanner.nextLine() — all valid but from different APIs, creating mixing errors.

With hierarchy (MARIN): If the code already imports java.nio.file.Files, MARIN retrieves only methods from that package tree, ensuring consistency.

Source: "MARIN: Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware" (FSE 2025)

Module 6 · Slide 21

Mitigation: RAG-Based Approaches

Retrieval-Augmented Generation (RAG) is the foundation for many hallucination mitigation strategies. By grounding the LLM with retrieved documentation, we reduce the model's reliance on potentially incorrect parametric knowledge.

How RAG Reduces Hallucinations

Standard LLM Generation: The model relies entirely on patterns learned during training. If the training data was sparse or contradictory for a given API, hallucinations are likely.

RAG-Augmented Generation: Before generating code, the system retrieves relevant documentation, code examples, or API specifications from an external knowledge base. This retrieved context is prepended to the prompt, giving the model accurate, up-to-date information.

RAG Pipeline Components

1. Knowledge Base: API docs, code examples, Stack Overflow answers, library changelogs
2. Embedding Model: Converts queries and documents to vectors for similarity search
3. Retriever: Finds the most relevant documents for the current code generation task
4. Generator: The LLM produces code grounded in retrieved context

Limitations of Naive RAG

Retrieval noise: Similar text does not mean compatible APIs. Searching for "read file Python" may return docs for open(), pathlib, io, and csv — overwhelming the model with options.

Context window limits: Stuffing too much documentation into the prompt can degrade model performance and push out the actual task description.

Stale indexes: If the knowledge base is not updated when libraries release new versions, RAG can reinforce outdated API usage.

Advanced RAG Strategies

MARIN's hierarchical RAG solves retrieval noise by constraining to dependency-compatible APIs (FSE 2025).

De-Hallucinator's iterative RAG solves context limits by retrieving docs only for APIs the model actually uses, then re-querying (Eghbali & Pradel 2024).

Version-aware RAG indexes API docs by version, ensuring retrieved context matches the project's declared dependencies.

Module 6 · Slide 22

Prompt Engineering as Mitigation

Careful prompt design is one of the simplest and most effective ways to reduce hallucinations. These techniques require no infrastructure changes — just better prompts.

1

Be Explicit About Constraints

Tell the model exactly what it can and cannot use: "Only use Python standard library functions. Do not import any third-party packages."

2

Provide API Documentation in Context

Paste the relevant API docs directly into the prompt. The model is far less likely to hallucinate APIs when it has the real docs in its context window.

3

Ask for Citations / Reasoning

Prompt: "Cite which library each method comes from" or "Explain why you chose this approach before writing code." Chain-of-thought reduces hallucination.

4

Add Negative Examples

"Do NOT use deprecated methods. Do NOT use eval(). Do NOT use string concatenation for SQL queries."

5

Request Self-Verification

"After generating the code, verify that all API methods you used actually exist in the specified library version."

Before (Vague Prompt)

Write a Python function
to read a CSV file
and find duplicates.

After (Constrained Prompt)

Write a Python function using
only the csv module from the
standard library. Read a CSV
file, find duplicate rows based
on all columns, and return them
as a list of dicts.

Do NOT use pandas.
Do NOT use third-party libs.
Explain your approach first.

Why This Works

Constrained prompts reduce the model's "degrees of freedom" for hallucination. When you specify which libraries to use, the model cannot fabricate new ones. When you request chain-of-thought reasoning, the model is forced to justify its choices, making it more likely to catch its own errors before they appear in code.

Module 6 · Slide 23

Tool-Augmented Generation

Tool-augmented generation gives LLMs access to external tools that can verify code in real time, dramatically reducing hallucinations by providing ground truth feedback during the generation process.

1

LLM Generates Code

The model produces an initial code solution based on the user's prompt, as in standard generation.

2

Code Execution Tool Runs It

A sandboxed code interpreter executes the generated code. Runtime errors (ImportError, AttributeError, TypeError) are caught immediately and fed back to the model.

3

Static Analysis Tool Checks It

Linters and type checkers analyze the code for issues that might not cause immediate runtime errors: type mismatches, unused imports, deprecated API usage, and potential security vulnerabilities.

4

Documentation Lookup Verifies APIs

A documentation search tool verifies that every API method, class, and parameter used in the code actually exists in the specified library version.

5

LLM Self-Corrects

The model receives all feedback from the tools and generates a corrected version. This loop can repeat multiple times until the code passes all checks.

Key Insight

Tool-augmented generation reduces hallucinations by giving the LLM access to ground truth. Instead of guessing whether an API exists, the model can verify its own outputs against real execution results, documentation, and static analysis. This is the approach used by "code interpreter" features in modern LLM products.

Module 6 · Slide 24 · Interactive

Design a Mitigation Pipeline

Arrange the components below into the correct order for a hallucination mitigation pipeline. Click a component to place it in the next available slot, or click a filled slot to remove it.

Pipeline Builder interactive

Available Components (click to place):

Retrieve API Documentation Verify with Test Execution Receive User Prompt Re-generate with Grounded Context Static Analysis Check Initial LLM Code Generation

Pipeline Order:

Step 1: click a component above

Step 2

Step 3

Step 4

Step 5

Step 6

Module 6 · Slide 25 · Interactive

Practical Detection Exercise

For each code snippet below, identify the hallucination type using the CodeHalu taxonomy and select the best fix. This exercise combines taxonomy knowledge with practical debugging skills.

Exercise: Classify & Fix interactive

Snippet 1

from collections import OrderedSet

unique_items = OrderedSet([3, 1, 4, 1, 5])
print(list(unique_items))

What type of hallucination is this?

Resource Hallucination (Missing Dependencies). collections.OrderedSet does not exist in Python's standard library. The collections module provides OrderedDict, but there is no OrderedSet. Fix: use dict.fromkeys() for insertion-ordered uniqueness in Python 3.7+, or install the third-party ordered-set package.

Snippet 2

def flatten(nested_list):
    """Flatten a nested list of arbitrary depth."""
    result = []
    for item in nested_list:
        if isinstance(item, list):
            result.extend(item)     # only 1 level!
        else:
            result.append(item)
    return result

What type of hallucination is this?

Mapping Hallucination (Task Misunderstanding). The docstring says "arbitrary depth" but the code only flattens one level. extend(item) handles [[1,2],[3,4]] but fails on [[[1,2],[3]],4]. Fix: use recursion — replace result.extend(item) with result.extend(flatten(item)).

Snippet 3

List<String> names = Arrays.asList("Alice", "Bob");
names.stream()
    .filterBy(name -> name.length() > 3)
    .collect(Collectors.toList());

What type of hallucination is this?

Naming Hallucination (Name Confusion). The Java Stream API method is .filter(), not .filterBy(). The model likely conflated Java streams with a different framework's naming convention. Fix: replace .filterBy() with .filter().

Module 6 · Slide 26

Case Study: Hallucinations in Production

These are real-world cases where LLM hallucinations caused significant problems. They demonstrate why hallucination mitigation is not just an academic concern.

The Lawyer Who Cited Fake Cases

In 2023, a lawyer used ChatGPT to research legal precedents. The model generated citations to cases that sounded real but did not exist. The lawyer submitted these to a federal court, leading to sanctions. The hallucinated case names followed real naming conventions, making them appear legitimate.

Package Hallucination Attacks

Researchers discovered that LLMs consistently recommend npm and PyPI packages that do not exist. Attackers then register those package names with malicious code. When developers follow the LLM's recommendation and run pip install or npm install, they install malware. This is called "package confusion" or "slopsquatting."

API Version Confusion

LLMs trained on older documentation regularly generate TensorFlow 1.x code for users expecting TensorFlow 2.x. The code is syntactically valid Python but uses deprecated session-based execution instead of eager mode. This wastes developer time debugging version mismatches.

Security Vulnerability Injection

Studies show LLMs sometimes generate code with known vulnerability patterns (CVEs): buffer overflows in C, SQL injection in web apps, insecure deserialization in Java. The model has learned these patterns from training data that included vulnerable code alongside secure code.

Takeaway

These are not hypothetical scenarios — they have all happened. This is why we study hallucinations: the consequences range from embarrassment (fake legal citations) to security breaches (malicious packages, vulnerability injection). Every code output from an LLM should be treated as untrusted input that requires verification.

Module 6 · Slide 27

Building a Hallucination-Resistant Workflow

A practical framework you can apply to your course projects and beyond. This workflow layers multiple mitigation strategies for increasing levels of protection.

1

Use RAG to Ground Generation

Provide relevant documentation and code context in your prompts. The model is far less likely to hallucinate APIs when it has real docs in its context window.

2

Set Low Temperature for Production Code

Use T=0.0–0.2 for code you plan to ship. Save higher temperatures for brainstorming and exploration only.

3

Add Explicit Constraints in Prompts

Specify which libraries, versions, and patterns to use. Include negative constraints ("do NOT use deprecated APIs").

4

Run Linting + Type Checking

Automatically catch naming and resource hallucinations with static analysis tools (pylint, mypy, ESLint, TypeScript compiler).

5

Generate and Run Tests Automatically

Ask the LLM to generate tests for its own code, then execute them. Failing tests indicate hallucinations or logic errors.

6

Human Review for Critical Logic

Logic hallucinations cannot be caught by automated tools alone. Have a human review the algorithm design and edge case handling for critical code paths.

7

Monitor in Production

Log LLM outputs, track error rates, and set up alerts for unexpected failures. Production monitoring catches hallucinations that slip through earlier stages.

Minimum Viable Safety (Steps 1–4)

These four steps are the bare minimum for any project using LLM-generated code. They catch the majority of naming, resource, and API hallucinations with low effort. You should always do at least these steps.

Full Protection (All 7 Steps)

For production systems, critical infrastructure, or security-sensitive code, apply all seven steps. The additional cost of testing, human review, and monitoring is justified by the risk reduction.

Course Projects

You will apply this workflow in your course projects. At minimum, every LLM-generated code submission must include evidence of steps 1–4 (grounding, low temperature, constrained prompts, linting).

Module 6 · Slide 28

Open Challenges

Despite significant progress, several fundamental challenges remain in the study and mitigation of code hallucinations. These represent active areas of research and important open problems.

Challenge 01

Benchmark Coverage

Current benchmarks like CodeHaluEval focus on algorithmic tasks. Real-world hallucinations often involve framework-specific APIs, multi-file contexts, and system-level interactions that are poorly represented in existing benchmarks.

Challenge 02

Multi-Language Support

Most research focuses on Python and Java. Hallucination patterns in languages like Rust, Go, TypeScript, and Kotlin remain understudied. Each language's type system and ecosystem create unique hallucination profiles.

Challenge 03

Long-Context Hallucinations

As context windows grow, models can hallucinate consistency with earlier parts of a long file while introducing subtle contradictions. Detecting hallucinations across thousands of lines of code remains extremely difficult.

Challenge 04

Logic Hallucination Detection

Mapping, naming, and resource hallucinations can often be caught by static analysis. Logic hallucinations — where the code uses real APIs correctly but implements wrong algorithms — fundamentally require semantic understanding or comprehensive tests.

Challenge 05

Mitigation-Performance Tradeoff

Techniques like De-Hallucinator and MARIN add latency through retrieval and re-querying. Finding the right balance between hallucination reduction and generation speed is critical for practical adoption in IDE-integrated tools.

Challenge 06

Evolving API Landscapes

Libraries update frequently. An API that was correct when the model was trained may be deprecated or modified. Keeping knowledge bases current across thousands of libraries and versions is an ongoing infrastructure challenge.

Research Frontier

Future work includes combining static analysis with execution-based verification, developing language-specific hallucination profiles, creating adversarial benchmarks that specifically target model weaknesses, and integrating hallucination detection directly into code editors as real-time feedback.
Lee et al. 2025, arXiv:2504.20799 — Survey of open challenges

Module 6 · Slide 29

Lab Challenge: Hallucination Hunt

Your assignment: systematically investigate LLM hallucinations in code generation across multiple tasks. This exercise combines detection, classification, and mitigation skills.

Assignment Overview

Use an LLM to generate code for each of the 5 tasks below. For each task: (a) identify any hallucinations in the output, (b) classify the type using CodeHalu or HalluCode taxonomies (intent-conflicting, context-conflicting, knowledge-conflicting, dead code, naming, resource, logic, mapping), (c) apply one mitigation strategy and regenerate, (d) compare before and after. Write a 2-page analysis of patterns you observed.

Task 1: Data Processing

"Write a Python function using pandas to read a CSV file, remove duplicate rows, normalize column names to snake_case, and export to Parquet format."

Task 2: Web API Client

"Write a Python class that wraps the GitHub REST API. Include methods to list repositories, create issues, and fetch pull request details. Handle rate limiting and authentication."

Task 3: Algorithm

"Implement a balanced BST (AVL tree) in Java with insert, delete, and search operations. Include proper rotation logic and height balancing."

Task 4: Database Query

"Write SQLAlchemy ORM models for a blog platform (Users, Posts, Comments, Tags with many-to-many). Include a function to query the 10 most commented posts in the last 30 days."

Task 5: Security

"Write a Node.js Express middleware for JWT authentication. Include token generation, validation, refresh token rotation, and proper error handling for expired/invalid tokens."

Deliverables

• Original LLM output for each of the 5 tasks
• Annotated hallucinations with taxonomy classification
• Mitigation strategy applied for each task
• Regenerated output after mitigation
• 2-page written analysis of patterns observed

Analysis Questions

• Which hallucination types were most common?
• Did certain task types produce more hallucinations?
• Which mitigation strategy was most effective?
• Were there hallucinations you only caught on second review?
• What patterns did you notice across all 5 tasks?

Module 6 · Slide 30 · Interactive

Knowledge Check

Test your understanding of the key concepts covered in this module. Click each question to reveal the answer.

What are the four categories in the CodeHalu taxonomy?

Mapping, Naming, Resource, and Logic hallucinations. Mapping hallucinations occur when the model misinterprets the task. Naming hallucinations involve incorrect identifiers or API names. Resource hallucinations reference non-existent dependencies or files. Logic hallucinations produce syntactically valid but semantically incorrect algorithms. (Tian et al., AAAI 2025)

How does De-Hallucinator differ from standard RAG approaches?

De-Hallucinator uses iterative grounding. Instead of a single retrieval step, it: (1) generates code, (2) extracts API calls from the generated code, (3) retrieves documentation for those specific APIs, and (4) re-queries the model with the documentation as context. This iterative process allows it to correct cascading hallucinations that a single RAG pass would miss. (Eghbali & Pradel 2024)

What is MARIN's key advantage over flat RAG for API hallucination?

Hierarchical dependency awareness. MARIN models the package → class → method hierarchy of available APIs. This ensures retrieved API documentation is structurally compatible with the project's actual dependencies, achieving a 67.52% decrease in MiHN and 73.56% decrease in MaHR compared to standard RAG. (FSE 2025)

Name the three types of conflicts detectable by static analysis (ISSTA 2025).

Dependency Conflicts, Environment Conflicts, and API Knowledge Conflicts. Dependency conflicts involve incompatible library versions. Environment conflicts involve wrong runtime assumptions (e.g., Python version). API knowledge conflicts involve non-existent or misused API calls. All three can be detected without executing the code. (ISSTA 2025)

Why are logic hallucinations the hardest to detect?

They are syntactically valid and use real APIs correctly. Unlike naming or resource hallucinations (which can be caught by static analysis or dependency checking), logic hallucinations implement wrong algorithms or miss edge cases. The code compiles, runs, and may even pass simple test cases — but produces incorrect results on certain inputs. Detection requires either comprehensive test suites or deep semantic analysis of the algorithm's intent. (Tian et al., AAAI 2025)

What distinguishes HalluCode's "Context Repetition" from other categories?

Context Repetition captures when the model loses track of its own output. It unnecessarily re-declares variables, duplicates code blocks, or repeats logic already present. While this is often harmless (the code still works), it signals that the model is generating tokens based on local patterns rather than maintaining a coherent understanding of the full function it has already produced.

6. Hallucinations inCoding Tasks

What Are Hallucinations?

How LLMs Generate Code: A Quick Primer

Receive Prompt / Context

Compute Probability Distribution

Sample a Token

Append & Repeat

Temperature & Sampling: The Hallucination Knob

Why Hallucinations Matter in Code

Real-World Examples

CodeHalu Taxonomy (Part 1)

CodeHalu Taxonomy (Part 2)

Spot the Hallucination

HalluCode Taxonomy

Comparing Taxonomies

Interactive: Spot the Hallucination in Code

How LLMs Hallucinate Code

Training Data Gaps

API Version Confusion

Distribution Shift

Autoregressive Accumulation

Pattern Mimicry over Semantics

Measuring Hallucinations

Rate the Severity

Detection Approaches

Hallucination Detection: Automated Approaches

Static Analysis

Type Checking

Test Generation

Cross-Model Verification

Confidence Scoring

RAG for Hallucination Reduction

The Problem

The Solution: Retrieve at Query Time

The Pipeline

The Result

Mitigation: De-Hallucinator

Initial Generation

API Extraction

Documentation Retrieval

Iterative Re-query

Mitigation: MARIN

Dependency Graph Construction

Hierarchical Context Retrieval

Constrained Generation

Mitigation: RAG-Based Approaches

Prompt Engineering as Mitigation

Be Explicit About Constraints

Provide API Documentation in Context

Ask for Citations / Reasoning

Add Negative Examples

Request Self-Verification

Before (Vague Prompt)

After (Constrained Prompt)

Tool-Augmented Generation

LLM Generates Code

Code Execution Tool Runs It

Static Analysis Tool Checks It

Documentation Lookup Verifies APIs

LLM Self-Corrects

Design a Mitigation Pipeline

Practical Detection Exercise

Case Study: Hallucinations in Production

Building a Hallucination-Resistant Workflow

Use RAG to Ground Generation

Set Low Temperature for Production Code

Add Explicit Constraints in Prompts

Run Linting + Type Checking

Generate and Run Tests Automatically

Human Review for Critical Logic

Monitor in Production

Open Challenges

Benchmark Coverage

Multi-Language Support

Long-Context Hallucinations

Logic Hallucination Detection

Mitigation-Performance Tradeoff

Evolving API Landscapes

Lab Challenge: Hallucination Hunt

Knowledge Check

6. Hallucinations in
Coding Tasks