</> CodeLab|Hallucinations in Coding TasksModule 6 · Interactive1 / 30Home
Module 6 · Slide 01
6. Hallucinations in Coding Tasks
When large language models generate code that looks correct but invokes non-existent APIs, uses wrong parameters, or produces logically flawed implementations. Understanding, detecting, and mitigating code hallucinations is critical for safe AI-assisted software development.
Learning Objectives
Objective 1
Define code hallucinations and distinguish them from general LLM hallucinations
Objective 2
Apply the CodeHalu and HalluCode taxonomies to classify hallucination types
Objective 3
Evaluate mitigation strategies including De-Hallucinator and MARIN
Key References
CodeHalu
Tian et al. (AAAI 2025) — Taxonomy of 4 categories, 8 subcategories; CodeHaluEval benchmark
De-Hallucinator
Eghbali & Pradel (2024) — Iterative grounding with API documentation
5 primary categories, 19 specific types of code hallucination
Module 6 · Slide 02
What Are Hallucinations?
In general NLP, a hallucination occurs when a model generates content that is nonsensical, unfaithful to the source, or factually incorrect. In code generation, hallucinations take on a uniquely dangerous character: they produce syntactically valid code that compiles or runs — but behaves incorrectly.
General LLM Hallucination
Output that is fluent and confident but factually wrong. Example: claiming a historical event occurred on the wrong date, or citing a paper that does not exist.
Code Hallucination
Generated code that looks plausible but calls non-existent APIs, uses wrong method signatures, introduces logical errors, or references unavailable libraries. The code often passes superficial review. — Lee et al. 2025, arXiv:2504.20799
Why Code Hallucinations Are Worse
1. Verifiability gap: Text hallucinations can be fact-checked; code hallucinations require execution or deep expertise to detect.
2. Downstream impact: A hallucinated API call can cause runtime crashes, security vulnerabilities, or silent data corruption in production.
3. False confidence: Syntactically correct code creates a strong illusion of correctness, even for experienced developers.
Module 6 · Slide 03
How LLMs Generate Code: A Quick Primer
Before understanding hallucinations, we need to understand how LLMs produce code token by token. The autoregressive generation process is the root cause of many hallucination patterns.
1
Receive Prompt / Context
The LLM receives the user's prompt plus any system instructions or retrieved context as input tokens.
2
Compute Probability Distribution
The model computes a probability distribution over its entire vocabulary for the next token. Each possible token gets a score based on what the model has learned.
3
Sample a Token
A token is selected from the distribution using sampling parameters (temperature, top-k, top-p). Lower temperature means more deterministic choices.
4
Append & Repeat
The chosen token is appended to the context. The model now sees the original prompt plus all previously generated tokens, and computes the next distribution. This repeats until a stop condition is met.
Visual: Token-by-Token Generation
Prompt: "sort a list"
→
P(def)=0.42
→
Token: def
...sort a list def
→
P(sort)=0.38
→
Token: sort
...def sort
→
P(_list)=0.31
→
Token: _list
Key Insight
Hallucinations emerge because the model always produces the most plausible-sounding token, even when it has no knowledge of the correct answer. It cannot say "I don't know" mid-generation — it must always pick the next token.
Module 6 · Slide 04 · Interactive
Temperature & Sampling: The Hallucination Knob
Generation parameters directly affect hallucination rates. Temperature controls how "peaked" or "flat" the probability distribution is when selecting the next token.
Low Temperature (0.1 – 0.3)
More deterministic — the model strongly favors the highest-probability token.
• Fewer hallucinations
• Less creative / more repetitive
• Best for production code generation
• May repeat common patterns
High Temperature (0.8 – 1.2)
More diverse — lower-probability tokens get a real chance of being selected.
• More hallucinations
• More creative / exploratory
• Better for brainstorming
• Can produce surprising (and wrong) outputs
Practical Guidelines
For production code generation, use T=0.0–0.2. For brainstorming and exploration, use T=0.7–1.0. Never use T>1.0 for code you will actually ship.
Temperature Visualizer interactive
Temperature: 0.5
0.0 (greedy)1.0 (balanced)2.0 (chaotic)
45%
def
25%
func
15%
class
10%
async
5%
xyzq
Probability distribution over next-token candidates
Module 6 · Slide 05
Why Hallucinations Matter in Code
Code hallucinations are not merely academic curiosities. They have real consequences across the software development lifecycle, from introducing subtle bugs to creating exploitable security vulnerabilities.
API Misuse
LLMs frequently invent API methods that do not exist or use real APIs with incorrect parameter types and orderings. This is the most common form of code hallucination. ISSTA 2025 — API Knowledge Conflicts
Security Vulnerabilities
Hallucinated code may omit input validation, use deprecated cryptographic functions, or introduce injection points that would not appear in human-written code.
Dependency Conflicts
Models may reference library versions that are incompatible, or combine APIs from different major versions of the same framework. ISSTA 2025 — Dependency Conflicts
Silent Failures
The most insidious hallucinations produce code that runs without errors but produces incorrect results — wrong calculations, missed edge cases, or flawed logic that passes basic tests.
699
Tasks in CodeHaluEval
8,883
Hallucination Samples
17
LLMs Evaluated
4
Hallucination Categories
Source: Tian et al. CodeHalu (AAAI 2025)
Module 6 · Slide 06
Real-World Examples
These examples illustrate common hallucination patterns observed in LLM-generated code. Each snippet looks plausible but contains fabricated or incorrect API usage.
Example 1: Non-Existent API
The model invents a method that does not exist in the library:
# Hallucinated: pandas has no .smart_merge()import pandas as pd
df1 = pd.read_csv("users.csv")
df2 = pd.read_csv("orders.csv")
# This method does NOT exist in pandas
result = pd.smart_merge(df1, df2,
on="user_id",
strategy="fuzzy", # not a real param
threshold=0.8) # not a real param
Example 2: Wrong Method Signature
The model uses a real API but with incorrect parameters:
// Hallucinated: wrong overload of Files.readStringimport java.nio.file.Files;
import java.nio.file.Path;
// Files.readString() takes a Path, not// a Path + boolean + Charset
String content = Files.readString(
Path.of("data.txt"),
true, // no boolean param"UTF-8"); // wants Charset obj
Example 3: Fabricated Library
The model imports a library that has never been published:
# Hallucinated: no such package existsfromsklearn.neuralimport DeepClassifier
model = DeepClassifier(
layers=[128, 64, 32],
activation="gelu",
optimizer="adamw")
model.fit(X_train, y_train)
Example 4: Logic Hallucination
Syntactically valid but logically wrong — returns before completing:
deffind_duplicates(lst):
"""Find all duplicate elements."""
seen = set()
duplicates = set()
for item in lst:
if item in seen:
duplicates.add(item)
return duplicates # BUG: returns on
seen.add(item) # first dup foundreturn duplicates
Module 6 · Slide 07
CodeHalu Taxonomy (Part 1)
Tian et al. (AAAI 2025) propose a systematic taxonomy of code hallucinations based on execution-based verification, dividing them into four categories with eight subcategories. This slide covers the first two categories.
Category 1: Mapping Hallucinations
The LLM fails to correctly map the task description to executable code. The generated solution does not align with what was asked.
Subcategory 1a — Task Misunderstanding: The model solves a different problem than specified. E.g., asked to sort descending but sorts ascending.
Subcategory 1b — Specification Violation: The code ignores explicit constraints such as time complexity requirements or input/output formats.
# Mapping hallucination: task says "return indices"# but model returns values insteaddeftwo_sum(nums, target):
for i inrange(len(nums)):
for j inrange(i+1, len(nums)):
if nums[i] + nums[j] == target:
return [nums[i], nums[j]] # wrong!# should be: return [i, j]
Category 2: Naming Hallucinations
The LLM references identifiers — variable names, function names, class names — that either do not exist in the current scope or are used inconsistently.
Subcategory 2a — Undefined References: Using variables or functions that were never declared or imported.
Subcategory 2b — Name Confusion: Mixing up similar-sounding API names (e.g., getSize() vs size() vs length()).
// Naming hallucination: mixing up API namesimport java.util.ArrayList;
ArrayList<String> list = new ArrayList<>();
list.add("hello");
// ArrayList uses .size(), not .length()int n = list.length(); // compile error!// Also confuses with array .length field// and String .length() method
Source: Tian et al. "CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification" (AAAI 2025)
Module 6 · Slide 08
CodeHalu Taxonomy (Part 2)
The remaining two categories in the CodeHalu taxonomy cover hallucinations related to resource management and logical reasoning — often the hardest to detect without execution.
Category 3: Resource Hallucinations
The LLM generates code that references external resources (files, network endpoints, database tables, environment variables) that do not exist or are inaccessible.
Subcategory 3a — Missing Dependencies: Importing packages that are not installed or do not exist in the target ecosystem.
Subcategory 3b — Environment Assumptions: Assuming the presence of files, directories, services, or configurations that are not guaranteed.
# Resource hallucination: assumes file existsimport json
# Model assumes config.json is always presentwithopen("config.json") as f:
config = json.load(f)
# No error handling, no fallback defaults
db_host = config["database"]["host"]
db_port = config["database"]["port"]
Category 4: Logic Hallucinations
The code is syntactically valid and uses real APIs correctly, but implements flawed algorithms or incorrect control flow that produces wrong results.
Subcategory 4a — Algorithm Errors: Implementing a sorting algorithm that does not actually sort, or a search that misses valid results.
Subcategory 4b — Edge Case Failures: Code that works for common inputs but fails on boundary conditions (empty lists, negative numbers, null values).
# Logic hallucination: off-by-one in binary searchdefbinary_search(arr, target):
lo, hi = 0, len(arr) # should be len-1while lo <= hi:
mid = (lo + hi) // 2if arr[mid] == target: # IndexError!return mid
elif arr[mid] < target:
lo = mid + 1else:
hi = mid - 1return -1
Source: Tian et al. "CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification" (AAAI 2025)
Module 6 · Slide 09 · Interactive
Spot the Hallucination
Each pair below contains one snippet with a real API call and one with a hallucinated API. Click the snippet you believe is hallucinated.
Challenge 1: Python HTTP Requests interactive
import requests
response = requests.get("https://api.example.com")
data = response.json()
import requests
response = requests.fetch("https://api.example.com")
data = response.to_json()
Explanation
Right snippet is hallucinated. The requests library uses .get(), not .fetch(). Also, the response method is .json() not .to_json(). This is a Naming Hallucination per the CodeHalu taxonomy.
Challenge 2: Java Collections interactive
Map<String,Integer> map = new HashMap<>();
map.put("a", 1);
int val = map.getOrDefault("b", 0, true);
Map<String,Integer> map = new HashMap<>();
map.put("a", 1);
int val = map.getOrDefault("b", 0);
Explanation
Left snippet is hallucinated.Map.getOrDefault() takes exactly 2 parameters (key, defaultValue). The third boolean parameter does not exist. This is a Naming Hallucination (Name Confusion) — the model added a phantom parameter.
Challenge 3: Python File I/O interactive
from pathlib import Path
text = Path("file.txt").read_text()
lines = text.splitlines()
from pathlib import Path
text = Path("file.txt").read_lines()
lines = text.to_list()
Explanation
Right snippet is hallucinated.pathlib.Path has .read_text() and .read_bytes(), but no .read_lines() method. The model fabricated both the method name and the .to_list() conversion. This is a Naming Hallucination.
Module 6 · Slide 10
HalluCode Taxonomy
HalluCode proposes a complementary taxonomy focusing on the relationship between generated code and the developer's intent, context, and domain knowledge. It defines 5 primary categories and 19 specific types.
1. Intent Conflicting
Generated code contradicts the user's stated intent or task description. The code runs but does something fundamentally different from what was requested. Example: asked to delete duplicates, but the code deletes unique elements instead.
2. Context Inconsistency
Code is inconsistent with the surrounding context — uses variables defined elsewhere with wrong types, breaks established patterns in the codebase, or contradicts previous lines in the same function.
3. Context Repetition
The model unnecessarily repeats code blocks, re-declares variables, or duplicates logic that is already present. This can indicate the model has lost track of what it has already generated.
4. Dead Code
Generated code includes unreachable statements, unused variables, redundant conditions, or function definitions that are never called. While not always incorrect, dead code signals the model is generating without genuine understanding.
5. Knowledge Conflicting
Code contradicts established programming knowledge: using deprecated APIs, violating language-specific idioms, applying algorithms to unsuitable data structures, or using patterns that are known anti-patterns. Example: using a bubble sort where the prompt requires O(n log n) complexity. This category overlaps with CodeHalu's Naming and Resource hallucinations.
Source: HalluCode — 5 primary categories, 19 specific types of code hallucination
Module 6 · Slide 11
Comparing Taxonomies
CodeHalu and HalluCode approach code hallucination classification from different angles. Understanding both provides a more complete picture of the hallucination landscape.
Dimension
CodeHalu (AAAI 2025)
HalluCode
Focus
Execution-based verification; what goes wrong at runtime
Intent alignment; how code deviates from user expectations
Categories
MappingNamingResourceLogic
IntentContextRepetitionDead CodeKnowledge
Granularity
4 categories, 8 subcategories
5 categories, 19 specific types
Detection Method
Execution-based: run code, compare outputs against expected results
Hybrid: static analysis + semantic comparison to intent
Neither taxonomy is strictly superior. CodeHalu excels at automated, execution-based detection of hallucinations that cause runtime failures. HalluCode captures subtler issues like dead code and context repetition that may not cause crashes but still indicate poor code quality. A comprehensive hallucination analysis should draw from both frameworks.
Module 6 · Slide 12 · Interactive
Interactive: Spot the Hallucination in Code
Each code snippet below was generated by an LLM and contains a specific hallucination. Click the line you think is hallucinated, then click "Reveal" to check your answer.
Code Hallucination Hunt interactive
Snippet A: Python List Sorting
import random
data = [random.randint(1, 100) for _ inrange(20)]
sorted_data = data.sortDescending()
print(sorted_data)
Line 3 is hallucinated. Python lists have no .sortDescending() method. The correct approach is data.sort(reverse=True) or sorted_data = sorted(data, reverse=True). This is a Naming Hallucination — the model invented a Java-style method name.
Line 1 is hallucinated. The correct import is from sklearn.neural_network import MLPClassifier. The module sklearn.neural does not exist — it is sklearn.neural_network. This is a Resource Hallucination (Missing Dependencies).
Snippet C: JavaScript String Method
const text = "Hello, World!";
const reversed = text.reverse();
console.log(reversed);
Line 2 is hallucinated. JavaScript strings do not have a .reverse() method. Only arrays have .reverse(). The correct approach is text.split('').reverse().join(''). This is a Naming Hallucination (Name Confusion) — confusing Array and String APIs.
Snippet D: Python JSON Handling
import json
data = {"name": "Alice", "age": 30}
json_str = json.stringify(data)
print(json_str)
Line 3 is hallucinated. Python's json module uses json.dumps(), not json.stringify(). The model confused Python with JavaScript's JSON.stringify(). This is a Knowledge Conflicting hallucination — mixing language conventions.
Snippet E: Semantic Error (Logic)
defis_palindrome(s):
s = s.lower().strip()
return s == s[:-1] # missing the full reverse
Line 3 is hallucinated. The code compares s to s[:-1] (all characters except the last), not to the reversed string s[::-1]. This would return True for strings like "aa" but False for actual palindromes like "racecar". This is a Logic Hallucination — syntactically valid but semantically wrong.
Module 6 · Slide 13
How LLMs Hallucinate Code
Understanding the mechanisms behind code hallucinations helps inform both detection and mitigation strategies. Several factors contribute to hallucination in code generation.
1
Training Data Gaps
APIs that are infrequent in training data are more likely to be hallucinated. The model may have seen similar but different APIs and conflates them. Less popular libraries have far fewer training examples, leading to fabricated method names and signatures.
2
API Version Confusion
Training data contains code using multiple versions of the same library (e.g., TensorFlow 1.x vs 2.x, Python 2 vs 3). The model may blend APIs across incompatible versions, producing code that was valid in an older version but fails in the current one. ISSTA 2025 — Environment Conflicts
3
Distribution Shift
When a prompt asks for code combining libraries in novel ways not well-represented in training data, the model falls back on statistical patterns rather than genuine understanding, increasing hallucination probability.
4
Autoregressive Accumulation
Each generated token conditions on all previous tokens. A small error early in generation (e.g., wrong import) cascades — the model then generates code consistent with the hallucinated import, compounding the hallucination throughout the function.
5
Pattern Mimicry over Semantics
LLMs are pattern-matching systems. They learn that methods are often named get_X(), set_X(), to_X(), and may synthesize plausible-sounding method names that follow common conventions but do not actually exist in the target library. Lee et al. 2025, arXiv:2504.20799
Module 6 · Slide 14
Measuring Hallucinations
Quantifying code hallucinations requires specialized benchmarks and metrics. The CodeHaluEval benchmark by Tian et al. provides a rigorous framework for evaluation.
CodeHaluEval Benchmark
Scale: 699 tasks generating 8,883 code samples across 17 LLMs
Process: Each task is paired with a specification and test cases. Generated code is executed against the tests. Failures are categorized by hallucination type using the CodeHalu taxonomy.
Coverage: Tasks span algorithm design, data structure manipulation, API usage, and systems programming.
ISSTA 2025 Framework
Focuses on hallucinations detectable via static analysis without execution:
• Dependency Conflicts — incompatible library versions
• Environment Conflicts — wrong Python/Java versions
• API Knowledge Conflicts — non-existent or misused APIs
Key Metrics
MiHN (Micro Hallucination Number): The total count of individual hallucinated elements (wrong API calls, incorrect parameters, etc.) across all generated samples. Lower is better.
MaHR (Macro Hallucination Rate): The fraction of generated code samples that contain at least one hallucination. Captures how frequently the model produces any hallucination at all.
Edit Distance: How many edits are needed to fix hallucinated code to match correct code. Used by De-Hallucinator to measure improvement.
API Recall: The fraction of required API calls that are correctly included in generated code.
67.5%
MiHN Decrease (MARIN)
73.6%
MaHR Decrease (MARIN)
50.6%
Edit Dist. Improvement
61.0%
API Recall Improvement
MARIN (FSE 2025) and De-Hallucinator (Eghbali & Pradel 2024) results
Module 6 · Slide 15 · Interactive
Rate the Severity
Not all hallucinations are equally dangerous. Classify each example below by severity: Low (cosmetic / style issue), Medium (will cause errors but detectable), High (runtime crash or wrong results), or Critical (security vulnerability or data loss).
Severity Classification interactive
An unused import statement is generated at the top of the file
An unused import is a Dead Code hallucination (HalluCode). It does not affect execution — at worst, it adds a minor dependency. Severity: Low.
The model calls torch.cuda.memory_optimized_forward(), a method that does not exist
A non-existent method causes an AttributeError at runtime. This is a Naming Hallucination (CodeHalu). Easily caught by testing but will crash in production if undetected. Severity: High.
User input is concatenated directly into a SQL query string instead of using parameterized queries
SQL injection is a security vulnerability. The code runs fine in normal testing but is exploitable. This is a Knowledge Conflicting hallucination (HalluCode) — the model ignores known security practices. Severity: Critical.
A function re-declares a variable that was already defined 3 lines above with the same value
This is a Context Repetition hallucination (HalluCode). It may cause confusion and in some languages could shadow variables leading to subtle bugs. Severity: Medium.
A sorting function returns after finding the first unsorted pair instead of fully sorting the list
A Logic Hallucination (CodeHalu). The function appears to work on small or nearly-sorted inputs but silently produces wrong results on general inputs. Severity: High.
Module 6 · Slide 16
Detection Approaches
Detecting code hallucinations requires multiple complementary strategies, ranging from dynamic execution to static analysis and semantic reasoning.
Execution-Based Verification
How it works: Execute generated code against test suites and compare actual outputs to expected results. Any mismatch indicates a hallucination.
Strengths: Definitive — if code fails a test, something is wrong. Catches all categories of hallucination that affect behavior.
Limitations: Requires comprehensive test suites. Cannot detect dead code or stylistic issues. Expensive for large-scale evaluation.
Used by: CodeHaluEval (AAAI 2025)
AST-Based Analysis
How it works: Parse the generated code into an Abstract Syntax Tree and check for structural anomalies: undefined references, type mismatches, unreachable code paths, redundant expressions.
Strengths: Fast, no execution needed. Good at catching naming hallucinations and dead code.
Limitations: Cannot catch logic hallucinations or semantic errors that are structurally valid.
Static Analysis
How it works: Use existing static analysis tools (linters, type checkers, dependency resolvers) to identify hallucinated dependencies, version conflicts, and API misuse without running the code.
ISSTA 2025 — "LLM Hallucinations in Practical Code Generation"
LLM-Based Self-Verification
How it works: Use a second LLM pass (or the same model) to review generated code for correctness, asking it to identify potential issues.
Strengths: Can catch semantic and logic issues that static tools miss. No test suite required.
Limitations: The verifier LLM may itself hallucinate and approve incorrect code. Not reliable as a sole detection method.
Module 6 · Slide 17
Hallucination Detection: Automated Approaches
Beyond manual code review, several automated techniques can detect hallucinations at scale. The best strategies combine multiple signals for robust detection.
Static Analysis
Check if generated APIs exist in library documentation. Linters and type checkers catch non-existent methods and wrong signatures automatically.
Type Checking
Run a type checker (mypy, TypeScript compiler) on generated code. Type errors often indicate hallucinated APIs or wrong parameter types.
Test Generation
Generate tests for the code and run them. Failing tests suggest hallucinations. This can be automated with the same LLM that generated the code.
Cross-Model Verification
Ask multiple LLMs to solve the same task, then compare outputs. Disagreement between models suggests at least one is hallucinating.
Confidence Scoring
Low-probability tokens in the generation are more likely to be hallucinated. Token-level confidence can flag suspicious API calls and parameter names.
Key Insight
No single detection method is perfect. Static analysis catches naming and resource hallucinations but misses logic errors. Testing catches behavioral issues but requires good test coverage. Cross-model verification adds a probabilistic signal but is expensive. The best approaches combine multiple signals into a detection pipeline.
Module 6 · Slide 18
RAG for Hallucination Reduction
Retrieval-Augmented Generation (RAG) is a foundational technique for reducing hallucinations. Instead of relying on the model's parametric memory alone, RAG retrieves relevant documentation at query time to ground the generation in factual data.
1
The Problem
LLMs hallucinate because they generate from statistical patterns, not factual knowledge. When the training data is sparse or outdated for a given API, the model invents plausible-sounding but incorrect code.
2
The Solution: Retrieve at Query Time
Instead of relying solely on what the model memorized during training, retrieve relevant documentation, code examples, and API specifications from an external knowledge base before generating.
3
The Pipeline
User query → Embed the query → Search codebase / docs → Retrieve relevant files → Add retrieved context to prompt → Generate code with grounded context.
4
The Result
The model generates code based on actual documentation and real code examples rather than imagination. API names, parameter types, and method signatures are grounded in retrieved facts.
User Query
→
Embed
→
Search Docs
→
Retrieve Context
→
Augmented Prompt
→
Grounded Code
Important Caveat
RAG does not eliminate hallucinations, but it dramatically reduces them by providing the model with factual grounding. The model can still hallucinate if retrieved context is irrelevant, incomplete, or if the model ignores the context. More advanced RAG variants (like MARIN and De-Hallucinator) address these limitations.
Module 6 · Slide 19
Mitigation: De-Hallucinator
Eghbali & Pradel (2024) introduce De-Hallucinator, an iterative grounding approach that retrieves relevant API documentation to reduce hallucinations in generated code.
User Prompt
→
Initial LLM Generation
→
Extract API Calls
→
Retrieve API Docs
→
Re-query with Context
→
Refined Output
1
Initial Generation
The LLM generates code from the user's prompt as usual. This code may contain hallucinated API calls.
2
API Extraction
Parse the generated code to identify all external API calls, library imports, and method invocations.
3
Documentation Retrieval
For each identified API, retrieve the official documentation including correct method signatures, parameter types, return types, and usage examples.
4
Iterative Re-query
Re-prompt the LLM with the original request augmented by the retrieved API docs. The model uses this grounding to correct hallucinated calls. This process can repeat for multiple iterations.
Results
De-Hallucinator achieves significant improvements across multiple metrics when tested on code generation benchmarks:
23.3-50.6%
Edit Distance Improvement
23.9-61.0%
API Recall Improvement
63.2%
More Fixed Tests
15.5%
Statement Coverage Increase
Key Insight
The iterative approach is crucial. A single round of documentation retrieval helps, but multiple iterations allow the model to progressively resolve cascading hallucinations where one incorrect API call leads to further errors.
Source: Eghbali & Pradel "De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding" (2024)
Module 6 · Slide 20
Mitigation: MARIN
MARIN (FSE 2025) takes a hierarchical, dependency-aware approach to mitigating API hallucinations by modeling the relationships between APIs in a project's dependency graph.
Core Idea
Rather than treating each API call independently, MARIN builds a hierarchical dependency graph that captures relationships between packages, classes, and methods. When generating code, the model is constrained to only use APIs that are reachable in this dependency hierarchy.
1
Dependency Graph Construction
Build a hierarchical graph of all available APIs from the project's declared dependencies, organized by package → class → method.
2
Hierarchical Context Retrieval
When the model needs an API, retrieve candidates at each level of the hierarchy: first the package, then the class, then compatible methods.
3
Constrained Generation
Provide the LLM with only verified, reachable APIs as context, preventing hallucination of non-existent methods.
MARIN vs RAG Baseline
MARIN significantly outperforms standard RAG (Retrieval-Augmented Generation) approaches because RAG retrieves documentation by text similarity, which can return irrelevant APIs with similar names. MARIN's hierarchical approach ensures structural compatibility.
67.52%
MiHN Decrease vs RAG
73.56%
MaHR Decrease vs RAG
Why Hierarchy Matters
Without hierarchy (flat RAG): A query for "read file" might retrieve BufferedReader.read(), FileInputStream.read(), and Scanner.nextLine() — all valid but from different APIs, creating mixing errors.
With hierarchy (MARIN): If the code already imports java.nio.file.Files, MARIN retrieves only methods from that package tree, ensuring consistency.
Source: "MARIN: Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware" (FSE 2025)
Module 6 · Slide 21
Mitigation: RAG-Based Approaches
Retrieval-Augmented Generation (RAG) is the foundation for many hallucination mitigation strategies. By grounding the LLM with retrieved documentation, we reduce the model's reliance on potentially incorrect parametric knowledge.
How RAG Reduces Hallucinations
Standard LLM Generation: The model relies entirely on patterns learned during training. If the training data was sparse or contradictory for a given API, hallucinations are likely.
RAG-Augmented Generation: Before generating code, the system retrieves relevant documentation, code examples, or API specifications from an external knowledge base. This retrieved context is prepended to the prompt, giving the model accurate, up-to-date information.
RAG Pipeline Components
1. Knowledge Base: API docs, code examples, Stack Overflow answers, library changelogs 2. Embedding Model: Converts queries and documents to vectors for similarity search 3. Retriever: Finds the most relevant documents for the current code generation task 4. Generator: The LLM produces code grounded in retrieved context
Limitations of Naive RAG
Retrieval noise: Similar text does not mean compatible APIs. Searching for "read file Python" may return docs for open(), pathlib, io, and csv — overwhelming the model with options.
Context window limits: Stuffing too much documentation into the prompt can degrade model performance and push out the actual task description.
Stale indexes: If the knowledge base is not updated when libraries release new versions, RAG can reinforce outdated API usage.
Advanced RAG Strategies
MARIN's hierarchical RAG solves retrieval noise by constraining to dependency-compatible APIs (FSE 2025).
De-Hallucinator's iterative RAG solves context limits by retrieving docs only for APIs the model actually uses, then re-querying (Eghbali & Pradel 2024).
Version-aware RAG indexes API docs by version, ensuring retrieved context matches the project's declared dependencies.
Module 6 · Slide 22
Prompt Engineering as Mitigation
Careful prompt design is one of the simplest and most effective ways to reduce hallucinations. These techniques require no infrastructure changes — just better prompts.
1
Be Explicit About Constraints
Tell the model exactly what it can and cannot use: "Only use Python standard library functions. Do not import any third-party packages."
2
Provide API Documentation in Context
Paste the relevant API docs directly into the prompt. The model is far less likely to hallucinate APIs when it has the real docs in its context window.
3
Ask for Citations / Reasoning
Prompt: "Cite which library each method comes from" or "Explain why you chose this approach before writing code." Chain-of-thought reduces hallucination.
4
Add Negative Examples
"Do NOT use deprecated methods. Do NOT use eval(). Do NOT use string concatenation for SQL queries."
5
Request Self-Verification
"After generating the code, verify that all API methods you used actually exist in the specified library version."
Before (Vague Prompt)
Write a Python function
to read a CSV file
and find duplicates.
After (Constrained Prompt)
Write a Python function using
only the csv module from the
standard library. Read a CSV
file, find duplicate rows based
on all columns, and return them
as a list of dicts.
Do NOT use pandas.
Do NOT use third-party libs.
Explain your approach first.
Why This Works
Constrained prompts reduce the model's "degrees of freedom" for hallucination. When you specify which libraries to use, the model cannot fabricate new ones. When you request chain-of-thought reasoning, the model is forced to justify its choices, making it more likely to catch its own errors before they appear in code.
Module 6 · Slide 23
Tool-Augmented Generation
Tool-augmented generation gives LLMs access to external tools that can verify code in real time, dramatically reducing hallucinations by providing ground truth feedback during the generation process.
1
LLM Generates Code
The model produces an initial code solution based on the user's prompt, as in standard generation.
2
Code Execution Tool Runs It
A sandboxed code interpreter executes the generated code. Runtime errors (ImportError, AttributeError, TypeError) are caught immediately and fed back to the model.
3
Static Analysis Tool Checks It
Linters and type checkers analyze the code for issues that might not cause immediate runtime errors: type mismatches, unused imports, deprecated API usage, and potential security vulnerabilities.
4
Documentation Lookup Verifies APIs
A documentation search tool verifies that every API method, class, and parameter used in the code actually exists in the specified library version.
5
LLM Self-Corrects
The model receives all feedback from the tools and generates a corrected version. This loop can repeat multiple times until the code passes all checks.
Key Insight
Tool-augmented generation reduces hallucinations by giving the LLM access to ground truth. Instead of guessing whether an API exists, the model can verify its own outputs against real execution results, documentation, and static analysis. This is the approach used by "code interpreter" features in modern LLM products.
Module 6 · Slide 24 · Interactive
Design a Mitigation Pipeline
Arrange the components below into the correct order for a hallucination mitigation pipeline. Click a component to place it in the next available slot, or click a filled slot to remove it.
Pipeline Builder interactive
Available Components (click to place):
Retrieve API DocumentationVerify with Test ExecutionReceive User PromptRe-generate with Grounded ContextStatic Analysis CheckInitial LLM Code Generation
Pipeline Order:
Step 1: click a component above
Step 2
Step 3
Step 4
Step 5
Step 6
Module 6 · Slide 25 · Interactive
Practical Detection Exercise
For each code snippet below, identify the hallucination type using the CodeHalu taxonomy and select the best fix. This exercise combines taxonomy knowledge with practical debugging skills.
Resource Hallucination (Missing Dependencies).collections.OrderedSet does not exist in Python's standard library. The collections module provides OrderedDict, but there is no OrderedSet. Fix: use dict.fromkeys() for insertion-ordered uniqueness in Python 3.7+, or install the third-party ordered-set package.
Snippet 2
defflatten(nested_list):
"""Flatten a nested list of arbitrary depth."""
result = []
for item in nested_list:
ifisinstance(item, list):
result.extend(item) # only 1 level!else:
result.append(item)
return result
What type of hallucination is this?
Mapping Hallucination (Task Misunderstanding). The docstring says "arbitrary depth" but the code only flattens one level. extend(item) handles [[1,2],[3,4]] but fails on [[[1,2],[3]],4]. Fix: use recursion — replace result.extend(item) with result.extend(flatten(item)).
Naming Hallucination (Name Confusion). The Java Stream API method is .filter(), not .filterBy(). The model likely conflated Java streams with a different framework's naming convention. Fix: replace .filterBy() with .filter().
Module 6 · Slide 26
Case Study: Hallucinations in Production
These are real-world cases where LLM hallucinations caused significant problems. They demonstrate why hallucination mitigation is not just an academic concern.
The Lawyer Who Cited Fake Cases
In 2023, a lawyer used ChatGPT to research legal precedents. The model generated citations to cases that sounded real but did not exist. The lawyer submitted these to a federal court, leading to sanctions. The hallucinated case names followed real naming conventions, making them appear legitimate.
Package Hallucination Attacks
Researchers discovered that LLMs consistently recommend npm and PyPI packages that do not exist. Attackers then register those package names with malicious code. When developers follow the LLM's recommendation and run pip install or npm install, they install malware. This is called "package confusion" or "slopsquatting."
API Version Confusion
LLMs trained on older documentation regularly generate TensorFlow 1.x code for users expecting TensorFlow 2.x. The code is syntactically valid Python but uses deprecated session-based execution instead of eager mode. This wastes developer time debugging version mismatches.
Security Vulnerability Injection
Studies show LLMs sometimes generate code with known vulnerability patterns (CVEs): buffer overflows in C, SQL injection in web apps, insecure deserialization in Java. The model has learned these patterns from training data that included vulnerable code alongside secure code.
Takeaway
These are not hypothetical scenarios — they have all happened. This is why we study hallucinations: the consequences range from embarrassment (fake legal citations) to security breaches (malicious packages, vulnerability injection). Every code output from an LLM should be treated as untrusted input that requires verification.
Module 6 · Slide 27
Building a Hallucination-Resistant Workflow
A practical framework you can apply to your course projects and beyond. This workflow layers multiple mitigation strategies for increasing levels of protection.
1
Use RAG to Ground Generation
Provide relevant documentation and code context in your prompts. The model is far less likely to hallucinate APIs when it has real docs in its context window.
2
Set Low Temperature for Production Code
Use T=0.0–0.2 for code you plan to ship. Save higher temperatures for brainstorming and exploration only.
3
Add Explicit Constraints in Prompts
Specify which libraries, versions, and patterns to use. Include negative constraints ("do NOT use deprecated APIs").
4
Run Linting + Type Checking
Automatically catch naming and resource hallucinations with static analysis tools (pylint, mypy, ESLint, TypeScript compiler).
5
Generate and Run Tests Automatically
Ask the LLM to generate tests for its own code, then execute them. Failing tests indicate hallucinations or logic errors.
6
Human Review for Critical Logic
Logic hallucinations cannot be caught by automated tools alone. Have a human review the algorithm design and edge case handling for critical code paths.
7
Monitor in Production
Log LLM outputs, track error rates, and set up alerts for unexpected failures. Production monitoring catches hallucinations that slip through earlier stages.
Minimum Viable Safety (Steps 1–4)
These four steps are the bare minimum for any project using LLM-generated code. They catch the majority of naming, resource, and API hallucinations with low effort. You should always do at least these steps.
Full Protection (All 7 Steps)
For production systems, critical infrastructure, or security-sensitive code, apply all seven steps. The additional cost of testing, human review, and monitoring is justified by the risk reduction.
Course Projects
You will apply this workflow in your course projects. At minimum, every LLM-generated code submission must include evidence of steps 1–4 (grounding, low temperature, constrained prompts, linting).
Module 6 · Slide 28
Open Challenges
Despite significant progress, several fundamental challenges remain in the study and mitigation of code hallucinations. These represent active areas of research and important open problems.
Challenge 01
Benchmark Coverage
Current benchmarks like CodeHaluEval focus on algorithmic tasks. Real-world hallucinations often involve framework-specific APIs, multi-file contexts, and system-level interactions that are poorly represented in existing benchmarks.
Challenge 02
Multi-Language Support
Most research focuses on Python and Java. Hallucination patterns in languages like Rust, Go, TypeScript, and Kotlin remain understudied. Each language's type system and ecosystem create unique hallucination profiles.
Challenge 03
Long-Context Hallucinations
As context windows grow, models can hallucinate consistency with earlier parts of a long file while introducing subtle contradictions. Detecting hallucinations across thousands of lines of code remains extremely difficult.
Challenge 04
Logic Hallucination Detection
Mapping, naming, and resource hallucinations can often be caught by static analysis. Logic hallucinations — where the code uses real APIs correctly but implements wrong algorithms — fundamentally require semantic understanding or comprehensive tests.
Challenge 05
Mitigation-Performance Tradeoff
Techniques like De-Hallucinator and MARIN add latency through retrieval and re-querying. Finding the right balance between hallucination reduction and generation speed is critical for practical adoption in IDE-integrated tools.
Challenge 06
Evolving API Landscapes
Libraries update frequently. An API that was correct when the model was trained may be deprecated or modified. Keeping knowledge bases current across thousands of libraries and versions is an ongoing infrastructure challenge.
Research Frontier
Future work includes combining static analysis with execution-based verification, developing language-specific hallucination profiles, creating adversarial benchmarks that specifically target model weaknesses, and integrating hallucination detection directly into code editors as real-time feedback. Lee et al. 2025, arXiv:2504.20799 — Survey of open challenges
Module 6 · Slide 29
Lab Challenge: Hallucination Hunt
Your assignment: systematically investigate LLM hallucinations in code generation across multiple tasks. This exercise combines detection, classification, and mitigation skills.
Assignment Overview
Use an LLM to generate code for each of the 5 tasks below. For each task: (a) identify any hallucinations in the output, (b) classify the type using CodeHalu or HalluCode taxonomies (intent-conflicting, context-conflicting, knowledge-conflicting, dead code, naming, resource, logic, mapping), (c) apply one mitigation strategy and regenerate, (d) compare before and after. Write a 2-page analysis of patterns you observed.
Task 1: Data Processing
"Write a Python function using pandas to read a CSV file, remove duplicate rows, normalize column names to snake_case, and export to Parquet format."
Task 2: Web API Client
"Write a Python class that wraps the GitHub REST API. Include methods to list repositories, create issues, and fetch pull request details. Handle rate limiting and authentication."
Task 3: Algorithm
"Implement a balanced BST (AVL tree) in Java with insert, delete, and search operations. Include proper rotation logic and height balancing."
Task 4: Database Query
"Write SQLAlchemy ORM models for a blog platform (Users, Posts, Comments, Tags with many-to-many). Include a function to query the 10 most commented posts in the last 30 days."
Task 5: Security
"Write a Node.js Express middleware for JWT authentication. Include token generation, validation, refresh token rotation, and proper error handling for expired/invalid tokens."
Deliverables
• Original LLM output for each of the 5 tasks
• Annotated hallucinations with taxonomy classification
• Mitigation strategy applied for each task
• Regenerated output after mitigation
• 2-page written analysis of patterns observed
Analysis Questions
• Which hallucination types were most common?
• Did certain task types produce more hallucinations?
• Which mitigation strategy was most effective?
• Were there hallucinations you only caught on second review?
• What patterns did you notice across all 5 tasks?
Module 6 · Slide 30 · Interactive
Knowledge Check
Test your understanding of the key concepts covered in this module. Click each question to reveal the answer.
What are the four categories in the CodeHalu taxonomy?
Mapping, Naming, Resource, and Logic hallucinations. Mapping hallucinations occur when the model misinterprets the task. Naming hallucinations involve incorrect identifiers or API names. Resource hallucinations reference non-existent dependencies or files. Logic hallucinations produce syntactically valid but semantically incorrect algorithms. (Tian et al., AAAI 2025)
How does De-Hallucinator differ from standard RAG approaches?
De-Hallucinator uses iterative grounding. Instead of a single retrieval step, it: (1) generates code, (2) extracts API calls from the generated code, (3) retrieves documentation for those specific APIs, and (4) re-queries the model with the documentation as context. This iterative process allows it to correct cascading hallucinations that a single RAG pass would miss. (Eghbali & Pradel 2024)
What is MARIN's key advantage over flat RAG for API hallucination?
Hierarchical dependency awareness. MARIN models the package → class → method hierarchy of available APIs. This ensures retrieved API documentation is structurally compatible with the project's actual dependencies, achieving a 67.52% decrease in MiHN and 73.56% decrease in MaHR compared to standard RAG. (FSE 2025)
Name the three types of conflicts detectable by static analysis (ISSTA 2025).
Dependency Conflicts, Environment Conflicts, and API Knowledge Conflicts. Dependency conflicts involve incompatible library versions. Environment conflicts involve wrong runtime assumptions (e.g., Python version). API knowledge conflicts involve non-existent or misused API calls. All three can be detected without executing the code. (ISSTA 2025)
Why are logic hallucinations the hardest to detect?
They are syntactically valid and use real APIs correctly. Unlike naming or resource hallucinations (which can be caught by static analysis or dependency checking), logic hallucinations implement wrong algorithms or miss edge cases. The code compiles, runs, and may even pass simple test cases — but produces incorrect results on certain inputs. Detection requires either comprehensive test suites or deep semantic analysis of the algorithm's intent. (Tian et al., AAAI 2025)
What distinguishes HalluCode's "Context Repetition" from other categories?
Context Repetition captures when the model loses track of its own output. It unnecessarily re-declares variables, duplicates code blocks, or repeats logic already present. While this is often harmless (the code still works), it signals that the model is generating tokens based on local patterns rather than maintaining a coherent understanding of the full function it has already produced.