Beyond Exact Matches: Metrics for Evaluating Technical NL & Code
When we evaluate AI systems for software engineering, we compare model predictions against reference answers. But an exact textual match is not always necessary for correctness. This module explores why overlap-based metrics can fail and what alternatives exist.
12
Metrics Covered
3
Loss Functions
SIDE
Final Framework
7
Live Demos
Core Question
If two outputs mean the same thing, should a metric give them a low score just because they use different words?
Module 3 · Slide 02
The Semantic Equivalence Problem
Consider a Java method read() that opens a stream, reads contents, converts to string, and closes resources. Two summaries describe it:
Prediction (PR)
"Reads the contents of this source as a string."
Ground Truth (GT)
"Get the textual information from this source and represent it as a string."
BLEU: 0.21
Interactive: Word Overlap
PR tokens:
GT tokens:
Insight
Only 3 tokens overlap — yet the summaries are semantically equivalent.
Module 3 · Slide 03
Classification Metrics: Precision, Recall & F1
Before we evaluate generative models, many SE tasks are classification problems. These use standard classification metrics.
Precision
Of all items predicted positive, how many truly are? TP / (TP + FP)
Recall
Of all actual positives, how many did we find? TP / (TP + FN)
F1 Score
Harmonic mean of precision and recall. 2 · P · R / (P + R)
Accuracy
Overall correct predictions. Can be misleading with imbalanced data. (TP + TN) / Total
Why This Matters in SE
In SE tasks like vulnerability detection, classes are highly imbalanced — 95%+ of code is non-vulnerable. Accuracy alone is misleading; F1 and especially recall matter more.
Module 3 · Slide 04
Interactive: Confusion Matrix Explorer
Predicted +
Predicted −
Actual +
Actual −
Adjust values to explore how metrics change.
Computed Metrics
Precision
0.89
Recall
0.80
F1
0.84
Accuracy
0.99
Interpretation
The confusion matrix reveals WHERE a model fails. High FP = noisy alerts. High FN = missed bugs.
Module 3 · Slide 05
BLEU Score: The Workhorse Metric
BLEU (Bilingual Evaluation Understudy) measures how much the generated text's n-grams overlap with the reference.
Formula
BLEU = BP × exp(Σ wn · log pn) pn = modified n-gram precision wn = 1/N (uniform weight, typically N=4) BP = min(1, e1 − r/c) (brevity penalty)
Brevity Penalty
Without BP, a model could output just one confident n-gram and score perfectly on precision. The brevity penalty penalizes candidates shorter than the reference.
Interactive: BLEU Calculator
1-gram
0.00
2-gram
0.00
BP
0.00
BLEU
0.00
Module 3 · Slide 06
Worked Example: Computing BLEU Step by Step
Let us walk through the full BLEU computation for a code-generation scenario.
Extract 2-grams: Ref = {public int, int getMax(int[], getMax(int[] arr)} | Cand = {public int, int findMax(int[], findMax(int[] array}. Match: public int → precision = 1/3 = 0.33
3
Brevity Penalty: |candidate| = 4, |reference| = 4. Since c = r → BP = min(1, e1−4/4) = 1.00
4
BLEU (using 1- and 2-gram): BLEU = 1.00 × exp(0.5 × ln(0.50) + 0.5 × ln(0.33)) = 0.41. The variable rename alone dropped the score significantly.
Module 3 · Slide 07
Interactive: Compute Your Own BLEU
Type or paste any code into the reference and candidate boxes. Watch n-gram precisions, brevity penalty, and the final BLEU score update in real time. Matching tokens are highlighted.
Matched reference tokens:
Matched candidate tokens:
1-gram P
—
2-gram P
—
3-gram P
—
4-gram P
—
BP
—
BLEU-4
—
Tip
Try changing just one variable name or operator. Notice how even small changes impact BLEU, yet the code may still be functionally identical.
Module 3 · Slide 08
ROUGE & METEOR: Beyond BLEU
BLEU measures precision — how much of the candidate appears in the reference. But what about recall? Different tasks need different emphasis.
ROUGE (Recall-Oriented)
ROUGE-1
Unigram recall: what fraction of reference unigrams appear in the candidate?
ROUGE-2
Bigram recall: captures some word-order information.
ROUGE-L
Longest common subsequence: rewards in-order overlap without requiring contiguity.
METEOR (Flexible Matching)
Key Features
Stemming: "running" matches "run" Synonyms: "get" matches "retrieve" Word order: penalizes fragmented matches F-mean: combines precision and recall (recall-weighted)
BLEU vs ROUGE
BLEU = precision-oriented (how much of the candidate is correct?) ROUGE = recall-oriented (how much of the reference is captured?) For summarization, recall often matters more.
Module 3 · Slide 09
Automated vs. Manual Evaluation
Every evaluation approach trades off between scalability and semantic sensitivity.
Automated Metrics
Quick, cheap, easy to apply at scale
Useful for benchmarking many predictions
May fail to capture semantic equivalence
Can over-reward surface similarity
VS
Manual Evaluation
Expert analyzes prediction vs. ground truth
Captures nuance, correctness, usefulness
Expensive in time and money ($$$)
Subject to annotator variability
Takeaway
Automated metrics are attractive because they are fast and inexpensive, but that convenience comes with a risk: surface-level similarity is not the same as semantic correctness.
Module 3 · Slide 10
Embeddings Primer: Vectors for Code
Before we use embedding-based metrics, we need to understand what embeddings are and why they matter.
What is an Embedding?
A dense vector representation where similar items are close in vector space. Unlike one-hot encodings (sparse, high-dimensional), embeddings are compact and capture semantic meaning.
How Are They Trained?
Neural networks learn embeddings by observing context. Words/tokens that appear in similar contexts get similar vectors. Models like Word2Vec, CodeBERT, and UniXcoder produce code-aware embeddings.
Why They Matter
Embeddings enable similarity computation: instead of comparing raw text, we compare vectors. Two semantically equivalent code snippets will have vectors that point in similar directions.
Conceptual 2D Embedding Space
Similar concepts cluster together. "for" and "while" (both loops) are close; "return" and "import" are far apart.
Key Insight
Embeddings transform the evaluation problem from string matching to geometric distance in a learned space.
Module 3 · Slide 11
Cosine Similarity Explained
The most common way to compare embeddings is cosine similarity — it measures the angle between two vectors, ignoring their magnitude.
Formula
cos(A, B) = (A · B) / (||A|| × ||B||) Range: −1 (opposite) to +1 (identical direction) 0 = orthogonal (unrelated)
Two vectors pointing in the same direction have cosine = 1.0 regardless of their length.
Key Concept
Cosine similarity measures the angle between vectors, ignoring magnitude. It is the foundation of embedding-based metrics. Two code snippets with cosine similarity > 0.9 are likely semantically equivalent.
Module 3 · Slide 12
The Metric Landscape
Click each metric to learn what it measures, when it helps, and where it fails.
Code Metrics
Technical NL Metrics
Select a metric
Click any card above to see details.
Module 3 · Slide 13
Embedding-based Metrics
Instead of comparing surface tokens, encode each text into a vector and measure geometric closeness.
Sentence A (PR)
"Reads the contents of this source as a string."
Vector A
Sentence B (GT)
"Get the textual information from this source and represent it as a string."
Vector B
Key Point
The choice of encoder determines the similarity: a code-trained encoder gives code-specific embeddings; a NL-trained encoder gives NL embeddings; a bimodal model handles both.
Module 3 · Slide 14
CodeBLEU
CodeBLEU extends BLEU by recognizing that code should be evaluated not only as text, but also as structure and behavior.
N-gram Match 0.25
+
Weighted N-gram 0.25
+
AST Match 0.25
+
Data-flow Match 0.25
=
CodeBLEU
N-gram Match
Surface overlap of token sequences — classic BLEU approach.
Weighted N-gram
Emphasizes more informative tokens over boilerplate.
AST Match
Compares syntactic structure — balanced brackets, control flow, nesting.
Data-flow Match
Compares how values and dependencies move through the program.
Takeaway
CodeBLEU is a stronger fit for code because it treats code as more than just a string — it evaluates structure and behavior too.
Module 3 · Slide 15
CrystalBLEU
CrystalBLEU rewards correct yet novel outputs by reducing the influence of frequent boilerplate tokens.
The Problem
Code contains many common, repeated, low-information tokens (generic variable names, routine syntax). These inflate BLEU scores without reflecting meaningful similarity.
CrystalBLEU Solution
Discount the contribution of trivially frequent tokens so the score better reflects substantive similarity rather than boilerplate reuse.
Interactive: Token Informativeness
Click tokens to classify them:
Result
Select tokens above to see how discounting boilerplate changes the similarity judgment.
Module 3 · Slide 16
pass@k: Evaluating Code Generation
For code generation, we do not need EVERY output to be correct — just ONE. pass@k measures the probability that at least one of k generated solutions passes all test cases.
Formula
pass@k = 1 − C(n−c, k) / C(n, k) n = total samples generated c = number that pass all tests k = number of attempts allowed
pass@k captures a fundamental property of code generation: users often generate multiple candidates and pick the best one. A model that produces one correct solution out of ten is still useful.
Module 3 · Slide 17
Contrastive Learning: Core Idea
If lexical metrics are not enough, how else might we learn what "similar meaning" looks like? The answer: shape the embedding space itself.
Goal
Learn an embedding space in which similar pairs stay close and dissimilar pairs are pushed far apart.
Why This Matters for SE
Code summarization produces many valid paraphrases. Instead of hand-crafting similarity rules, we can train a model to recognize semantic equivalence directly.
Green = positive pairs. Red = negative pairs. Points drift together/apart during training.
Module 3 · Slide 18
Contrastive Loss Functions
Three progressively more powerful objectives for shaping the embedding space.
Contrastive Loss
Pull similar pairs together, push dissimilar pairs apart up to a margin.
L = (1-y)·D² + y·max(0, m-D)²
y = similar (0) / dissimilar (1) D = distance, m = margin
Triplet Loss
Anchor should be closer to positive than to negative by margin m.
L = max(0, D(A,P) - D(A,N) + m)
A = anchor, P = positive, N = negative
N-pair Loss
Generalize to multiple negatives — distinguish the correct match from many wrong candidates.
L = -log(e^(a·p) / Σ e^(a·n_i))
Scales contrastive idea from 1 wrong alternative to many.
Module 3 · Slide 19
Interactive: Triplet Loss Explorer
Drag the Anchor (A), Positive (P), and Negative (N) points. Watch the loss update live.
Drag points to explore. The loss is zero when D(A,P) + margin < D(A,N).
50
Loss
Drag points to begin.
Intuition
Triplet loss doesn't just say "similar" or "different" — it says the right match should be closer than the wrong one by a meaningful gap.
Module 3 · Slide 20
Hard Negatives & Batch Size
The quality of contrastive learning depends on how informative the training comparisons are.
Hard-Negative Mining
A hard negative is a negative sample very close to the anchor in embedding space — challenging to distinguish from a positive. These force the model to learn fine-grained semantic distinctions.
Classify These Negatives:
Anchor: "The dog is playing with the bone"
Positive: "The dog is enjoying his bone"
Large Batch Size
Larger batches expose the model to a more diverse set of negatives. That richer contrast helps learn more discriminative representations.
16
Negatives per Anchor
With batch size 16, each anchor sees up to 15 negatives per step.
Module 3 · Slide 21
From Summary-to-Summary → Summary-to-Code
Traditional metrics compare prediction vs. reference summary. But a summary can sound fluent and still be wrong with respect to the code. A stronger metric should measure alignment with code semantics.
publicConnectionConsumercreateConnectionConsumer( Destination dest, String selector, ServerSessionPool pool, int max) { returnnewActiveMQConnectionConsumer( this, pool, dest, selector, max); }
Good Summary
"Create a connection to the consumer."
Bad Summary
"Connect to the server and return the status."
SIDE — Summary alIgnment to coDe sEmantic
Instead of only comparing prediction vs. reference summary, SIDE learns a metric that measures whether the summary aligns with the meaning of the code itself, using contrastive learning on ~180K method-summary pairs from CodeXGLUE.
Code Method
↔
Summary
→
MPNet Encoder
→
SIDE Score
Result
Good summary: SIDE: 0.81 Bad summary: SIDE: 0.23
Module 3 · Slide 22
Human Evaluation: When Metrics Are Not Enough
Automated metrics have blind spots. Some qualities can only be assessed by human judgment.
Semantic Correctness
BLEU cannot check if code runs. A syntactically different but functionally equivalent solution may score poorly on all automated metrics.
Code Readability
Style, clarity, and naming conventions are subjective. No automated metric captures whether code is "clean" or "idiomatic" for a given language.
Intent Alignment
Does the code solve the RIGHT problem? A correct implementation of the wrong feature scores well on metrics but fails the user.
Human Evaluation Approaches
Likert scales: rate quality 1-5 A/B comparison: which output is better? Think-aloud: experts narrate their judgment process
Challenges
Expensive: requires domain experts Slow: cannot scale to thousands of examples Subjective: inter-rater disagreement (measured by Cohen's Kappa)
Conclusion
No automated metric perfectly captures code quality. The best evaluations combine automated AND human judgment.
Module 3 · Slide 23
Metric Selection Guide
Different tasks demand different metrics. Using the wrong metric can lead to misleading conclusions about model quality.
ROUGE + METEOR + Human Eval Recall matters: did the summary capture all key information?
Code Translation
CodeBLEU + Exact Match Structure preservation across languages requires AST-aware metrics.
Bug Fixing
Exact Match + Test Pass Rate The fix either works or it does not. Binary correctness dominates.
Code Review
Human Eval + SIDE Quality, relevance, and actionability require human judgment.
Documentation
METEOR + Embedding Sim Paraphrasing is common; synonym-aware and semantic metrics shine.
Rule of Thumb
The right metric depends on the task. Using the wrong metric can lead to misleading conclusions. When in doubt, use multiple metrics and include human evaluation.
Module 3 · Slide 24
Common Pitfalls in Evaluation
Even with the right metrics, evaluation can go wrong. Watch out for these traps.
Cherry-Picking Examples
Showing only the best outputs in papers or demos. This creates a misleadingly positive impression of model quality.
Fix: Report aggregate metrics over the full test set.
Wrong Baseline
Comparing to weak or outdated models to inflate relative improvement.
Fix: Compare against current state-of-the-art.
Overfitting to Benchmarks
Optimizing for metric scores rather than actual quality. Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.
Fix: Validate with held-out data and human eval.
Ignoring Statistical Significance
Reporting single-run results without confidence intervals. Small improvements may be noise.
Fix: Run multiple seeds, report mean ± std.
Test Set Contamination
LLMs may have seen the test data during pre-training. Results on contaminated benchmarks are inflated and unreliable.
Fix: Use time-split data or create new benchmarks.
Module 3 · Slide 25
Evaluation in Practice: A Real Study
How does a published paper evaluate their approach? Here is a condensed example of a real evaluation pipeline.
1
Task: Code summarization — generate natural language descriptions of Java methods.
2
Dataset: CodeSearchNet (6 programming languages, ~2M code-comment pairs). Test split held out from training.
Human Evaluation: 3 expert annotators rated 200 random samples on a 5-point Likert scale. Annotators only preferred the new model 55% of the time (vs. 45% baseline).
5
Conclusion: Automated metrics showed large gains, but human evaluators found the improvement modest. Automated metrics are necessary but not sufficient.
Lesson
A 15% BLEU improvement does not mean a 15% improvement in quality. Always triangulate automated metrics with human judgment to understand the true magnitude of improvement.
Module 3 · Slide 26
Interactive: Multi-Metric Comparison
Compare how different metrics score the same prediction. Select a scenario:
Prediction
Ground Truth
Insight
Module 3 · Slide 27
Interactive: Metric Comparison Dashboard
See how the same generated code scores differently across metrics. Click each example to see the divergence.
Reference
Generated
Analysis
Module 3 · Slide 28
From Evaluation to Improvement
Evaluation is not the end — it is the beginning of improvement. Here is the feedback cycle.
Evaluate Model
→
Error Analysis
→
Identify Failure Patterns
→
Targeted Improvement
→
Re-evaluate
1
Error Analysis: Categorize failures — is the model failing on long methods? Nested logic? Specific APIs? Quantify each failure type.
2
Identify Patterns: Cluster errors by type. Are they syntactic? Semantic? Missing context? Understanding the pattern guides the fix.
3
Targeted Improvement: Fine-tune on weak areas, adjust prompting strategies (Module 5), augment training data, or change architecture.
4
Re-evaluate: Measure again with the same metrics. Did the targeted fix help? Did it hurt performance elsewhere? Track regression.
Connection
Prompting strategies (Module 5) can improve performance without retraining. Evaluation tells you WHERE prompting fails so you can improve the prompt.
Module 3 · Slide 29
Lab Challenge: Evaluate a Code Generator
Put your knowledge into practice with this structured assignment.
Assignment
Given outputs from two code generation models (Model A and Model B), compute BLEU, pass@k, and perform a mini human evaluation. Write a 1-page analysis comparing the models.
Analysis Template
1
Automated Metrics: Compute BLEU-4 and pass@1 for both models on the provided test set. Report means and standard deviations.
2
Human Evaluation: Rate 10 random outputs from each model on correctness (1-5), readability (1-5), and completeness (1-5).
3
Comparison Table: Side-by-side metrics for Model A vs. Model B. Include both automated and human scores.
4
Discussion: Where do automated metrics agree/disagree with human judgment? What does this tell you about metric selection?
Grading Criteria
Correctness of metric computation (40%) Quality of human evaluation (30%) Depth of analysis and discussion (30%)
Bonus
Compute ROUGE-L and an embedding similarity metric. Discuss how they compare to BLEU for your specific examples.
Module 3 · Slide 30
Key Takeaways
01 · Foundation
Exact Overlap ≠ Semantic Correctness
A model can be right in meaning and still look wrong to a token-overlap metric like BLEU.
02 · Tradeoffs
Automated: Fast but Shallow
Automated metrics scale well but can miss semantics. Manual evaluation is rich but expensive.
03 · Embeddings
Meaning Through Geometry
Embedding-based metrics capture semantic relatedness better than overlap, but depend on encoder quality.
04 · Code Needs More
CodeBLEU & CrystalBLEU
Code requires structure-aware evaluation. CrystalBLEU further reduces boilerplate distortion.
05 · Functional Testing
pass@k for Code Generation
For code generation, functional correctness via test execution is the gold standard. pass@k captures multi-sample evaluation.
06 · Contrastive Learning
Shape the Embedding Space
Contrastive learning arranges embeddings so similar items cluster and dissimilar items separate. Hard negatives are crucial.
07 · SIDE
Summary ↔ Code Alignment
The strongest evaluation aligns summaries with code semantics — not just with a reference sentence.
08 · Best Practice
Combine Multiple Metrics
No single metric tells the whole story. Use multiple automated metrics plus human evaluation for reliable conclusions.