/ Evaluating AI-enabled SD Techniques Module 3 · Interactive 1 / 30 ← modules
Module 3 · Slide 01

Beyond Exact Matches:
Metrics for Evaluating Technical NL & Code

When we evaluate AI systems for software engineering, we compare model predictions against reference answers. But an exact textual match is not always necessary for correctness. This module explores why overlap-based metrics can fail and what alternatives exist.

12
Metrics Covered
3
Loss Functions
SIDE
Final Framework
7
Live Demos
Core Question
If two outputs mean the same thing, should a metric give them a low score just because they use different words?
Module 3 · Slide 02

The Semantic Equivalence Problem

Consider a Java method read() that opens a stream, reads contents, converts to string, and closes resources. Two summaries describe it:

Prediction (PR)
"Reads the contents of this source as a string."
Ground Truth (GT)
"Get the textual information from this source and represent it as a string."
BLEU: 0.21

Interactive: Word Overlap

PR tokens:

GT tokens:

Insight
Only 3 tokens overlap — yet the summaries are semantically equivalent.
Module 3 · Slide 03

Classification Metrics: Precision, Recall & F1

Before we evaluate generative models, many SE tasks are classification problems. These use standard classification metrics.

Precision
Of all items predicted positive, how many truly are?
TP / (TP + FP)
Recall
Of all actual positives, how many did we find?
TP / (TP + FN)
F1 Score
Harmonic mean of precision and recall.
2 · P · R / (P + R)
Accuracy
Overall correct predictions. Can be misleading with imbalanced data.
(TP + TN) / Total
Why This Matters in SE
In SE tasks like vulnerability detection, classes are highly imbalanced — 95%+ of code is non-vulnerable. Accuracy alone is misleading; F1 and especially recall matter more.
Module 3 · Slide 04

Interactive: Confusion Matrix Explorer

Predicted +
Predicted −
Actual +
Actual −

Adjust values to explore how metrics change.

Computed Metrics

Precision
0.89
Recall
0.80
F1
0.84
Accuracy
0.99
Interpretation
The confusion matrix reveals WHERE a model fails. High FP = noisy alerts. High FN = missed bugs.
Module 3 · Slide 05

BLEU Score: The Workhorse Metric

BLEU (Bilingual Evaluation Understudy) measures how much the generated text's n-grams overlap with the reference.

Formula
BLEU = BP × exp(Σ wn · log pn)
pn = modified n-gram precision
wn = 1/N (uniform weight, typically N=4)
BP = min(1, e1 − r/c) (brevity penalty)
Brevity Penalty
Without BP, a model could output just one confident n-gram and score perfectly on precision. The brevity penalty penalizes candidates shorter than the reference.

Interactive: BLEU Calculator

1-gram
0.00
2-gram
0.00
BP
0.00
BLEU
0.00
Module 3 · Slide 06

Worked Example: Computing BLEU Step by Step

Let us walk through the full BLEU computation for a code-generation scenario.

Reference
public int getMax(int[] arr)
Candidate
public int findMax(int[] array)
1
Extract 1-grams: Ref = {public, int, getMax(int[], arr)} | Cand = {public, int, findMax(int[], array}. Matches: public, int → precision = 2/4 = 0.50
2
Extract 2-grams: Ref = {public int, int getMax(int[], getMax(int[] arr)} | Cand = {public int, int findMax(int[], findMax(int[] array}. Match: public int → precision = 1/3 = 0.33
3
Brevity Penalty: |candidate| = 4, |reference| = 4. Since c = r → BP = min(1, e1−4/4) = 1.00
4
BLEU (using 1- and 2-gram): BLEU = 1.00 × exp(0.5 × ln(0.50) + 0.5 × ln(0.33)) = 0.41. The variable rename alone dropped the score significantly.
Module 3 · Slide 07

Interactive: Compute Your Own BLEU

Type or paste any code into the reference and candidate boxes. Watch n-gram precisions, brevity penalty, and the final BLEU score update in real time. Matching tokens are highlighted.

Matched reference tokens:

Matched candidate tokens:

1-gram P
2-gram P
3-gram P
4-gram P
BP
BLEU-4
Tip
Try changing just one variable name or operator. Notice how even small changes impact BLEU, yet the code may still be functionally identical.
Module 3 · Slide 08

ROUGE & METEOR: Beyond BLEU

BLEU measures precision — how much of the candidate appears in the reference. But what about recall? Different tasks need different emphasis.

ROUGE (Recall-Oriented)

ROUGE-1
Unigram recall: what fraction of reference unigrams appear in the candidate?
ROUGE-2
Bigram recall: captures some word-order information.
ROUGE-L
Longest common subsequence: rewards in-order overlap without requiring contiguity.

METEOR (Flexible Matching)

Key Features
Stemming: "running" matches "run"
Synonyms: "get" matches "retrieve"
Word order: penalizes fragmented matches
F-mean: combines precision and recall (recall-weighted)
BLEU vs ROUGE
BLEU = precision-oriented (how much of the candidate is correct?)
ROUGE = recall-oriented (how much of the reference is captured?)
For summarization, recall often matters more.
Module 3 · Slide 09

Automated vs. Manual Evaluation

Every evaluation approach trades off between scalability and semantic sensitivity.

Automated Metrics

  • Quick, cheap, easy to apply at scale
  • Useful for benchmarking many predictions
  • May fail to capture semantic equivalence
  • Can over-reward surface similarity
VS

Manual Evaluation

  • Expert analyzes prediction vs. ground truth
  • Captures nuance, correctness, usefulness
  • Expensive in time and money ($$$)
  • Subject to annotator variability
Takeaway
Automated metrics are attractive because they are fast and inexpensive, but that convenience comes with a risk: surface-level similarity is not the same as semantic correctness.
Module 3 · Slide 10

Embeddings Primer: Vectors for Code

Before we use embedding-based metrics, we need to understand what embeddings are and why they matter.

What is an Embedding?
A dense vector representation where similar items are close in vector space. Unlike one-hot encodings (sparse, high-dimensional), embeddings are compact and capture semantic meaning.
How Are They Trained?
Neural networks learn embeddings by observing context. Words/tokens that appear in similar contexts get similar vectors. Models like Word2Vec, CodeBERT, and UniXcoder produce code-aware embeddings.
Why They Matter
Embeddings enable similarity computation: instead of comparing raw text, we compare vectors. Two semantically equivalent code snippets will have vectors that point in similar directions.

Conceptual 2D Embedding Space

Similar concepts cluster together. "for" and "while" (both loops) are close; "return" and "import" are far apart.

Key Insight
Embeddings transform the evaluation problem from string matching to geometric distance in a learned space.
Module 3 · Slide 11

Cosine Similarity Explained

The most common way to compare embeddings is cosine similarity — it measures the angle between two vectors, ignoring their magnitude.

Formula
cos(A, B) = (A · B) / (||A|| × ||B||)
Range: −1 (opposite) to +1 (identical direction)
0 = orthogonal (unrelated)

Worked Example

1
A = [1, 2, 3]   B = [2, 4, 6]
2
A · B = 1×2 + 2×4 + 3×6 = 2 + 8 + 18 = 28
3
||A|| = sqrt(1+4+9) = sqrt(14) ≈ 3.74
||B|| = sqrt(4+16+36) = sqrt(56) ≈ 7.48
4
cos(A,B) = 28 / (3.74 × 7.48) = 1.00 — identical direction (B = 2×A)

Two vectors pointing in the same direction have cosine = 1.0 regardless of their length.

Key Concept
Cosine similarity measures the angle between vectors, ignoring magnitude. It is the foundation of embedding-based metrics. Two code snippets with cosine similarity > 0.9 are likely semantically equivalent.
Module 3 · Slide 12

The Metric Landscape

Click each metric to learn what it measures, when it helps, and where it fails.

Code Metrics

Technical NL Metrics

Select a metric
Click any card above to see details.
Module 3 · Slide 13

Embedding-based Metrics

Instead of comparing surface tokens, encode each text into a vector and measure geometric closeness.

Sentence A (PR)
"Reads the contents of this source as a string."
Vector A
Sentence B (GT)
"Get the textual information from this source and represent it as a string."
Vector B
Key Point
The choice of encoder determines the similarity: a code-trained encoder gives code-specific embeddings; a NL-trained encoder gives NL embeddings; a bimodal model handles both.
Module 3 · Slide 14

CodeBLEU

CodeBLEU extends BLEU by recognizing that code should be evaluated not only as text, but also as structure and behavior.

N-gram Match
0.25
+
Weighted N-gram
0.25
+
AST Match
0.25
+
Data-flow Match
0.25
=
CodeBLEU
N-gram Match
Surface overlap of token sequences — classic BLEU approach.
Weighted N-gram
Emphasizes more informative tokens over boilerplate.
AST Match
Compares syntactic structure — balanced brackets, control flow, nesting.
Data-flow Match
Compares how values and dependencies move through the program.
Takeaway
CodeBLEU is a stronger fit for code because it treats code as more than just a string — it evaluates structure and behavior too.
Module 3 · Slide 15

CrystalBLEU

CrystalBLEU rewards correct yet novel outputs by reducing the influence of frequent boilerplate tokens.

The Problem
Code contains many common, repeated, low-information tokens (generic variable names, routine syntax). These inflate BLEU scores without reflecting meaningful similarity.
CrystalBLEU Solution
Discount the contribution of trivially frequent tokens so the score better reflects substantive similarity rather than boilerplate reuse.

Interactive: Token Informativeness

Click tokens to classify them:

Result
Select tokens above to see how discounting boilerplate changes the similarity judgment.
Module 3 · Slide 16

pass@k: Evaluating Code Generation

For code generation, we do not need EVERY output to be correct — just ONE. pass@k measures the probability that at least one of k generated solutions passes all test cases.

Formula
pass@k = 1 − C(n−c, k) / C(n, k)
n = total samples generated
c = number that pass all tests
k = number of attempts allowed
Example
Generate 100 solutions, 23 pass tests:
pass@1 ≈ 0.23 | pass@10 ≈ 0.89 | pass@100 = 1.00
Used by OpenAI Codex, HumanEval benchmark.

Interactive: pass@k Explorer

100
23
pass@1
pass@10
pass@100
Key Insight
pass@k captures a fundamental property of code generation: users often generate multiple candidates and pick the best one. A model that produces one correct solution out of ten is still useful.
Module 3 · Slide 17

Contrastive Learning: Core Idea

If lexical metrics are not enough, how else might we learn what "similar meaning" looks like? The answer: shape the embedding space itself.

Goal
Learn an embedding space in which similar pairs stay close and dissimilar pairs are pushed far apart.
Why This Matters for SE
Code summarization produces many valid paraphrases. Instead of hand-crafting similarity rules, we can train a model to recognize semantic equivalence directly.

Green = positive pairs. Red = negative pairs. Points drift together/apart during training.

Module 3 · Slide 18

Contrastive Loss Functions

Three progressively more powerful objectives for shaping the embedding space.

Contrastive Loss
Pull similar pairs together, push dissimilar pairs apart up to a margin.
L = (1-y)·D² + y·max(0, m-D)²
y = similar (0) / dissimilar (1)
D = distance, m = margin
Triplet Loss
Anchor should be closer to positive than to negative by margin m.
L = max(0, D(A,P) - D(A,N) + m)
A = anchor, P = positive, N = negative
N-pair Loss
Generalize to multiple negatives — distinguish the correct match from many wrong candidates.
L = -log(e^(a·p) / Σ e^(a·n_i))
Scales contrastive idea from 1 wrong alternative to many.
Module 3 · Slide 19

Interactive: Triplet Loss Explorer

Drag the Anchor (A), Positive (P), and Negative (N) points. Watch the loss update live.

Drag points to explore. The loss is zero when D(A,P) + margin < D(A,N).

50
Loss
Drag points to begin.
Intuition
Triplet loss doesn't just say "similar" or "different" — it says the right match should be closer than the wrong one by a meaningful gap.
Module 3 · Slide 20

Hard Negatives & Batch Size

The quality of contrastive learning depends on how informative the training comparisons are.

Hard-Negative Mining
A hard negative is a negative sample very close to the anchor in embedding space — challenging to distinguish from a positive. These force the model to learn fine-grained semantic distinctions.

Classify These Negatives:

Anchor: "The dog is playing with the bone"

Positive: "The dog is enjoying his bone"

Large Batch Size
Larger batches expose the model to a more diverse set of negatives. That richer contrast helps learn more discriminative representations.
16
Negatives per Anchor
With batch size 16, each anchor sees up to 15 negatives per step.
Module 3 · Slide 21

From Summary-to-Summary → Summary-to-Code

Traditional metrics compare prediction vs. reference summary. But a summary can sound fluent and still be wrong with respect to the code. A stronger metric should measure alignment with code semantics.

public ConnectionConsumer createConnectionConsumer(
Destination dest, String selector,
ServerSessionPool pool, int max) {
return new ActiveMQConnectionConsumer(
this, pool, dest, selector, max);
}
Good Summary
"Create a connection to the consumer."
Bad Summary
"Connect to the server and return the status."
SIDE — Summary alIgnment to coDe sEmantic
Instead of only comparing prediction vs. reference summary, SIDE learns a metric that measures whether the summary aligns with the meaning of the code itself, using contrastive learning on ~180K method-summary pairs from CodeXGLUE.
Code Method
Summary
MPNet Encoder
SIDE Score
Result
Good summary: SIDE: 0.81
Bad summary: SIDE: 0.23
Module 3 · Slide 22

Human Evaluation: When Metrics Are Not Enough

Automated metrics have blind spots. Some qualities can only be assessed by human judgment.

Semantic Correctness

BLEU cannot check if code runs. A syntactically different but functionally equivalent solution may score poorly on all automated metrics.

Code Readability

Style, clarity, and naming conventions are subjective. No automated metric captures whether code is "clean" or "idiomatic" for a given language.

Intent Alignment

Does the code solve the RIGHT problem? A correct implementation of the wrong feature scores well on metrics but fails the user.

Human Evaluation Approaches
Likert scales: rate quality 1-5
A/B comparison: which output is better?
Think-aloud: experts narrate their judgment process
Challenges
Expensive: requires domain experts
Slow: cannot scale to thousands of examples
Subjective: inter-rater disagreement (measured by Cohen's Kappa)
Conclusion
No automated metric perfectly captures code quality. The best evaluations combine automated AND human judgment.
Module 3 · Slide 23

Metric Selection Guide

Different tasks demand different metrics. Using the wrong metric can lead to misleading conclusions about model quality.

Code Generation
pass@k + BLEU
Functional correctness is paramount. pass@k tests execution; BLEU adds surface comparison.
Code Summarization
ROUGE + METEOR + Human Eval
Recall matters: did the summary capture all key information?
Code Translation
CodeBLEU + Exact Match
Structure preservation across languages requires AST-aware metrics.
Bug Fixing
Exact Match + Test Pass Rate
The fix either works or it does not. Binary correctness dominates.
Code Review
Human Eval + SIDE
Quality, relevance, and actionability require human judgment.
Documentation
METEOR + Embedding Sim
Paraphrasing is common; synonym-aware and semantic metrics shine.
Rule of Thumb
The right metric depends on the task. Using the wrong metric can lead to misleading conclusions. When in doubt, use multiple metrics and include human evaluation.
Module 3 · Slide 24

Common Pitfalls in Evaluation

Even with the right metrics, evaluation can go wrong. Watch out for these traps.

Cherry-Picking Examples
Showing only the best outputs in papers or demos. This creates a misleadingly positive impression of model quality.
Fix: Report aggregate metrics over the full test set.
Wrong Baseline
Comparing to weak or outdated models to inflate relative improvement.
Fix: Compare against current state-of-the-art.
Overfitting to Benchmarks
Optimizing for metric scores rather than actual quality. Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.
Fix: Validate with held-out data and human eval.
Ignoring Statistical Significance
Reporting single-run results without confidence intervals. Small improvements may be noise.
Fix: Run multiple seeds, report mean ± std.
Test Set Contamination
LLMs may have seen the test data during pre-training. Results on contaminated benchmarks are inflated and unreliable.
Fix: Use time-split data or create new benchmarks.
Module 3 · Slide 25

Evaluation in Practice: A Real Study

How does a published paper evaluate their approach? Here is a condensed example of a real evaluation pipeline.

1
Task: Code summarization — generate natural language descriptions of Java methods.
2
Dataset: CodeSearchNet (6 programming languages, ~2M code-comment pairs). Test split held out from training.
3
Automated Metrics: BLEU improved 15% over baseline, ROUGE-L improved 12%, METEOR improved 8%.
4
Human Evaluation: 3 expert annotators rated 200 random samples on a 5-point Likert scale. Annotators only preferred the new model 55% of the time (vs. 45% baseline).
5
Conclusion: Automated metrics showed large gains, but human evaluators found the improvement modest. Automated metrics are necessary but not sufficient.
Lesson
A 15% BLEU improvement does not mean a 15% improvement in quality. Always triangulate automated metrics with human judgment to understand the true magnitude of improvement.
Module 3 · Slide 26

Interactive: Multi-Metric Comparison

Compare how different metrics score the same prediction. Select a scenario:

Prediction
Ground Truth
Insight
Module 3 · Slide 27

Interactive: Metric Comparison Dashboard

See how the same generated code scores differently across metrics. Click each example to see the divergence.

Reference
Generated
Analysis
Module 3 · Slide 28

From Evaluation to Improvement

Evaluation is not the end — it is the beginning of improvement. Here is the feedback cycle.

Evaluate Model
Error Analysis
Identify Failure Patterns
Targeted Improvement
Re-evaluate
1
Error Analysis: Categorize failures — is the model failing on long methods? Nested logic? Specific APIs? Quantify each failure type.
2
Identify Patterns: Cluster errors by type. Are they syntactic? Semantic? Missing context? Understanding the pattern guides the fix.
3
Targeted Improvement: Fine-tune on weak areas, adjust prompting strategies (Module 5), augment training data, or change architecture.
4
Re-evaluate: Measure again with the same metrics. Did the targeted fix help? Did it hurt performance elsewhere? Track regression.
Connection
Prompting strategies (Module 5) can improve performance without retraining. Evaluation tells you WHERE prompting fails so you can improve the prompt.
Module 3 · Slide 29

Lab Challenge: Evaluate a Code Generator

Put your knowledge into practice with this structured assignment.

Assignment
Given outputs from two code generation models (Model A and Model B), compute BLEU, pass@k, and perform a mini human evaluation. Write a 1-page analysis comparing the models.

Analysis Template

1
Automated Metrics: Compute BLEU-4 and pass@1 for both models on the provided test set. Report means and standard deviations.
2
Human Evaluation: Rate 10 random outputs from each model on correctness (1-5), readability (1-5), and completeness (1-5).
3
Comparison Table: Side-by-side metrics for Model A vs. Model B. Include both automated and human scores.
4
Discussion: Where do automated metrics agree/disagree with human judgment? What does this tell you about metric selection?
Grading Criteria
Correctness of metric computation (40%)
Quality of human evaluation (30%)
Depth of analysis and discussion (30%)
Bonus
Compute ROUGE-L and an embedding similarity metric. Discuss how they compare to BLEU for your specific examples.
Module 3 · Slide 30

Key Takeaways

01 · Foundation

Exact Overlap ≠ Semantic Correctness

A model can be right in meaning and still look wrong to a token-overlap metric like BLEU.

02 · Tradeoffs

Automated: Fast but Shallow

Automated metrics scale well but can miss semantics. Manual evaluation is rich but expensive.

03 · Embeddings

Meaning Through Geometry

Embedding-based metrics capture semantic relatedness better than overlap, but depend on encoder quality.

04 · Code Needs More

CodeBLEU & CrystalBLEU

Code requires structure-aware evaluation. CrystalBLEU further reduces boilerplate distortion.

05 · Functional Testing

pass@k for Code Generation

For code generation, functional correctness via test execution is the gold standard. pass@k captures multi-sample evaluation.

06 · Contrastive Learning

Shape the Embedding Space

Contrastive learning arranges embeddings so similar items cluster and dissimilar items separate. Hard negatives are crucial.

07 · SIDE

Summary ↔ Code Alignment

The strongest evaluation aligns summaries with code semantics — not just with a reference sentence.

08 · Best Practice

Combine Multiple Metrics

No single metric tells the whole story. Use multiple automated metrics plus human evaluation for reliable conclusions.

Overlap Metrics
Embedding Similarity
Contrastive Learning
SIDE: Code-Semantic Alignment
🎉

Module Complete!

You've finished Evaluation Metrics. Great work!