Module 3 · Slide 01

Beyond Exact Matches:
Metrics for Evaluating Technical NL & Code

When we evaluate AI systems for software engineering, we compare model predictions against reference answers. But an exact textual match is not always necessary for correctness. This module explores why overlap-based metrics can fail and what alternatives exist.

12

Metrics Covered

3

Loss Functions

SIDE

Final Framework

7

Live Demos

Core Question

If two outputs mean the same thing, should a metric give them a low score just because they use different words?

Module 3 · Slide 02

The Semantic Equivalence Problem

Consider a Java method read() that opens a stream, reads contents, converts to string, and closes resources. Two summaries describe it:

Prediction (PR)

"Reads the contents of this source as a string."

Ground Truth (GT)

"Get the textual information from this source and represent it as a string."

BLEU: 0.21

Interactive: Word Overlap

PR tokens:

GT tokens:

Insight

Only 3 tokens overlap — yet the summaries are semantically equivalent.

Module 3 · Slide 03

Classification Metrics: Precision, Recall & F1

Before we evaluate generative models, many SE tasks are classification problems. These use standard classification metrics.

Precision

Of all items predicted positive, how many truly are?
TP / (TP + FP)

Recall

Of all actual positives, how many did we find?
TP / (TP + FN)

F1 Score

Harmonic mean of precision and recall.
2 · P · R / (P + R)

Accuracy

Overall correct predictions. Can be misleading with imbalanced data.
(TP + TN) / Total

Why This Matters in SE

In SE tasks like vulnerability detection, classes are highly imbalanced — 95%+ of code is non-vulnerable. Accuracy alone is misleading; F1 and especially recall matter more.

Module 3 · Slide 04

Interactive: Confusion Matrix Explorer

Predicted +

Predicted −

Actual +

TP

FN

Actual −

FP

TN

Adjust values to explore how metrics change.

Computed Metrics

Precision

0.89

Recall

0.80

F1

0.84

Accuracy

0.99

Interpretation

The confusion matrix reveals WHERE a model fails. High FP = noisy alerts. High FN = missed bugs.

Module 3 · Slide 05

BLEU Score: The Workhorse Metric

BLEU (Bilingual Evaluation Understudy) measures how much the generated text's n-grams overlap with the reference.

Formula

BLEU = BP × exp(Σ w_n · log p_n)
p_n = modified n-gram precision
w_n = 1/N (uniform weight, typically N=4)
BP = min(1, e^{1 − r/c}) (brevity penalty)

Brevity Penalty

Without BP, a model could output just one confident n-gram and score perfectly on precision. The brevity penalty penalizes candidates shorter than the reference.

Interactive: BLEU Calculator

Reference:

Candidate:

1-gram

0.00

2-gram

0.00

BP

0.00

BLEU

0.00

Module 3 · Slide 06

Worked Example: Computing BLEU Step by Step

Let us walk through the full BLEU computation for a code-generation scenario.

Reference

public int getMax(int[] arr)

Candidate

public int findMax(int[] array)

1

Extract 1-grams: Ref = {public, int, getMax(int[], arr)} | Cand = {public, int, findMax(int[], array}. Matches: public, int → precision = 2/4 = 0.50

2

Extract 2-grams: Ref = {public int, int getMax(int[], getMax(int[] arr)} | Cand = {public int, int findMax(int[], findMax(int[] array}. Match: public int → precision = 1/3 = 0.33

3

Brevity Penalty: |candidate| = 4, |reference| = 4. Since c = r → BP = min(1, e^1−4/4) = 1.00

4

BLEU (using 1- and 2-gram): BLEU = 1.00 × exp(0.5 × ln(0.50) + 0.5 × ln(0.33)) = 0.41. The variable rename alone dropped the score significantly.

Module 3 · Slide 07

Interactive: Compute Your Own BLEU

Type or paste any code into the reference and candidate boxes. Watch n-gram precisions, brevity penalty, and the final BLEU score update in real time. Matching tokens are highlighted.

Reference Code:

Candidate Code:

Matched reference tokens:

Matched candidate tokens:

1-gram P

—

2-gram P

—

3-gram P

—

4-gram P

—

BP

—

BLEU-4

—

Tip

Try changing just one variable name or operator. Notice how even small changes impact BLEU, yet the code may still be functionally identical.

Module 3 · Slide 08

ROUGE & METEOR: Beyond BLEU

BLEU measures precision — how much of the candidate appears in the reference. But what about recall? Different tasks need different emphasis.

ROUGE (Recall-Oriented)

ROUGE-1

Unigram recall: what fraction of reference unigrams appear in the candidate?

ROUGE-2

Bigram recall: captures some word-order information.

ROUGE-L

Longest common subsequence: rewards in-order overlap without requiring contiguity.

METEOR (Flexible Matching)

Key Features

Stemming: "running" matches "run"
Synonyms: "get" matches "retrieve"
Word order: penalizes fragmented matches
F-mean: combines precision and recall (recall-weighted)

BLEU vs ROUGE

BLEU = precision-oriented (how much of the candidate is correct?)
ROUGE = recall-oriented (how much of the reference is captured?)
For summarization, recall often matters more.

Module 3 · Slide 09

Automated vs. Manual Evaluation

Every evaluation approach trades off between scalability and semantic sensitivity.

Automated Metrics

Quick, cheap, easy to apply at scale
Useful for benchmarking many predictions
May fail to capture semantic equivalence
Can over-reward surface similarity

VS

Manual Evaluation

Expert analyzes prediction vs. ground truth
Captures nuance, correctness, usefulness
Expensive in time and money ($$$)
Subject to annotator variability

Takeaway

Automated metrics are attractive because they are fast and inexpensive, but that convenience comes with a risk: surface-level similarity is not the same as semantic correctness.

Module 3 · Slide 10

Embeddings Primer: Vectors for Code

Before we use embedding-based metrics, we need to understand what embeddings are and why they matter.

What is an Embedding?

A dense vector representation where similar items are close in vector space. Unlike one-hot encodings (sparse, high-dimensional), embeddings are compact and capture semantic meaning.

How Are They Trained?

Neural networks learn embeddings by observing context. Words/tokens that appear in similar contexts get similar vectors. Models like Word2Vec, CodeBERT, and UniXcoder produce code-aware embeddings.

Why They Matter

Embeddings enable similarity computation: instead of comparing raw text, we compare vectors. Two semantically equivalent code snippets will have vectors that point in similar directions.

Conceptual 2D Embedding Space

Similar concepts cluster together. "for" and "while" (both loops) are close; "return" and "import" are far apart.

Key Insight

Embeddings transform the evaluation problem from string matching to geometric distance in a learned space.

Module 3 · Slide 11

Cosine Similarity Explained

The most common way to compare embeddings is cosine similarity — it measures the angle between two vectors, ignoring their magnitude.

Formula

cos(A, B) = (A · B) / (||A|| × ||B||)
Range: −1 (opposite) to +1 (identical direction)
0 = orthogonal (unrelated)

Worked Example

1

A = [1, 2, 3] B = [2, 4, 6]

2

A · B = 1×2 + 2×4 + 3×6 = 2 + 8 + 18 = 28

3

||A|| = sqrt(1+4+9) = sqrt(14) ≈ 3.74
||B|| = sqrt(4+16+36) = sqrt(56) ≈ 7.48

4

cos(A,B) = 28 / (3.74 × 7.48) = 1.00 — identical direction (B = 2×A)

Two vectors pointing in the same direction have cosine = 1.0 regardless of their length.

Key Concept

Cosine similarity measures the angle between vectors, ignoring magnitude. It is the foundation of embedding-based metrics. Two code snippets with cosine similarity > 0.9 are likely semantically equivalent.

Module 3 · Slide 12

The Metric Landscape

Click each metric to learn what it measures, when it helps, and where it fails.

Code Metrics

Technical NL Metrics

Select a metric

Click any card above to see details.

Module 3 · Slide 13

Embedding-based Metrics

Instead of comparing surface tokens, encode each text into a vector and measure geometric closeness.

Sentence A (PR)

"Reads the contents of this source as a string."

Vector A

Sentence B (GT)

"Get the textual information from this source and represent it as a string."

Vector B

Key Point

The choice of encoder determines the similarity: a code-trained encoder gives code-specific embeddings; a NL-trained encoder gives NL embeddings; a bimodal model handles both.

Module 3 · Slide 14

CodeBLEU

CodeBLEU extends BLEU by recognizing that code should be evaluated not only as text, but also as structure and behavior.

N-gram Match
0.25

+

Weighted N-gram
0.25

+

AST Match
0.25

+

Data-flow Match
0.25

=

CodeBLEU

N-gram Match

Surface overlap of token sequences — classic BLEU approach.

Weighted N-gram

Emphasizes more informative tokens over boilerplate.

AST Match

Compares syntactic structure — balanced brackets, control flow, nesting.

Data-flow Match

Compares how values and dependencies move through the program.

Takeaway

CodeBLEU is a stronger fit for code because it treats code as more than just a string — it evaluates structure and behavior too.

Module 3 · Slide 15

CrystalBLEU

CrystalBLEU rewards correct yet novel outputs by reducing the influence of frequent boilerplate tokens.

The Problem

Code contains many common, repeated, low-information tokens (generic variable names, routine syntax). These inflate BLEU scores without reflecting meaningful similarity.

CrystalBLEU Solution

Discount the contribution of trivially frequent tokens so the score better reflects substantive similarity rather than boilerplate reuse.

Interactive: Token Informativeness

Click tokens to classify them:

Result

Select tokens above to see how discounting boilerplate changes the similarity judgment.

Module 3 · Slide 16

pass@k: Evaluating Code Generation

For code generation, we do not need EVERY output to be correct — just ONE. pass@k measures the probability that at least one of k generated solutions passes all test cases.

Formula

pass@k = 1 − C(n−c, k) / C(n, k)
n = total samples generated
c = number that pass all tests
k = number of attempts allowed

Example

Generate 100 solutions, 23 pass tests:
pass@1 ≈ 0.23 | pass@10 ≈ 0.89 | pass@100 = 1.00
Used by OpenAI Codex, HumanEval benchmark.

Interactive: pass@k Explorer

n (total):100

c (correct):23

pass@1

—

pass@10

—

pass@100

—

Key Insight

pass@k captures a fundamental property of code generation: users often generate multiple candidates and pick the best one. A model that produces one correct solution out of ten is still useful.

Module 3 · Slide 17

Contrastive Learning: Core Idea

If lexical metrics are not enough, how else might we learn what "similar meaning" looks like? The answer: shape the embedding space itself.

Goal

Learn an embedding space in which similar pairs stay close and dissimilar pairs are pushed far apart.

Why This Matters for SE

Code summarization produces many valid paraphrases. Instead of hand-crafting similarity rules, we can train a model to recognize semantic equivalence directly.

Green = positive pairs. Red = negative pairs. Points drift together/apart during training.

Module 3 · Slide 18

Contrastive Loss Functions

Three progressively more powerful objectives for shaping the embedding space.

Contrastive Loss

Pull similar pairs together, push dissimilar pairs apart up to a margin.

L = (1-y)·D² + y·max(0, m-D)²

y = similar (0) / dissimilar (1)
D = distance, m = margin

Triplet Loss

Anchor should be closer to positive than to negative by margin m.

L = max(0, D(A,P) - D(A,N) + m)

A = anchor, P = positive, N = negative

N-pair Loss

Generalize to multiple negatives — distinguish the correct match from many wrong candidates.

L = -log(e^(a·p) / Σ e^(a·n_i))

Scales contrastive idea from 1 wrong alternative to many.

Module 3 · Slide 19

Interactive: Triplet Loss Explorer

Drag the Anchor (A), Positive (P), and Negative (N) points. Watch the loss update live.

Drag points to explore. The loss is zero when D(A,P) + margin < D(A,N).

Margin (m):50

Loss

Drag points to begin.

Intuition

Triplet loss doesn't just say "similar" or "different" — it says the right match should be closer than the wrong one by a meaningful gap.

Module 3 · Slide 20

Hard Negatives & Batch Size

The quality of contrastive learning depends on how informative the training comparisons are.

Hard-Negative Mining

A hard negative is a negative sample very close to the anchor in embedding space — challenging to distinguish from a positive. These force the model to learn fine-grained semantic distinctions.

Classify These Negatives:

Anchor: "The dog is playing with the bone"

Positive: "The dog is enjoying his bone"

Large Batch Size

Larger batches expose the model to a more diverse set of negatives. That richer contrast helps learn more discriminative representations.

Batch size:16

Negatives per Anchor

With batch size 16, each anchor sees up to 15 negatives per step.

Module 3 · Slide 21

From Summary-to-Summary → Summary-to-Code

Traditional metrics compare prediction vs. reference summary. But a summary can sound fluent and still be wrong with respect to the code. A stronger metric should measure alignment with code semantics.

public ConnectionConsumer createConnectionConsumer(
    Destination dest, String selector,
    ServerSessionPool pool, int max) {
    return new ActiveMQConnectionConsumer(
        this, pool, dest, selector, max);
}

Good Summary

"Create a connection to the consumer."

Bad Summary

"Connect to the server and return the status."

SIDE — Summary alIgnment to coDe sEmantic

Instead of only comparing prediction vs. reference summary, SIDE learns a metric that measures whether the summary aligns with the meaning of the code itself, using contrastive learning on ~180K method-summary pairs from CodeXGLUE.

Code Method

↔

Summary

→

MPNet Encoder

→

SIDE Score

Result

Good summary: SIDE: 0.81
Bad summary: SIDE: 0.23

Module 3 · Slide 22

Human Evaluation: When Metrics Are Not Enough

Automated metrics have blind spots. Some qualities can only be assessed by human judgment.

Semantic Correctness

BLEU cannot check if code runs. A syntactically different but functionally equivalent solution may score poorly on all automated metrics.

Code Readability

Style, clarity, and naming conventions are subjective. No automated metric captures whether code is "clean" or "idiomatic" for a given language.

Intent Alignment

Does the code solve the RIGHT problem? A correct implementation of the wrong feature scores well on metrics but fails the user.

Human Evaluation Approaches

Likert scales: rate quality 1-5
A/B comparison: which output is better?
Think-aloud: experts narrate their judgment process

Challenges

Expensive: requires domain experts
Slow: cannot scale to thousands of examples
Subjective: inter-rater disagreement (measured by Cohen's Kappa)

Conclusion

No automated metric perfectly captures code quality. The best evaluations combine automated AND human judgment.

Module 3 · Slide 23

Metric Selection Guide

Different tasks demand different metrics. Using the wrong metric can lead to misleading conclusions about model quality.

Code Generation

pass@k + BLEU
Functional correctness is paramount. pass@k tests execution; BLEU adds surface comparison.

Code Summarization

ROUGE + METEOR + Human Eval
Recall matters: did the summary capture all key information?

Code Translation

CodeBLEU + Exact Match
Structure preservation across languages requires AST-aware metrics.

Bug Fixing

Exact Match + Test Pass Rate
The fix either works or it does not. Binary correctness dominates.

Code Review

Human Eval + SIDE
Quality, relevance, and actionability require human judgment.

Documentation

METEOR + Embedding Sim
Paraphrasing is common; synonym-aware and semantic metrics shine.

Rule of Thumb

The right metric depends on the task. Using the wrong metric can lead to misleading conclusions. When in doubt, use multiple metrics and include human evaluation.

Module 3 · Slide 24

Common Pitfalls in Evaluation

Even with the right metrics, evaluation can go wrong. Watch out for these traps.

Cherry-Picking Examples

Showing only the best outputs in papers or demos. This creates a misleadingly positive impression of model quality.

Fix: Report aggregate metrics over the full test set.

Wrong Baseline

Comparing to weak or outdated models to inflate relative improvement.

Fix: Compare against current state-of-the-art.

Overfitting to Benchmarks

Optimizing for metric scores rather than actual quality. Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.

Fix: Validate with held-out data and human eval.

Ignoring Statistical Significance

Reporting single-run results without confidence intervals. Small improvements may be noise.

Fix: Run multiple seeds, report mean ± std.

Test Set Contamination

LLMs may have seen the test data during pre-training. Results on contaminated benchmarks are inflated and unreliable.

Fix: Use time-split data or create new benchmarks.

Module 3 · Slide 25

Evaluation in Practice: A Real Study

How does a published paper evaluate their approach? Here is a condensed example of a real evaluation pipeline.

1

Task: Code summarization — generate natural language descriptions of Java methods.

2

Dataset: CodeSearchNet (6 programming languages, ~2M code-comment pairs). Test split held out from training.

3

Automated Metrics: BLEU improved 15% over baseline, ROUGE-L improved 12%, METEOR improved 8%.

4

Human Evaluation: 3 expert annotators rated 200 random samples on a 5-point Likert scale. Annotators only preferred the new model 55% of the time (vs. 45% baseline).

5

Conclusion: Automated metrics showed large gains, but human evaluators found the improvement modest. Automated metrics are necessary but not sufficient.

Lesson

A 15% BLEU improvement does not mean a 15% improvement in quality. Always triangulate automated metrics with human judgment to understand the true magnitude of improvement.

Module 3 · Slide 26

Interactive: Multi-Metric Comparison

Compare how different metrics score the same prediction. Select a scenario:

Prediction

Ground Truth

Insight

Module 3 · Slide 27

Interactive: Metric Comparison Dashboard

See how the same generated code scores differently across metrics. Click each example to see the divergence.

Reference

Generated

Analysis

Module 3 · Slide 28

From Evaluation to Improvement

Evaluation is not the end — it is the beginning of improvement. Here is the feedback cycle.

Evaluate Model

→

Error Analysis

→

Identify Failure Patterns

→

Targeted Improvement

→

Re-evaluate

1

Error Analysis: Categorize failures — is the model failing on long methods? Nested logic? Specific APIs? Quantify each failure type.

2

Identify Patterns: Cluster errors by type. Are they syntactic? Semantic? Missing context? Understanding the pattern guides the fix.

3

Targeted Improvement: Fine-tune on weak areas, adjust prompting strategies (Module 5), augment training data, or change architecture.

4

Re-evaluate: Measure again with the same metrics. Did the targeted fix help? Did it hurt performance elsewhere? Track regression.

Connection

Prompting strategies (Module 5) can improve performance without retraining. Evaluation tells you WHERE prompting fails so you can improve the prompt.

Module 3 · Slide 29

Lab Challenge: Evaluate a Code Generator

Put your knowledge into practice with this structured assignment.

Assignment

Given outputs from two code generation models (Model A and Model B), compute BLEU, pass@k, and perform a mini human evaluation. Write a 1-page analysis comparing the models.

Analysis Template

1

Automated Metrics: Compute BLEU-4 and pass@1 for both models on the provided test set. Report means and standard deviations.

2

Human Evaluation: Rate 10 random outputs from each model on correctness (1-5), readability (1-5), and completeness (1-5).

3

Comparison Table: Side-by-side metrics for Model A vs. Model B. Include both automated and human scores.

4

Discussion: Where do automated metrics agree/disagree with human judgment? What does this tell you about metric selection?

Grading Criteria

Correctness of metric computation (40%)
Quality of human evaluation (30%)
Depth of analysis and discussion (30%)

Bonus

Compute ROUGE-L and an embedding similarity metric. Discuss how they compare to BLEU for your specific examples.

Module 3 · Slide 30

Key Takeaways

01 · Foundation

Exact Overlap ≠ Semantic Correctness

A model can be right in meaning and still look wrong to a token-overlap metric like BLEU.

02 · Tradeoffs

Automated: Fast but Shallow

Automated metrics scale well but can miss semantics. Manual evaluation is rich but expensive.

03 · Embeddings

Meaning Through Geometry

Embedding-based metrics capture semantic relatedness better than overlap, but depend on encoder quality.

04 · Code Needs More

CodeBLEU & CrystalBLEU

Code requires structure-aware evaluation. CrystalBLEU further reduces boilerplate distortion.

05 · Functional Testing

pass@k for Code Generation

For code generation, functional correctness via test execution is the gold standard. pass@k captures multi-sample evaluation.

06 · Contrastive Learning

Shape the Embedding Space

Contrastive learning arranges embeddings so similar items cluster and dissimilar items separate. Hard negatives are crucial.

07 · SIDE

Summary ↔ Code Alignment

The strongest evaluation aligns summaries with code semantics — not just with a reference sentence.

08 · Best Practice

Combine Multiple Metrics

No single metric tells the whole story. Use multiple automated metrics plus human evaluation for reliable conclusions.

Overlap Metrics

→

Embedding Similarity

→

Contrastive Learning

→

SIDE: Code-Semantic Alignment

Beyond Exact Matches:Metrics for Evaluating Technical NL & Code

The Semantic Equivalence Problem

Interactive: Word Overlap

Classification Metrics: Precision, Recall & F1

Interactive: Confusion Matrix Explorer

Computed Metrics

BLEU Score: The Workhorse Metric

Interactive: BLEU Calculator

Worked Example: Computing BLEU Step by Step

Interactive: Compute Your Own BLEU

ROUGE & METEOR: Beyond BLEU

ROUGE (Recall-Oriented)

METEOR (Flexible Matching)

Automated vs. Manual Evaluation

Automated Metrics

Manual Evaluation

Embeddings Primer: Vectors for Code

Conceptual 2D Embedding Space

Cosine Similarity Explained

Worked Example

The Metric Landscape

Code Metrics

Technical NL Metrics

Embedding-based Metrics

Vector A

Vector B

CodeBLEU

CrystalBLEU

Interactive: Token Informativeness

pass@k: Evaluating Code Generation

Interactive: pass@k Explorer

Contrastive Learning: Core Idea

Contrastive Loss Functions

Interactive: Triplet Loss Explorer

Hard Negatives & Batch Size

Classify These Negatives:

From Summary-to-Summary → Summary-to-Code

Human Evaluation: When Metrics Are Not Enough

Semantic Correctness

Code Readability

Intent Alignment

Metric Selection Guide

Common Pitfalls in Evaluation

Evaluation in Practice: A Real Study

Interactive: Multi-Metric Comparison

Interactive: Metric Comparison Dashboard

From Evaluation to Improvement

Lab Challenge: Evaluate a Code Generator

Analysis Template

Key Takeaways

Exact Overlap ≠ Semantic Correctness

Automated: Fast but Shallow

Meaning Through Geometry

CodeBLEU & CrystalBLEU

pass@k for Code Generation

Shape the Embedding Space

Summary ↔ Code Alignment

Combine Multiple Metrics

Module Complete!

Beyond Exact Matches:
Metrics for Evaluating Technical NL & Code