Module 2 · Slide 01

Probabilistic Source Code Modeling

A Probabilistic Language Model (PLM) is a statistical approach to model the likelihood of occurring words or tokens in a sequence. It estimates how likely a given word or sequence of words is to occur in a particular context, based on statistical patterns observed in large corpora.

Core Goal

Estimate P(word | context) and P(sequence) — the probability of the next word given previous context, and the probability of an entire sequence of words.

Learning Objectives

By the end of this module you will be able to:
1. Apply conditional probabilities and the chain rule
2. Explain n-gram models and the Markov assumption
3. Compute and interpret perplexity
4. Describe smoothing, interpolation, and backoff
5. Apply sampling from language models
6. Explain why code is “natural” (Hindle et al.)
7. Compare n-gram LMs to neural LMs

PLM

Probabilistic Language Model — a statistical model that assigns probabilities to sequences of words/tokens based on patterns observed in training data.

Corpus

A large collection of text (or code) used to train the language model. The model learns statistical patterns from this data.

Context Window

The number of preceding tokens the model considers when predicting the next token. Larger windows capture more context but require more data.

Module 2 · Slide 02

Probability Refresher

A quick visual recap of the probability concepts you will need throughout this module.

Sample Space & Events

Sample Space (S): the set of all possible outcomes.
Example — rolling a die: S = {1, 2, 3, 4, 5, 6}

Event A = “roll an even number” = {2, 4, 6}
P(A) = |A| / |S| = 3 / 6 = 0.5

Probability of an Event
P(A) = Number of favorable outcomes / Total outcomes

Conditional Probability

P(A|B) = P(A ∩ B) / P(B)

Example: a bag has 3 red and 2 blue marbles. You draw one marble and it is red (event B). What is the probability the next draw is also red (event A)?

P(A|B) = 2/4 = 0.5 (since one red was removed)

Readiness Check

If you are comfortable with P(A|B), you are ready for this module. If not, review your probability notes — these concepts are the foundation of ALL language models.

Module 2 · Slide 03

Conditional Probabilities & the Chain Rule

To estimate the probability of an entire sentence, we need two building blocks: conditional probability and the chain rule.

Conditional Probability
P(B | A) = P(A, B) / P(A)

Joint Probability
P(A, B) = P(A) · P(B | A)

A Sentence
s = (w₁, w₂, ..., w_N)

Chain Rule Expansion
P(w₁, ..., w_N) =
P(w₁) · P(w₂|w₁) · P(w₃|w₁,w₂) · ...
· P(w_N|w₁, ..., w_N-1)

Compact Notation
P(w₁, ..., w_N) = Π_i=1..N P(w_i | w₁, ..., w_i-1)

Key Insight

The chain rule decomposes a joint probability into a product of conditional probabilities. This means we can compute the probability of any sequence by multiplying together the probability of each word given all the words that came before it.

Module 2 · Slide 04

Computing Probability of a Sentence

Let us apply the chain rule step by step to compute the probability of a concrete sentence.

Step-Through: P("its water is so transparent") interactive

Step 0 / 5

P(its)= P(its)

× P(water | its)= P(water|its)

× P(is | its water)= P(is|its water)

× P(so | its water is)= P(so|its water is)

× P(transparent | its water is so)= P(transparent|its water is so)

The Problem

To compute P(the | its water is so transparent that), we need:
Count(its water is so transparent that the) / Count(its water is so transparent that)

Finding exact long sequences in any corpus is extremely unlikely! Even huge corpora will never contain most long word sequences.

Why is it impractical to estimate P(word | very long context) using raw counts?

The number of possible sequences grows exponentially with length. Most long sequences will never appear in any finite corpus, so the counts will be zero. Even sequences that do appear will have counts too small to give reliable probability estimates.

Module 2 · Slide 05

Maximum Likelihood Estimation (MLE)

When we estimate probabilities by counting, we are implicitly using Maximum Likelihood Estimation — the simplest and most intuitive way to learn a language model.

1

Count occurrences

For each event (e.g., bigram), count how many times it appears in the training data.

2

Divide by total observations

Normalize by the total count of the context to get a probability.

3

That is your probability estimate!

The MLE estimate maximizes the likelihood of the observed training data.

MLE Formula
P_MLE(w | context) = Count(context, w) / Count(context)

Worked Example

In a Java corpus, suppose:
Count(“(”, “for”) = 100 — “for” appears 100 times after “(”
Count(“(”) = 500 — “(” appears 500 times total

P_MLE(“for” | “(”) = 100 / 500 = 0.20

Key Insight

MLE is the simplest estimator. It works well with lots of data but fails for rare events — if an n-gram was never observed, its MLE probability is zero. This is called the zero-count problem.

Limitation

MLE assigns P = 0 to any event not seen in training. In language modeling, this means a single unseen bigram makes the probability of an entire sentence zero!

Module 2 · Slide 06

The Data Sparsity Problem: Why N-grams?

To compute exact sequence probabilities, we would need enough data to observe every possible word combination. The number of combinations is astronomical.

Combinatorics
C(n, r) = n! / (r!(n - r)!)

1,000

Vocabulary Size

8.25 × 10¹²

5-word Combos

Example

With just 1,000 unique words, the number of possible 5-word combinations is:
C(1000, 5) = 8.25 × 10¹²

No corpus is large enough to observe all of these!

Data Sparsity

The number of possible word sequences grows exponentially with sequence length. Even with enormous corpora, most sequences will never be observed. This makes direct counting impractical for long contexts.

The Solution

Instead of conditioning on the entire history, approximate by conditioning on only the last few words. This is the Markov assumption, and it gives us n-gram models.

1

Full history is impractical

P(w_i | w₁...w_i-1) requires too much data

2

Truncate the context

Approximate with only the last k words

3

N-gram model

P(w_i | w_i-k...w_i-1) is tractable to estimate

Module 2 · Slide 07

The Markov Assumption & N-gram Models

The key idea: approximate the full conditional probability by looking at only the last N-1 tokens.

Markov Assumption
P(w_i | w₁, w₂, ..., w_i-1) ≈ P(w_i | w_i-k, ..., w_i-1)

Example

P(the | its water is so transparent that) ≈ P(the | that) — using a bigram model

Model	Approximation
Unigram	P(w₁...w_n) ≈ Π P(w_i) — each word independent
Bigram	P(w_i\|w₁...w_i-1) ≈ P(w_i\|w_i-1) — previous word only
Trigram	P(w_i\|w₁...w_i-1) ≈ P(w_i\|w_i-2,w_i-1)
N-gram	Condition on the last N−1 tokens (3-grams, 4-grams, 5-grams...)

Unigram

Context = 0 words. Each word is treated independently. Simplest but least accurate.

Bigram

Context = 1 word. Predicts based on the immediately preceding token.

Trigram

Context = 2 words. Captures short-range dependencies like “for (int”.

N-gram

General form. Higher N = more context but exponentially more data needed.

What is the tradeoff between using higher-order n-grams (e.g., 5-grams vs bigrams)?

Higher-order n-grams capture more context, leading to more informed predictions. However, the number of possible n-grams grows exponentially, causing severe data sparsity — most n-grams will never appear in training data, making their estimated probability zero or unreliable.

Module 2 · Slide 08

Start & End Tokens: Sentence Boundaries

Language models need to know where sequences begin and end. Special boundary tokens make this possible.

Raw Sequence

int x = 5 ;

With Boundary Markers

<s> int x = 5 ; </s>

Why <s>?

Without <s>, the model cannot learn which tokens start sequences. The bigram P(int | <s>) captures that int is a common way to begin a statement.

Why </s>?

Without </s>, the model cannot learn when to stop generating. P(</s> | ;) tells the model that a semicolon often ends a statement.

Resulting Bigram Pairs

From <s> int x = 5 ; </s>:

(<s>, int) → P(int | <s>)
(int, x) → P(x | int)
(x, =) → P(= | x)
(=, 5) → P(5 | =)
(5, ;) → P(; | 5)
(;, </s>) → P(</s> | ;)

Key Takeaway

Every modern language model needs to know where sequences begin and end. These special tokens make that possible — from simple n-gram models to large Transformers.

Module 2 · Slide 09

Zipf's Law: The Long Tail

Word frequencies in natural language (and code) follow a power-law distribution known as Zipf's Law: a small number of words occur very frequently, while a large number occur rarely.

Zipf Distribution visualization

Most frequent Least frequent

20%

of Words

80%

of Occurrences

Pareto's Law (80/20 Rule)

20% of words account for 80% of all occurrences. In code, tokens like ;, (, ), {, }, = dominate, while most identifiers appear very few times.

Impact on N-grams

Zipf's law means that most n-grams in our vocabulary will be extremely rare or never observed, making probability estimation challenging. This is the fundamental reason we need smoothing techniques.

In Code

High frequency: ; = ( ) return if this
Low frequency: processUserData calcTaxRate getEmbeddedIPv4
Most custom identifiers fall in the “long tail.”

Module 2 · Slide 10

The Naturalness of Software

Hindle et al. (ICSE 2012) made a groundbreaking discovery: programming languages, in theory, are complex and flexible, but real programs are mostly simple and repetitive, with predictable statistical properties that can be captured by language models.

Key Insight — Hindle et al., ICSE 2012

Programming languages are theoretically complex, flexible, and powerful. But the programs that real people actually write are mostly simple and rather repetitive, and thus they have usefully predictable statistical properties that can be captured in statistical language models.

Natural Language

Rich vocabulary, diverse phrasing
Many ways to express the same idea
Ambiguity is common and accepted
Follows statistical patterns
Repetitive structures (grammar)

vs

Source Code

Constrained by strict syntax rules
Coding conventions enforce patterns
Repetitive idioms (for-loops, if-checks)
Also follows statistical patterns
Even MORE predictable than English!

Why This Matters

Because code is “natural” — repetitive and predictable — statistical language models like n-grams can be effectively applied to software engineering tasks: code completion, bug detection, code recommendation, and more. This insight is what makes PLMs useful for SE.

Module 2 · Slide 11

Training a Bigram Model for Code

Training an n-gram model is simply counting. Split your corpus into training (80%) and test (20%) sets. Count all bigram occurrences, then normalize to get probabilities.

public static Inet4Address getEmbeddedIPv4CA(Inet6Address ip) {
  ip1 = New IP("127.0.0.1")
  if (isCompatIPv4Address(ip)) {
    return getCompatIPv4Address(ip);
  }
  ...
}

Count Matrix

Build a table where rows = preceding token, columns = following token, and each cell = the count of that bigram in the training corpus. Then normalize each row to get probabilities.

Bigram Probability
P(getEmbeddedIPv4CA | Inet4Address)
= Count(Inet4Address, getEmbeddedIPv4CA) / Count(Inet4Address)
= 15 / 300 = 0.05

Simplified Count Matrix

	static	void	Inet4Address	(	return	ip
public	120	15	8	5	2	0
static	0	180	45	3	0	0
Inet4Address	0	0	0	10	0	3
if	0	0	0	220	0	0
return	0	0	5	10	0	30

Click any cell to see the bigram probability computation.

Module 2 · Slide 12

Interactive: Bigram Probability Calculator

Select a code snippet and see how bigram counts and probabilities are computed from the token sequence.

Bigram Calculator interactive

Module 2 · Slide 13

Interactive: Code Completion with N-grams

See how a language model predicts the next token given a partial code snippet. Toggle between unigram, bigram, and trigram to see how more context improves predictions.

Code Completion Simulator interactive

Module 2 · Slide 14

Pros & Cons of N-gram Language Models

N-gram models are a foundational approach with clear strengths and known limitations.

Pros

No need for specialized hardware (GPU/TPU)
CPU-based computation is sufficient
Easy to implement — just counting and dividing
Fast to train and query
Interpretable: inspect the count tables directly
Well-understood mathematical foundations

vs

Cons

Long-distance relationships cannot be captured
Fixed context window (cannot see beyond N-1 tokens)
Data sparsity grows exponentially with N
No semantic understanding of tokens
Cannot generalize to unseen n-grams
Open vocabulary problem with code identifiers

Key Takeaway

N-gram models are powerful for local patterns but fundamentally limited by their fixed context window. They are excellent baselines and remain useful in practice, but complex code understanding requires models that can look further back.

Module 2 · Slide 15

Long-Range Dependencies in Code

Code often has dependencies that span many tokens. A variable declared at the top of a method may be used at the bottom — far beyond any n-gram window.

public static Inet4Address getEmbeddedIPv4ClientAddress(Inet6Address ip) {
    ip1 = New IP("127.0.0.1")
    if (isCompatIPv4Address(ip)) {
        return getCompatIPv4Address(ip);
    }
    if (is6to4Address(ip)) {
        return get6to4IPv4Address(ip);
    }
    if (isTeredoAddress(ip)) {
        return getTeredoInfo(ip).getClient();
    }
    return ip1.getDefault();
}

The Problem

ip1 is defined at line 2 but used at the last line — many tokens apart. A bigram or trigram model would have no knowledge of ip1 when predicting the final return statement.

Neural Networks to the Rescue

Neural networks (RNNs, Transformers) can easily handle longer context windows. They maintain hidden state or attention mechanisms that allow information from distant tokens to influence predictions.

In the code above, which dependency would a bigram model miss?

The dependency between ip1 on line 2 and ip1 on the last line. These are separated by many tokens (all the if-statements and return statements). A bigram model only sees the previous token, so when it reaches the final return, it has no memory of ip1 being declared earlier.

Module 2 · Slide 16

Model Evaluation: Extrinsic Evaluation

How do we know if a language model is good? One approach: test it on an actual downstream task and measure its performance.

1

Choose a Task

Pick a downstream task: code completion, code generation, bug detection, etc.

2

Run Both Models

Apply Model A and Model B to the same task with the same test data.

3

Measure Accuracy

How many tokens correctly predicted? How many mispredicted? Compare metrics.

4

Compare Results

The model with better task performance is the better language model for that task.

Model A

→

Code Completion
downstream task

→

Accuracy: 72%

Model B

→

Code Completion
downstream task

→

Accuracy: 85%

Drawback

Extrinsic evaluation is time-consuming. Running a full downstream task evaluation can take days or even weeks. We need a faster way to compare models during development.

Gold Standard

Extrinsic evaluation is the gold standard because it directly measures what we care about. But it is expensive — which motivates intrinsic metrics like perplexity.

Module 2 · Slide 17

Intrinsic Evaluation: Perplexity

Perplexity measures model quality independent of any application. Lower perplexity = better model (more confident, less “surprised” by the test data).

Perplexity (General)
PP(W) = P(w₁w₂...w_N)^-1/N

= (Π_i=1..N 1/P(w_i|w₁...w_i-1))^1/N

For Bigrams
PP(W) = (Π_i=1..N 1/P(w_i|w_i-1))^1/N

Comparison

If Model A has perplexity 20 and Model B has perplexity 5, Model B is much more confident in its predictions. It is as if Model B is choosing among only 5 equally likely options at each step, while Model A is choosing among 20.

Caveat

Perplexity provides a bad approximation unless the test data looks like the training data. It is generally only useful in pilot experiments for quick model comparison.

Warning: Zero Probabilities

If any P(w_i|context) = 0, then 1/0 = infinity, and perplexity goes to infinity. A single unseen n-gram breaks the entire metric! This is why we need smoothing.

Module 2 · Slide 18

Worked Example: Computing Perplexity

Let us walk through a full numerical example before using the interactive calculator.

Sequence (4 tokens)

for ( int i

Bigram probabilities:
P(for | <s>) = 0.15
P( ( | for ) = 0.85
P( int | ( ) = 0.40
P( i | int ) = 0.30

1

Multiply probabilities

0.15 × 0.85 × 0.40 × 0.30 = 0.0153

2

Compute log₂ of each

-2.74 + (-0.23) + (-1.32) + (-1.74) = -6.03

3

Average (negative)

-(-6.03) / 4 = 1.508 (this is cross-entropy H)

4

Perplexity = 2^H

2^1.508 = 2.84

Interpretation

A perplexity of 2.84 means the model is, on average, as uncertain as choosing between ~3 equally likely next tokens. Lower is better — a perfect model would have perplexity 1.

Module 2 · Slide 19

Cross-Entropy: The Information Theory Connection

Cross-entropy connects language modeling to information theory. It measures how well our model Q approximates the true distribution P.

Cross-Entropy
H(P, Q) = -Σ P(x) log₂ Q(x)

In Language Modeling

P = the true distribution (actual next tokens)
Q = our model's predicted distribution

In practice, we approximate with the test set:
H ≈ -(1/N) Σ log₂ Q(w_i | context)

Key Relationship
Perplexity = 2^H(P,Q)

Why Log Base 2?

We measure information in bits. One bit of entropy means one binary choice. Log base 2 gives us units of bits — the standard in information theory.

Worked Examples

If our model assigns P = 0.5 to every next token:
H = -log₂(0.5) = 1 bit, Perplexity = 2¹ = 2

If our model assigns P = 0.25:
H = -log₂(0.25) = 2 bits, Perplexity = 2² = 4

Higher entropy = more uncertainty = higher perplexity.

Intuition

Cross-entropy measures the average number of bits needed to encode the true data using our model. A better model needs fewer bits.

Module 2 · Slide 20

Interactive: Perplexity Calculator

Adjust the bigram probabilities for each token and see how cross-entropy and perplexity change in real time. Try setting one probability to zero!

Perplexity Calculator interactive

Sequence: for ( int i = 0

Module 2 · Slide 21

Laplace & Add-k Smoothing

Zero-count n-grams break the model. Smoothing techniques redistribute probability mass to handle unseen events.

The Problem

If count(w_i-1, w_i) = 0, then P(w_i|w_i-1) = 0. A single zero probability makes the entire sequence probability zero and perplexity infinite.

Laplace (Add-1) Smoothing

Add 1 to all n-gram counts:
P(w_i|w_i-1) = (count(w_i-1, w_i) + 1) / (count(w_i-1) + V)
where V = vocabulary size. Simple but crude — steals too much mass from observed events.

Add-k Smoothing

Generalize Laplace by adding k instead of 1:
P(w_i|w_i-1) = (count(w_i-1, w_i) + k) / (count(w_i-1) + kV)
k is a hyperparameter — optimize it on a held-out set.

Laplace Smoothing Example

Bigram	Raw Count	+1 Count	Raw P	Smoothed P
`for (`	18	19	18/100 = 0.180	19/110 = 0.173
`for return`	0	1	0/100 = 0.000	1/110 = 0.009

Assume count(for) = 100, V = 10

Why is Laplace smoothing considered too crude for large vocabularies?

It adds 1 to every possible bigram count, and adds V to the denominator. When V is very large (common in code), the total mass redistributed to unseen events becomes enormous, stealing too much probability from actually observed n-grams. Add-k and interpolation are better alternatives.

Module 2 · Slide 22

Interpolation & Backoff

Instead of relying on a single n-gram order, blend multiple models together for more robust probability estimates.

Linear Interpolation
P(w | w_n-1) =
λ₁ · P_uni(w)
+ λ₂ · P_bi(w | w_n-1)
+ λ₃ · P_tri(w | w_n-2, w_n-1)

where λ₁ + λ₂ + λ₃ = 1

Choosing λ Weights

Use a held-out validation set (separate from train and test). Try different λ combinations and pick those that maximize the likelihood of the held-out data.

Worked Example

Predict next token after for (:

P_uni(int) = 0.05
P_bi(int | () = 0.40
P_tri(int | for, () = 0.65

With λ₁=0.1, λ₂=0.3, λ₃=0.6:
P(int) = 0.1(0.05) + 0.3(0.40) + 0.6(0.65)
= 0.005 + 0.12 + 0.39 = 0.515

Key Insight

Interpolation blends multiple models. When the trigram has data, trust it; when it does not, the unigram and bigram components provide a safety net that prevents zero probabilities.

Module 2 · Slide 23

Stupid Backoff & Sampling

Two more practical techniques for handling unseen n-grams and generating text from language models.

Stupid Backoff

If the n-gram count is zero, back off to the (n-1)-gram:

1. Try P(w_i | w_i-2, w_i-1) — trigram
2. If zero, try P(w_i | w_i-1) — bigram
3. If still zero, use P(w_i) — unigram

Simple and effective in practice. Not strictly a probability distribution unless adjusted, but works well for large-scale systems.

Trigram

→

Bigram

→

Unigram

back off if count = 0

Sampling from Language Models

Instead of always picking the most likely next token, choose randomly according to the probability distribution:

High-probability tokens: more likely to be sampled
Low-probability tokens: less likely to be sampled

This generates more diverse and creative output while still following learned patterns.

Why Sample?

Always picking the top prediction (greedy decoding) produces repetitive and boring output. Sampling introduces variety — the model is more likely to generate high-probability sentences but can occasionally produce surprising and novel completions.

Module 2 · Slide 24

Temperature & Sampling Strategies

Temperature controls the “sharpness” of the probability distribution when sampling from a language model. Combined with top-k and top-p, it gives fine-grained control over generation.

Temperature (T)

T = 0 (greedy): always pick the highest-probability token
T = 1: sample from the actual learned distribution
T > 1: flatten the distribution — more random/creative
T < 1: sharpen the distribution — more focused/deterministic

Top-k Sampling

Only consider the k most probable tokens. Discard the rest and renormalize. Example: top-5 keeps only the 5 best candidates.

Top-p / Nucleus Sampling

Choose the smallest set of tokens whose cumulative probability exceeds p (e.g., p=0.9). Adapts to the shape of the distribution — keeps fewer tokens when the model is confident.

Temperature Visualizer interactive

Temperature 1.00

Module 2 · Slide 25

Out-of-Vocabulary & the <UNK> Token

How do n-gram models handle tokens that were never seen during training? The open vocabulary problem is a fundamental challenge.

1

The Problem

New code uses identifiers not in training data: myCustomHandler, calcNetRevenue. These have zero probability.

2

Solution: <UNK> Token

Replace rare tokens (count < threshold) with a special <UNK> token before training.

3

Train with <UNK>

The model learns “rare token behavior” — what typically follows or precedes an uncommon identifier.

4

Map Unseen to <UNK>

At test time, any token not in the vocabulary is replaced with <UNK> and gets a nonzero probability.

The Tradeoff

Too aggressive (high threshold): many tokens become <UNK>, losing valuable information.

Too lenient (low threshold): vocabulary stays large, probabilities remain sparse. Finding the right threshold requires experimentation.

Modern Solution

The open vocabulary problem is why modern LLMs use Byte-Pair Encoding (BPE) instead of word-level tokenization. BPE breaks words into subword units, so even unseen identifiers can be represented as sequences of known subwords.

Example

Vocabulary = {for, int, if, return, ;, (, ), ...}
Input: calcNetRevenue → not in vocab → mapped to <UNK>
P(<UNK> | return) is now a valid, nonzero probability.

Module 2 · Slide 26

Large Language Models vs N-grams

LLMs use neural networks rather than explicit n-gram counts. They address two fundamental problems that limit n-gram models.

Problem 1: Parameter Explosion

The number of parameters in an n-gram model grows exponentially as the n-gram order increases. A 5-gram model with vocabulary V requires up to V⁵ parameters. This quickly becomes intractable.

Problem 2: No Generalization

N-grams cannot generalize from training to test unless the exact same words are used. Seeing “the cat sat” teaches nothing about “the dog sat” because the model treats every word as an independent symbol.

N-gram LMs

Count-based statistics
Fixed context window
No semantic understanding
Cannot generalize
Fast, interpretable

vs

Neural LMs

Learned representations
Long-range context
Semantic embeddings
Generalizes to new data
GPU-intensive, less interpretable

N-grams

→

FFNNs

→

RNNs

→

Transformers

→

LLMs

This is where Module 4 picks up...

Module 2 · Slide 27

Lab Challenge: Build Your Own Bigram Model

Put it all together! Using the Java dataset from Module 1, build a bigram language model in Python.

1

Load & Tokenize

Load Java methods from the corpus and tokenize them into sequences of tokens.

2

Build Bigram Count Matrix

Count all consecutive token pairs across the training set.

3

Implement Add-k Smoothing

Add k to all counts to handle unseen bigrams. Experiment with k values.

4

Compute Perplexity

Evaluate your model on a held-out test set using the perplexity metric.

5

Generate Code

Use sampling to generate a simple code sequence from your trained model.

Starter Code Skeleton

import collections, math, random

def build_bigram_counts(corpus):
    """Count all bigrams in tokenized corpus."""
    counts = collections.defaultdict(Counter)
    for seq in corpus:
        for i in range(len(seq)-1):
            counts[seq[i]][seq[i+1]] += 1
    return counts

def bigram_prob(w1, w2, counts, k=1):
    """P(w2|w1) with add-k smoothing."""
    num = counts[w1][w2] + k
    den = sum(counts[w1].values()) + k*V
    return num/den

def perplexity(test, counts, k=1):
    """Compute perplexity on test set."""
    # Your implementation here
          

Module 2 · Slide 28

Recap & Knowledge Check

01

PLMs, MLE & Chain Rule

PLMs assign probabilities via MLE counting. The chain rule decomposes joint probability into conditionals.

02

N-gram Models

Use the Markov property with boundary tokens (<s>, </s>) to estimate probabilities from a fixed window.

03

Evaluation

Extrinsic: downstream tasks. Intrinsic: cross-entropy and perplexity (2^H). Lower perplexity = better model.

04

Smoothing & Interpolation

Laplace, add-k, linear interpolation, and stupid backoff address the zero-count problem.

05

Sampling & Temperature

Temperature, top-k, and top-p control generation diversity. OOV tokens are handled via <UNK> or BPE.

06

Naturalness & Beyond

Code is “natural” (Hindle et al.). LLMs overcome n-gram limitations using deep learning and embeddings.

Recap Quiz

Q1: What does the chain rule allow us to compute?

The joint probability of a sequence by decomposing it into a product of conditional probabilities. Correct

The average number of bits needed to encode a token, independent of the model. Incorrect

The optimal vocabulary size for an n-gram model trained on source code. Incorrect

Q2: Why is code more predictable than English text?

Because programming languages have fewer words than English. Incorrect

Because code follows strict syntax rules, coding conventions, and repetitive idioms that create predictable patterns. Correct

Because code is always written by a single person, reducing variability. Incorrect

Q3: What happens to perplexity when a model assigns zero probability to a token?

Perplexity drops to zero, indicating a perfect model. Incorrect

Perplexity remains unchanged because zero-probability tokens are ignored. Incorrect

Perplexity goes to infinity because we divide by zero in the computation. Correct

Q4: What are the two key limitations of n-gram models that LLMs overcome?

(1) Parameter explosion: the number of parameters grows exponentially with the n-gram order, making high-order n-grams intractable. (2) No generalization: n-gram models cannot transfer knowledge between similar but distinct n-grams (e.g., “the cat sat” teaches nothing about “the dog sat”). Neural LMs address both through learned embeddings that capture semantic similarity and shared representations.

References: Hindle et al., “On the Naturalness of Software” (ICSE 2012) • Jurafsky & Martin, “Speech and Language Processing” • “Recommending Code Tokens via N-gram Models”

Probabilistic Source Code Modeling

Probability Refresher

Conditional Probabilities & the Chain Rule

Computing Probability of a Sentence

Maximum Likelihood Estimation (MLE)

Count occurrences

Divide by total observations

That is your probability estimate!

The Data Sparsity Problem: Why N-grams?

Full history is impractical

Truncate the context

N-gram model

The Markov Assumption & N-gram Models

Start & End Tokens: Sentence Boundaries

Zipf's Law: The Long Tail

The Naturalness of Software

Natural Language

Source Code

Training a Bigram Model for Code

Interactive: Bigram Probability Calculator

Interactive: Code Completion with N-grams

Pros & Cons of N-gram Language Models

Pros

Cons

Long-Range Dependencies in Code

Model Evaluation: Extrinsic Evaluation

Choose a Task

Run Both Models

Measure Accuracy

Compare Results

Intrinsic Evaluation: Perplexity

Worked Example: Computing Perplexity

Multiply probabilities

Compute log2 of each

Average (negative)

Perplexity = 2H

Cross-Entropy: The Information Theory Connection

Interactive: Perplexity Calculator

Laplace & Add-k Smoothing

Interpolation & Backoff

Stupid Backoff & Sampling

Temperature & Sampling Strategies

Out-of-Vocabulary & the <UNK> Token

The Problem

Solution: <UNK> Token

Train with <UNK>

Map Unseen to <UNK>

Large Language Models vs N-grams

N-gram LMs

Neural LMs

Lab Challenge: Build Your Own Bigram Model

Load & Tokenize

Build Bigram Count Matrix

Implement Add-k Smoothing

Compute Perplexity

Generate Code

Recap & Knowledge Check

PLMs, MLE & Chain Rule

N-gram Models

Evaluation

Smoothing & Interpolation

Sampling & Temperature

Naturalness & Beyond

Recap Quiz

Module Complete!

Compute log₂ of each

Perplexity = 2^H