A Probabilistic Language Model (PLM) is a statistical approach to model the likelihood of occurring words or tokens in a sequence. It estimates how likely a given word or sequence of words is to occur in a particular context, based on statistical patterns observed in large corpora.
Core Goal
Estimate P(word | context) and P(sequence) — the probability of the next word given previous context, and the probability of an entire sequence of words.
Learning Objectives
By the end of this module you will be able to: 1. Apply conditional probabilities and the chain rule 2. Explain n-gram models and the Markov assumption 3. Compute and interpret perplexity 4. Describe smoothing, interpolation, and backoff 5. Apply sampling from language models 6. Explain why code is “natural” (Hindle et al.) 7. Compare n-gram LMs to neural LMs
PLM
Probabilistic Language Model — a statistical model that assigns probabilities to sequences of words/tokens based on patterns observed in training data.
Corpus
A large collection of text (or code) used to train the language model. The model learns statistical patterns from this data.
Context Window
The number of preceding tokens the model considers when predicting the next token. Larger windows capture more context but require more data.
Module 2 · Slide 02
Probability Refresher
A quick visual recap of the probability concepts you will need throughout this module.
Sample Space & Events
Sample Space (S): the set of all possible outcomes.
Example — rolling a die: S = {1, 2, 3, 4, 5, 6}
Event A = “roll an even number” = {2, 4, 6} P(A) = |A| / |S| = 3 / 6 = 0.5
Probability of an Event
P(A) = Number of favorable outcomes / Total outcomes
Conditional Probability
P(A|B) = P(A ∩ B) / P(B)
Example: a bag has 3 red and 2 blue marbles. You draw one marble and it is red (event B). What is the probability the next draw is also red (event A)?
P(A|B) = 2/4 = 0.5 (since one red was removed)
Readiness Check
If you are comfortable with P(A|B), you are ready for this module. If not, review your probability notes — these concepts are the foundation of ALL language models.
Module 2 · Slide 03
Conditional Probabilities & the Chain Rule
To estimate the probability of an entire sentence, we need two building blocks: conditional probability and the chain rule.
The chain rule decomposes a joint probability into a product of conditional probabilities. This means we can compute the probability of any sequence by multiplying together the probability of each word given all the words that came before it.
Module 2 · Slide 04
Computing Probability of a Sentence
Let us apply the chain rule step by step to compute the probability of a concrete sentence.
Step-Through: P("its water is so transparent") interactive
Step 0 / 5
P(its)= P(its)
× P(water | its)= P(water|its)
× P(is | its water)= P(is|its water)
× P(so | its water is)= P(so|its water is)
× P(transparent | its water is so)= P(transparent|its water is so)
The Problem
To compute P(the | its water is so transparent that), we need:
Count(its water is so transparent that the) / Count(its water is so transparent that)
Finding exact long sequences in any corpus is extremely unlikely! Even huge corpora will never contain most long word sequences.
Why is it impractical to estimate P(word | very long context) using raw counts?
The number of possible sequences grows exponentially with length. Most long sequences will never appear in any finite corpus, so the counts will be zero. Even sequences that do appear will have counts too small to give reliable probability estimates.
Module 2 · Slide 05
Maximum Likelihood Estimation (MLE)
When we estimate probabilities by counting, we are implicitly using Maximum Likelihood Estimation — the simplest and most intuitive way to learn a language model.
1
Count occurrences
For each event (e.g., bigram), count how many times it appears in the training data.
2
Divide by total observations
Normalize by the total count of the context to get a probability.
3
That is your probability estimate!
The MLE estimate maximizes the likelihood of the observed training data.
MLE Formula
PMLE(w | context) = Count(context, w) / Count(context)
Worked Example
In a Java corpus, suppose: Count(“(”, “for”) = 100 — “for” appears 100 times after “(” Count(“(”) = 500 — “(” appears 500 times total
PMLE(“for” | “(”) = 100 / 500 = 0.20
Key Insight
MLE is the simplest estimator. It works well with lots of data but fails for rare events — if an n-gram was never observed, its MLE probability is zero. This is called the zero-count problem.
Limitation
MLE assigns P = 0 to any event not seen in training. In language modeling, this means a single unseen bigram makes the probability of an entire sentence zero!
Module 2 · Slide 06
The Data Sparsity Problem: Why N-grams?
To compute exact sequence probabilities, we would need enough data to observe every possible word combination. The number of combinations is astronomical.
Combinatorics
C(n, r) =n! / (r!(n-r)!)
1,000
Vocabulary Size
8.25 × 1012
5-word Combos
Example
With just 1,000 unique words, the number of possible 5-word combinations is:
C(1000, 5) = 8.25 × 1012
No corpus is large enough to observe all of these!
Data Sparsity
The number of possible word sequences grows exponentially with sequence length. Even with enormous corpora, most sequences will never be observed. This makes direct counting impractical for long contexts.
The Solution
Instead of conditioning on the entire history, approximate by conditioning on only the last few words. This is the Markov assumption, and it gives us n-gram models.
1
Full history is impractical
P(wi | w1...wi-1) requires too much data
2
Truncate the context
Approximate with only the last k words
3
N-gram model
P(wi | wi-k...wi-1) is tractable to estimate
Module 2 · Slide 07
The Markov Assumption & N-gram Models
The key idea: approximate the full conditional probability by looking at only the last N-1 tokens.
P(the | its water is so transparent that) ≈ P(the | that) — using a bigram model
Model
Approximation
Unigram
P(w1...wn) ≈ Π P(wi) — each word independent
Bigram
P(wi|w1...wi-1) ≈ P(wi|wi-1) — previous word only
Trigram
P(wi|w1...wi-1) ≈ P(wi|wi-2,wi-1)
N-gram
Condition on the last N−1 tokens (3-grams, 4-grams, 5-grams...)
Unigram
Context = 0 words. Each word is treated independently. Simplest but least accurate.
Bigram
Context = 1 word. Predicts based on the immediately preceding token.
Trigram
Context = 2 words. Captures short-range dependencies like “for (int”.
N-gram
General form. Higher N = more context but exponentially more data needed.
What is the tradeoff between using higher-order n-grams (e.g., 5-grams vs bigrams)?
Higher-order n-grams capture more context, leading to more informed predictions. However, the number of possible n-grams grows exponentially, causing severe data sparsity — most n-grams will never appear in training data, making their estimated probability zero or unreliable.
Module 2 · Slide 08
Start & End Tokens: Sentence Boundaries
Language models need to know where sequences begin and end. Special boundary tokens make this possible.
Raw Sequence
int x = 5 ;
With Boundary Markers
<s> int x = 5 ; </s>
Why <s>?
Without <s>, the model cannot learn which tokens start sequences. The bigram P(int | <s>) captures that int is a common way to begin a statement.
Why </s>?
Without </s>, the model cannot learn when to stop generating. P(</s> | ;) tells the model that a semicolon often ends a statement.
Every modern language model needs to know where sequences begin and end. These special tokens make that possible — from simple n-gram models to large Transformers.
Module 2 · Slide 09
Zipf's Law: The Long Tail
Word frequencies in natural language (and code) follow a power-law distribution known as Zipf's Law: a small number of words occur very frequently, while a large number occur rarely.
Zipf Distribution visualization
Most frequentLeast frequent
20%
of Words
80%
of Occurrences
Pareto's Law (80/20 Rule)
20% of words account for 80% of all occurrences. In code, tokens like ;, (, ), {, }, = dominate, while most identifiers appear very few times.
Impact on N-grams
Zipf's law means that most n-grams in our vocabulary will be extremely rare or never observed, making probability estimation challenging. This is the fundamental reason we need smoothing techniques.
In Code
High frequency:;=()returnifthis Low frequency:processUserDatacalcTaxRategetEmbeddedIPv4
Most custom identifiers fall in the “long tail.”
Module 2 · Slide 10
The Naturalness of Software
Hindle et al. (ICSE 2012) made a groundbreaking discovery: programming languages, in theory, are complex and flexible, but real programs are mostly simple and repetitive, with predictable statistical properties that can be captured by language models.
Key Insight — Hindle et al., ICSE 2012
Programming languages are theoretically complex, flexible, and powerful. But the programs that real people actually write are mostly simple and rather repetitive, and thus they have usefully predictable statistical properties that can be captured in statistical language models.
Natural Language
Rich vocabulary, diverse phrasing
Many ways to express the same idea
Ambiguity is common and accepted
Follows statistical patterns
Repetitive structures (grammar)
vs
Source Code
Constrained by strict syntax rules
Coding conventions enforce patterns
Repetitive idioms (for-loops, if-checks)
Also follows statistical patterns
Even MORE predictable than English!
Why This Matters
Because code is “natural” — repetitive and predictable — statistical language models like n-grams can be effectively applied to software engineering tasks: code completion, bug detection, code recommendation, and more. This insight is what makes PLMs useful for SE.
Module 2 · Slide 11
Training a Bigram Model for Code
Training an n-gram model is simply counting. Split your corpus into training (80%) and test (20%) sets. Count all bigram occurrences, then normalize to get probabilities.
public staticInet4AddressgetEmbeddedIPv4CA(Inet6Addressip) {
ip1=NewIP("127.0.0.1")
if (isCompatIPv4Address(ip)) {
returngetCompatIPv4Address(ip);
}
...
}
Count Matrix
Build a table where rows = preceding token, columns = following token, and each cell = the count of that bigram in the training corpus. Then normalize each row to get probabilities.
Click any cell to see the bigram probability computation.
Module 2 · Slide 12
Interactive: Bigram Probability Calculator
Select a code snippet and see how bigram counts and probabilities are computed from the token sequence.
Bigram Calculator interactive
Module 2 · Slide 13
Interactive: Code Completion with N-grams
See how a language model predicts the next token given a partial code snippet. Toggle between unigram, bigram, and trigram to see how more context improves predictions.
Code Completion Simulator interactive
Module 2 · Slide 14
Pros & Cons of N-gram Language Models
N-gram models are a foundational approach with clear strengths and known limitations.
Pros
No need for specialized hardware (GPU/TPU)
CPU-based computation is sufficient
Easy to implement — just counting and dividing
Fast to train and query
Interpretable: inspect the count tables directly
Well-understood mathematical foundations
vs
Cons
Long-distance relationships cannot be captured
Fixed context window (cannot see beyond N-1 tokens)
Data sparsity grows exponentially with N
No semantic understanding of tokens
Cannot generalize to unseen n-grams
Open vocabulary problem with code identifiers
Key Takeaway
N-gram models are powerful for local patterns but fundamentally limited by their fixed context window. They are excellent baselines and remain useful in practice, but complex code understanding requires models that can look further back.
Module 2 · Slide 15
Long-Range Dependencies in Code
Code often has dependencies that span many tokens. A variable declared at the top of a method may be used at the bottom — far beyond any n-gram window.
public staticInet4AddressgetEmbeddedIPv4ClientAddress(Inet6Addressip) {
ip1 = New IP("127.0.0.1")
if (isCompatIPv4Address(ip)) {
return getCompatIPv4Address(ip);
}
if (is6to4Address(ip)) {
return get6to4IPv4Address(ip);
}
if (isTeredoAddress(ip)) {
return getTeredoInfo(ip).getClient();
}
returnip1.getDefault();
}
The Problem
ip1 is defined at line 2 but used at the last line — many tokens apart. A bigram or trigram model would have no knowledge of ip1 when predicting the final return statement.
Neural Networks to the Rescue
Neural networks (RNNs, Transformers) can easily handle longer context windows. They maintain hidden state or attention mechanisms that allow information from distant tokens to influence predictions.
In the code above, which dependency would a bigram model miss?
The dependency between ip1 on line 2 and ip1 on the last line. These are separated by many tokens (all the if-statements and return statements). A bigram model only sees the previous token, so when it reaches the final return, it has no memory of ip1 being declared earlier.
Module 2 · Slide 16
Model Evaluation: Extrinsic Evaluation
How do we know if a language model is good? One approach: test it on an actual downstream task and measure its performance.
1
Choose a Task
Pick a downstream task: code completion, code generation, bug detection, etc.
2
Run Both Models
Apply Model A and Model B to the same task with the same test data.
3
Measure Accuracy
How many tokens correctly predicted? How many mispredicted? Compare metrics.
4
Compare Results
The model with better task performance is the better language model for that task.
Model A
→
Code Completion downstream task
→
Accuracy: 72%
Model B
→
Code Completion downstream task
→
Accuracy: 85%
Drawback
Extrinsic evaluation is time-consuming. Running a full downstream task evaluation can take days or even weeks. We need a faster way to compare models during development.
Gold Standard
Extrinsic evaluation is the gold standard because it directly measures what we care about. But it is expensive — which motivates intrinsic metrics like perplexity.
Module 2 · Slide 17
Intrinsic Evaluation: Perplexity
Perplexity measures model quality independent of any application. Lower perplexity = better model (more confident, less “surprised” by the test data).
Perplexity (General)
PP(W) = P(w1w2...wN)-1/N
= (Πi=1..N 1/P(wi|w1...wi-1))1/N
For Bigrams
PP(W) = (Πi=1..N 1/P(wi|wi-1))1/N
Comparison
If Model A has perplexity 20 and Model B has perplexity 5, Model B is much more confident in its predictions. It is as if Model B is choosing among only 5 equally likely options at each step, while Model A is choosing among 20.
Caveat
Perplexity provides a bad approximation unless the test data looks like the training data. It is generally only useful in pilot experiments for quick model comparison.
Warning: Zero Probabilities
If any P(wi|context) = 0, then 1/0 = infinity, and perplexity goes to infinity. A single unseen n-gram breaks the entire metric! This is why we need smoothing.
Module 2 · Slide 18
Worked Example: Computing Perplexity
Let us walk through a full numerical example before using the interactive calculator.
Sequence (4 tokens)
for ( int i
Bigram probabilities:
P(for | <s>) = 0.15
P( ( | for ) = 0.85
P( int | ( ) = 0.40
P( i | int ) = 0.30
1
Multiply probabilities
0.15 × 0.85 × 0.40 × 0.30 = 0.0153
2
Compute log2 of each
-2.74 + (-0.23) + (-1.32) + (-1.74) = -6.03
3
Average (negative)
-(-6.03) / 4 = 1.508 (this is cross-entropy H)
4
Perplexity = 2H
21.508 = 2.84
Interpretation
A perplexity of 2.84 means the model is, on average, as uncertain as choosing between ~3 equally likely next tokens. Lower is better — a perfect model would have perplexity 1.
Module 2 · Slide 19
Cross-Entropy: The Information Theory Connection
Cross-entropy connects language modeling to information theory. It measures how well our model Q approximates the true distribution P.
Cross-Entropy
H(P, Q) =-Σ P(x) log2 Q(x)
In Language Modeling
P = the true distribution (actual next tokens) Q = our model's predicted distribution
In practice, we approximate with the test set: H ≈ -(1/N) Σ log2 Q(wi | context)
Key Relationship
Perplexity = 2H(P,Q)
Why Log Base 2?
We measure information in bits. One bit of entropy means one binary choice. Log base 2 gives us units of bits — the standard in information theory.
Worked Examples
If our model assigns P = 0.5 to every next token:
H = -log2(0.5) = 1 bit, Perplexity = 21 = 2
If our model assigns P = 0.25:
H = -log2(0.25) = 2 bits, Perplexity = 22 = 4
Higher entropy = more uncertainty = higher perplexity.
Intuition
Cross-entropy measures the average number of bits needed to encode the true data using our model. A better model needs fewer bits.
Module 2 · Slide 20
Interactive: Perplexity Calculator
Adjust the bigram probabilities for each token and see how cross-entropy and perplexity change in real time. Try setting one probability to zero!
Perplexity Calculator interactive
Sequence: for ( int i = 0
Module 2 · Slide 21
Laplace & Add-k Smoothing
Zero-count n-grams break the model. Smoothing techniques redistribute probability mass to handle unseen events.
The Problem
If count(wi-1, wi) = 0, then P(wi|wi-1) = 0. A single zero probability makes the entire sequence probability zero and perplexity infinite.
Laplace (Add-1) Smoothing
Add 1 to all n-gram counts: P(wi|wi-1) = (count(wi-1, wi) + 1) / (count(wi-1) + V)
where V = vocabulary size. Simple but crude — steals too much mass from observed events.
Add-k Smoothing
Generalize Laplace by adding k instead of 1: P(wi|wi-1) = (count(wi-1, wi) + k) / (count(wi-1) + kV) k is a hyperparameter — optimize it on a held-out set.
Laplace Smoothing Example
Bigram
Raw Count
+1 Count
Raw P
Smoothed P
for (
18
19
18/100 = 0.180
19/110 = 0.173
for return
0
1
0/100 = 0.000
1/110 = 0.009
Assume count(for) = 100, V = 10
Why is Laplace smoothing considered too crude for large vocabularies?
It adds 1 to every possible bigram count, and adds V to the denominator. When V is very large (common in code), the total mass redistributed to unseen events becomes enormous, stealing too much probability from actually observed n-grams. Add-k and interpolation are better alternatives.
Module 2 · Slide 22
Interpolation & Backoff
Instead of relying on a single n-gram order, blend multiple models together for more robust probability estimates.
Use a held-out validation set (separate from train and test). Try different λ combinations and pick those that maximize the likelihood of the held-out data.
Interpolation blends multiple models. When the trigram has data, trust it; when it does not, the unigram and bigram components provide a safety net that prevents zero probabilities.
Module 2 · Slide 23
Stupid Backoff & Sampling
Two more practical techniques for handling unseen n-grams and generating text from language models.
Stupid Backoff
If the n-gram count is zero, back off to the (n-1)-gram:
1. Try P(wi | wi-2, wi-1) — trigram 2. If zero, try P(wi | wi-1) — bigram 3. If still zero, use P(wi) — unigram
Simple and effective in practice. Not strictly a probability distribution unless adjusted, but works well for large-scale systems.
Trigram
→
Bigram
→
Unigram
back off if count = 0
Sampling from Language Models
Instead of always picking the most likely next token, choose randomly according to the probability distribution:
High-probability tokens: more likely to be sampled Low-probability tokens: less likely to be sampled
This generates more diverse and creative output while still following learned patterns.
Why Sample?
Always picking the top prediction (greedy decoding) produces repetitive and boring output. Sampling introduces variety — the model is more likely to generate high-probability sentences but can occasionally produce surprising and novel completions.
Module 2 · Slide 24
Temperature & Sampling Strategies
Temperature controls the “sharpness” of the probability distribution when sampling from a language model. Combined with top-k and top-p, it gives fine-grained control over generation.
Temperature (T)
T = 0 (greedy): always pick the highest-probability token T = 1: sample from the actual learned distribution T > 1: flatten the distribution — more random/creative T < 1: sharpen the distribution — more focused/deterministic
Top-k Sampling
Only consider the k most probable tokens. Discard the rest and renormalize. Example: top-5 keeps only the 5 best candidates.
Top-p / Nucleus Sampling
Choose the smallest set of tokens whose cumulative probability exceeds p (e.g., p=0.9). Adapts to the shape of the distribution — keeps fewer tokens when the model is confident.
Temperature Visualizer interactive
1.00
Module 2 · Slide 25
Out-of-Vocabulary & the <UNK> Token
How do n-gram models handle tokens that were never seen during training? The open vocabulary problem is a fundamental challenge.
1
The Problem
New code uses identifiers not in training data: myCustomHandler, calcNetRevenue. These have zero probability.
2
Solution: <UNK> Token
Replace rare tokens (count < threshold) with a special <UNK> token before training.
3
Train with <UNK>
The model learns “rare token behavior” — what typically follows or precedes an uncommon identifier.
4
Map Unseen to <UNK>
At test time, any token not in the vocabulary is replaced with <UNK> and gets a nonzero probability.
The Tradeoff
Too aggressive (high threshold): many tokens become <UNK>, losing valuable information.
Too lenient (low threshold): vocabulary stays large, probabilities remain sparse. Finding the right threshold requires experimentation.
Modern Solution
The open vocabulary problem is why modern LLMs use Byte-Pair Encoding (BPE) instead of word-level tokenization. BPE breaks words into subword units, so even unseen identifiers can be represented as sequences of known subwords.
Example
Vocabulary = {for, int, if, return, ;, (, ), ...}
Input: calcNetRevenue → not in vocab → mapped to <UNK>
P(<UNK> | return) is now a valid, nonzero probability.
Module 2 · Slide 26
Large Language Models vs N-grams
LLMs use neural networks rather than explicit n-gram counts. They address two fundamental problems that limit n-gram models.
Problem 1: Parameter Explosion
The number of parameters in an n-gram model grows exponentially as the n-gram order increases. A 5-gram model with vocabulary V requires up to V5 parameters. This quickly becomes intractable.
Problem 2: No Generalization
N-grams cannot generalize from training to test unless the exact same words are used. Seeing “the cat sat” teaches nothing about “the dog sat” because the model treats every word as an independent symbol.
N-gram LMs
Count-based statistics
Fixed context window
No semantic understanding
Cannot generalize
Fast, interpretable
vs
Neural LMs
Learned representations
Long-range context
Semantic embeddings
Generalizes to new data
GPU-intensive, less interpretable
N-grams
→
FFNNs
→
RNNs
→
Transformers
→
LLMs
This is where Module 4 picks up...
Module 2 · Slide 27
Lab Challenge: Build Your Own Bigram Model
Put it all together! Using the Java dataset from Module 1, build a bigram language model in Python.
1
Load & Tokenize
Load Java methods from the corpus and tokenize them into sequences of tokens.
2
Build Bigram Count Matrix
Count all consecutive token pairs across the training set.
3
Implement Add-k Smoothing
Add k to all counts to handle unseen bigrams. Experiment with k values.
4
Compute Perplexity
Evaluate your model on a held-out test set using the perplexity metric.
5
Generate Code
Use sampling to generate a simple code sequence from your trained model.
Starter Code Skeleton
importcollections, math, randomdefbuild_bigram_counts(corpus):
"""Count all bigrams in tokenized corpus."""counts = collections.defaultdict(Counter)
forseqincorpus:
foriinrange(len(seq)-1):
counts[seq[i]][seq[i+1]] += 1returncountsdefbigram_prob(w1, w2, counts, k=1):
"""P(w2|w1) with add-k smoothing."""num = counts[w1][w2] + kden = sum(counts[w1].values()) + k*Vreturnnum/dendefperplexity(test, counts, k=1):
"""Compute perplexity on test set."""# Your implementation here
Module 2 · Slide 28
Recap & Knowledge Check
01
PLMs, MLE & Chain Rule
PLMs assign probabilities via MLE counting. The chain rule decomposes joint probability into conditionals.
02
N-gram Models
Use the Markov property with boundary tokens (<s>, </s>) to estimate probabilities from a fixed window.
Laplace, add-k, linear interpolation, and stupid backoff address the zero-count problem.
05
Sampling & Temperature
Temperature, top-k, and top-p control generation diversity. OOV tokens are handled via <UNK> or BPE.
06
Naturalness & Beyond
Code is “natural” (Hindle et al.). LLMs overcome n-gram limitations using deep learning and embeddings.
Recap Quiz
Q1: What does the chain rule allow us to compute?
The joint probability of a sequence by decomposing it into a product of conditional probabilities.Correct
The average number of bits needed to encode a token, independent of the model.Incorrect
The optimal vocabulary size for an n-gram model trained on source code.Incorrect
Q2: Why is code more predictable than English text?
Because programming languages have fewer words than English.Incorrect
Because code follows strict syntax rules, coding conventions, and repetitive idioms that create predictable patterns.Correct
Because code is always written by a single person, reducing variability.Incorrect
Q3: What happens to perplexity when a model assigns zero probability to a token?
Perplexity drops to zero, indicating a perfect model.Incorrect
Perplexity remains unchanged because zero-probability tokens are ignored.Incorrect
Perplexity goes to infinity because we divide by zero in the computation.Correct
Q4: What are the two key limitations of n-gram models that LLMs overcome?
(1) Parameter explosion: the number of parameters grows exponentially with the n-gram order, making high-order n-grams intractable. (2) No generalization: n-gram models cannot transfer knowledge between similar but distinct n-grams (e.g., “the cat sat” teaches nothing about “the dog sat”). Neural LMs address both through learned embeddings that capture semantic similarity and shared representations.
References: Hindle et al., “On the Naturalness of Software” (ICSE 2012) • Jurafsky & Martin, “Speech and Language Processing” • “Recommending Code Tokens via N-gram Models”
🎉
Module Complete!
You've finished Probabilistic Source Code Modeling. Great work!