Probabilistic Source Code Modeling Module 2 · Interactive 1 / 28 ← Learning Material
Module 2 · Slide 01

Probabilistic Source Code Modeling

A Probabilistic Language Model (PLM) is a statistical approach to model the likelihood of occurring words or tokens in a sequence. It estimates how likely a given word or sequence of words is to occur in a particular context, based on statistical patterns observed in large corpora.

Core Goal
Estimate P(word | context) and P(sequence) — the probability of the next word given previous context, and the probability of an entire sequence of words.
Learning Objectives
By the end of this module you will be able to:
1. Apply conditional probabilities and the chain rule
2. Explain n-gram models and the Markov assumption
3. Compute and interpret perplexity
4. Describe smoothing, interpolation, and backoff
5. Apply sampling from language models
6. Explain why code is “natural” (Hindle et al.)
7. Compare n-gram LMs to neural LMs
PLM
Probabilistic Language Model — a statistical model that assigns probabilities to sequences of words/tokens based on patterns observed in training data.
Corpus
A large collection of text (or code) used to train the language model. The model learns statistical patterns from this data.
Context Window
The number of preceding tokens the model considers when predicting the next token. Larger windows capture more context but require more data.
Module 2 · Slide 02

Probability Refresher

A quick visual recap of the probability concepts you will need throughout this module.

Sample Space & Events
Sample Space (S): the set of all possible outcomes.
Example — rolling a die: S = {1, 2, 3, 4, 5, 6}

Event A = “roll an even number” = {2, 4, 6}
P(A) = |A| / |S| = 3 / 6 = 0.5
Probability of an Event
P(A) = Number of favorable outcomes / Total outcomes
Conditional Probability
P(A|B) = P(A ∩ B) / P(B)

Example: a bag has 3 red and 2 blue marbles. You draw one marble and it is red (event B). What is the probability the next draw is also red (event A)?

P(A|B) = 2/4 = 0.5 (since one red was removed)
Readiness Check
If you are comfortable with P(A|B), you are ready for this module. If not, review your probability notes — these concepts are the foundation of ALL language models.
Module 2 · Slide 03

Conditional Probabilities & the Chain Rule

To estimate the probability of an entire sentence, we need two building blocks: conditional probability and the chain rule.

Conditional Probability
P(B | A) = P(A, B) / P(A)
Joint Probability
P(A, B) = P(A) · P(B | A)
A Sentence
s = (w1, w2, ..., wN)
Chain Rule Expansion
P(w1, ..., wN) =
  P(w1) · P(w2|w1) · P(w3|w1,w2) · ...
  · P(wN|w1, ..., wN-1)
Compact Notation
P(w1, ..., wN) = Πi=1..N P(wi | w1, ..., wi-1)
Key Insight
The chain rule decomposes a joint probability into a product of conditional probabilities. This means we can compute the probability of any sequence by multiplying together the probability of each word given all the words that came before it.
Module 2 · Slide 04

Computing Probability of a Sentence

Let us apply the chain rule step by step to compute the probability of a concrete sentence.

Step-Through: P("its water is so transparent") interactive
Step 0 / 5
P(its)= P(its)
× P(water | its)= P(water|its)
× P(is | its water)= P(is|its water)
× P(so | its water is)= P(so|its water is)
× P(transparent | its water is so)= P(transparent|its water is so)
The Problem
To compute P(the | its water is so transparent that), we need:
Count(its water is so transparent that the) / Count(its water is so transparent that)

Finding exact long sequences in any corpus is extremely unlikely! Even huge corpora will never contain most long word sequences.
Why is it impractical to estimate P(word | very long context) using raw counts?
The number of possible sequences grows exponentially with length. Most long sequences will never appear in any finite corpus, so the counts will be zero. Even sequences that do appear will have counts too small to give reliable probability estimates.
Module 2 · Slide 05

Maximum Likelihood Estimation (MLE)

When we estimate probabilities by counting, we are implicitly using Maximum Likelihood Estimation — the simplest and most intuitive way to learn a language model.

1

Count occurrences

For each event (e.g., bigram), count how many times it appears in the training data.

2

Divide by total observations

Normalize by the total count of the context to get a probability.

3

That is your probability estimate!

The MLE estimate maximizes the likelihood of the observed training data.

MLE Formula
PMLE(w | context) = Count(context, w) / Count(context)
Worked Example
In a Java corpus, suppose:
Count(“(”, “for”) = 100 — “for” appears 100 times after “(”
Count(“(”) = 500 — “(” appears 500 times total

PMLE(“for” | “(”) = 100 / 500 = 0.20
Key Insight
MLE is the simplest estimator. It works well with lots of data but fails for rare events — if an n-gram was never observed, its MLE probability is zero. This is called the zero-count problem.
Limitation
MLE assigns P = 0 to any event not seen in training. In language modeling, this means a single unseen bigram makes the probability of an entire sentence zero!
Module 2 · Slide 06

The Data Sparsity Problem: Why N-grams?

To compute exact sequence probabilities, we would need enough data to observe every possible word combination. The number of combinations is astronomical.

Combinatorics
C(n, r) = n! / (r!(n - r)!)
1,000
Vocabulary Size
8.25 × 1012
5-word Combos
Example
With just 1,000 unique words, the number of possible 5-word combinations is:
C(1000, 5) = 8.25 × 1012

No corpus is large enough to observe all of these!
Data Sparsity
The number of possible word sequences grows exponentially with sequence length. Even with enormous corpora, most sequences will never be observed. This makes direct counting impractical for long contexts.
The Solution
Instead of conditioning on the entire history, approximate by conditioning on only the last few words. This is the Markov assumption, and it gives us n-gram models.
1

Full history is impractical

P(wi | w1...wi-1) requires too much data

2

Truncate the context

Approximate with only the last k words

3

N-gram model

P(wi | wi-k...wi-1) is tractable to estimate

Module 2 · Slide 07

The Markov Assumption & N-gram Models

The key idea: approximate the full conditional probability by looking at only the last N-1 tokens.

Markov Assumption
P(wi | w1, w2, ..., wi-1) P(wi | wi-k, ..., wi-1)
Example
P(the | its water is so transparent that) ≈ P(the | that)  —  using a bigram model
ModelApproximation
Unigram P(w1...wn) ≈ Π P(wi) — each word independent
Bigram P(wi|w1...wi-1) ≈ P(wi|wi-1) — previous word only
Trigram P(wi|w1...wi-1) ≈ P(wi|wi-2,wi-1)
N-gram Condition on the last N−1 tokens (3-grams, 4-grams, 5-grams...)
Unigram
Context = 0 words. Each word is treated independently. Simplest but least accurate.
Bigram
Context = 1 word. Predicts based on the immediately preceding token.
Trigram
Context = 2 words. Captures short-range dependencies like “for (int”.
N-gram
General form. Higher N = more context but exponentially more data needed.
What is the tradeoff between using higher-order n-grams (e.g., 5-grams vs bigrams)?
Higher-order n-grams capture more context, leading to more informed predictions. However, the number of possible n-grams grows exponentially, causing severe data sparsity — most n-grams will never appear in training data, making their estimated probability zero or unreliable.
Module 2 · Slide 08

Start & End Tokens: Sentence Boundaries

Language models need to know where sequences begin and end. Special boundary tokens make this possible.

Raw Sequence
int x = 5 ;
With Boundary Markers
<s> int x = 5 ; </s>
Why <s>?
Without <s>, the model cannot learn which tokens start sequences. The bigram P(int | <s>) captures that int is a common way to begin a statement.
Why </s>?
Without </s>, the model cannot learn when to stop generating. P(</s> | ;) tells the model that a semicolon often ends a statement.
Resulting Bigram Pairs
From <s> int x = 5 ; </s>:

(<s>, int) → P(int | <s>)
(int, x) → P(x | int)
(x, =) → P(= | x)
(=, 5) → P(5 | =)
(5, ;) → P(; | 5)
(;, </s>) → P(</s> | ;)
Key Takeaway
Every modern language model needs to know where sequences begin and end. These special tokens make that possible — from simple n-gram models to large Transformers.
Module 2 · Slide 09

Zipf's Law: The Long Tail

Word frequencies in natural language (and code) follow a power-law distribution known as Zipf's Law: a small number of words occur very frequently, while a large number occur rarely.

Zipf Distribution visualization
Most frequent Least frequent
20%
of Words
80%
of Occurrences
Pareto's Law (80/20 Rule)
20% of words account for 80% of all occurrences. In code, tokens like ;, (, ), {, }, = dominate, while most identifiers appear very few times.
Impact on N-grams
Zipf's law means that most n-grams in our vocabulary will be extremely rare or never observed, making probability estimation challenging. This is the fundamental reason we need smoothing techniques.
In Code
High frequency: ; = ( ) return if this
Low frequency: processUserData calcTaxRate getEmbeddedIPv4
Most custom identifiers fall in the “long tail.”
Module 2 · Slide 10

The Naturalness of Software

Hindle et al. (ICSE 2012) made a groundbreaking discovery: programming languages, in theory, are complex and flexible, but real programs are mostly simple and repetitive, with predictable statistical properties that can be captured by language models.

Key Insight — Hindle et al., ICSE 2012
Programming languages are theoretically complex, flexible, and powerful. But the programs that real people actually write are mostly simple and rather repetitive, and thus they have usefully predictable statistical properties that can be captured in statistical language models.

Natural Language

  • Rich vocabulary, diverse phrasing
  • Many ways to express the same idea
  • Ambiguity is common and accepted
  • Follows statistical patterns
  • Repetitive structures (grammar)
vs

Source Code

  • Constrained by strict syntax rules
  • Coding conventions enforce patterns
  • Repetitive idioms (for-loops, if-checks)
  • Also follows statistical patterns
  • Even MORE predictable than English!
Why This Matters
Because code is “natural” — repetitive and predictable — statistical language models like n-grams can be effectively applied to software engineering tasks: code completion, bug detection, code recommendation, and more. This insight is what makes PLMs useful for SE.
Module 2 · Slide 11

Training a Bigram Model for Code

Training an n-gram model is simply counting. Split your corpus into training (80%) and test (20%) sets. Count all bigram occurrences, then normalize to get probabilities.

public static Inet4Address getEmbeddedIPv4CA(Inet6Address ip) { ip1 = New IP("127.0.0.1") if (isCompatIPv4Address(ip)) { return getCompatIPv4Address(ip); } ... }
Count Matrix
Build a table where rows = preceding token, columns = following token, and each cell = the count of that bigram in the training corpus. Then normalize each row to get probabilities.
Bigram Probability
P(getEmbeddedIPv4CA | Inet4Address)
= Count(Inet4Address, getEmbeddedIPv4CA) / Count(Inet4Address)
= 15 / 300 = 0.05
Simplified Count Matrix
staticvoidInet4Address(returnip
public120158520
static018045300
Inet4Address0001003
if00022000
return00510030
Click any cell to see the bigram probability computation.
Module 2 · Slide 12

Interactive: Bigram Probability Calculator

Select a code snippet and see how bigram counts and probabilities are computed from the token sequence.

Bigram Calculator interactive
Module 2 · Slide 13

Interactive: Code Completion with N-grams

See how a language model predicts the next token given a partial code snippet. Toggle between unigram, bigram, and trigram to see how more context improves predictions.

Code Completion Simulator interactive
Module 2 · Slide 14

Pros & Cons of N-gram Language Models

N-gram models are a foundational approach with clear strengths and known limitations.

Pros

  • No need for specialized hardware (GPU/TPU)
  • CPU-based computation is sufficient
  • Easy to implement — just counting and dividing
  • Fast to train and query
  • Interpretable: inspect the count tables directly
  • Well-understood mathematical foundations
vs

Cons

  • Long-distance relationships cannot be captured
  • Fixed context window (cannot see beyond N-1 tokens)
  • Data sparsity grows exponentially with N
  • No semantic understanding of tokens
  • Cannot generalize to unseen n-grams
  • Open vocabulary problem with code identifiers
Key Takeaway
N-gram models are powerful for local patterns but fundamentally limited by their fixed context window. They are excellent baselines and remain useful in practice, but complex code understanding requires models that can look further back.
Module 2 · Slide 15

Long-Range Dependencies in Code

Code often has dependencies that span many tokens. A variable declared at the top of a method may be used at the bottom — far beyond any n-gram window.

public static Inet4Address getEmbeddedIPv4ClientAddress(Inet6Address ip) { ip1 = New IP("127.0.0.1") if (isCompatIPv4Address(ip)) { return getCompatIPv4Address(ip); } if (is6to4Address(ip)) { return get6to4IPv4Address(ip); } if (isTeredoAddress(ip)) { return getTeredoInfo(ip).getClient(); } return ip1.getDefault(); }
The Problem
ip1 is defined at line 2 but used at the last line — many tokens apart. A bigram or trigram model would have no knowledge of ip1 when predicting the final return statement.
Neural Networks to the Rescue
Neural networks (RNNs, Transformers) can easily handle longer context windows. They maintain hidden state or attention mechanisms that allow information from distant tokens to influence predictions.
In the code above, which dependency would a bigram model miss?
The dependency between ip1 on line 2 and ip1 on the last line. These are separated by many tokens (all the if-statements and return statements). A bigram model only sees the previous token, so when it reaches the final return, it has no memory of ip1 being declared earlier.
Module 2 · Slide 16

Model Evaluation: Extrinsic Evaluation

How do we know if a language model is good? One approach: test it on an actual downstream task and measure its performance.

1

Choose a Task

Pick a downstream task: code completion, code generation, bug detection, etc.

2

Run Both Models

Apply Model A and Model B to the same task with the same test data.

3

Measure Accuracy

How many tokens correctly predicted? How many mispredicted? Compare metrics.

4

Compare Results

The model with better task performance is the better language model for that task.

Model A
Code Completion
downstream task
Accuracy: 72%
Model B
Code Completion
downstream task
Accuracy: 85%
Drawback
Extrinsic evaluation is time-consuming. Running a full downstream task evaluation can take days or even weeks. We need a faster way to compare models during development.
Gold Standard
Extrinsic evaluation is the gold standard because it directly measures what we care about. But it is expensive — which motivates intrinsic metrics like perplexity.
Module 2 · Slide 17

Intrinsic Evaluation: Perplexity

Perplexity measures model quality independent of any application. Lower perplexity = better model (more confident, less “surprised” by the test data).

Perplexity (General)
PP(W) = P(w1w2...wN)-1/N

= (Πi=1..N 1/P(wi|w1...wi-1))1/N
For Bigrams
PP(W) = (Πi=1..N 1/P(wi|wi-1))1/N
Comparison
If Model A has perplexity 20 and Model B has perplexity 5, Model B is much more confident in its predictions. It is as if Model B is choosing among only 5 equally likely options at each step, while Model A is choosing among 20.
Caveat
Perplexity provides a bad approximation unless the test data looks like the training data. It is generally only useful in pilot experiments for quick model comparison.
Warning: Zero Probabilities
If any P(wi|context) = 0, then 1/0 = infinity, and perplexity goes to infinity. A single unseen n-gram breaks the entire metric! This is why we need smoothing.
Module 2 · Slide 18

Worked Example: Computing Perplexity

Let us walk through a full numerical example before using the interactive calculator.

Sequence (4 tokens)
for ( int i

Bigram probabilities:
P(for | <s>) = 0.15
P( ( | for ) = 0.85
P( int | ( ) = 0.40
P( i | int ) = 0.30
1

Multiply probabilities

0.15 × 0.85 × 0.40 × 0.30 = 0.0153

2

Compute log2 of each

-2.74 + (-0.23) + (-1.32) + (-1.74) = -6.03

3

Average (negative)

-(-6.03) / 4 = 1.508 (this is cross-entropy H)

4

Perplexity = 2H

21.508 = 2.84

Interpretation
A perplexity of 2.84 means the model is, on average, as uncertain as choosing between ~3 equally likely next tokens. Lower is better — a perfect model would have perplexity 1.
Module 2 · Slide 19

Cross-Entropy: The Information Theory Connection

Cross-entropy connects language modeling to information theory. It measures how well our model Q approximates the true distribution P.

Cross-Entropy
H(P, Q) = P(x) log2 Q(x)
In Language Modeling
P = the true distribution (actual next tokens)
Q = our model's predicted distribution

In practice, we approximate with the test set:
H ≈ -(1/N) Σ log2 Q(wi | context)
Key Relationship
Perplexity = 2H(P,Q)
Why Log Base 2?
We measure information in bits. One bit of entropy means one binary choice. Log base 2 gives us units of bits — the standard in information theory.
Worked Examples
If our model assigns P = 0.5 to every next token:
H = -log2(0.5) = 1 bit, Perplexity = 21 = 2

If our model assigns P = 0.25:
H = -log2(0.25) = 2 bits, Perplexity = 22 = 4

Higher entropy = more uncertainty = higher perplexity.
Intuition
Cross-entropy measures the average number of bits needed to encode the true data using our model. A better model needs fewer bits.
Module 2 · Slide 20

Interactive: Perplexity Calculator

Adjust the bigram probabilities for each token and see how cross-entropy and perplexity change in real time. Try setting one probability to zero!

Perplexity Calculator interactive
Sequence: for ( int i = 0
Module 2 · Slide 21

Laplace & Add-k Smoothing

Zero-count n-grams break the model. Smoothing techniques redistribute probability mass to handle unseen events.

The Problem
If count(wi-1, wi) = 0, then P(wi|wi-1) = 0. A single zero probability makes the entire sequence probability zero and perplexity infinite.
Laplace (Add-1) Smoothing
Add 1 to all n-gram counts:
P(wi|wi-1) = (count(wi-1, wi) + 1) / (count(wi-1) + V)
where V = vocabulary size. Simple but crude — steals too much mass from observed events.
Add-k Smoothing
Generalize Laplace by adding k instead of 1:
P(wi|wi-1) = (count(wi-1, wi) + k) / (count(wi-1) + kV)
k is a hyperparameter — optimize it on a held-out set.
Laplace Smoothing Example
BigramRaw Count+1 CountRaw PSmoothed P
for ( 18 19 18/100 = 0.180 19/110 = 0.173
for return 0 1 0/100 = 0.000 1/110 = 0.009
Assume count(for) = 100, V = 10
Why is Laplace smoothing considered too crude for large vocabularies?
It adds 1 to every possible bigram count, and adds V to the denominator. When V is very large (common in code), the total mass redistributed to unseen events becomes enormous, stealing too much probability from actually observed n-grams. Add-k and interpolation are better alternatives.
Module 2 · Slide 22

Interpolation & Backoff

Instead of relying on a single n-gram order, blend multiple models together for more robust probability estimates.

Linear Interpolation
P(w | wn-1) =
  λ1 · Puni(w)
+ λ2 · Pbi(w | wn-1)
+ λ3 · Ptri(w | wn-2, wn-1)

where λ1 + λ2 + λ3 = 1
Choosing λ Weights
Use a held-out validation set (separate from train and test). Try different λ combinations and pick those that maximize the likelihood of the held-out data.
Worked Example
Predict next token after for (:

Puni(int) = 0.05
Pbi(int | () = 0.40
Ptri(int | for, () = 0.65

With λ1=0.1, λ2=0.3, λ3=0.6:
P(int) = 0.1(0.05) + 0.3(0.40) + 0.6(0.65)
= 0.005 + 0.12 + 0.39 = 0.515
Key Insight
Interpolation blends multiple models. When the trigram has data, trust it; when it does not, the unigram and bigram components provide a safety net that prevents zero probabilities.
Module 2 · Slide 23

Stupid Backoff & Sampling

Two more practical techniques for handling unseen n-grams and generating text from language models.

Stupid Backoff
If the n-gram count is zero, back off to the (n-1)-gram:

1. Try P(wi | wi-2, wi-1) — trigram
2. If zero, try P(wi | wi-1) — bigram
3. If still zero, use P(wi) — unigram

Simple and effective in practice. Not strictly a probability distribution unless adjusted, but works well for large-scale systems.
Trigram
Bigram
Unigram
back off if count = 0
Sampling from Language Models
Instead of always picking the most likely next token, choose randomly according to the probability distribution:

High-probability tokens: more likely to be sampled
Low-probability tokens: less likely to be sampled

This generates more diverse and creative output while still following learned patterns.
Why Sample?
Always picking the top prediction (greedy decoding) produces repetitive and boring output. Sampling introduces variety — the model is more likely to generate high-probability sentences but can occasionally produce surprising and novel completions.
Module 2 · Slide 24

Temperature & Sampling Strategies

Temperature controls the “sharpness” of the probability distribution when sampling from a language model. Combined with top-k and top-p, it gives fine-grained control over generation.

Temperature (T)
T = 0 (greedy): always pick the highest-probability token
T = 1: sample from the actual learned distribution
T > 1: flatten the distribution — more random/creative
T < 1: sharpen the distribution — more focused/deterministic
Top-k Sampling
Only consider the k most probable tokens. Discard the rest and renormalize. Example: top-5 keeps only the 5 best candidates.
Top-p / Nucleus Sampling
Choose the smallest set of tokens whose cumulative probability exceeds p (e.g., p=0.9). Adapts to the shape of the distribution — keeps fewer tokens when the model is confident.
Temperature Visualizer interactive
1.00
Module 2 · Slide 25

Out-of-Vocabulary & the <UNK> Token

How do n-gram models handle tokens that were never seen during training? The open vocabulary problem is a fundamental challenge.

1

The Problem

New code uses identifiers not in training data: myCustomHandler, calcNetRevenue. These have zero probability.

2

Solution: <UNK> Token

Replace rare tokens (count < threshold) with a special <UNK> token before training.

3

Train with <UNK>

The model learns “rare token behavior” — what typically follows or precedes an uncommon identifier.

4

Map Unseen to <UNK>

At test time, any token not in the vocabulary is replaced with <UNK> and gets a nonzero probability.

The Tradeoff
Too aggressive (high threshold): many tokens become <UNK>, losing valuable information.

Too lenient (low threshold): vocabulary stays large, probabilities remain sparse. Finding the right threshold requires experimentation.
Modern Solution
The open vocabulary problem is why modern LLMs use Byte-Pair Encoding (BPE) instead of word-level tokenization. BPE breaks words into subword units, so even unseen identifiers can be represented as sequences of known subwords.
Example
Vocabulary = {for, int, if, return, ;, (, ), ...}
Input: calcNetRevenue → not in vocab → mapped to <UNK>
P(<UNK> | return) is now a valid, nonzero probability.
Module 2 · Slide 26

Large Language Models vs N-grams

LLMs use neural networks rather than explicit n-gram counts. They address two fundamental problems that limit n-gram models.

Problem 1: Parameter Explosion
The number of parameters in an n-gram model grows exponentially as the n-gram order increases. A 5-gram model with vocabulary V requires up to V5 parameters. This quickly becomes intractable.
Problem 2: No Generalization
N-grams cannot generalize from training to test unless the exact same words are used. Seeing “the cat sat” teaches nothing about “the dog sat” because the model treats every word as an independent symbol.

N-gram LMs

  • Count-based statistics
  • Fixed context window
  • No semantic understanding
  • Cannot generalize
  • Fast, interpretable
vs

Neural LMs

  • Learned representations
  • Long-range context
  • Semantic embeddings
  • Generalizes to new data
  • GPU-intensive, less interpretable
N-grams
FFNNs
RNNs
Transformers
LLMs
This is where Module 4 picks up...
Module 2 · Slide 27

Lab Challenge: Build Your Own Bigram Model

Put it all together! Using the Java dataset from Module 1, build a bigram language model in Python.

1

Load & Tokenize

Load Java methods from the corpus and tokenize them into sequences of tokens.

2

Build Bigram Count Matrix

Count all consecutive token pairs across the training set.

3

Implement Add-k Smoothing

Add k to all counts to handle unseen bigrams. Experiment with k values.

4

Compute Perplexity

Evaluate your model on a held-out test set using the perplexity metric.

5

Generate Code

Use sampling to generate a simple code sequence from your trained model.

Starter Code Skeleton
import collections, math, random def build_bigram_counts(corpus): """Count all bigrams in tokenized corpus.""" counts = collections.defaultdict(Counter) for seq in corpus: for i in range(len(seq)-1): counts[seq[i]][seq[i+1]] += 1 return counts def bigram_prob(w1, w2, counts, k=1): """P(w2|w1) with add-k smoothing.""" num = counts[w1][w2] + k den = sum(counts[w1].values()) + k*V return num/den def perplexity(test, counts, k=1): """Compute perplexity on test set.""" # Your implementation here
Module 2 · Slide 28

Recap & Knowledge Check

01

PLMs, MLE & Chain Rule

PLMs assign probabilities via MLE counting. The chain rule decomposes joint probability into conditionals.

02

N-gram Models

Use the Markov property with boundary tokens (<s>, </s>) to estimate probabilities from a fixed window.

03

Evaluation

Extrinsic: downstream tasks. Intrinsic: cross-entropy and perplexity (2H). Lower perplexity = better model.

04

Smoothing & Interpolation

Laplace, add-k, linear interpolation, and stupid backoff address the zero-count problem.

05

Sampling & Temperature

Temperature, top-k, and top-p control generation diversity. OOV tokens are handled via <UNK> or BPE.

06

Naturalness & Beyond

Code is “natural” (Hindle et al.). LLMs overcome n-gram limitations using deep learning and embeddings.

Recap Quiz

Q1: What does the chain rule allow us to compute?
The joint probability of a sequence by decomposing it into a product of conditional probabilities. Correct
The average number of bits needed to encode a token, independent of the model. Incorrect
The optimal vocabulary size for an n-gram model trained on source code. Incorrect
Q2: Why is code more predictable than English text?
Because programming languages have fewer words than English. Incorrect
Because code follows strict syntax rules, coding conventions, and repetitive idioms that create predictable patterns. Correct
Because code is always written by a single person, reducing variability. Incorrect
Q3: What happens to perplexity when a model assigns zero probability to a token?
Perplexity drops to zero, indicating a perfect model. Incorrect
Perplexity remains unchanged because zero-probability tokens are ignored. Incorrect
Perplexity goes to infinity because we divide by zero in the computation. Correct
Q4: What are the two key limitations of n-gram models that LLMs overcome?
(1) Parameter explosion: the number of parameters grows exponentially with the n-gram order, making high-order n-grams intractable. (2) No generalization: n-gram models cannot transfer knowledge between similar but distinct n-grams (e.g., “the cat sat” teaches nothing about “the dog sat”). Neural LMs address both through learned embeddings that capture semantic similarity and shared representations.
References: Hindle et al., “On the Naturalness of Software” (ICSE 2012) • Jurafsky & Martin, “Speech and Language Processing” • “Recommending Code Tokens via N-gram Models”
🎉

Module Complete!

You've finished Probabilistic Source Code Modeling. Great work!