Probabilistic Source Code Modeling

How do we model the likelihood of code tokens in a sequence? This module covers n-gram language models, probability estimation, perplexity, smoothing, and why code is surprisingly predictable.

Probability Refresher

A quick recap of the probability concepts you will need throughout this module.

The sample space (S) is the set of all possible outcomes. For example, rolling a die gives S = {1, 2, 3, 4, 5, 6}. An event A = “roll an even number” = {2, 4, 6}, so P(A) = 3/6 = 0.5.

P(A) = \frac{\text{Number of favorable outcomes}}{\text{Total outcomes}}

Conditional Probability

P(A \mid B) = \frac{P(A \cap B)}{P(B)}

Example: A bag has 3 red and 2 blue marbles. You draw one and it is red (event B). What is the probability the next draw is also red (event A)? Since one red was removed, P(A|B) = 2/4 = 0.5.

The Chain Rule

To estimate the probability of an entire sentence, we need two building blocks: conditional probability and the chain rule.

From conditional probability we get the joint probability: P(A, B) = P(A) · P(B|A). For a sentence s = (w₁, w₂, …, wₙ), the chain rule expands this to:

P(w_1, \dots, w_N) = \prod_{i=1}^{N} P(w_i \mid w_1, \dots, w_{i-1})

Walkthrough: “its water is so transparent”

Applying the chain rule step by step:

  • P(its)
  • × P(water | its)
  • × P(is | its water)
  • × P(so | its water is)
  • × P(transparent | its water is so)
The chain rule decomposes a joint probability into a product of conditional probabilities. This means we can compute the probability of any sequence by multiplying together the probability of each word given all the words that came before it.

The problem: to compute P(the | its water is so transparent that), we need Count(“its water is so transparent that the”) / Count(“its water is so transparent that”). Finding exact long sequences in any corpus is extremely unlikely — even huge corpora will never contain most long word sequences.

Maximum Likelihood Estimation

When we estimate probabilities by counting, we are using Maximum Likelihood Estimation (MLE) — the simplest and most intuitive way to learn a language model. Count occurrences, divide by total observations, and that is your probability estimate.

P_{\text{MLE}}(w \mid \text{context}) = \frac{\text{Count}(\text{context}, w)}{\text{Count}(\text{context})}

Worked Example

In a Java corpus, suppose Count(“(”, “for”) = 100 and Count(“(”) = 500. Then:

PMLE(“for” | “(”) = 100 / 500 = 0.20

MLE assigns P=0 to any event not seen in training. A single unseen bigram makes the probability of an entire sentence zero!

The Data Sparsity Problem

To compute exact sequence probabilities, we would need enough data to observe every possible word combination. But the number of combinations is astronomical.

With just 1,000 unique words, the number of possible 5-word combinations is C(1000, 5) = 8.25 × 10¹². No corpus is large enough to observe all of these.

The solution: instead of conditioning on the entire history, approximate by conditioning on only the last few words. This is the Markov assumption, and it gives us n-gram models.

N-gram Models & the Markov Assumption

The key idea: approximate the full conditional probability by looking at only the last N−1 tokens.

P(w_i \mid w_1, \dots, w_{i-1}) \approx P(w_i \mid w_{i-k}, \dots, w_{i-1})
ModelApproximation
UnigramP(w₁…wₙ) ≈ Π P(wᵢ) — each word independent
BigramP(wᵢ | w₁…wᵢ₋₁) ≈ P(wᵢ | wᵢ₋₁) — previous word only
TrigramP(wᵢ | w₁…wᵢ₋₁) ≈ P(wᵢ | wᵢ₋₂, wᵢ₋₁)
N-gramCondition on the last N−1 tokens. Higher N = more context but exponentially more data needed.

Start & End Tokens

Language models need to know where sequences begin and end. Special boundary tokens <s> and </s> make this possible. Without <s>, the model cannot learn which tokens start sequences. Without </s>, the model cannot learn when to stop generating.

From <s> int x = 5 ; </s>, the resulting bigram pairs are:

<s> int x = 5 ; </s>
Each arc represents one bigram: the model learns P(right token | left token) for every adjacent pair.

Zipf’s Law & the Naturalness of Software

Word frequencies in natural language (and code) follow a power-law distribution known as Zipf’s Law: a small number of words occur very frequently, while a large number occur rarely. About 20% of words account for 80% of all occurrences.

In code, tokens like ;, =, (, ), return, if dominate, while most custom identifiers like processUserData or calcTaxRate fall in the long tail.

Bar chart showing Zipf's law token frequency distribution in a Java corpus. A few tokens like semicolons and parentheses have very high frequency, while custom identifiers form a long tail.
Token frequency follows a power law: a handful of syntax tokens dominate, while most identifiers appear rarely.

The Naturalness Hypothesis

Hindle et al. (ICSE 2012) made a groundbreaking discovery: “Programming languages, in theory, are complex and flexible, but real programs are mostly simple and rather repetitive, and thus they have usefully predictable statistical properties that can be captured in statistical language models.”

Both natural language and source code follow statistical patterns and have repetitive structures. But code is constrained by strict syntax rules and coding conventions, making it even more predictable than English.

Because code is “natural” — repetitive and predictable — statistical language models can be effectively applied to SE tasks: code completion, bug detection, code recommendation, and more.

The Naturalness Hypothesis

The theoretical foundation for applying language models to source code is the naturalness hypothesis, articulated by Hindle et al. at ICSE 2012. Their key finding was surprising: despite the theoretical expressiveness of programming languages, the code that developers actually write is highly repetitive and statistically predictable — in many cases more so than English prose.

The Key Experiment

Hindle et al. trained n-gram models on both English text (the Brown corpus and the Gutenberg corpus) and Java source code (a large open-source corpus). They measured cross-entropy on held-out test data and found a striking result:

~7.5
English cross-entropy (bits)
~4.4
Java cross-entropy (bits)
~41%
Lower entropy for code

Lower cross-entropy means higher predictability. The n-gram model was substantially less “surprised” by Java code than by English text. This means that at each token position, the model had fewer plausible next tokens to choose from when processing code.

Why Is Code More Predictable?

Several factors contribute to the naturalness of software:

  • Strict syntax rules: programming languages enforce rigid grammatical structure. After for (, only a handful of tokens are valid (int, var, ;, an identifier), whereas English allows far more flexibility after any two-word prefix.
  • Coding conventions: style guides, naming conventions, and project norms reduce variability. Most Java developers write for (int i = 0; i < rather than creatively varying loop syntax.
  • Repetitive idioms: common patterns like null checks, getter/setter methods, iterator loops, and try-catch blocks appear thousands of times across codebases with near-identical structure.
  • API usage patterns: frameworks and libraries impose stereotyped call sequences. Using a BufferedReader almost always follows the same open-read-close pattern.
  • Copy-paste culture: developers frequently reuse code snippets from documentation, Stack Overflow, or other parts of the same project, further increasing repetition.

Implications for Software Engineering

The naturalness hypothesis opened the door to an entire research area — applying statistical and machine learning techniques to source code. If code is predictable, then models trained on large code corpora can support practical tools:

Code Completion

Predict the next token or sequence of tokens as a developer types, reducing keystrokes and cognitive load.

Bug Detection

Flag code that has unusually low probability under the model — surprising code is often buggy code.

Code Review

Identify non-idiomatic patterns by detecting deviations from the statistical norm of the codebase.

Code Migration

Learn translation patterns between languages or API versions by exploiting statistical regularities in both source and target.

The naturalness hypothesis is not just an observation — it is a research program. It asserts that the tools and techniques of natural language processing can be systematically applied to source code, precisely because code exhibits the same kind of statistical regularity (and often more) that makes NLP work on human languages.

Training a Bigram Model

Training an n-gram model is simply counting. Split your corpus into training (80%) and test (20%) sets, count all bigram occurrences, then normalize each row to get probabilities.

Here’s what training an n-gram model looks like in practice — iterate through every token sequence, extract contexts, and count:

Python
def train(self, methods, vocab):
    self.vocab = vocab
    self.vocab_size = len(vocab)
    for tokens in methods:
        p = pad(tokens, self.n)
        for i in range(self.n - 1, len(p)):
            for k in range(1, self.n + 1):
                ctx = tuple(p[i - k + 1 : i])
                self.counts[ctx][p[i]] += 1
                self.totals[ctx] += 1

In practice, a 3-way split is preferred: training (70%) for learning the counts, a held-out/validation set (10%) for tuning hyperparameters like interpolation weights λ, and a test set (20%) for final evaluation. The held-out corpus is specifically used for optimizing interpolation weights and smoothing parameters, and must be kept strictly separate from both training and test data — otherwise you risk overfitting your hyperparameters to the test set.

To prevent data leakage, we split by repository — all methods from one repo go into the same pool:

Python
REPO_SPLITS = {
    "train": (1, 500),
    "val":   (501, 600),
    "test":  (601, 700),
}

def get_pool_name(rank):
    for pool_name, (lo, hi) in REPO_SPLITS.items():
        if lo <= rank <= hi:
            return pool_name
    return None

Simplified Count Matrix

staticvoidInet4Address(returnip
public120158520
static018045300
Inet4Address0001003
if00022000
return00510030

Example: P(getEmbeddedIPv4CA | Inet4Address) = Count(Inet4Address, getEmbeddedIPv4CA) / Count(Inet4Address) = 15 / 300 = 0.05

Worked Example: Building a Bigram Model

Let us walk through the entire process of building a bigram model from scratch, using a tiny Java corpus. This end-to-end example makes every step concrete: tokenization, counting, normalization, and prediction.

Step 1: The Corpus

Suppose our entire training corpus consists of three short Java statements:

Java
int x = 0 ;
int y = x ;
return x ;

After adding boundary markers, our token sequences are:

  • <s> int x = 0 ; </s>
  • <s> int y = x ; </s>
  • <s> return x ; </s>

Step 2: Count All Bigrams

We scan every adjacent pair and tally the counts:

BigramCount
<s>, int2
<s>, return1
int, x1
int, y1
x, =1
y, =1
=, 01
=, x1
0, ;1
x, ;2
;, </s>3
return, x1

Step 3: Compute Conditional Probabilities

For each context token, divide each bigram count by the total count of that context. For example, the context <s> appears 3 times total (2 followed by int, 1 followed by return):

  • P(int | <s>) = 2/3 = 0.667
  • P(return | <s>) = 1/3 = 0.333

The context int appears 2 times (once before x, once before y):

  • P(x | int) = 1/2 = 0.500
  • P(y | int) = 1/2 = 0.500

The context = appears 2 times (once before 0, once before x):

  • P(0 | =) = 1/2 = 0.500
  • P(x | =) = 1/2 = 0.500

And ; always leads to </s>: P(</s> | ;) = 3/3 = 1.000.

Step 4: Use the Model to Predict

Given the partial input <s> int, what does our model predict next? We look up the distribution P(? | int):

x → 50%
P(x | int) = 0.500
y → 50%
P(y | int) = 0.500

The model assigns equal probability to x and y. With greedy decoding, we could pick either; with sampling, each has a 50% chance. Notice that any other token has probability zero under raw MLE — the model has never seen int followed by anything else. This illustrates the zero-count problem on even a tiny corpus.

Step 5: Score a Full Sequence

What is P(<s> int x = 0 ; </s>) under our bigram model?

P(int|<s>) × P(x|int) × P(=|x) × P(0|=) × P(;|0) × P(</s>|;)
= 0.667 × 0.500 × 0.500 × 0.500 × 1.000 × 1.000
= 0.0833

This complete walkthrough — corpus to counts to probabilities to predictions — is exactly what happens at scale. Real n-gram models use the same algorithm on millions of tokens. The only additions are smoothing (to handle zero counts) and boundary tokens (which we already included).

Pros & Cons of N-gram Models

Strengths

  • No GPU needed — CPU computation is sufficient
  • Easy to implement — just counting and dividing
  • Fast to train and query
  • Interpretable — inspect count tables directly
  • Well-understood mathematical foundations

Limitations

  • Cannot capture long-distance relationships
  • Fixed context window (cannot see beyond N−1 tokens)
  • Data sparsity grows exponentially with N
  • No semantic understanding of tokens
  • Cannot generalize to unseen n-grams

The long-range dependency problem: consider a variable ip1 declared at the top of a method but used many lines later in the return statement. A bigram or trigram model would have no knowledge of ip1 when predicting the final return — the declaration is far beyond its context window.

public static Inet4Address getEmbeddedIPv4CA(Inet6Address ip) {
ip1 = new IP("127.0.0.1") ← declared here
  if (isCompatIPv4Address(ip)) {
    return getCompatIPv4Address(ip);
  }
  if (is6to4Address(ip)) {
    return get6to4IPv4Address(ip);
  }
  if (isTeredoAddress(ip)) {
    return getTeredoInfo(ip).getClient();
  }
  return ip1.getDefault(); ← used here (many tokens later)
}
N-gram models are powerful for local patterns but fundamentally limited by their fixed context window. They are excellent baselines and remain useful in practice, but complex code understanding requires models that can look further back.

Evaluation: Perplexity & Cross-Entropy

Extrinsic evaluation tests a model on a downstream task (code completion, bug detection) and measures accuracy. It is the gold standard but expensive and time-consuming.

Intrinsic evaluation uses perplexity — a measure of model quality independent of any application. Lower perplexity = better model (less “surprised” by the test data).

\text{PP}(W) = \left( \prod_{i=1}^{N} \frac{1}{P(w_i \mid w_{i-1})} \right)^{1/N}

Worked Example

Sequence: for ( int i (4 tokens)

  • P(for | <s>) = 0.15
  • P( ( | for ) = 0.85
  • P( int | ( ) = 0.40
  • P( i | int ) = 0.30

Product: 0.15 × 0.85 × 0.40 × 0.30 = 0.0153. Cross-entropy H = 1.508 bits. Perplexity = 21.508 = 2.84 — the model is as uncertain as choosing between ~3 equally likely next tokens.

In code, perplexity is computed in log-space to avoid floating-point underflow from multiplying many small probabilities:

Python
def perplexity(self, methods):
    log_prob, n_tok = 0.0, 0
    for tokens in methods:
        p = pad(tokens, self.n)
        for i in range(self.n - 1, len(p)):
            ctx = tuple(p[i - self.n + 1 : i])
            log_prob += math.log2(self.prob(p[i], ctx))
            n_tok += 1
    return 2.0 ** (-(log_prob / n_tok)) if n_tok else float("inf")

Cross-Entropy

H(P, Q) = -\sum P(x) \log_2 Q(x) \qquad \text{Perplexity} = 2^{H(P,Q)}
If any P(wᵢ|context) = 0, perplexity goes to infinity. A single unseen n-gram breaks the entire metric! This is why we need smoothing.

Perplexity Calculation Walkthrough

Let us take the bigram model we built in the worked example and compute perplexity on a test sentence, step by step. This makes every probability lookup and arithmetic operation explicit.

The Test Sentence

We want to evaluate our bigram model on the unseen sequence:

<s> int x = 0 ; </s>

This gives us N = 6 bigram transitions to evaluate (from <s> through </s>).

Step 1: Look Up Every Bigram Probability

Using the conditional probabilities we computed from the training corpus:

PositionBigramP(wᵢ | wᵢ₋₁)log₂ P
1<s> → int2/3 = 0.6667−0.585
2int → x1/2 = 0.5000−1.000
3x → =1/3 = 0.3333−1.585
4= → 01/2 = 0.5000−1.000
50 → ;1/1 = 1.00000.000
6; → </s>3/3 = 1.00000.000

Step 2: Sum the Log Probabilities

Working in log-space prevents floating-point underflow:

Σ log₂ P = (−0.585) + (−1.000) + (−1.585) + (−1.000) + 0.000 + 0.000
= −4.170

Step 3: Compute Cross-Entropy

Cross-entropy is the negative average of the log probabilities:

H = −(1/N) × Σ log₂ P = −(−4.170) / 6 = 0.695 bits

Step 4: Compute Perplexity

Perplexity is 2 raised to the cross-entropy:

PP = 2H = 20.695 = 1.62

0.695
Cross-entropy (bits per token)
1.62
Perplexity (effective choices)

Interpreting the Result

A perplexity of 1.62 is very low — the model is almost never surprised by this test sentence. This makes sense: the sentence int x = 0 ; appeared verbatim in the training data. The model has seen every bigram before, and several transitions (like 0 → ; and ; → </s>) have probability 1.0, contributing zero uncertainty.

In practice, perplexity on held-out test data is always higher than on training data. Typical values for well-trained code models range from 5 to 50, depending on the model order, corpus size, and vocabulary.

Watch out for train/test overlap. If your test data closely mirrors the training data, perplexity will be misleadingly low. Always split by repository (not by file or by line) to ensure the model is evaluated on genuinely unseen code.

Smoothing: Handling Zero Counts

If count(wᵢ₋₁, wᵢ) = 0, then P(wᵢ|wᵢ₋₁) = 0. A single zero probability makes the entire sequence probability zero and perplexity infinite.

Laplace (Add-1) Smoothing

P(w_i \mid w_{i-1}) = \frac{\text{count}(w_{i-1}, w_i) + 1}{\text{count}(w_{i-1}) + V}

where V = vocabulary size. Simple but crude — steals too much mass from observed events.

BigramRaw Count+1 CountRaw PSmoothed P
for (181918/100 = 0.18019/110 = 0.173
for return010/100 = 0.0001/110 = 0.009

Assumes count(for) = 100, V = 10

Add-k Smoothing

P(w_i \mid w_{i-1}) = \frac{\text{count}(w_{i-1}, w_i) + k}{\text{count}(w_{i-1}) + kV}

Generalize Laplace by adding k instead of 1. k is a hyperparameter — optimize it on a held-out set. Laplace is too crude for large vocabularies because it adds 1 to every possible bigram, redistributing enormous mass to unseen events.

The add-k formula translates directly into a single line of Python:

Python
def prob_additive(self, token, ctx):
    return (self.counts.get(ctx, {}).get(token, 0) + self.alpha) / \
           (self.totals.get(ctx, 0) + self.alpha * self.vocab_size)

Smoothing Techniques Compared

Different smoothing methods handle the zero-count problem in different ways. Let us compare the three most important approaches using the same unseen bigram, so you can see exactly how each one redistributes probability mass.

Setup

Suppose our vocabulary has V = 1,000 tokens. The context token return appeared 200 times in training. The bigram return newList was never observed (count = 0). Meanwhile, the bigram return null was observed 80 times.

Comparison Grid

MethodP(newList | return)
(unseen bigram)
P(null | return)
(seen bigram)
Key Trade-off
MLE (no smoothing) 0 / 200 = 0.0000 80 / 200 = 0.4000 Unseen events get zero — breaks perplexity
Laplace (add-1) (0+1) / (200+1000) = 0.000833 (80+1) / (200+1000) = 0.0675 Steals too much mass — null drops from 0.40 to 0.07
Add-k (k=0.01) (0+0.01) / (200+10) = 0.0000476 (80+0.01) / (200+10) = 0.3810 Better balance — small mass to unseen, preserves observed
Kneser-Ney Backed by continuation probability: ∼0.0003 Count minus discount d: ∼0.3950 Best overall — uses context diversity, not raw frequency

Why Laplace Fails at Scale

The numbers above make the problem clear. With V = 1,000, Laplace adds 1 to every possible bigram count and adds V = 1,000 to the denominator. The observed bigram return null drops from probability 0.40 to 0.07 — an 83% reduction. Meanwhile, each of the ~920 never-seen bigrams after return receives probability 0.000833. In total, these unseen events steal the majority of the probability mass from events we actually observed.

Kneser-Ney: The Gold Standard

Kneser-Ney smoothing goes beyond simple count adjustment. Instead of asking “how often did w appear after this context?”, it asks “how many different contexts does w appear in?” This is called the continuation probability.

The intuition: a token like Francisco has a high raw unigram count (because it appears many times in “San Francisco”), but it only ever follows one context (San). Its continuation probability is low, reflecting that it is not a versatile word. In contrast, null follows many different contexts (return, =, !=, ==), so its continuation probability is high.

P_{\text{KN}}(w_i \mid w_{i-1}) = \frac{\max(\text{count}(w_{i-1}, w_i) - d,\; 0)}{\text{count}(w_{i-1})} + \lambda(w_{i-1}) \cdot P_{\text{continuation}}(w_i)

where d is a fixed discount (typically 0.75) and λ is a normalizing weight that distributes the discounted mass.

In practice, Modified Kneser-Ney smoothing is the best-performing smoothing technique for n-gram models. It consistently outperforms add-k and linear interpolation across both natural language and code corpora. If you are building an n-gram model for production use, Kneser-Ney should be your default choice.

Interpolation & Backoff

Instead of relying on a single n-gram order, blend multiple models together for more robust probability estimates.

Linear Interpolation

P(w \mid w_{n-1}) = \lambda_1 \cdot P_{\text{uni}}(w) + \lambda_2 \cdot P_{\text{bi}}(w \mid w_{n-1}) + \lambda_3 \cdot P_{\text{tri}}(w \mid w_{n-2}, w_{n-1})

where λ₁ + λ₂ + λ₃ = 1. Choose weights using a held-out validation set.

Worked Example

Predict next token after for ( with λ₁=0.1, λ₂=0.3, λ₃=0.6:

  • Puni(int) = 0.05
  • Pbi(int | () = 0.40
  • Ptri(int | for, () = 0.65

P(int) = 0.1(0.05) + 0.3(0.40) + 0.6(0.65) = 0.005 + 0.12 + 0.39 = 0.515

Interpolation blends all sub-orders from unigram through n-gram, each weighted equally by 1/n:

Python
def prob_interpolation(self, token, ctx):
    lam = 1.0 / self.n
    p = 0.0
    for k in range(1, self.n + 1):
        sub_ctx = ctx[-(k-1):] if k > 1 else ()
        c = self.counts.get(sub_ctx, {}).get(token, 0)
        t = self.totals.get(sub_ctx, 0)
        p += lam * (c + self.alpha) / (t + self.alpha * self.vocab_size)
    return p

Stupid Backoff

If the n-gram count is zero, back off to the (n−1)-gram: try trigram → bigram → unigram.

Trigram
P(w|w₂,w₁)
count>0?
no →
Bigram
λ · P(w|w₁)
count>0?
no →
Unigram
λ² · P(w)
Stupid backoff cascade: try the highest-order model first, fall back with λ=0.4 penalty if count is zero.

The formal definition from Brants et al.:

S(w_i \mid w_{i-n+1:i-1}) = \begin{cases} \dfrac{\text{count}(w_{i-n+1:i})}{\text{count}(w_{i-n+1:i-1})} & \text{if count} > 0 \\[8pt] \lambda \cdot S(w_i \mid w_{i-n+2:i-1}) & \text{otherwise} \end{cases}

Because we back off without proper renormalization, this is not strictly a probability distribution — but it works well in practice at scale. Brants et al. found λ = 0.4 works well across many tasks.

Interpolation blends multiple models. When the trigram has data, trust it; when it doesn’t, lower-order models provide a safety net that prevents zero probabilities.

Temperature & Sampling

Always picking the most likely next token (greedy decoding) produces repetitive, boring output. Sampling introduces variety by choosing randomly according to the probability distribution.

Temperature (T) controls the sharpness of the distribution:

  • T = 0 (greedy) — always pick the highest-probability token
  • T = 1 — sample from the actual learned distribution
  • T > 1 — flatten the distribution — more random/creative
  • T < 1 — sharpen the distribution — more focused/deterministic
Three horizontal bar charts showing how temperature affects next-token probabilities. T=0.3 concentrates almost all mass on the top token, T=1.0 gives a balanced spread, T=2.0 flattens the distribution across many tokens.
Low temperature concentrates probability on the top prediction; high temperature spreads it across more tokens.

Top-k sampling: only consider the k most probable tokens, discard the rest and renormalize.

Top-p / nucleus sampling: choose the smallest set of tokens whose cumulative probability exceeds p (e.g., p=0.9). Adapts to the shape of the distribution — keeps fewer tokens when the model is confident.

Concrete Example: Temperature in Action

Suppose our model has just processed the context return new and the raw logits (unnormalized scores) for the top five candidate tokens are:

TokenRaw logitT = 0.3
(focused)
T = 1.0
(neutral)
T = 2.0
(creative)
ArrayList5.00.8760.4370.288
HashMap3.50.1010.2630.241
String2.00.0120.1420.197
Object1.50.0060.1050.175
int1.00.0030.0530.099

The temperature-scaled softmax formula divides each logit by T before applying softmax:

P(w_i) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}

Reading across the table:

  • T = 0.3: ArrayList dominates with 87.6% probability. The model is almost certain — this is nearly greedy decoding. Useful for code completion where correctness matters more than variety.
  • T = 1.0: the original learned distribution. ArrayList is still the top choice at 43.7%, but alternatives have meaningful probability. A reasonable default.
  • T = 2.0: the distribution is nearly flat. Even int (the least likely candidate) has a 9.9% chance. This produces diverse but potentially incorrect or surprising output.
Temperature is not a binary choice. In practice, code generation tasks use low temperatures (0.2–0.5) for deterministic completions and moderate temperatures (0.7–1.0) for generating diverse test cases or creative suggestions. Values above 1.5 are rarely useful for code.

Out-of-Vocabulary & the <UNK> Token

New code uses identifiers not in training data: myCustomHandler, calcNetRevenue. These have zero probability.

Solution: replace rare tokens (count < threshold) with a special <UNK> token before training. The model learns “rare token behavior” — what typically follows or precedes an uncommon identifier. At test time, any unseen token maps to <UNK> and gets a nonzero probability.

The tradeoff: threshold too high loses valuable information; too low keeps sparsity. Finding the right balance requires experimentation.

Modern solution: this is why modern LLMs use BPE (Byte Pair Encoding) instead of word-level tokenization. BPE breaks words into subword units, so even unseen identifiers can be represented as sequences of known subwords — connecting back to the tokenization concepts from Module 1.

Building a vocabulary from training data and mapping unseen tokens to UNK at test time:

Python
def build_vocabulary(methods):
    vocab = set()
    for m in methods:
        tokens = m['tokenized_code'].split()
        vocab.update(tokens)
    return vocab

def apply_unk(methods, vocab):
    result = []
    for m in methods:
        tokens = m['tokenized_code'].split()
        new_tokens = []
        for t in tokens:
            if t in vocab:
                new_tokens.append(t)
            else:
                new_tokens.append('<UNK>')
        result.append(' '.join(new_tokens))
    return result

From N-grams to Neural Language Models

N-gram models have two fundamental limitations that neural language models overcome:

  1. Parameter explosion: the number of parameters grows exponentially with the n-gram order. A 5-gram model with vocabulary V requires up to V⁵ parameters — quickly intractable.
  2. No generalization: n-gram models cannot transfer knowledge between similar contexts. Seeing “the cat sat” teaches nothing about “the dog sat” because the model treats every word as an independent symbol.

N-gram LMs

  • Count-based statistics
  • Fixed context window
  • No semantic understanding
  • Cannot generalize
  • Fast, interpretable

Neural LMs

  • Learned representations
  • Long-range context
  • Semantic embeddings
  • Generalizes to new data
  • GPU-intensive, less interpretable
N-grams FFNNs RNNs Transformers LLMs

Why N-grams Hit a Ceiling

Despite their elegance, n-gram models face three hard limits that no amount of data or smoothing can fully overcome:

  1. Sparsity is exponential: a bigram model with vocabulary V has V² possible entries; a 5-gram model has V⁵. For a modest code vocabulary of 50,000 tokens, a 5-gram table would need 50,000⁵ ≈ 3.1 × 10²³ cells — most of which will be zero regardless of corpus size.
  2. Fixed context is blind: an n-gram model with n = 5 cannot see the variable declared 20 tokens ago, the function signature 50 tokens ago, or the import statement 200 tokens ago. Code dependencies routinely span these distances.
  3. No shared representations: to an n-gram model, ArrayList and LinkedList are completely unrelated symbols. Seeing return new ArrayList a thousand times teaches the model nothing about return new LinkedList — it must learn each combination independently from scratch.

How Neural Models Overcome These Limits

Each generation of neural architecture addressed a specific n-gram weakness:

Feed-Forward Neural LMs (Bengio, 2003)

Replace the sparse count table with a dense neural network. Each word is mapped to a continuous embedding vector, so similar words (e.g., ArrayList and LinkedList) share nearby representations. Solves the generalization problem, but still uses a fixed context window.

Recurrent Neural Networks

Process tokens one at a time, maintaining a hidden state that accumulates information from all previous tokens. In theory, the context window is unlimited. In practice, vanilla RNNs struggle with long-range dependencies due to vanishing gradients; LSTMs and GRUs partially address this.

Transformers (Vaswani et al., 2017)

Use self-attention to let every token attend directly to every other token in the sequence, regardless of distance. No recurrence needed — attention computes pairwise relevance in parallel. This solves both the fixed-window problem and the vanishing gradient problem.

Large Language Models

Scale transformers to billions of parameters, trained on trillions of tokens of code and text. Models like Codex, CodeLlama, and GPT-4 achieve perplexities that dwarf what any n-gram model can reach, and they can generate entire functions from natural language descriptions.

What Carries Forward

Despite these advances, the concepts from this module remain foundational. Neural language models still estimate P(next token | context) — the same conditional probability we have been computing. They are still evaluated using perplexity and cross-entropy. They still use temperature and top-k/top-p sampling for generation. And the naturalness hypothesis still explains why these models work so well on code.

N-gram models are not obsolete — they are the conceptual foundation. Every idea in this module (conditional probability, the chain rule, perplexity, smoothing, sampling) carries directly into neural language models. The architecture changes, but the probabilistic framework stays the same. Module 04 (Deep Learning) picks up exactly where this module leaves off.

Try It Yourself

Assignment: Recommending Code Tokens via N-gram Models

Build an n-gram language model for Java code completion. Train models for n ∈ {3, 5, 7}, implement smoothing to handle unseen n-grams, and evaluate using perplexity.

  1. Mine and prepare Java method-level data (replicate the MSR pipeline from Module 1)
  2. Create three nested training sets (T1 ≤ 15K, T2 ≤ 25K, T3 ≤ 35K), plus validation (~1K) and test (~1K) sets
  3. Build a separate vocabulary for each training set, map unseen tokens to <UNK>
  4. Train n-gram models for n ∈ {3, 5, 7} with smoothing (add-α or interpolation)
  5. Select the best configuration using validation perplexity
  6. Evaluate on both a provided test set and a self-created test set

Open the Exercise in Google Colab →

Hindle et al., “On the Naturalness of Software” (ICSE 2012) · Jurafsky & Martin, “Speech and Language Processing” · Brants, Popat, Xu, Och, Dean, “Large Language Models in Machine Translation” (Google) · “Recommending Code Tokens via N-gram Models”

Code examples in this module are adapted from an implementation by Walker Hyman, an undergraduate student in Generative AI for Software Development at William & Mary.