Module 4 · Slide 01

Deep Learning for
Software Development Foundations

This module covers the core deep learning concepts and architectures that power modern AI-driven software engineering tools — from non-generative classification tasks to generative models based on LSTMs and Transformers.

Part A — Non-Generative

Clone detection, vulnerability prediction, code smell detection — classification and prediction tasks that do not produce new code.

Part B — Embeddings & LSTMs

Word embeddings, BPE tokenization, LSTM architecture, seq2seq with attention, and beam search for code generation.

Part C — Transformers

Self-attention, multi-head attention, positional encoding, pre-training, fine-tuning, and the modern code model ecosystem.

Learning Objectives

Distinguish generative vs. non-generative DL4SE tasks. Understand key architectures (LSTM, Transformer) and how pre-trained models are fine-tuned for SE.

Module 4 · Slide 02

Non-Generative Tasks Overview

Non-generative DL4SE tasks classify or predict properties of existing code without producing new source code. The model answers a question about the code rather than writing new code.

Definition

Non-generative tasks take code as input and output a label or score — e.g., "vulnerable" / "safe", or "clone" / "not clone".

Key Examples

Code Clone Detection — find duplicate or near-duplicate code fragments.
Vulnerability Prediction — flag code likely to contain security flaws.
Code Smell Detection — identify design problems like God Class or Long Method.

Pipeline Pattern

Most non-generative tasks follow the same pattern:

Source Code

→

Feature Extraction

→

Classifier / DNN

→

Label / Score

Contrast

Generative tasks (code generation, summarization, translation) produce a sequence of new tokens. We cover those in Parts B and C.

Module 4 · Slide 03

Neural Network Fundamentals

A primer on the building blocks of every deep learning model — the artificial neuron and how neurons are organized into layers.

The Neuron

A single neuron computes a weighted sum of its inputs, adds a bias, and applies a non-linear activation function:

y = σ(Σ wᵢxᵢ + b)

xᵢ — input values (e.g., code metrics).
wᵢ — learnable weights (importance of each input).
b — bias term (shifts the decision boundary).
σ — activation (ReLU, sigmoid, tanh).

Common Activations

ReLU: max(0, x) — fast, default for hidden layers.
Sigmoid: 1/(1+e^-x) — output between 0 and 1.
Softmax: normalizes outputs to probabilities.

Layers

Neurons are stacked into layers to form a network:

Input Layer — receives raw features (e.g., token counts, complexity metrics).

Hidden Layers — learn intermediate representations. More layers = deeper network = more abstract features.

Output Layer — produces the final prediction (a class label, a probability, a score).

NETWORK STRUCTURE

Input
features

→

Hidden 1
learned repr.

→

Hidden 2
abstract feat.

→

Output
prediction

Key Insight

A neural network is just a function that maps inputs to outputs through layers of simple computations. The “learning” happens by adjusting the weights.

Module 4 · Slide 04

How Neural Networks Learn: Backpropagation

High-level intuition for how a neural network adjusts its weights to improve predictions — the training loop that drives all deep learning.

1

Forward Pass

Input flows through the network layer by layer, producing a prediction at the output. Each neuron applies its weights, bias, and activation function.

2

Loss Computation

Compare the prediction to the ground truth using a loss function (e.g., cross-entropy for classification, MSE for regression). The loss is a single number measuring how wrong the prediction is.

3

Backward Pass (Backpropagation)

Compute gradients — how much each weight contributed to the error. Uses the chain rule from calculus, applied systematically from output back to input.

4

Weight Update (Gradient Descent)

Adjust each weight in the direction that reduces the loss: w = w - lr * gradient. The learning rate (lr) controls step size.

5

Repeat

Iterate over the training data many times (epochs). Each pass through the full dataset refines the weights further. Stop when validation performance plateaus.

Key Insight

Backprop is just the chain rule from calculus applied systematically. You don’t need to derive it — frameworks like PyTorch do it automatically with loss.backward().

The Training Loop in Code


              prediction = model(input)   # forward

              loss = criterion(prediction, label)

              loss.backward()           # backprop

              optimizer.step()          # update

Module 4 · Slide 05

Training in Practice: Hyperparameters

Settings you choose before training begins. Unlike weights (which are learned), hyperparameters must be set by the practitioner.

Learning Rate

How big each weight update step is. Too high: loss diverges. Too low: training is painfully slow. Typical: 1e-3 to 1e-5. The single most important hyperparameter.

Batch Size

How many examples to process before updating weights. Common values: 32, 64, 128. Larger batches = more stable gradients but more memory.

Epochs

How many complete passes through the full training dataset. Too few: underfitting. Too many: overfitting. Monitor validation loss to decide.

Dropout

Randomly disable neurons during training (e.g., 10-50% chance). Forces the network to learn redundant representations, preventing overfitting.

Early Stopping

Stop training when validation loss starts increasing (the model is memorizing, not generalizing). Saves compute and prevents overfitting.

Optimizer

The algorithm for gradient descent. Adam is the default choice — adapts learning rate per-parameter. SGD with momentum is a common alternative.

Practical Advice

You rarely need to tune these from scratch. Start with published defaults and adjust based on validation performance. Learning rate is the first knob to turn; batch size and dropout come second.

Module 4 · Slide 06

Code Clone Detection: Types I – IV

Code clones are pairs of code fragments that are similar. They are classified into four types based on the degree of similarity.

Type I — Exact Clone

Identical code except for whitespace, layout, and comments. A straight copy-paste.

Type II — Renamed Clone

Syntactically identical except for identifier names, literal values, or type references.

Type III — Gapped Clone

Similar structure but with statements added, removed, or modified. Near-miss clone.

Type IV — Semantic Clone

Same functionality, completely different syntax. E.g., iterative vs. recursive implementations.

Clone Classifier Interactive

Examine the code pair and classify the clone type:

Module 4 · Slide 07

Clone Detection Approaches

Different techniques trade off precision, recall, and the ability to detect higher-type clones.

1

Text-Based

Compare raw text or normalized lines. Fast but limited to Type I / II.

2

Token-Based

Tokenize code and compare token sequences with suffix trees or hashing. Handles Type I-III.

3

AST-Based

Parse into Abstract Syntax Trees and compare subtrees. Invariant to formatting and renaming (Type I-III).

4

ML / DL-Based

Learn embeddings of code fragments and use similarity in vector space. Can detect Type IV semantic clones.

Approach	Types	Scalability
Text-based	I II	High
Token-based	I II III	High
AST-based	I II III	Medium
ML / DL	I II III IV	Medium

Key Insight

Only learning-based approaches can detect Type IV (semantic) clones, because they capture the meaning rather than the syntax of code.

Module 4 · Slide 08

Vulnerability Prediction

Automatically predicting whether a code component contains a security vulnerability, using machine learning classifiers trained on historical vulnerability data.

Why It Matters

Security vulnerabilities cost billions per year. Manual code review does not scale. ML models can prioritize code for review by flagging high-risk components.

Common Vulnerability Types

Buffer Overflow — writing beyond allocated memory.
SQL Injection — unsanitized user input in queries.
XSS — injecting scripts into web pages.
Use After Free — accessing freed memory.
Integer Overflow — arithmetic exceeding type range.

ML Pipeline

The typical vulnerability prediction pipeline:

Code

→

Metrics & Tokens

→

Classifier

→

Vuln Score

Features Used

Software metrics: cyclomatic complexity, LOC, coupling.
Code tokens: n-grams of source code tokens.
API calls: use of dangerous functions (e.g., strcpy, eval).
Change history: past bug-fixing frequency.

Module 4 · Slide 09

Vulnerability Prediction Pipeline

Walk through how a vulnerability prediction model extracts features and assigns a risk score to a code snippet.

Vulnerability Analyzer Interactive

Code Snippet:

Extracted Features:

Module 4 · Slide 10

Code Smell Detection

Code smells are symptoms of poor design choices that increase maintenance cost and defect likelihood. ML models can automatically detect them from software metrics.

Long Method

A method with too many lines of code, doing too much. Hard to understand and test.

God Class

A class that centralizes too much functionality and knows too much about the system.

Feature Envy

A method that uses data from another class more than its own. Suggests misplaced responsibility.

Data Clumps

Groups of data that frequently appear together and should be encapsulated into a class.

ML Approach

Extract software metrics (LOC, cyclomatic complexity, coupling, cohesion, depth of inheritance), then train a classifier (Random Forest, SVM, or DNN) to predict smell type.

Why It Matters

Studies show that code with smells has higher defect density and longer change times. Automated detection helps developers refactor proactively.

Module 4 · Slide 11

Interactive: Code Smell Detector

Analyze code metrics to classify the type of code smell present in a given snippet.

Smell Detector Interactive

Code Snippet:

Metrics Analysis:

Module 4 · Slide 12

Non-Generative Tasks Recap

Summary and comparison of the three non-generative tasks we covered.

Task	Input	Output	Features	Challenge
Clone Detection	Code pair	Clone type (I-IV)	Tokens, AST, embeddings	Type IV semantic clones
Vulnerability Prediction	Code component	Vulnerability score	Metrics, API calls, history	Class imbalance (few vulns)
Code Smell Detection	Class / method	Smell type	LOC, complexity, coupling	Subjective thresholds

Which clone type requires semantic understanding to detect?

Type IV — same functionality, different syntax Correct

Type II — renamed identifiers Incorrect — Type II is detectable by token comparison

Type III — gapped clones Incorrect — Type III can be found with AST matching

What is the main challenge in vulnerability prediction?

Class imbalance — very few vulnerable samples vs. many safe ones Correct

Code is too short to extract features Incorrect

Module 4 · Slide 13

From Non-Generative to Generative

We now shift from classifying existing code to generating new sequences of code tokens. This requires a fundamentally different class of models.

Non-Generative

Input: code → Output: label / score
Clone detection, vulnerability prediction
Traditional ML often sufficient
Fixed-size output

vs

Generative

Input: code/NL → Output: code/NL sequence
Code generation, summarization, translation
Deep learning required (RNN, Transformer)
Variable-length output

Generative Tasks in SE

Code Generation — NL description → source code.
Code Summarization — source code → NL summary.
Code Translation — Java code → Python code.
Bug Repair — buggy code → fixed code.
Commit Message Generation — code diff → commit message.

Module 4 · Slide 14

Word Embeddings for Code

Before feeding code tokens to a neural network, we need to represent them as dense numerical vectors. Embeddings capture semantic relationships between tokens.

One-Hot Encoding

Each token gets a sparse vector with a single 1. Problem: no similarity info, huge dimensions (vocabulary can be 50K+).

Dense Embeddings (Word2Vec)

Each token mapped to a low-dimensional dense vector (e.g., 128-d). Tokens with similar contexts get similar vectors. int and float end up nearby.

Why Embeddings Work for Code

Code has strong local context: for(int i=0 is predictable. Embeddings capture that ArrayList is similar to LinkedList.

Encoding Comparison Interactive

Vocabulary: int, float, String, for, if, return

Module 4 · Slide 15

Byte Pair Encoding (BPE)

Subword tokenization that splits rare identifiers into known subword units. Essential for handling code's open vocabulary (compound names like getEmbeddedIPv4ClientAddr).

1

Start with Characters

Initialize vocabulary with all individual characters in the training corpus.

2

Count Pairs

Find the most frequent adjacent pair of tokens (e.g., 'e' + 's' appearing 1000 times).

3

Merge

Replace all occurrences of the pair with a new token ('es'). Add to vocabulary.

4

Repeat

Repeat steps 2-3 for a fixed number of merges (e.g., 32K). Common words become single tokens; rare words split into subwords.

BPE Tokenizer Interactive

Enter an identifier:

Try: getEmbeddedIPv4 | parseHTTPResponse | calculateTotalPrice

Module 4 · Slide 16

LSTM Architecture

Long Short-Term Memory networks solve the vanishing gradient problem of vanilla RNNs by introducing gated memory cells that selectively remember and forget information.

RNN Limitation

Vanishing gradients: in long sequences, gradients shrink exponentially during backpropagation. The network cannot learn long-range dependencies (e.g., matching a { with its closing } 50 tokens later).

LSTM Solution: Gates

Three gates control information flow through the cell:

Forget Gate f_t

Decides what to erase from cell state. sigmoid(W_f * [h_{t-1}, x_t])

Input Gate i_t

Decides what new info to store. sigmoid(W_i * [h_{t-1}, x_t])

Candidate g_t

New candidate values. tanh(W_g * [h_{t-1}, x_t])

Output Gate o_t

Decides what to output. sigmoid(W_o * [h_{t-1}, x_t])

Cell State Update

c_t = f_t * c_{t-1} + i_t * g_t
The cell state is a highway: the forget gate removes old info, the input gate writes new info.

h_t = o_t * tanh(c_t)
The output gate filters the cell state to produce the hidden state.

Why LSTM for Code

Code has long-range dependencies: variable declarations referenced many lines later, matching brackets, function calls. LSTM's cell state can carry this info across hundreds of tokens.

Module 4 · Slide 17

Interactive: LSTM Gate Explorer

Adjust the input values and observe how each gate responds. The bars show gate activations computed with fixed illustrative weights.

LSTM Gate Explorer Interactive

Input Values:

x_t0.50

h_{t-1}0.30

c_{t-1}0.80

Outputs:

0.000

c_t (cell)

0.000

h_t (hidden)

Gate Activations:

Forget f_t

0.500

Input i_t

0.500

Candidate g_t

0.500

Output o_t

0.500

Module 4 · Slide 18

GRU: A Simpler Alternative to LSTM

Gated Recurrent Units achieve similar performance to LSTMs with a simpler architecture — fewer gates, fewer parameters, and faster training.

LSTM Recap

3 Gates: Input, Forget, Output.
2 States: Cell state (c_t) + Hidden state (h_t).
Parameters: 4 weight matrices per layer.
Powerful but complex — more parameters to learn, slower to train.

GRU Design

2 Gates: Update gate (z_t) and Reset gate (r_t).
1 State: Hidden state (h_t) only — no separate cell state.
Parameters: 3 weight matrices per layer.
The update gate combines the roles of LSTM’s forget and input gates.

Property	LSTM	GRU
Gates	3 (input, forget, output)	2 (update, reset)
States	2 (cell + hidden)	1 (hidden only)
Parameters/layer	4n(n+m)	3n(n+m)
Training speed	Slower	Faster
Long sequences	Better	Good
Small datasets	Overfits more	Preferred

When to Use GRU

GRUs are often preferred when you have less training data, since they have fewer parameters to learn. For very long sequences or when maximum performance is critical, LSTM may still have an edge.

Module 4 · Slide 19

Interactive: RNN vs LSTM vs GRU on Code

See how a simple RNN, LSTM, and GRU retain information over a code token sequence. RNNs suffer from vanishing gradients; gated architectures preserve long-range memory.

Memory Retention Comparison Interactive

Select a code example to see how each architecture retains information across tokens:

INFORMATION RETAINED vs TOKEN DISTANCE

Simple RNN

LSTM

GRU

Module 4 · Slide 20

Seq2Seq for Code

The encoder-decoder (sequence-to-sequence) architecture maps an input sequence to an output sequence of potentially different length — ideal for code translation and repair.

Encoder

An LSTM reads the input sequence token by token and compresses it into a fixed-size context vector (the final hidden state).

Decoder

A second LSTM generates the output sequence one token at a time, conditioned on the context vector and previously generated tokens.

Teacher Forcing

During training, the decoder receives the ground-truth previous token (not its own prediction). This stabilizes and accelerates training.

Bug Repair Example:

sum(arr, n-1)

→

Encoder LSTM

→

Context Vector

→

Decoder LSTM

→

sum(arr, n)

Bottleneck Problem

The entire input sequence must be compressed into a single fixed-size vector. For long code sequences, this loses information. Solution: attention mechanism (next slide).

Module 4 · Slide 21

Attention Mechanism

Instead of relying on a single context vector, attention lets the decoder look back at all encoder hidden states and focus on the most relevant parts for each output token.

How Attention Works

1. Compute alignment scores between decoder state and each encoder state.
2. Softmax to get attention weights (sum to 1).
3. Weighted sum of encoder states = context for this step.
4. Concatenate with decoder state to predict next token.

Why It Helps for Code

When fixing a bug, the decoder can directly attend to the relevant input tokens. For sum(arr, n-1) → sum(arr, n), attention learns to skip -1.

Attention Heatmap Interactive

Input (buggy): sum ( arr , n - 1 )

Click an output token to see its attention weights:

Click an output token to see its attention weights on the input.

Module 4 · Slide 22

Beam Search Decoding

At inference time, the decoder must choose tokens one at a time. Beam search keeps multiple candidate sequences (beams) to avoid getting trapped by greedy local choices.

Greedy Decoding

Always pick the highest-probability token at each step. Fast but often produces suboptimal sequences — a locally good choice can lead to a poor overall result.

Beam Search (width = k)

1. Start with the top-k tokens for position 1.
2. For each beam, expand with all possible next tokens.
3. Keep only the top-k sequences by cumulative log-probability.
4. Repeat until all beams produce <EOS>.
5. Return the highest-scoring complete sequence.

Example: Beam Width = 2

Step 1: Top-2 starts

return (0.45)

if (0.30)

Step 2: Expand & keep top-2

returnresult (0.38)

returnnull (0.32)

Step 3: Best complete sequence

returnresult; Winner (0.35)

Trade-off

Larger beam width → better results but slower. Typical values: k=5 to k=10 for code generation tasks.

Module 4 · Slide 23

Generative Part 1 Recap

Summary of the key building blocks for LSTM-based generative models.

01

Embeddings

Dense vector representations that capture semantic similarity between code tokens. Foundation for all neural models.

02

BPE Tokenization

Subword splitting that handles code's open vocabulary. Compound identifiers become sequences of known subwords.

03

LSTM Gates

Forget, input, and output gates control memory flow. Solves vanishing gradients for long code sequences.

04

Seq2Seq

Encoder-decoder architecture maps input sequences to output sequences. Used for code translation and repair.

05

Attention

Lets the decoder focus on relevant encoder states at each step. Eliminates the information bottleneck.

06

Beam Search

Explores multiple candidate sequences during decoding. Avoids greedy local optima for better overall output.

Why does seq2seq with attention outperform vanilla seq2seq for long code sequences?

Attention lets the decoder access all encoder states, avoiding the fixed-size bottleneck Correct

Attention uses a larger context vector Incorrect — the key is dynamic access, not vector size

Module 4 · Slide 24

Self-Attention: The Key Innovation

Self-attention allows every token in a sequence to attend to every other token simultaneously, replacing the sequential processing of RNNs with parallel computation.

Query, Key, Value

Each token is projected into three vectors:
Q (Query) — "What am I looking for?"
K (Key) — "What do I contain?"
V (Value) — "What information do I provide?"
Attention = softmax(QK^T / sqrt(d_k)) * V

Why Replace Recurrence?

Parallelism: all positions computed simultaneously (vs. sequential RNN).
Direct paths: any two tokens are connected in one step (vs. O(n) hops in RNN).
Scalability: leverages GPU parallelism for training on massive code corpora.

Scaled Dot-Product

The scaling factor 1/sqrt(d_k) prevents dot products from becoming too large, which would push softmax into saturated regions with vanishing gradients.

Self-Attention for Code

Self-attention naturally learns structural patterns in code: ( attends to its matching ), function names attend to their arguments, and def links to :.

Input Tokens

→

Q, K, V Projections

→

Attention Scores

→

Weighted Sum

Module 4 · Slide 25

Multi-Head Attention & Positional Encoding

Multiple attention heads capture different types of relationships, while positional encodings inject sequence order into the otherwise position-agnostic Transformer.

Multi-Head Attention

Run h parallel attention functions with different learned projections. Each head can specialize: one head learns bracket matching, another learns data-flow, another learns identifier co-reference. Outputs are concatenated and linearly projected.

Sinusoidal Positional Encoding

Since self-attention has no notion of order, we add position vectors using sine/cosine functions at different frequencies:
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
This lets the model generalize to unseen sequence lengths.

Positional Encoding Heatmap Interactive

Each row = a position (0-7), each column = a dimension. Brighter green = higher value.

Pattern

Low dimensions oscillate rapidly (capturing local position), high dimensions change slowly (capturing global position). Each position gets a unique fingerprint.

Module 4 · Slide 26

Transformer Architecture

The Transformer stacks self-attention and feed-forward layers into encoder and decoder blocks, with residual connections and layer normalization for stable deep training.

Encoder Block (x N layers):

Multi-Head Self-Attention

Add & Layer Norm (residual)

Feed-Forward Network

Add & Layer Norm (residual)

Decoder Block (x N layers):

Masked Self-Attention

Cross-Attention (to encoder)

Feed-Forward Network

Key Components

Residual Connections: input added to sublayer output, enabling gradient flow through deep stacks.

Layer Normalization: normalizes activations per-sample, stabilizing training.

Masked Self-Attention: in the decoder, prevents attending to future tokens (preserves autoregressive property).

Cross-Attention: decoder queries attend to encoder keys/values, linking input and output.

Architecture Variants

Encoder-only (BERT, CodeBERT): bidirectional, for classification.
Decoder-only (GPT, Codex): autoregressive, for generation.
Encoder-decoder (T5, CodeT5): full seq2seq, for translation.

Module 4 · Slide 27

Autoregressive Generation: How LLMs Write Code

Large language models generate code one token at a time, left to right. Each generated token becomes part of the input for the next prediction step.

1

Start with Prompt

The user provides initial context: def sort_list(

2

Generate Next Token

Model predicts the most likely next token: arr. Context is now def sort_list(arr

3

Append and Repeat

Each new token extends the context: ) → : → \n → return → sorted …

4

Stop Condition

Generation stops when the model produces an end-of-sequence token (<EOS>) or reaches the maximum context length.

Token-by-Token Generation Interactive

Key Insight

Every token an LLM generates was produced one at a time, left to right, conditioned on everything that came before. This is why context window size matters — the model can only “see” a fixed number of preceding tokens.

Sampling Strategies

Greedy: always pick the top token.
Temperature: scale logits to control randomness.
Top-k / Top-p: restrict to most likely tokens.
Higher temperature = more creative but less reliable code.

Module 4 · Slide 28

Pre-training & Fine-tuning

Modern code models are first pre-trained on massive unlabeled code corpora, then fine-tuned on smaller task-specific datasets. This transfer learning paradigm dramatically reduces labeled data requirements.

Pre-training Objectives

MLM (Masked Language Modeling): randomly mask 15% of tokens, predict them from context. Used by BERT/CodeBERT. Learns bidirectional representations.

CLM (Causal Language Modeling): predict the next token given all previous tokens. Used by GPT/Codex. Learns to generate code left-to-right.

Fine-tuning

Add a task-specific head (classifier, decoder) on top of the pre-trained model. Train on labeled data with a small learning rate. The pre-trained weights provide a strong initialization that already understands code syntax and semantics.

Transfer Learning Benefit Interactive

Labeled examples for fine-tuning:

1,000

From Scratch:

38%

Pre-trained:

76%

Module 4 · Slide 29

Fine-Tuning vs. Pre-Training

Understanding the economics and workflow of the two-stage training paradigm that powers every modern code model.

Pre-Training

Goal: Learn general language / code patterns.
Data: Massive corpus (e.g., The Stack: 900GB of code).
Cost: Millions of dollars in GPU compute.
Duration: Weeks to months on hundreds of GPUs.
Who does it: Research labs (Meta, Google, BigCode).
Frequency: Done once, shared publicly.

Result

A foundation model that understands code syntax, semantics, and common patterns across many languages — but is not specialized for any particular task.

Fine-Tuning

Goal: Specialize the model for a specific task or domain.
Data: Small labeled dataset (hundreds to thousands of examples).
Cost: Cheap — hours on a single GPU.
Duration: Minutes to hours.
Who does it: Practitioners, researchers, teams.
Frequency: Done for each new task or domain.

TYPICAL PIPELINE

Pre-train on
The Stack

→

Fine-tune on
Java methods

→

Deploy for
Java completion

Practical Takeaway

You will almost never pre-train a model from scratch. Fine-tuning lets you specialize a powerful model for your task with relatively little data and compute. This is the standard workflow for all course projects.

Module 4 · Slide 30

Code Models Ecosystem

An overview of the major pre-trained models for code, their architectures, and capabilities.

Model	Architecture	Training Data	Key Capability
CodeBERT	Encoder-only	CodeSearchNet (6 langs)	Code search, clone detection, defect prediction
GraphCodeBERT	Encoder-only	CodeSearchNet + data flow	Structure-aware code understanding
CodeT5	Encoder-decoder	CodeSearchNet + C/C#	Code generation, summarization, translation
StarCoder	Decoder-only	The Stack (80+ langs)	Code completion, fill-in-the-middle
Codex / GPT	Decoder-only	GitHub public code	Code generation from NL, Copilot backend

Key Trend

The field has moved from encoder-only models (good for understanding) to decoder-only models (good for generation), with encoder-decoder models offering a balance for seq2seq tasks like code translation.

Module 4 · Slide 31

HNN & CC2Vec: Learning Code Changes

Specialized architectures for understanding code changes (diffs) rather than static code snapshots.

HNN — Hierarchical Neural Network

Generates commit messages from code diffs using a two-level hierarchy: first encode individual changed hunks, then aggregate hunk representations to generate a natural language commit message.

CC2Vec — Code Change to Vector

Learns distributed representations of code changes by separately encoding added lines, removed lines, and their context. The resulting vector captures the semantics of a commit for downstream tasks (just-in-time defect prediction, commit classification).

CC2Vec Pipeline — Step Through:

Module 4 · Slide 32

From Theory to Practice: The DL4SE Toolkit

The tools that abstract away complexity and let you go from idea to working model in under 50 lines of Python.

PyTorch

The framework. Tensors, automatic differentiation (autograd), and nn.Module for building models. The dominant framework in research and increasingly in industry.

HuggingFace Transformers

Pre-trained models on demand. Load any code model (CodeBERT, CodeT5, StarCoder) with one line. Includes tokenizers, pipelines, and fine-tuning utilities.

Weights & Biases

Experiment tracking. Log metrics, visualize training curves, compare runs, and run hyperparameter sweeps. Essential for reproducible research.

Google Colab / Lightning AI

Free GPU compute. Run Jupyter notebooks with GPU/TPU acceleration in the browser. No local GPU required for course projects.

HuggingFace Datasets

Data loading made easy. Access standard SE datasets (CodeSearchNet, The Stack) with streaming support for large corpora.

Git & DVC

Version control for ML. Git for code, DVC (Data Version Control) for datasets and model checkpoints that are too large for Git.

Key Insight

These tools abstract away most of the complexity. You can load a pre-trained code model and fine-tune it in under 50 lines of Python. Focus on the research question, not the plumbing.

Module 4 · Slide 33

Module 4 Recap & Knowledge Check

01

Non-Generative Tasks

Clone detection (Types I-IV), vulnerability prediction, code smell detection — classification without generating new code.

02

Embeddings & BPE

Dense vector representations and subword tokenization form the foundation for feeding code to neural networks.

03

LSTM & Seq2Seq

Gated recurrent cells with encoder-decoder architecture for sequence-to-sequence code tasks.

04

Attention & Transformers

Self-attention replaces recurrence; multi-head attention captures diverse code patterns in parallel.

05

Pre-training & Fine-tuning

Train on massive unlabeled code (MLM/CLM), then fine-tune on small labeled datasets for specific SE tasks.

06

Code Models

CodeBERT, CodeT5, StarCoder, Codex — the ecosystem of pre-trained models powering modern AI-assisted development.

What pre-training objective does CodeBERT use?

Masked Language Modeling (MLM) — predict randomly masked tokens from bidirectional context Correct

Causal Language Modeling (CLM) — predict next token left-to-right Incorrect — CLM is used by decoder-only models like GPT/Codex

Which architecture variant is best suited for code translation (input lang → output lang)?

Encoder-decoder (e.g., CodeT5) — encodes input and generates output sequence Correct

Encoder-only (e.g., CodeBERT) — produces contextualized embeddings Incorrect — encoder-only models cannot generate output sequences

Why is BPE tokenization particularly important for code?

Code has compound identifiers (e.g., getEmbeddedIPv4) that would be OOV with word-level tokenization Correct

BPE makes code run faster at inference time Incorrect — BPE affects vocabulary coverage, not inference speed directly

Module 4 · Slide 34

What's Next?

You now have a foundation in both non-generative and generative deep learning for software engineering. Here is how this connects to the rest of the course.

Evaluating AI-enabled SE (Module 3)

How do we measure whether these models actually work? Module 3 covers evaluation metrics (BLEU, CodeBLEU, accuracy, F1), benchmarks, and experimental design for AI4SE research.

Practical Applications

These foundational models power real tools: GitHub Copilot (Codex), automated code review (CodeBERT), vulnerability scanners, and refactoring assistants.

References & Further Reading

CodeBERT: Feng et al., 2020. Pre-trained models for NL-PL understanding and generation.

CodeT5: Wang et al., 2021. Unified pre-trained encoder-decoder for code understanding and generation.

CC2Vec: Hoang et al., 2020. Distributed representations of code changes.

Attention Is All You Need: Vaswani et al., 2017. The original Transformer paper.

Key Takeaway

The shift from handcrafted features to learned representations, and from task-specific models to pre-trained + fine-tuned architectures, has transformed software engineering research and practice.

Deep Learning forSoftware Development Foundations

Non-Generative Tasks Overview

Neural Network Fundamentals

How Neural Networks Learn: Backpropagation

Forward Pass

Loss Computation

Backward Pass (Backpropagation)

Weight Update (Gradient Descent)

Repeat

Training in Practice: Hyperparameters

Code Clone Detection: Types I – IV

Clone Detection Approaches

Text-Based

Token-Based

AST-Based

ML / DL-Based

Vulnerability Prediction

Vulnerability Prediction Pipeline

Code Smell Detection

Interactive: Code Smell Detector

Non-Generative Tasks Recap

From Non-Generative to Generative

Non-Generative

Generative

Word Embeddings for Code

Byte Pair Encoding (BPE)

Start with Characters

Count Pairs

Merge

Repeat

LSTM Architecture

Interactive: LSTM Gate Explorer

GRU: A Simpler Alternative to LSTM

Interactive: RNN vs LSTM vs GRU on Code

Seq2Seq for Code

Attention Mechanism

Beam Search Decoding

Generative Part 1 Recap

Embeddings

BPE Tokenization

LSTM Gates

Seq2Seq

Attention

Beam Search

Self-Attention: The Key Innovation

Multi-Head Attention & Positional Encoding

Transformer Architecture

Autoregressive Generation: How LLMs Write Code

Start with Prompt

Generate Next Token

Append and Repeat

Stop Condition

Pre-training & Fine-tuning

Fine-Tuning vs. Pre-Training

Code Models Ecosystem

HNN & CC2Vec: Learning Code Changes

From Theory to Practice: The DL4SE Toolkit

Module 4 Recap & Knowledge Check

Non-Generative Tasks

Embeddings & BPE

LSTM & Seq2Seq

Attention & Transformers

Pre-training & Fine-tuning

Code Models

What's Next?

Module Complete!

Deep Learning for
Software Development Foundations