| DL for Software Development Foundations Module 4 · Interactive 1 / 34 Home
Module 4 · Slide 01

Deep Learning for
Software Development Foundations

This module covers the core deep learning concepts and architectures that power modern AI-driven software engineering tools — from non-generative classification tasks to generative models based on LSTMs and Transformers.

Part A — Non-Generative
Clone detection, vulnerability prediction, code smell detection — classification and prediction tasks that do not produce new code.
Part B — Embeddings & LSTMs
Word embeddings, BPE tokenization, LSTM architecture, seq2seq with attention, and beam search for code generation.
Part C — Transformers
Self-attention, multi-head attention, positional encoding, pre-training, fine-tuning, and the modern code model ecosystem.
Learning Objectives
Distinguish generative vs. non-generative DL4SE tasks. Understand key architectures (LSTM, Transformer) and how pre-trained models are fine-tuned for SE.
Module 4 · Slide 02

Non-Generative Tasks Overview

Non-generative DL4SE tasks classify or predict properties of existing code without producing new source code. The model answers a question about the code rather than writing new code.

Definition
Non-generative tasks take code as input and output a label or score — e.g., "vulnerable" / "safe", or "clone" / "not clone".
Key Examples
Code Clone Detection — find duplicate or near-duplicate code fragments.
Vulnerability Prediction — flag code likely to contain security flaws.
Code Smell Detection — identify design problems like God Class or Long Method.
Pipeline Pattern
Most non-generative tasks follow the same pattern:
Source Code
Feature Extraction
Classifier / DNN
Label / Score
Contrast
Generative tasks (code generation, summarization, translation) produce a sequence of new tokens. We cover those in Parts B and C.
Module 4 · Slide 03

Neural Network Fundamentals

A primer on the building blocks of every deep learning model — the artificial neuron and how neurons are organized into layers.

The Neuron
A single neuron computes a weighted sum of its inputs, adds a bias, and applies a non-linear activation function:

y = σ(Σ wᵢxᵢ + b)

xᵢ — input values (e.g., code metrics).
wᵢ — learnable weights (importance of each input).
b — bias term (shifts the decision boundary).
σ — activation (ReLU, sigmoid, tanh).
Common Activations
ReLU: max(0, x) — fast, default for hidden layers.
Sigmoid: 1/(1+e-x) — output between 0 and 1.
Softmax: normalizes outputs to probabilities.
Layers
Neurons are stacked into layers to form a network:

Input Layer — receives raw features (e.g., token counts, complexity metrics).

Hidden Layers — learn intermediate representations. More layers = deeper network = more abstract features.

Output Layer — produces the final prediction (a class label, a probability, a score).
NETWORK STRUCTURE
Input
features
Hidden 1
learned repr.
Hidden 2
abstract feat.
Output
prediction
Key Insight
A neural network is just a function that maps inputs to outputs through layers of simple computations. The “learning” happens by adjusting the weights.
Module 4 · Slide 04

How Neural Networks Learn: Backpropagation

High-level intuition for how a neural network adjusts its weights to improve predictions — the training loop that drives all deep learning.

1

Forward Pass

Input flows through the network layer by layer, producing a prediction at the output. Each neuron applies its weights, bias, and activation function.

2

Loss Computation

Compare the prediction to the ground truth using a loss function (e.g., cross-entropy for classification, MSE for regression). The loss is a single number measuring how wrong the prediction is.

3

Backward Pass (Backpropagation)

Compute gradients — how much each weight contributed to the error. Uses the chain rule from calculus, applied systematically from output back to input.

4

Weight Update (Gradient Descent)

Adjust each weight in the direction that reduces the loss: w = w - lr * gradient. The learning rate (lr) controls step size.

5

Repeat

Iterate over the training data many times (epochs). Each pass through the full dataset refines the weights further. Stop when validation performance plateaus.

Key Insight
Backprop is just the chain rule from calculus applied systematically. You don’t need to derive it — frameworks like PyTorch do it automatically with loss.backward().
The Training Loop in Code
prediction = model(input)   # forward
loss = criterion(prediction, label)
loss.backward()           # backprop
optimizer.step()          # update
Module 4 · Slide 05

Training in Practice: Hyperparameters

Settings you choose before training begins. Unlike weights (which are learned), hyperparameters must be set by the practitioner.

Learning Rate
How big each weight update step is. Too high: loss diverges. Too low: training is painfully slow. Typical: 1e-3 to 1e-5. The single most important hyperparameter.
Batch Size
How many examples to process before updating weights. Common values: 32, 64, 128. Larger batches = more stable gradients but more memory.
Epochs
How many complete passes through the full training dataset. Too few: underfitting. Too many: overfitting. Monitor validation loss to decide.
Dropout
Randomly disable neurons during training (e.g., 10-50% chance). Forces the network to learn redundant representations, preventing overfitting.
Early Stopping
Stop training when validation loss starts increasing (the model is memorizing, not generalizing). Saves compute and prevents overfitting.
Optimizer
The algorithm for gradient descent. Adam is the default choice — adapts learning rate per-parameter. SGD with momentum is a common alternative.
Practical Advice
You rarely need to tune these from scratch. Start with published defaults and adjust based on validation performance. Learning rate is the first knob to turn; batch size and dropout come second.
Module 4 · Slide 06

Code Clone Detection: Types I – IV

Code clones are pairs of code fragments that are similar. They are classified into four types based on the degree of similarity.

Type I — Exact Clone
Identical code except for whitespace, layout, and comments. A straight copy-paste.
Type II — Renamed Clone
Syntactically identical except for identifier names, literal values, or type references.
Type III — Gapped Clone
Similar structure but with statements added, removed, or modified. Near-miss clone.
Type IV — Semantic Clone
Same functionality, completely different syntax. E.g., iterative vs. recursive implementations.
Clone Classifier Interactive

Examine the code pair and classify the clone type:

Module 4 · Slide 07

Clone Detection Approaches

Different techniques trade off precision, recall, and the ability to detect higher-type clones.

1

Text-Based

Compare raw text or normalized lines. Fast but limited to Type I / II.

2

Token-Based

Tokenize code and compare token sequences with suffix trees or hashing. Handles Type I-III.

3

AST-Based

Parse into Abstract Syntax Trees and compare subtrees. Invariant to formatting and renaming (Type I-III).

4

ML / DL-Based

Learn embeddings of code fragments and use similarity in vector space. Can detect Type IV semantic clones.

ApproachTypesScalability
Text-basedI IIHigh
Token-basedI II IIIHigh
AST-basedI II IIIMedium
ML / DLI II III IVMedium
Key Insight
Only learning-based approaches can detect Type IV (semantic) clones, because they capture the meaning rather than the syntax of code.
Module 4 · Slide 08

Vulnerability Prediction

Automatically predicting whether a code component contains a security vulnerability, using machine learning classifiers trained on historical vulnerability data.

Why It Matters
Security vulnerabilities cost billions per year. Manual code review does not scale. ML models can prioritize code for review by flagging high-risk components.
Common Vulnerability Types
Buffer Overflow — writing beyond allocated memory.
SQL Injection — unsanitized user input in queries.
XSS — injecting scripts into web pages.
Use After Free — accessing freed memory.
Integer Overflow — arithmetic exceeding type range.
ML Pipeline
The typical vulnerability prediction pipeline:
Code
Metrics & Tokens
Classifier
Vuln Score
Features Used
Software metrics: cyclomatic complexity, LOC, coupling.
Code tokens: n-grams of source code tokens.
API calls: use of dangerous functions (e.g., strcpy, eval).
Change history: past bug-fixing frequency.
Module 4 · Slide 09

Vulnerability Prediction Pipeline

Walk through how a vulnerability prediction model extracts features and assigns a risk score to a code snippet.

Vulnerability Analyzer Interactive
Code Snippet:
Extracted Features:
Module 4 · Slide 10

Code Smell Detection

Code smells are symptoms of poor design choices that increase maintenance cost and defect likelihood. ML models can automatically detect them from software metrics.

Long Method
A method with too many lines of code, doing too much. Hard to understand and test.
God Class
A class that centralizes too much functionality and knows too much about the system.
Feature Envy
A method that uses data from another class more than its own. Suggests misplaced responsibility.
Data Clumps
Groups of data that frequently appear together and should be encapsulated into a class.
ML Approach
Extract software metrics (LOC, cyclomatic complexity, coupling, cohesion, depth of inheritance), then train a classifier (Random Forest, SVM, or DNN) to predict smell type.
Why It Matters
Studies show that code with smells has higher defect density and longer change times. Automated detection helps developers refactor proactively.
Module 4 · Slide 11

Interactive: Code Smell Detector

Analyze code metrics to classify the type of code smell present in a given snippet.

Smell Detector Interactive
Code Snippet:
Metrics Analysis:
Module 4 · Slide 12

Non-Generative Tasks Recap

Summary and comparison of the three non-generative tasks we covered.

TaskInputOutputFeaturesChallenge
Clone Detection Code pair Clone type (I-IV) Tokens, AST, embeddings Type IV semantic clones
Vulnerability Prediction Code component Vulnerability score Metrics, API calls, history Class imbalance (few vulns)
Code Smell Detection Class / method Smell type LOC, complexity, coupling Subjective thresholds
Which clone type requires semantic understanding to detect?
Type IV — same functionality, different syntax Correct
Type II — renamed identifiers Incorrect — Type II is detectable by token comparison
Type III — gapped clones Incorrect — Type III can be found with AST matching
What is the main challenge in vulnerability prediction?
Class imbalance — very few vulnerable samples vs. many safe ones Correct
Code is too short to extract features Incorrect
Module 4 · Slide 13

From Non-Generative to Generative

We now shift from classifying existing code to generating new sequences of code tokens. This requires a fundamentally different class of models.

Non-Generative

  • Input: code → Output: label / score
  • Clone detection, vulnerability prediction
  • Traditional ML often sufficient
  • Fixed-size output
vs

Generative

  • Input: code/NL → Output: code/NL sequence
  • Code generation, summarization, translation
  • Deep learning required (RNN, Transformer)
  • Variable-length output
Generative Tasks in SE
Code Generation — NL description → source code.
Code Summarization — source code → NL summary.
Code Translation — Java code → Python code.
Bug Repair — buggy code → fixed code.
Commit Message Generation — code diff → commit message.
Module 4 · Slide 14

Word Embeddings for Code

Before feeding code tokens to a neural network, we need to represent them as dense numerical vectors. Embeddings capture semantic relationships between tokens.

One-Hot Encoding
Each token gets a sparse vector with a single 1. Problem: no similarity info, huge dimensions (vocabulary can be 50K+).
Dense Embeddings (Word2Vec)
Each token mapped to a low-dimensional dense vector (e.g., 128-d). Tokens with similar contexts get similar vectors. int and float end up nearby.
Why Embeddings Work for Code
Code has strong local context: for(int i=0 is predictable. Embeddings capture that ArrayList is similar to LinkedList.
Encoding Comparison Interactive

Vocabulary: int, float, String, for, if, return

Module 4 · Slide 15

Byte Pair Encoding (BPE)

Subword tokenization that splits rare identifiers into known subword units. Essential for handling code's open vocabulary (compound names like getEmbeddedIPv4ClientAddr).

1

Start with Characters

Initialize vocabulary with all individual characters in the training corpus.

2

Count Pairs

Find the most frequent adjacent pair of tokens (e.g., 'e' + 's' appearing 1000 times).

3

Merge

Replace all occurrences of the pair with a new token ('es'). Add to vocabulary.

4

Repeat

Repeat steps 2-3 for a fixed number of merges (e.g., 32K). Common words become single tokens; rare words split into subwords.

BPE Tokenizer Interactive
Enter an identifier:
Try: getEmbeddedIPv4 | parseHTTPResponse | calculateTotalPrice
Module 4 · Slide 16

LSTM Architecture

Long Short-Term Memory networks solve the vanishing gradient problem of vanilla RNNs by introducing gated memory cells that selectively remember and forget information.

RNN Limitation
Vanishing gradients: in long sequences, gradients shrink exponentially during backpropagation. The network cannot learn long-range dependencies (e.g., matching a { with its closing } 50 tokens later).
LSTM Solution: Gates
Three gates control information flow through the cell:
Forget Gate f_t
Decides what to erase from cell state. sigmoid(W_f * [h_{t-1}, x_t])
Input Gate i_t
Decides what new info to store. sigmoid(W_i * [h_{t-1}, x_t])
Candidate g_t
New candidate values. tanh(W_g * [h_{t-1}, x_t])
Output Gate o_t
Decides what to output. sigmoid(W_o * [h_{t-1}, x_t])
Cell State Update
c_t = f_t * c_{t-1} + i_t * g_t
The cell state is a highway: the forget gate removes old info, the input gate writes new info.

h_t = o_t * tanh(c_t)
The output gate filters the cell state to produce the hidden state.
Why LSTM for Code
Code has long-range dependencies: variable declarations referenced many lines later, matching brackets, function calls. LSTM's cell state can carry this info across hundreds of tokens.
Module 4 · Slide 17

Interactive: LSTM Gate Explorer

Adjust the input values and observe how each gate responds. The bars show gate activations computed with fixed illustrative weights.

LSTM Gate Explorer Interactive
Input Values:
0.50
0.30
0.80
Outputs:
0.000
c_t (cell)
0.000
h_t (hidden)
Gate Activations:
Forget f_t
0.500
Input i_t
0.500
Candidate g_t
0.500
Output o_t
0.500
Module 4 · Slide 18

GRU: A Simpler Alternative to LSTM

Gated Recurrent Units achieve similar performance to LSTMs with a simpler architecture — fewer gates, fewer parameters, and faster training.

LSTM Recap
3 Gates: Input, Forget, Output.
2 States: Cell state (c_t) + Hidden state (h_t).
Parameters: 4 weight matrices per layer.
Powerful but complex — more parameters to learn, slower to train.
GRU Design
2 Gates: Update gate (z_t) and Reset gate (r_t).
1 State: Hidden state (h_t) only — no separate cell state.
Parameters: 3 weight matrices per layer.
The update gate combines the roles of LSTM’s forget and input gates.
PropertyLSTMGRU
Gates3 (input, forget, output)2 (update, reset)
States2 (cell + hidden)1 (hidden only)
Parameters/layer4n(n+m)3n(n+m)
Training speedSlowerFaster
Long sequencesBetterGood
Small datasetsOverfits morePreferred
When to Use GRU
GRUs are often preferred when you have less training data, since they have fewer parameters to learn. For very long sequences or when maximum performance is critical, LSTM may still have an edge.
Module 4 · Slide 19

Interactive: RNN vs LSTM vs GRU on Code

See how a simple RNN, LSTM, and GRU retain information over a code token sequence. RNNs suffer from vanishing gradients; gated architectures preserve long-range memory.

Memory Retention Comparison Interactive
Select a code example to see how each architecture retains information across tokens:
INFORMATION RETAINED vs TOKEN DISTANCE
Simple RNN
LSTM
GRU
Module 4 · Slide 20

Seq2Seq for Code

The encoder-decoder (sequence-to-sequence) architecture maps an input sequence to an output sequence of potentially different length — ideal for code translation and repair.

Encoder
An LSTM reads the input sequence token by token and compresses it into a fixed-size context vector (the final hidden state).
Decoder
A second LSTM generates the output sequence one token at a time, conditioned on the context vector and previously generated tokens.
Teacher Forcing
During training, the decoder receives the ground-truth previous token (not its own prediction). This stabilizes and accelerates training.
Bug Repair Example:
sum(arr, n-1)
Encoder LSTM
Context Vector
Decoder LSTM
sum(arr, n)
Bottleneck Problem
The entire input sequence must be compressed into a single fixed-size vector. For long code sequences, this loses information. Solution: attention mechanism (next slide).
Module 4 · Slide 21

Attention Mechanism

Instead of relying on a single context vector, attention lets the decoder look back at all encoder hidden states and focus on the most relevant parts for each output token.

How Attention Works
1. Compute alignment scores between decoder state and each encoder state.
2. Softmax to get attention weights (sum to 1).
3. Weighted sum of encoder states = context for this step.
4. Concatenate with decoder state to predict next token.
Why It Helps for Code
When fixing a bug, the decoder can directly attend to the relevant input tokens. For sum(arr, n-1)sum(arr, n), attention learns to skip -1.
Attention Heatmap Interactive
Input (buggy): sum ( arr , n - 1 )
Click an output token to see its attention weights:
Click an output token to see its attention weights on the input.
Module 4 · Slide 22

Beam Search Decoding

At inference time, the decoder must choose tokens one at a time. Beam search keeps multiple candidate sequences (beams) to avoid getting trapped by greedy local choices.

Greedy Decoding
Always pick the highest-probability token at each step. Fast but often produces suboptimal sequences — a locally good choice can lead to a poor overall result.
Beam Search (width = k)
1. Start with the top-k tokens for position 1.
2. For each beam, expand with all possible next tokens.
3. Keep only the top-k sequences by cumulative log-probability.
4. Repeat until all beams produce <EOS>.
5. Return the highest-scoring complete sequence.
Example: Beam Width = 2
Step 1: Top-2 starts
return (0.45)
if (0.30)
Step 2: Expand & keep top-2
returnresult (0.38)
returnnull (0.32)
Step 3: Best complete sequence
returnresult; Winner (0.35)
Trade-off
Larger beam width → better results but slower. Typical values: k=5 to k=10 for code generation tasks.
Module 4 · Slide 23

Generative Part 1 Recap

Summary of the key building blocks for LSTM-based generative models.

01

Embeddings

Dense vector representations that capture semantic similarity between code tokens. Foundation for all neural models.

02

BPE Tokenization

Subword splitting that handles code's open vocabulary. Compound identifiers become sequences of known subwords.

03

LSTM Gates

Forget, input, and output gates control memory flow. Solves vanishing gradients for long code sequences.

04

Seq2Seq

Encoder-decoder architecture maps input sequences to output sequences. Used for code translation and repair.

05

Attention

Lets the decoder focus on relevant encoder states at each step. Eliminates the information bottleneck.

06

Beam Search

Explores multiple candidate sequences during decoding. Avoids greedy local optima for better overall output.

Why does seq2seq with attention outperform vanilla seq2seq for long code sequences?
Attention lets the decoder access all encoder states, avoiding the fixed-size bottleneck Correct
Attention uses a larger context vector Incorrect — the key is dynamic access, not vector size
Module 4 · Slide 24

Self-Attention: The Key Innovation

Self-attention allows every token in a sequence to attend to every other token simultaneously, replacing the sequential processing of RNNs with parallel computation.

Query, Key, Value
Each token is projected into three vectors:
Q (Query) — "What am I looking for?"
K (Key) — "What do I contain?"
V (Value) — "What information do I provide?"
Attention = softmax(QKT / sqrt(d_k)) * V
Why Replace Recurrence?
Parallelism: all positions computed simultaneously (vs. sequential RNN).
Direct paths: any two tokens are connected in one step (vs. O(n) hops in RNN).
Scalability: leverages GPU parallelism for training on massive code corpora.
Scaled Dot-Product
The scaling factor 1/sqrt(d_k) prevents dot products from becoming too large, which would push softmax into saturated regions with vanishing gradients.
Self-Attention for Code
Self-attention naturally learns structural patterns in code: ( attends to its matching ), function names attend to their arguments, and def links to :.
Input Tokens
Q, K, V Projections
Attention Scores
Weighted Sum
Module 4 · Slide 25

Multi-Head Attention & Positional Encoding

Multiple attention heads capture different types of relationships, while positional encodings inject sequence order into the otherwise position-agnostic Transformer.

Multi-Head Attention
Run h parallel attention functions with different learned projections. Each head can specialize: one head learns bracket matching, another learns data-flow, another learns identifier co-reference. Outputs are concatenated and linearly projected.
Sinusoidal Positional Encoding
Since self-attention has no notion of order, we add position vectors using sine/cosine functions at different frequencies:
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
This lets the model generalize to unseen sequence lengths.
Positional Encoding Heatmap Interactive

Each row = a position (0-7), each column = a dimension. Brighter green = higher value.

Pattern
Low dimensions oscillate rapidly (capturing local position), high dimensions change slowly (capturing global position). Each position gets a unique fingerprint.
Module 4 · Slide 26

Transformer Architecture

The Transformer stacks self-attention and feed-forward layers into encoder and decoder blocks, with residual connections and layer normalization for stable deep training.

Encoder Block (x N layers):
Multi-Head Self-Attention
Add & Layer Norm (residual)
Feed-Forward Network
Add & Layer Norm (residual)
Decoder Block (x N layers):
Masked Self-Attention
Cross-Attention (to encoder)
Feed-Forward Network
Key Components
Residual Connections: input added to sublayer output, enabling gradient flow through deep stacks.

Layer Normalization: normalizes activations per-sample, stabilizing training.

Masked Self-Attention: in the decoder, prevents attending to future tokens (preserves autoregressive property).

Cross-Attention: decoder queries attend to encoder keys/values, linking input and output.
Architecture Variants
Encoder-only (BERT, CodeBERT): bidirectional, for classification.
Decoder-only (GPT, Codex): autoregressive, for generation.
Encoder-decoder (T5, CodeT5): full seq2seq, for translation.
Module 4 · Slide 27

Autoregressive Generation: How LLMs Write Code

Large language models generate code one token at a time, left to right. Each generated token becomes part of the input for the next prediction step.

1

Start with Prompt

The user provides initial context: def sort_list(

2

Generate Next Token

Model predicts the most likely next token: arr. Context is now def sort_list(arr

3

Append and Repeat

Each new token extends the context: ):\nreturnsorted

4

Stop Condition

Generation stops when the model produces an end-of-sequence token (<EOS>) or reaches the maximum context length.

Token-by-Token Generation Interactive
Key Insight
Every token an LLM generates was produced one at a time, left to right, conditioned on everything that came before. This is why context window size matters — the model can only “see” a fixed number of preceding tokens.
Sampling Strategies
Greedy: always pick the top token.
Temperature: scale logits to control randomness.
Top-k / Top-p: restrict to most likely tokens.
Higher temperature = more creative but less reliable code.
Module 4 · Slide 28

Pre-training & Fine-tuning

Modern code models are first pre-trained on massive unlabeled code corpora, then fine-tuned on smaller task-specific datasets. This transfer learning paradigm dramatically reduces labeled data requirements.

Pre-training Objectives
MLM (Masked Language Modeling): randomly mask 15% of tokens, predict them from context. Used by BERT/CodeBERT. Learns bidirectional representations.

CLM (Causal Language Modeling): predict the next token given all previous tokens. Used by GPT/Codex. Learns to generate code left-to-right.
Fine-tuning
Add a task-specific head (classifier, decoder) on top of the pre-trained model. Train on labeled data with a small learning rate. The pre-trained weights provide a strong initialization that already understands code syntax and semantics.
Transfer Learning Benefit Interactive
Labeled examples for fine-tuning:
1,000
From Scratch:
38%
Pre-trained:
76%
Module 4 · Slide 29

Fine-Tuning vs. Pre-Training

Understanding the economics and workflow of the two-stage training paradigm that powers every modern code model.

Pre-Training
Goal: Learn general language / code patterns.
Data: Massive corpus (e.g., The Stack: 900GB of code).
Cost: Millions of dollars in GPU compute.
Duration: Weeks to months on hundreds of GPUs.
Who does it: Research labs (Meta, Google, BigCode).
Frequency: Done once, shared publicly.
Result
A foundation model that understands code syntax, semantics, and common patterns across many languages — but is not specialized for any particular task.
Fine-Tuning
Goal: Specialize the model for a specific task or domain.
Data: Small labeled dataset (hundreds to thousands of examples).
Cost: Cheap — hours on a single GPU.
Duration: Minutes to hours.
Who does it: Practitioners, researchers, teams.
Frequency: Done for each new task or domain.
TYPICAL PIPELINE
Pre-train on
The Stack
Fine-tune on
Java methods
Deploy for
Java completion
Practical Takeaway
You will almost never pre-train a model from scratch. Fine-tuning lets you specialize a powerful model for your task with relatively little data and compute. This is the standard workflow for all course projects.
Module 4 · Slide 30

Code Models Ecosystem

An overview of the major pre-trained models for code, their architectures, and capabilities.

ModelArchitectureTraining DataKey Capability
CodeBERT Encoder-only CodeSearchNet (6 langs) Code search, clone detection, defect prediction
GraphCodeBERT Encoder-only CodeSearchNet + data flow Structure-aware code understanding
CodeT5 Encoder-decoder CodeSearchNet + C/C# Code generation, summarization, translation
StarCoder Decoder-only The Stack (80+ langs) Code completion, fill-in-the-middle
Codex / GPT Decoder-only GitHub public code Code generation from NL, Copilot backend
Key Trend
The field has moved from encoder-only models (good for understanding) to decoder-only models (good for generation), with encoder-decoder models offering a balance for seq2seq tasks like code translation.
Module 4 · Slide 31

HNN & CC2Vec: Learning Code Changes

Specialized architectures for understanding code changes (diffs) rather than static code snapshots.

HNN — Hierarchical Neural Network
Generates commit messages from code diffs using a two-level hierarchy: first encode individual changed hunks, then aggregate hunk representations to generate a natural language commit message.
CC2Vec — Code Change to Vector
Learns distributed representations of code changes by separately encoding added lines, removed lines, and their context. The resulting vector captures the semantics of a commit for downstream tasks (just-in-time defect prediction, commit classification).
CC2Vec Pipeline — Step Through:
Module 4 · Slide 32

From Theory to Practice: The DL4SE Toolkit

The tools that abstract away complexity and let you go from idea to working model in under 50 lines of Python.

PyTorch
The framework. Tensors, automatic differentiation (autograd), and nn.Module for building models. The dominant framework in research and increasingly in industry.
HuggingFace Transformers
Pre-trained models on demand. Load any code model (CodeBERT, CodeT5, StarCoder) with one line. Includes tokenizers, pipelines, and fine-tuning utilities.
Weights & Biases
Experiment tracking. Log metrics, visualize training curves, compare runs, and run hyperparameter sweeps. Essential for reproducible research.
Google Colab / Lightning AI
Free GPU compute. Run Jupyter notebooks with GPU/TPU acceleration in the browser. No local GPU required for course projects.
HuggingFace Datasets
Data loading made easy. Access standard SE datasets (CodeSearchNet, The Stack) with streaming support for large corpora.
Git & DVC
Version control for ML. Git for code, DVC (Data Version Control) for datasets and model checkpoints that are too large for Git.
Key Insight
These tools abstract away most of the complexity. You can load a pre-trained code model and fine-tune it in under 50 lines of Python. Focus on the research question, not the plumbing.
Module 4 · Slide 33

Module 4 Recap & Knowledge Check

01

Non-Generative Tasks

Clone detection (Types I-IV), vulnerability prediction, code smell detection — classification without generating new code.

02

Embeddings & BPE

Dense vector representations and subword tokenization form the foundation for feeding code to neural networks.

03

LSTM & Seq2Seq

Gated recurrent cells with encoder-decoder architecture for sequence-to-sequence code tasks.

04

Attention & Transformers

Self-attention replaces recurrence; multi-head attention captures diverse code patterns in parallel.

05

Pre-training & Fine-tuning

Train on massive unlabeled code (MLM/CLM), then fine-tune on small labeled datasets for specific SE tasks.

06

Code Models

CodeBERT, CodeT5, StarCoder, Codex — the ecosystem of pre-trained models powering modern AI-assisted development.

What pre-training objective does CodeBERT use?
Masked Language Modeling (MLM) — predict randomly masked tokens from bidirectional context Correct
Causal Language Modeling (CLM) — predict next token left-to-right Incorrect — CLM is used by decoder-only models like GPT/Codex
Which architecture variant is best suited for code translation (input lang → output lang)?
Encoder-decoder (e.g., CodeT5) — encodes input and generates output sequence Correct
Encoder-only (e.g., CodeBERT) — produces contextualized embeddings Incorrect — encoder-only models cannot generate output sequences
Why is BPE tokenization particularly important for code?
Code has compound identifiers (e.g., getEmbeddedIPv4) that would be OOV with word-level tokenization Correct
BPE makes code run faster at inference time Incorrect — BPE affects vocabulary coverage, not inference speed directly
Module 4 · Slide 34

What's Next?

You now have a foundation in both non-generative and generative deep learning for software engineering. Here is how this connects to the rest of the course.

Evaluating AI-enabled SE (Module 3)
How do we measure whether these models actually work? Module 3 covers evaluation metrics (BLEU, CodeBLEU, accuracy, F1), benchmarks, and experimental design for AI4SE research.
Practical Applications
These foundational models power real tools: GitHub Copilot (Codex), automated code review (CodeBERT), vulnerability scanners, and refactoring assistants.
References & Further Reading
CodeBERT: Feng et al., 2020. Pre-trained models for NL-PL understanding and generation.

CodeT5: Wang et al., 2021. Unified pre-trained encoder-decoder for code understanding and generation.

CC2Vec: Hoang et al., 2020. Distributed representations of code changes.

Attention Is All You Need: Vaswani et al., 2017. The original Transformer paper.
Key Takeaway
The shift from handcrafted features to learned representations, and from task-specific models to pre-trained + fine-tuned architectures, has transformed software engineering research and practice.
🎉

Module Complete!

You've finished DL4SE Foundations. Great work!