</> CodeLab|DL for Software Development FoundationsModule 4 · Interactive1 / 34Home
Module 4 · Slide 01
Deep Learning for Software Development Foundations
This module covers the core deep learning concepts and architectures that power modern AI-driven software engineering tools — from non-generative classification tasks to generative models based on LSTMs and Transformers.
Part A — Non-Generative
Clone detection, vulnerability prediction, code smell detection — classification and prediction tasks that do not produce new code.
Part B — Embeddings & LSTMs
Word embeddings, BPE tokenization, LSTM architecture, seq2seq with attention, and beam search for code generation.
Part C — Transformers
Self-attention, multi-head attention, positional encoding, pre-training, fine-tuning, and the modern code model ecosystem.
Learning Objectives
Distinguish generative vs. non-generative DL4SE tasks. Understand key architectures (LSTM, Transformer) and how pre-trained models are fine-tuned for SE.
Module 4 · Slide 02
Non-Generative Tasks Overview
Non-generative DL4SE tasks classify or predict properties of existing code without producing new source code. The model answers a question about the code rather than writing new code.
Definition
Non-generative tasks take code as input and output a label or score — e.g., "vulnerable" / "safe", or "clone" / "not clone".
Key Examples
Code Clone Detection — find duplicate or near-duplicate code fragments. Vulnerability Prediction — flag code likely to contain security flaws. Code Smell Detection — identify design problems like God Class or Long Method.
Pipeline Pattern
Most non-generative tasks follow the same pattern:
Source Code
→
Feature Extraction
→
Classifier / DNN
→
Label / Score
Contrast
Generative tasks (code generation, summarization, translation) produce a sequence of new tokens. We cover those in Parts B and C.
Module 4 · Slide 03
Neural Network Fundamentals
A primer on the building blocks of every deep learning model — the artificial neuron and how neurons are organized into layers.
The Neuron
A single neuron computes a weighted sum of its inputs, adds a bias, and applies a non-linear activation function:
y = σ(Σ wᵢxᵢ + b)
xᵢ — input values (e.g., code metrics). wᵢ — learnable weights (importance of each input). b — bias term (shifts the decision boundary). σ — activation (ReLU, sigmoid, tanh).
Common Activations
ReLU: max(0, x) — fast, default for hidden layers. Sigmoid: 1/(1+e-x) — output between 0 and 1. Softmax: normalizes outputs to probabilities.
Layers
Neurons are stacked into layers to form a network:
Input Layer — receives raw features (e.g., token counts, complexity metrics).
Hidden Layers — learn intermediate representations. More layers = deeper network = more abstract features.
Output Layer — produces the final prediction (a class label, a probability, a score).
NETWORK STRUCTURE
Input features
→
Hidden 1 learned repr.
→
Hidden 2 abstract feat.
→
Output prediction
Key Insight
A neural network is just a function that maps inputs to outputs through layers of simple computations. The “learning” happens by adjusting the weights.
Module 4 · Slide 04
How Neural Networks Learn: Backpropagation
High-level intuition for how a neural network adjusts its weights to improve predictions — the training loop that drives all deep learning.
1
Forward Pass
Input flows through the network layer by layer, producing a prediction at the output. Each neuron applies its weights, bias, and activation function.
2
Loss Computation
Compare the prediction to the ground truth using a loss function (e.g., cross-entropy for classification, MSE for regression). The loss is a single number measuring how wrong the prediction is.
3
Backward Pass (Backpropagation)
Compute gradients — how much each weight contributed to the error. Uses the chain rule from calculus, applied systematically from output back to input.
4
Weight Update (Gradient Descent)
Adjust each weight in the direction that reduces the loss: w = w - lr * gradient. The learning rate (lr) controls step size.
5
Repeat
Iterate over the training data many times (epochs). Each pass through the full dataset refines the weights further. Stop when validation performance plateaus.
Key Insight
Backprop is just the chain rule from calculus applied systematically. You don’t need to derive it — frameworks like PyTorch do it automatically with loss.backward().
Settings you choose before training begins. Unlike weights (which are learned), hyperparameters must be set by the practitioner.
Learning Rate
How big each weight update step is. Too high: loss diverges. Too low: training is painfully slow. Typical: 1e-3 to 1e-5. The single most important hyperparameter.
Batch Size
How many examples to process before updating weights. Common values: 32, 64, 128. Larger batches = more stable gradients but more memory.
Epochs
How many complete passes through the full training dataset. Too few: underfitting. Too many: overfitting. Monitor validation loss to decide.
Dropout
Randomly disable neurons during training (e.g., 10-50% chance). Forces the network to learn redundant representations, preventing overfitting.
Early Stopping
Stop training when validation loss starts increasing (the model is memorizing, not generalizing). Saves compute and prevents overfitting.
Optimizer
The algorithm for gradient descent. Adam is the default choice — adapts learning rate per-parameter. SGD with momentum is a common alternative.
Practical Advice
You rarely need to tune these from scratch. Start with published defaults and adjust based on validation performance. Learning rate is the first knob to turn; batch size and dropout come second.
Module 4 · Slide 06
Code Clone Detection: Types I – IV
Code clones are pairs of code fragments that are similar. They are classified into four types based on the degree of similarity.
Type I — Exact Clone
Identical code except for whitespace, layout, and comments. A straight copy-paste.
Type II — Renamed Clone
Syntactically identical except for identifier names, literal values, or type references.
Type III — Gapped Clone
Similar structure but with statements added, removed, or modified. Near-miss clone.
Type IV — Semantic Clone
Same functionality, completely different syntax. E.g., iterative vs. recursive implementations.
Clone Classifier Interactive
Examine the code pair and classify the clone type:
Module 4 · Slide 07
Clone Detection Approaches
Different techniques trade off precision, recall, and the ability to detect higher-type clones.
1
Text-Based
Compare raw text or normalized lines. Fast but limited to Type I / II.
2
Token-Based
Tokenize code and compare token sequences with suffix trees or hashing. Handles Type I-III.
3
AST-Based
Parse into Abstract Syntax Trees and compare subtrees. Invariant to formatting and renaming (Type I-III).
4
ML / DL-Based
Learn embeddings of code fragments and use similarity in vector space. Can detect Type IV semantic clones.
Approach
Types
Scalability
Text-based
III
High
Token-based
IIIIII
High
AST-based
IIIIII
Medium
ML / DL
IIIIIIIV
Medium
Key Insight
Only learning-based approaches can detect Type IV (semantic) clones, because they capture the meaning rather than the syntax of code.
Module 4 · Slide 08
Vulnerability Prediction
Automatically predicting whether a code component contains a security vulnerability, using machine learning classifiers trained on historical vulnerability data.
Why It Matters
Security vulnerabilities cost billions per year. Manual code review does not scale. ML models can prioritize code for review by flagging high-risk components.
Common Vulnerability Types
Buffer Overflow — writing beyond allocated memory. SQL Injection — unsanitized user input in queries. XSS — injecting scripts into web pages. Use After Free — accessing freed memory. Integer Overflow — arithmetic exceeding type range.
ML Pipeline
The typical vulnerability prediction pipeline:
Code
→
Metrics & Tokens
→
Classifier
→
Vuln Score
Features Used
Software metrics: cyclomatic complexity, LOC, coupling. Code tokens: n-grams of source code tokens. API calls: use of dangerous functions (e.g., strcpy, eval). Change history: past bug-fixing frequency.
Module 4 · Slide 09
Vulnerability Prediction Pipeline
Walk through how a vulnerability prediction model extracts features and assigns a risk score to a code snippet.
Vulnerability Analyzer Interactive
Code Snippet:
Extracted Features:
Module 4 · Slide 10
Code Smell Detection
Code smells are symptoms of poor design choices that increase maintenance cost and defect likelihood. ML models can automatically detect them from software metrics.
Long Method
A method with too many lines of code, doing too much. Hard to understand and test.
God Class
A class that centralizes too much functionality and knows too much about the system.
Feature Envy
A method that uses data from another class more than its own. Suggests misplaced responsibility.
Data Clumps
Groups of data that frequently appear together and should be encapsulated into a class.
ML Approach
Extract software metrics (LOC, cyclomatic complexity, coupling, cohesion, depth of inheritance), then train a classifier (Random Forest, SVM, or DNN) to predict smell type.
Why It Matters
Studies show that code with smells has higher defect density and longer change times. Automated detection helps developers refactor proactively.
Module 4 · Slide 11
Interactive: Code Smell Detector
Analyze code metrics to classify the type of code smell present in a given snippet.
Smell Detector Interactive
Code Snippet:
Metrics Analysis:
Module 4 · Slide 12
Non-Generative Tasks Recap
Summary and comparison of the three non-generative tasks we covered.
Task
Input
Output
Features
Challenge
Clone Detection
Code pair
Clone type (I-IV)
Tokens, AST, embeddings
Type IV semantic clones
Vulnerability Prediction
Code component
Vulnerability score
Metrics, API calls, history
Class imbalance (few vulns)
Code Smell Detection
Class / method
Smell type
LOC, complexity, coupling
Subjective thresholds
Which clone type requires semantic understanding to detect?
Type IV — same functionality, different syntaxCorrect
Type II — renamed identifiersIncorrect — Type II is detectable by token comparison
Type III — gapped clonesIncorrect — Type III can be found with AST matching
What is the main challenge in vulnerability prediction?
Class imbalance — very few vulnerable samples vs. many safe onesCorrect
Code is too short to extract featuresIncorrect
Module 4 · Slide 13
From Non-Generative to Generative
We now shift from classifying existing code to generating new sequences of code tokens. This requires a fundamentally different class of models.
Before feeding code tokens to a neural network, we need to represent them as dense numerical vectors. Embeddings capture semantic relationships between tokens.
One-Hot Encoding
Each token gets a sparse vector with a single 1. Problem: no similarity info, huge dimensions (vocabulary can be 50K+).
Dense Embeddings (Word2Vec)
Each token mapped to a low-dimensional dense vector (e.g., 128-d). Tokens with similar contexts get similar vectors. int and float end up nearby.
Why Embeddings Work for Code
Code has strong local context: for(int i=0 is predictable. Embeddings capture that ArrayList is similar to LinkedList.
Encoding Comparison Interactive
Vocabulary: int, float, String, for, if, return
Module 4 · Slide 15
Byte Pair Encoding (BPE)
Subword tokenization that splits rare identifiers into known subword units. Essential for handling code's open vocabulary (compound names like getEmbeddedIPv4ClientAddr).
1
Start with Characters
Initialize vocabulary with all individual characters in the training corpus.
2
Count Pairs
Find the most frequent adjacent pair of tokens (e.g., 'e' + 's' appearing 1000 times).
3
Merge
Replace all occurrences of the pair with a new token ('es'). Add to vocabulary.
4
Repeat
Repeat steps 2-3 for a fixed number of merges (e.g., 32K). Common words become single tokens; rare words split into subwords.
Long Short-Term Memory networks solve the vanishing gradient problem of vanilla RNNs by introducing gated memory cells that selectively remember and forget information.
RNN Limitation
Vanishing gradients: in long sequences, gradients shrink exponentially during backpropagation. The network cannot learn long-range dependencies (e.g., matching a { with its closing } 50 tokens later).
LSTM Solution: Gates
Three gates control information flow through the cell:
Forget Gate f_t
Decides what to erase from cell state. sigmoid(W_f * [h_{t-1}, x_t])
Input Gate i_t
Decides what new info to store. sigmoid(W_i * [h_{t-1}, x_t])
Candidate g_t
New candidate values. tanh(W_g * [h_{t-1}, x_t])
Output Gate o_t
Decides what to output. sigmoid(W_o * [h_{t-1}, x_t])
Cell State Update
c_t = f_t * c_{t-1} + i_t * g_t
The cell state is a highway: the forget gate removes old info, the input gate writes new info.
h_t = o_t * tanh(c_t)
The output gate filters the cell state to produce the hidden state.
Why LSTM for Code
Code has long-range dependencies: variable declarations referenced many lines later, matching brackets, function calls. LSTM's cell state can carry this info across hundreds of tokens.
Module 4 · Slide 17
Interactive: LSTM Gate Explorer
Adjust the input values and observe how each gate responds. The bars show gate activations computed with fixed illustrative weights.
LSTM Gate Explorer Interactive
Input Values:
0.50
0.30
0.80
Outputs:
0.000
c_t (cell)
0.000
h_t (hidden)
Gate Activations:
Forget f_t
0.500
Input i_t
0.500
Candidate g_t
0.500
Output o_t
0.500
Module 4 · Slide 18
GRU: A Simpler Alternative to LSTM
Gated Recurrent Units achieve similar performance to LSTMs with a simpler architecture — fewer gates, fewer parameters, and faster training.
LSTM Recap
3 Gates: Input, Forget, Output. 2 States: Cell state (c_t) + Hidden state (h_t). Parameters: 4 weight matrices per layer.
Powerful but complex — more parameters to learn, slower to train.
GRU Design
2 Gates: Update gate (z_t) and Reset gate (r_t). 1 State: Hidden state (h_t) only — no separate cell state. Parameters: 3 weight matrices per layer.
The update gate combines the roles of LSTM’s forget and input gates.
Property
LSTM
GRU
Gates
3 (input, forget, output)
2 (update, reset)
States
2 (cell + hidden)
1 (hidden only)
Parameters/layer
4n(n+m)
3n(n+m)
Training speed
Slower
Faster
Long sequences
Better
Good
Small datasets
Overfits more
Preferred
When to Use GRU
GRUs are often preferred when you have less training data, since they have fewer parameters to learn. For very long sequences or when maximum performance is critical, LSTM may still have an edge.
Module 4 · Slide 19
Interactive: RNN vs LSTM vs GRU on Code
See how a simple RNN, LSTM, and GRU retain information over a code token sequence. RNNs suffer from vanishing gradients; gated architectures preserve long-range memory.
Memory Retention Comparison Interactive
Select a code example to see how each architecture retains information across tokens:
INFORMATION RETAINED vs TOKEN DISTANCE
Simple RNN
LSTM
GRU
Module 4 · Slide 20
Seq2Seq for Code
The encoder-decoder (sequence-to-sequence) architecture maps an input sequence to an output sequence of potentially different length — ideal for code translation and repair.
Encoder
An LSTM reads the input sequence token by token and compresses it into a fixed-size context vector (the final hidden state).
Decoder
A second LSTM generates the output sequence one token at a time, conditioned on the context vector and previously generated tokens.
Teacher Forcing
During training, the decoder receives the ground-truth previous token (not its own prediction). This stabilizes and accelerates training.
Bug Repair Example:
sum(arr, n-1)
→
Encoder LSTM
→
Context Vector
→
Decoder LSTM
→
sum(arr, n)
Bottleneck Problem
The entire input sequence must be compressed into a single fixed-size vector. For long code sequences, this loses information. Solution: attention mechanism (next slide).
Module 4 · Slide 21
Attention Mechanism
Instead of relying on a single context vector, attention lets the decoder look back at all encoder hidden states and focus on the most relevant parts for each output token.
How Attention Works
1. Compute alignment scores between decoder state and each encoder state. 2. Softmax to get attention weights (sum to 1). 3. Weighted sum of encoder states = context for this step. 4. Concatenate with decoder state to predict next token.
Why It Helps for Code
When fixing a bug, the decoder can directly attend to the relevant input tokens. For sum(arr, n-1) → sum(arr, n), attention learns to skip -1.
Attention Heatmap Interactive
Input (buggy): sum ( arr , n - 1 )
Click an output token to see its attention weights:
Click an output token to see its attention weights on the input.
Module 4 · Slide 22
Beam Search Decoding
At inference time, the decoder must choose tokens one at a time. Beam search keeps multiple candidate sequences (beams) to avoid getting trapped by greedy local choices.
Greedy Decoding
Always pick the highest-probability token at each step. Fast but often produces suboptimal sequences — a locally good choice can lead to a poor overall result.
Beam Search (width = k)
1. Start with the top-k tokens for position 1. 2. For each beam, expand with all possible next tokens. 3. Keep only the top-k sequences by cumulative log-probability. 4. Repeat until all beams produce <EOS>. 5. Return the highest-scoring complete sequence.
Example: Beam Width = 2
Step 1: Top-2 starts
return(0.45)
if(0.30)
Step 2: Expand & keep top-2
returnresult(0.38)
returnnull(0.32)
Step 3: Best complete sequence
returnresult;Winner (0.35)
Trade-off
Larger beam width → better results but slower. Typical values: k=5 to k=10 for code generation tasks.
Module 4 · Slide 23
Generative Part 1 Recap
Summary of the key building blocks for LSTM-based generative models.
01
Embeddings
Dense vector representations that capture semantic similarity between code tokens. Foundation for all neural models.
02
BPE Tokenization
Subword splitting that handles code's open vocabulary. Compound identifiers become sequences of known subwords.
03
LSTM Gates
Forget, input, and output gates control memory flow. Solves vanishing gradients for long code sequences.
04
Seq2Seq
Encoder-decoder architecture maps input sequences to output sequences. Used for code translation and repair.
05
Attention
Lets the decoder focus on relevant encoder states at each step. Eliminates the information bottleneck.
06
Beam Search
Explores multiple candidate sequences during decoding. Avoids greedy local optima for better overall output.
Why does seq2seq with attention outperform vanilla seq2seq for long code sequences?
Attention lets the decoder access all encoder states, avoiding the fixed-size bottleneckCorrect
Attention uses a larger context vectorIncorrect — the key is dynamic access, not vector size
Module 4 · Slide 24
Self-Attention: The Key Innovation
Self-attention allows every token in a sequence to attend to every other token simultaneously, replacing the sequential processing of RNNs with parallel computation.
Query, Key, Value
Each token is projected into three vectors: Q (Query) — "What am I looking for?" K (Key) — "What do I contain?" V (Value) — "What information do I provide?"
Attention = softmax(QKT / sqrt(d_k)) * V
Why Replace Recurrence?
Parallelism: all positions computed simultaneously (vs. sequential RNN). Direct paths: any two tokens are connected in one step (vs. O(n) hops in RNN). Scalability: leverages GPU parallelism for training on massive code corpora.
Scaled Dot-Product
The scaling factor 1/sqrt(d_k) prevents dot products from becoming too large, which would push softmax into saturated regions with vanishing gradients.
Self-Attention for Code
Self-attention naturally learns structural patterns in code: ( attends to its matching ), function names attend to their arguments, and def links to :.
Input Tokens
→
Q, K, V Projections
→
Attention Scores
→
Weighted Sum
Module 4 · Slide 25
Multi-Head Attention & Positional Encoding
Multiple attention heads capture different types of relationships, while positional encodings inject sequence order into the otherwise position-agnostic Transformer.
Multi-Head Attention
Run h parallel attention functions with different learned projections. Each head can specialize: one head learns bracket matching, another learns data-flow, another learns identifier co-reference. Outputs are concatenated and linearly projected.
Sinusoidal Positional Encoding
Since self-attention has no notion of order, we add position vectors using sine/cosine functions at different frequencies: PE(pos, 2i) = sin(pos / 10000^(2i/d)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
This lets the model generalize to unseen sequence lengths.
Positional Encoding Heatmap Interactive
Each row = a position (0-7), each column = a dimension. Brighter green = higher value.
Pattern
Low dimensions oscillate rapidly (capturing local position), high dimensions change slowly (capturing global position). Each position gets a unique fingerprint.
Module 4 · Slide 26
Transformer Architecture
The Transformer stacks self-attention and feed-forward layers into encoder and decoder blocks, with residual connections and layer normalization for stable deep training.
Encoder Block (x N layers):
Multi-Head Self-Attention
Add & Layer Norm (residual)
Feed-Forward Network
Add & Layer Norm (residual)
Decoder Block (x N layers):
Masked Self-Attention
Cross-Attention (to encoder)
Feed-Forward Network
Key Components
Residual Connections: input added to sublayer output, enabling gradient flow through deep stacks.
Masked Self-Attention: in the decoder, prevents attending to future tokens (preserves autoregressive property).
Cross-Attention: decoder queries attend to encoder keys/values, linking input and output.
Architecture Variants
Encoder-only (BERT, CodeBERT): bidirectional, for classification. Decoder-only (GPT, Codex): autoregressive, for generation. Encoder-decoder (T5, CodeT5): full seq2seq, for translation.
Module 4 · Slide 27
Autoregressive Generation: How LLMs Write Code
Large language models generate code one token at a time, left to right. Each generated token becomes part of the input for the next prediction step.
1
Start with Prompt
The user provides initial context: def sort_list(
2
Generate Next Token
Model predicts the most likely next token: arr. Context is now def sort_list(arr
3
Append and Repeat
Each new token extends the context: ) → : → \n → return → sorted …
4
Stop Condition
Generation stops when the model produces an end-of-sequence token (<EOS>) or reaches the maximum context length.
Token-by-Token Generation Interactive
Key Insight
Every token an LLM generates was produced one at a time, left to right, conditioned on everything that came before. This is why context window size matters — the model can only “see” a fixed number of preceding tokens.
Sampling Strategies
Greedy: always pick the top token. Temperature: scale logits to control randomness. Top-k / Top-p: restrict to most likely tokens.
Higher temperature = more creative but less reliable code.
Module 4 · Slide 28
Pre-training & Fine-tuning
Modern code models are first pre-trained on massive unlabeled code corpora, then fine-tuned on smaller task-specific datasets. This transfer learning paradigm dramatically reduces labeled data requirements.
Pre-training Objectives
MLM (Masked Language Modeling): randomly mask 15% of tokens, predict them from context. Used by BERT/CodeBERT. Learns bidirectional representations.
CLM (Causal Language Modeling): predict the next token given all previous tokens. Used by GPT/Codex. Learns to generate code left-to-right.
Fine-tuning
Add a task-specific head (classifier, decoder) on top of the pre-trained model. Train on labeled data with a small learning rate. The pre-trained weights provide a strong initialization that already understands code syntax and semantics.
Transfer Learning Benefit Interactive
Labeled examples for fine-tuning:
1,000
From Scratch:
38%
Pre-trained:
76%
Module 4 · Slide 29
Fine-Tuning vs. Pre-Training
Understanding the economics and workflow of the two-stage training paradigm that powers every modern code model.
Pre-Training
Goal: Learn general language / code patterns. Data: Massive corpus (e.g., The Stack: 900GB of code). Cost: Millions of dollars in GPU compute. Duration: Weeks to months on hundreds of GPUs. Who does it: Research labs (Meta, Google, BigCode). Frequency: Done once, shared publicly.
Result
A foundation model that understands code syntax, semantics, and common patterns across many languages — but is not specialized for any particular task.
Fine-Tuning
Goal: Specialize the model for a specific task or domain. Data: Small labeled dataset (hundreds to thousands of examples). Cost: Cheap — hours on a single GPU. Duration: Minutes to hours. Who does it: Practitioners, researchers, teams. Frequency: Done for each new task or domain.
TYPICAL PIPELINE
Pre-train on The Stack
→
Fine-tune on Java methods
→
Deploy for Java completion
Practical Takeaway
You will almost never pre-train a model from scratch. Fine-tuning lets you specialize a powerful model for your task with relatively little data and compute. This is the standard workflow for all course projects.
Module 4 · Slide 30
Code Models Ecosystem
An overview of the major pre-trained models for code, their architectures, and capabilities.
Model
Architecture
Training Data
Key Capability
CodeBERT
Encoder-only
CodeSearchNet (6 langs)
Code search, clone detection, defect prediction
GraphCodeBERT
Encoder-only
CodeSearchNet + data flow
Structure-aware code understanding
CodeT5
Encoder-decoder
CodeSearchNet + C/C#
Code generation, summarization, translation
StarCoder
Decoder-only
The Stack (80+ langs)
Code completion, fill-in-the-middle
Codex / GPT
Decoder-only
GitHub public code
Code generation from NL, Copilot backend
Key Trend
The field has moved from encoder-only models (good for understanding) to decoder-only models (good for generation), with encoder-decoder models offering a balance for seq2seq tasks like code translation.
Module 4 · Slide 31
HNN & CC2Vec: Learning Code Changes
Specialized architectures for understanding code changes (diffs) rather than static code snapshots.
HNN — Hierarchical Neural Network
Generates commit messages from code diffs using a two-level hierarchy: first encode individual changed hunks, then aggregate hunk representations to generate a natural language commit message.
CC2Vec — Code Change to Vector
Learns distributed representations of code changes by separately encoding added lines, removed lines, and their context. The resulting vector captures the semantics of a commit for downstream tasks (just-in-time defect prediction, commit classification).
CC2Vec Pipeline — Step Through:
Module 4 · Slide 32
From Theory to Practice: The DL4SE Toolkit
The tools that abstract away complexity and let you go from idea to working model in under 50 lines of Python.
PyTorch
The framework. Tensors, automatic differentiation (autograd), and nn.Module for building models. The dominant framework in research and increasingly in industry.
HuggingFace Transformers
Pre-trained models on demand. Load any code model (CodeBERT, CodeT5, StarCoder) with one line. Includes tokenizers, pipelines, and fine-tuning utilities.
Weights & Biases
Experiment tracking. Log metrics, visualize training curves, compare runs, and run hyperparameter sweeps. Essential for reproducible research.
Google Colab / Lightning AI
Free GPU compute. Run Jupyter notebooks with GPU/TPU acceleration in the browser. No local GPU required for course projects.
HuggingFace Datasets
Data loading made easy. Access standard SE datasets (CodeSearchNet, The Stack) with streaming support for large corpora.
Git & DVC
Version control for ML. Git for code, DVC (Data Version Control) for datasets and model checkpoints that are too large for Git.
Key Insight
These tools abstract away most of the complexity. You can load a pre-trained code model and fine-tune it in under 50 lines of Python. Focus on the research question, not the plumbing.
Module 4 · Slide 33
Module 4 Recap & Knowledge Check
01
Non-Generative Tasks
Clone detection (Types I-IV), vulnerability prediction, code smell detection — classification without generating new code.
02
Embeddings & BPE
Dense vector representations and subword tokenization form the foundation for feeding code to neural networks.
03
LSTM & Seq2Seq
Gated recurrent cells with encoder-decoder architecture for sequence-to-sequence code tasks.
04
Attention & Transformers
Self-attention replaces recurrence; multi-head attention captures diverse code patterns in parallel.
05
Pre-training & Fine-tuning
Train on massive unlabeled code (MLM/CLM), then fine-tune on small labeled datasets for specific SE tasks.
06
Code Models
CodeBERT, CodeT5, StarCoder, Codex — the ecosystem of pre-trained models powering modern AI-assisted development.
What pre-training objective does CodeBERT use?
Masked Language Modeling (MLM) — predict randomly masked tokens from bidirectional contextCorrect
Causal Language Modeling (CLM) — predict next token left-to-rightIncorrect — CLM is used by decoder-only models like GPT/Codex
Which architecture variant is best suited for code translation (input lang → output lang)?
Encoder-decoder (e.g., CodeT5) — encodes input and generates output sequenceCorrect
Why is BPE tokenization particularly important for code?
Code has compound identifiers (e.g., getEmbeddedIPv4) that would be OOV with word-level tokenizationCorrect
BPE makes code run faster at inference timeIncorrect — BPE affects vocabulary coverage, not inference speed directly
Module 4 · Slide 34
What's Next?
You now have a foundation in both non-generative and generative deep learning for software engineering. Here is how this connects to the rest of the course.
Evaluating AI-enabled SE (Module 3)
How do we measure whether these models actually work? Module 3 covers evaluation metrics (BLEU, CodeBLEU, accuracy, F1), benchmarks, and experimental design for AI4SE research.
Practical Applications
These foundational models power real tools: GitHub Copilot (Codex), automated code review (CodeBERT), vulnerability scanners, and refactoring assistants.
References & Further Reading
CodeBERT: Feng et al., 2020. Pre-trained models for NL-PL understanding and generation.
CodeT5: Wang et al., 2021. Unified pre-trained encoder-decoder for code understanding and generation.
CC2Vec: Hoang et al., 2020. Distributed representations of code changes.
Attention Is All You Need: Vaswani et al., 2017. The original Transformer paper.
Key Takeaway
The shift from handcrafted features to learned representations, and from task-specific models to pre-trained + fine-tuned architectures, has transformed software engineering research and practice.