/ Mining Software Repositories Module 1 · Interactive 1 / 32 ← modules
Module 1 · Slide1

Mining Software Repositories

Where do we find the data that feeds AI-driven software engineering? This module explores how to collect, clean, and prepare source-code data from public repositories — the critical first step before any deep learning model can learn.

100M+
Developers on GitHub
400M+
Public Repositories
3
Repo Types
Artifacts to Mine
Goal
Build AI systems that support developers in one or more SE-related tasks — by leveraging data available in software repositories.
Module 1 · Slide2

What Is a Repository?

A repository (repo) is a centralized digital storage that developers use to make and manage changes to an application's source code and more. It lets developers track code changes, simultaneously edit files, and collaborate efficiently from any location.

Version Control
A versioning system manages changes to configuration items (artifacts). It tracks what was changed, who did the change, when, and why. It enables retrieving specific revisions and managing branches.
📝
Commit
A snapshot of changes to one or more files, with a message describing what changed and why.
🌿
Branch
An independent line of development allowing parallel work on features, bug fixes, or experiments.
🔀
Merge / Pull Request
Integrating changes from one branch into another, often through a reviewed pull request.
🏷️
Tag / Release
A named reference to a specific commit, typically marking a release version (e.g., v2.1.0).
Module 1 · Slide3

Types of Repositories

Software projects generate artifacts across three main repository types. Click each tab to explore what kind of data lives inside.

📂 Source Repositories
🐛 Bug Repositories
💬 Communication Repos
Key Insight
MSR (Mining Software Repositories) leverages data available in these repositories to aid development activities. Our overarching goal is to build AI systems that can support developers in one or more SE-related tasks.
Module 1 · Slide4

Version Control Refresher

MSR depends on version control history. Here is a quick Git refresher of the concepts that become data points for AI models.

📸
Commit
A snapshot of changes to one or more files. Each commit has a unique SHA hash and records who, when, and why.
Think of it as a save point in a video game.
🌳
Branch
A parallel line of development. Developers create branches to work on features or fixes without disturbing the main codebase.
🔀
Merge / Pull Request
Combining changes from one branch into another. Pull requests add a code-review layer before merging.
🏷️
Tag
A named reference to a specific commit, typically marking a release version (e.g., v2.1.0).
📋
Diff
Shows the exact changes between two versions of a file: lines added, removed, or modified. Diffs are central to code review and change-aware AI models.
Key Insight
Every commit is a data point. MSR treats version history as a rich dataset for training AI models — commit messages become natural language labels, diffs become input features, and branches capture development workflows.
Module 1 · Slide5

Rule-based vs. Data-driven Approaches

AI systems don't have to be grounded in data-driven methods. Some systems encode expert knowledge directly as rules. But for modern SE automation, data-driven approaches (especially deep learning) dominate.

🧠 Rule-based (Expert Systems)

  • Hand-crafted IF/THEN rules
  • Domain experts encode knowledge
  • No training data needed
  • Brittle: fails on edge cases
  • Hard to scale and maintain
IF: the screen is blue
AND: there is an error message
THEN: it's all good, it's Windows
VS

🤖 Data-driven (Machine Learning)

  • Learn patterns from data automatically
  • Require large, high-quality datasets
  • Generalize to unseen examples
  • Scale with more data
  • State of the art for SE tasks
Modern Approach
Among all data-driven techniques, deep learning models — particularly Large Language Models (LLMs) — are highly dependent on data. They require vast amounts of data to effectively learn and generalize.
Module 1 · Slide6

Why So Much Data?

Deep learning models, especially LLMs, learn statistical patterns from enormous corpora. More data means better generalization — but the data must be high quality.

The Data Hunger
A model trained on cat images needs millions of examples to distinguish breeds. Similarly, a code model needs millions of functions to learn patterns like variable naming, control flow, and idiomatic usage.
What Type of Data?
We mine data from publicly available GitHub repositories — both source code and natural language (comments, commit messages, issues, documentation).
GitHub Copilot Example
GitHub Copilot can generate code and natural language because it was trained on massive amounts of open-source code from GitHub repositories — learning both the structure of code and the intent expressed in comments and documentation.
GitHub Repos
Mine Data
Clean & Filter
Train Model
Module 1 · Slide7

Ensuring Data Quality

Not all data is useful. We need to ensure high-quality datasets through careful repository selection and rigorous preprocessing.

Step 1 — Select Good Repositories
Use repository quality as a proxy. Filters include: minimum stars, active maintenance (recent commits), non-fork status, proper licensing, and meaningful commit history.
Step 2 — Enforce Quality Checks
Once data is collected, apply sanity checks: remove duplicates, filter by language, remove auto-generated code, check for encoding issues, and validate syntactic correctness.
Remember
Garbage in, garbage out. A model trained on low-quality data will produce low-quality predictions. Data curation is arguably the most important (and most underappreciated) step in the ML pipeline.
Module 1 · Slide8

Preprocessing Source Code

Raw code from repositories needs systematic cleaning before it can be used for training. Here are the essential preprocessing steps applied to each method/function.

  1. Remove duplicates — Their presence can hinder the learning ability of the model by creating data leakage between training and test sets.
  2. ASCII-only characters — Keep code that contains only standard ASCII characters to avoid encoding issues.
  3. Remove outliers — Define outliers as methods that are incredibly long or incredibly short (e.g., single-line getters or 1000-line methods).
  4. Remove boilerplate — Eliminate trivial code like getters, setters, and auto-generated constructors.
  5. Strip comments — Clean the method by removing all inline and block comments.
  6. Custom criteria — Remove code that doesn't fit project-specific criteria (e.g., Cyclomatic Complexity < 5 if training on complex code).
Module 1 · Slide9

Interactive: Preprocessing Pipeline

Toggle each preprocessing step and watch the code get cleaned in real time.

Module 1 · Slide10

Tokenization

Tokenization is the process that breaks down text into smaller units (tokens) that can be analyzed separately. For code, this means converting raw source into a structured sequence of keywords, identifiers, operators, and literals.

Before Tokenization
public int addNumbers(int a, int b){
    int sum=a+b;
    return sum;
}
After Tokenization
public int addNumbers ( int a , int b ) {
    int sum = a + b ;
    return sum ;
}
🔤
Lexer
Converts raw text into a sequence of tokens — the basic building blocks of the language (keywords, operators, literals, identifiers).
🌲
Parser
Takes the token sequence from the lexer and analyzes them to understand the structure and syntax of the code (builds an AST).
Module 1 · Slide11

Why Is Tokenization Needed?

Reducing Complexity
Tokenization divides text into smaller units, making it easier for the model to identify patterns and relationships. Instead of processing raw character streams, the model works with meaningful units.
Handling the Vocabulary
By dividing text into tokens, the model can create a numerical representation of the vocabulary, making it easier to process and understand. Each unique token gets an ID in the vocabulary.

Tokenization Tools

code-tokenize
Python · Multi-language
Works for several programming languages. Provides tokenization tailored to code structure.
javalang
Python · Java only
Java-specific lexer and parser in Python. Produces fine-grained Java tokens.
JavaParser
Java · Java only
Full Java parser that builds a complete AST. Used for structural analysis.
Pygments
Python · Multi-language
Generic syntax highlighter integrating a lexer and parser. Works for many languages.
Module 1 · Slide12

Interactive: Code Tokenizer

Enter Java code below and click "Tokenize" to see it broken into classified tokens.

keyword
type
identifier
literal
operator
separator
Module 1 · Slide13

Beyond Lexer Tokens: Subword Tokenization

Lexer-based tokenization is great for analysis, but modern LLMs use a different approach: BPE (Byte Pair Encoding). Understanding both is essential.

Lexer-based Tokenization
Language-aware — knows Java keywords, operators, types
Fixed vocabulary per language (keyword set + identifiers)
Whole identifiers as single tokens
Example: getMaxValue[getMaxValue] (1 token)
Subword / BPE Tokenization
Language-agnostic — learns frequent character sequences
Learned vocabulary from training data (32K–100K tokens)
Splits identifiers into common subwords
Example: getMaxValue[get, Max, Value] (3 tokens)
Why BPE Wins for LLMs
Modern LLMs (GPT, CodeLlama, StarCoder) use BPE because it handles any vocabulary — including unseen identifiers, mixed languages, and even natural language comments — with a fixed-size token dictionary. No out-of-vocabulary problem.
32K
GPT-2 Vocab Size
50K
CodeLlama Vocab
49K
StarCoder Vocab
100K
GPT-4 Vocab Size
Module 1 · Slide14

Interactive: BPE vs Lexer Tokenizer

Type Java code below to see side-by-side comparison: lexer tokens (colored by type) vs simulated BPE tokens (showing how identifiers get split).

Java Lexer Tokens
BPE (GPT-style) Tokens
Lexer tokens: 0 BPE tokens: 0
Module 1 · Slide15

Abstract Syntax Trees (ASTs)

While tokenization produces a flat sequence, an AST captures the hierarchical structure of code. ASTs are used for code understanding tasks and semantic analysis.

public int add(int a, int b) {
    return a + b;
}
Key Points
1. ASTs capture program structure, not just tokens
2. Used for code understanding and semantic analysis
3. Enable tree-based neural models (Tree-LSTM, code2seq)
4. Built by parsers like JavaParser, tree-sitter
MethodDeclaration
  ├─ Type: int
  ├─ Name: add
  ├─ Parameters
  │  ├─ Parameter
  │  │  ├─ Type: int
  │  │  └─ Name: a
  │  └─ Parameter
  │     ├─ Type: int
  │     └─ Name: b
  └─ Body
     └─ ReturnStatement
        └─ BinaryExpression (+)
           ├─ Left: a
           └─ Right: b
Tokens vs Trees
While n-gram models see code as flat text, ASTs preserve the hierarchical structure that makes code different from natural language. This distinction becomes important when we discuss code embeddings and transformer architectures in later modules.
Module 1 · Slide16

Tools for Mining at Scale

Mining millions of repositories by hand isn't practical. Specialized APIs and search platforms make large-scale dataset construction possible.

GitHub REST & GraphQL API
Programmatic access to repository metadata, commits, issues, PRs, file contents, and more. Rate-limited (5,000 req/hour with auth). Libraries: PyGitHub (Python), Octokit (JS), go-github (Go).
# PyGitHub example
from github import Github

g = Github("access_token")
for repo in g.search_repositories(
    query="language:java stars:>100"):
    print(repo.full_name, repo.stargazers_count)
SEART-GHS
The Software Engineering Artifact Repository Tracker — a search engine for GitHub repos maintained by USI. Provides advanced filtering by language, stars, commits, contributors, license, and more.
Best Practices
1. Always respect rate limits and API terms
2. Cache responses to avoid redundant requests
3. Use bulk exports (GH Archive, GHTorrent) for historical data
4. Verify license compatibility for your use case
Module 1 · Slide17

The MSR Pipeline: End to End

From raw repositories to a clean, tokenized dataset ready for model training — here's the complete pipeline.

Select Repos
Stars, activity, license
Clone & Extract
Methods, classes, files
Preprocess
Dedup, filter, clean
Tokenize
Lexer → token sequence
Dataset
Train / Val / Test
Stage Input Output Key Concern
Repository SelectionGitHub / SEART-GHSRepo list (URLs)Quality proxy (stars, activity)
Data ExtractionRepo listRaw methods / filesLanguage filtering, scope
DeduplicationRaw methodsUnique methodsData leakage prevention
PreprocessingUnique methodsClean methodsOutliers, boilerplate, encoding
TokenizationClean methodsToken sequencesVocabulary size, OOV handling
SplitToken sequencesTrain / Val / TestNo overlap between splits
Module 1 · Slide18

Code as Data: What Makes It Special?

Source code is unlike natural language text in several important ways. Understanding these differences shapes how we build datasets and models.

Formal Syntax
Code must compile or parse. A single misplaced semicolon breaks everything. This rigid structure is both a constraint and an advantage for learning.
Executable Semantics
Code has deterministic meaning: we can run it, test it, and verify outputs. This enables automatic labeling and evaluation unlike natural language.
Multi-level Representation
The same code can be viewed as characters, tokens, AST nodes, control-flow graphs, or data-flow graphs. Each level reveals different patterns.
Bimodal Nature
Repositories contain both code and natural language (comments, docs, commit messages). Models can learn the mapping between intent and implementation.
Key Implication
Because code is formal, executable, and multi-level, MSR datasets can be richer than typical NLP corpora. We can extract not just text, but structure (ASTs), behavior (tests), and evolution (diffs) from the same repository.
Module 1 · Slide19

Real-World MSR Datasets

Researchers have curated benchmark datasets from mined repositories. These standardized datasets enable reproducible experiments and fair comparisons across techniques.

Dataset Language(s) Size Primary Task Key Paper
CodeSearchNet6 languages2M code-NL pairsCode search & retrievalHusain et al., 2019
Defects4JJava835 real bugsAPR & testingJust et al., 2014
BigCloneBenchJava8M clone pairsClone detectionSvajlenko et al., 2014
The Stack300+ languages6 TBPre-training code LLMsKocetkov et al., 2022
Methods2TestJava780K focal-test pairsTest generationTufano et al., 2022
Key Insight
These curated datasets are the bridge between raw repositories and reproducible research. Without standardized benchmarks, it would be impossible to compare different approaches fairly.
Module 1 · Slide20

Dataset Licensing & Legal Considerations

Public code is not necessarily free to use for any purpose. Ethical and legal issues are critical when building MSR datasets.

⚖️ License Compliance

MIT — permissive, almost no restrictions. Apache 2.0 — permissive with patent grants. GPL — copyleft, derivatives must also be GPL. Always check if a license permits use as training data.

⚠️ The Copilot Controversy

GitHub Copilot trained on public repos regardless of license, sparking a class-action lawsuit. Developers argued their copyleft code was used without respecting license terms.

🔒 Privacy in Code

Repositories often contain PII in comments (names, emails), hardcoded API keys, database credentials, and internal URLs. These must be scrubbed from training data.

🤝 Responsible Collection

Respect robots.txt and API rate limits. Provide attribution when possible. Consider opt-out mechanisms for developers who do not want their code used for training.

Remember
Just because code is public does not mean it is free to use for any purpose. Always verify license compatibility, scrub sensitive data, and respect the intent of open-source contributors.
Module 1 · Slide21

Data Provenance & Reproducibility

Tracking where data comes from and how it was processed is essential for scientific rigor and reproducibility in MSR research.

  1. Record source repos — Store full URLs, commit hashes, and timestamps for every repository you mine. This lets others verify and replicate your dataset.
  2. Version your filters — Document every inclusion/exclusion criterion (min stars, language, file patterns). Even small filter changes can drastically alter results.
  3. Timestamp your collection — Repositories change constantly. Code added, deleted, or relicensed after your snapshot may differ from what you collected.
  4. Share your pipeline — Publish your mining scripts, configuration files, and random seeds. A dataset without reproducible construction is scientifically weak.
  5. Use standardized formats — Store datasets in well-known formats like JSONL (one JSON object per line) or Parquet (columnar, compressed, fast). Include metadata fields.
Golden Rule
A dataset without provenance is scientifically useless. Always document HOW you built it, so others can reproduce, verify, and extend your work.
Module 1 · Slide22

Interactive: Repository Filter Simulator

Configure repository selection criteria and watch how quickly the pool of usable repositories shrinks. Every filter trades quantity for quality.

Minimum Stars
0
01050100500
Minimum Commits
1
11050100
Language Filter
Additional Filters
~400M
All 400M+ public GitHub repositories are included. Start adjusting filters to see the funnel effect.
Module 1 · Slide23

Deduplication: Why & How

Duplicate code inflates datasets and causes data leakage. Removing duplicates is essential for training models that generalize rather than memorize.

Exact Duplicates
Hash-based detection (MD5 or SHA-256): compute a hash for each file or method. Identical hashes mean identical content. Fast and simple — catches the same file copied across multiple repos.
Near-Duplicates
Jaccard similarity on token sets measures overlap. For scalability, use MinHash + Locality-Sensitive Hashing (LSH) to find near-duplicates across millions of files without pairwise comparison.
Cross-Split Leakage
The same function appearing in both training and test sets invalidates evaluation. The model appears to generalize but is actually recalling memorized examples.

These two snippets are near-duplicates — same logic, renamed variables:

Version A
public int calculateSum(int x, int y) { int result = x + y; return result; }
Version B
public int addNumbers(int a, int b) { int sum = a + b; return sum; }
Research Finding
Studies show 10–30% of GitHub code is duplicated. Without deduplication, models memorize rather than generalize — leading to inflated performance metrics.
Module 1 · Slide24

Interactive: Duplicate Detector

Paste two code snippets below to compute similarity. See how near-duplicates can fool exact matching.

Snippet A
Snippet B
Exact match:
0%
Token Jaccard:
0%
70%
Verdict
Adjust the snippets above to see similarity analysis.
Module 1 · Slide25

Hashing for Deduplication

Deduplication at scale relies on hashing. Here are the key concepts and how they fit into the MSR pipeline.

What Is a Hash?
A deterministic function that maps any input to a fixed-size output (digest). Same input always produces the same hash. Even a 1-character change produces a completely different hash.
MD5 / SHA-256
Collision-resistant, fast to compute. Used for exact duplicate detection: hash each file, group by hash, keep one per group. SHA-256 is preferred for security.
MinHash
Approximates Jaccard similarity efficiently. Instead of comparing all token sets pairwise, MinHash creates compact signatures that can be compared in O(1).
LSH (Locality-Sensitive Hashing)
Groups similar items into the same buckets with high probability. Combined with MinHash, it finds near-duplicates in sub-linear time across millions of files.

Type some code below to see a simulated hash update in real time:

Simulated SHA-256 ...
Pipeline Fit
Step 1: Use SHA-256 to remove exact duplicates (fast, O(n)). Step 2: Use MinHash+LSH to find near-duplicates (scalable, sub-quadratic). This two-phase approach handles datasets with millions of files.
Module 1 · Slide26

Data Splitting Strategies

How you split data into train, validation, and test sets matters as much as the data itself. The wrong strategy can silently invalidate your results.

Random Split
Simplest approach: shuffle all methods and split. Fast but risky — methods from the same project can appear in both train and test sets, causing data leakage.
Project-Based Split
All methods from one project go into the same split. Prevents cross-project leakage since methods in the same project share style, APIs, and patterns.
Temporal Split
Train on older commits, test on newer ones. Simulates real-world deployment where the model must predict code it has never seen from the future.
Typical Ratios
Common splits: 80/10/10 or 70/15/15 (train / validation / test). Larger test sets give more reliable evaluation estimates.

10 projects split into Train / Val / Test under each strategy:

RANDOM SPLIT (leakage risk)
P1 P2 P3* P5 P6 P7 P8 P9*
P3* P4
P9* P10
* P3 and P9 appear in multiple splits = leakage
PROJECT-BASED SPLIT (recommended)
P1 P2 P3 P4 P5 P6 P7 P8
P9
P10
Each project appears in exactly one split
TEMPORAL SPLIT (real-world simulation)
commits before 2022
2022–2023
2024+
Model never sees future code during training
Train
Validation
Test
Module 1 · Slide27

Worked Example: From Repo to Dataset

Let us walk through a concrete example of building a dataset for Java code summarization — generating a natural-language description for a given method.

1
Define the task: Build a dataset of Java method – Javadoc comment pairs for code summarization.
2
Query SEART-GHS: Java repos with ≥100 stars and ≥50 commits.
3,847 repos
3
Clone & extract: Parse each repo with JavaParser, extract methods that have Javadoc comments.
2.1M pairs
4
Preprocess: Remove duplicates (→1.4M), remove trivial methods (→980K), filter by length 3–100 tokens (→820K), ASCII-only (→810K).
810K pairs
5
Tokenize: Use javalang to convert each method into a token sequence.
810K sequences
6
Project-based split: Ensure no project appears in multiple splits.
648K / 81K / 81K
The Funnel Effect
Starting from 3,847 repositories and 2.1 million raw pairs, preprocessing removes nearly 62% of the data. This is normal — quality always trumps quantity.
Module 1 · Slide28

Common Pitfalls in MSR Research

Avoid these frequent mistakes that can silently invalidate your MSR experiments and models.

Leaky Splits

Same project (or near-duplicate code) appears in both training and test sets, inflating evaluation metrics.

✓ Use project-based splitting + deduplication across splits.

Stale Data

Using outdated repository snapshots that no longer reflect current coding practices or APIs.

✓ Timestamp collections and re-mine periodically.

Selection Bias

Only mining popular repos (high stars) or English-only projects skews the dataset toward specific demographics.

✓ Document selection criteria; include diverse sources.

Ignoring Tests

Test files mixed with production source code introduce repetitive patterns and assertions into training data.

✓ Filter by file path: exclude **/test/**, *Test.java.

Auto-generated Code

Protobuf stubs, build outputs, and boilerplate generators inflate datasets with non-human code.

✓ Check for generation markers; filter by heuristics.

Missing Documentation

No record of filters, versions, or parameters used. Others cannot reproduce or verify results.

✓ Publish mining scripts, configs, and random seeds.
Module 1 · Slide29

From MSR to Model Training

This module built the foundation. Now let us connect MSR to what comes next: training probabilistic models that understand and predict code.

What We Built
Clean, tokenized,
deduplicated dataset
What's Next
Train probabilistic
models on this data
The Key Question
Can we compute
P(next_token | context)?
Preview — N-gram Language Models
In Module 2, we will use the datasets built in this module to train n-gram language models. These models estimate the probability of the next token given the previous n−1 tokens: P(tn | t1, ..., tn-1). This is the simplest form of code completion — and the foundation for understanding how modern LLMs work.
The Pipeline So Far
ReposFilterExtractPreprocessTokenize (Lexer or BPE)DeduplicateSplitTrain Model (Module 2)
Module 1 · Slide30

Try It Yourself: Mini MSR Exercise

Put your knowledge into practice with this hands-on mini-assignment. You will use this dataset in the next module.

Assignment
1. Clone 3 Java repositories from GitHub (pick repos with 50+ stars).
2. Extract all methods using a parser (JavaParser or javalang).
3. Count the vocabulary (unique tokens after lexer tokenization).
4. Compute basic statistics: average method length (in tokens), most common tokens, number of duplicates.
5. Save the cleaned methods as a JSONL file (one method per line).
# Skeleton: mini_msr.py import javalang, json, os, hashlib from collections import Counter repos = ["user/repo1", "user/repo2", "user/repo3"] def clone_repos(repos): for r in repos: os.system(f"git clone https://github.com/{r}.git") def extract_methods(java_file): # Parse with javalang, yield method bodies ... def tokenize_method(code): # Use javalang.tokenizer to get tokens ... def deduplicate(methods): seen = set() for m in methods: h = hashlib.sha256(m.encode()).hexdigest() if h not in seen: seen.add(h) yield m # Main pipeline: clone -> extract -> tokenize -> dedup -> stats -> save
Deliverables
A JSONL file with your cleaned methods and a short report: how many repos, how many methods extracted, vocabulary size, avg method length, and how many duplicates you removed. Bring this dataset to the next class — we will use it to build n-gram models.
Module 1 · Slide31

Key Takeaways

01 · Foundation

Data Is the Foundation

AI-driven SE starts with data. Mining software repositories provides the raw material for every downstream model.

02 · Repo Types

Source, Bug, Communication

Each repository type contributes different artifacts. Version control history is a rich dataset of commits, diffs, and branches.

03 · Preprocessing

Quality Over Quantity

Deduplication (hashing, MinHash+LSH), outlier removal, boilerplate filtering, and sanity checks are essential. Garbage in, garbage out.

04 · Tokenization

Lexer Tokens and BPE

Lexer-based tokenization is language-aware; BPE is language-agnostic and used by modern LLMs. ASTs add structural understanding beyond flat tokens.

05 · Ethics & Law

Licensing & Privacy

Public code is not free to use for any purpose. Respect licenses, scrub PII, document provenance, and ensure reproducibility.

06 · Avoid Pitfalls

Leakage, Bias, Staleness

Project-based splits prevent leakage. Diverse selection avoids bias. Timestamped collections prevent staleness. Document everything.

07 · Tools

Mine at Scale

GitHub APIs, SEART-GHS, PyGitHub, javalang, Pygments, tree-sitter — a rich ecosystem exists for every pipeline stage.

08 · The Big Picture

MSR → Train → Deploy

This pipeline feeds directly into n-gram models, deep learning architectures, and pre-trained transformers for code.

Next Steps
Now that you know how to collect and prepare data from software repositories, the next module explores Probabilistic Source Code Modeling — how to model the statistical properties of code using n-grams.
Module 1 · Slide32

Knowledge Check

Test your understanding of the MSR pipeline. Click "Reveal Answer" to check your reasoning.

Which repository type would you mine to collect bug-fix pairs for training an automated program repair model?
Bug repositories (issue trackers linked to commits). You need bug reports linked to the commits that fix them — giving you a buggy → fixed code pair. Source repositories alone lack the structured bug metadata.
Why is project-based splitting preferred over random splitting?
Prevents data leakage. Methods from the same project share coding style, API usage patterns, and variable naming conventions. If they appear in both train and test, the model may appear to generalize when it is actually relying on project-specific patterns it memorized.
A dataset has 2M Java methods. After deduplication, 600K are removed. What does this suggest?
High code duplication on GitHub (~30% duplicates). This is consistent with research findings. Without removal, the model would memorize these repeated patterns rather than learning to generalize, and evaluation metrics would be inflated by test examples the model has already seen during training.
Module Complete
You have covered the full MSR pipeline: from repository selection to data extraction, deduplication, preprocessing, tokenization, and splitting. Next up: Probabilistic Source Code Modeling — how to model the statistical properties of code using n-grams.
🎉

Module Complete!

You've finished Mining Software Repositories. Great work covering the full MSR pipeline!