Module 1 · Slide1

Mining Software Repositories

Where do we find the data that feeds AI-driven software engineering? This module explores how to collect, clean, and prepare source-code data from public repositories — the critical first step before any deep learning model can learn.

100M+

Developers on GitHub

400M+

Public Repositories

3

Repo Types

∞

Artifacts to Mine

Goal

Build AI systems that support developers in one or more SE-related tasks — by leveraging data available in software repositories.

Module 1 · Slide2

What Is a Repository?

A repository (repo) is a centralized digital storage that developers use to make and manage changes to an application's source code and more. It lets developers track code changes, simultaneously edit files, and collaborate efficiently from any location.

Version Control

A versioning system manages changes to configuration items (artifacts). It tracks what was changed, who did the change, when, and why. It enables retrieving specific revisions and managing branches.

📝

Commit

A snapshot of changes to one or more files, with a message describing what changed and why.

🌿

Branch

An independent line of development allowing parallel work on features, bug fixes, or experiments.

🔀

Merge / Pull Request

Integrating changes from one branch into another, often through a reviewed pull request.

🏷️

Tag / Release

A named reference to a specific commit, typically marking a release version (e.g., v2.1.0).

Module 1 · Slide3

Types of Repositories

Software projects generate artifacts across three main repository types. Click each tab to explore what kind of data lives inside.

📂 Source Repositories

🐛 Bug Repositories

💬 Communication Repos

Key Insight

MSR (Mining Software Repositories) leverages data available in these repositories to aid development activities. Our overarching goal is to build AI systems that can support developers in one or more SE-related tasks.

Module 1 · Slide4

Version Control Refresher

MSR depends on version control history. Here is a quick Git refresher of the concepts that become data points for AI models.

📸

Commit

A snapshot of changes to one or more files. Each commit has a unique SHA hash and records who, when, and why.

Think of it as a save point in a video game.

🌳

Branch

A parallel line of development. Developers create branches to work on features or fixes without disturbing the main codebase.

🔀

Merge / Pull Request

Combining changes from one branch into another. Pull requests add a code-review layer before merging.

🏷️

Tag

A named reference to a specific commit, typically marking a release version (e.g., v2.1.0).

📋

Diff

Shows the exact changes between two versions of a file: lines added, removed, or modified. Diffs are central to code review and change-aware AI models.

Key Insight

Every commit is a data point. MSR treats version history as a rich dataset for training AI models — commit messages become natural language labels, diffs become input features, and branches capture development workflows.

Module 1 · Slide5

Rule-based vs. Data-driven Approaches

AI systems don't have to be grounded in data-driven methods. Some systems encode expert knowledge directly as rules. But for modern SE automation, data-driven approaches (especially deep learning) dominate.

🧠 Rule-based (Expert Systems)

Hand-crafted IF/THEN rules
Domain experts encode knowledge
No training data needed
Brittle: fails on edge cases
Hard to scale and maintain

            IF: the screen is blue

            AND: there is an error message

            THEN: it's all good, it's Windows

VS

🤖 Data-driven (Machine Learning)

Learn patterns from data automatically
Require large, high-quality datasets
Generalize to unseen examples
Scale with more data
State of the art for SE tasks

Modern Approach

Among all data-driven techniques, deep learning models — particularly Large Language Models (LLMs) — are highly dependent on data. They require vast amounts of data to effectively learn and generalize.

Module 1 · Slide6

Why So Much Data?

Deep learning models, especially LLMs, learn statistical patterns from enormous corpora. More data means better generalization — but the data must be high quality.

The Data Hunger

A model trained on cat images needs millions of examples to distinguish breeds. Similarly, a code model needs millions of functions to learn patterns like variable naming, control flow, and idiomatic usage.

What Type of Data?

We mine data from publicly available GitHub repositories — both source code and natural language (comments, commit messages, issues, documentation).

GitHub Copilot Example

GitHub Copilot can generate code and natural language because it was trained on massive amounts of open-source code from GitHub repositories — learning both the structure of code and the intent expressed in comments and documentation.

GitHub Repos

→

Mine Data

→

Clean & Filter

→

Train Model

Module 1 · Slide7

Ensuring Data Quality

Not all data is useful. We need to ensure high-quality datasets through careful repository selection and rigorous preprocessing.

Step 1 — Select Good Repositories

Use repository quality as a proxy. Filters include: minimum stars, active maintenance (recent commits), non-fork status, proper licensing, and meaningful commit history.

Step 2 — Enforce Quality Checks

Once data is collected, apply sanity checks: remove duplicates, filter by language, remove auto-generated code, check for encoding issues, and validate syntactic correctness.

Remember

Garbage in, garbage out. A model trained on low-quality data will produce low-quality predictions. Data curation is arguably the most important (and most underappreciated) step in the ML pipeline.

Module 1 · Slide8

Preprocessing Source Code

Raw code from repositories needs systematic cleaning before it can be used for training. Here are the essential preprocessing steps applied to each method/function.

Remove duplicates — Their presence can hinder the learning ability of the model by creating data leakage between training and test sets.
ASCII-only characters — Keep code that contains only standard ASCII characters to avoid encoding issues.
Remove outliers — Define outliers as methods that are incredibly long or incredibly short (e.g., single-line getters or 1000-line methods).
Remove boilerplate — Eliminate trivial code like getters, setters, and auto-generated constructors.
Strip comments — Clean the method by removing all inline and block comments.
Custom criteria — Remove code that doesn't fit project-specific criteria (e.g., Cyclomatic Complexity < 5 if training on complex code).

Module 1 · Slide9

Interactive: Preprocessing Pipeline

Toggle each preprocessing step and watch the code get cleaned in real time.

Module 1 · Slide10

Tokenization

Tokenization is the process that breaks down text into smaller units (tokens) that can be analyzed separately. For code, this means converting raw source into a structured sequence of keywords, identifiers, operators, and literals.

Before Tokenization

public int addNumbers(int a, int b){
int sum=a+b;
return sum;
}

After Tokenization

public int addNumbers ( int a , int b ) {
int sum = a + b ;
return sum ;
}

🔤

Lexer

Converts raw text into a sequence of tokens — the basic building blocks of the language (keywords, operators, literals, identifiers).

🌲

Parser

Takes the token sequence from the lexer and analyzes them to understand the structure and syntax of the code (builds an AST).

Module 1 · Slide11

Why Is Tokenization Needed?

Reducing Complexity

Tokenization divides text into smaller units, making it easier for the model to identify patterns and relationships. Instead of processing raw character streams, the model works with meaningful units.

Handling the Vocabulary

By dividing text into tokens, the model can create a numerical representation of the vocabulary, making it easier to process and understand. Each unique token gets an ID in the vocabulary.

Tokenization Tools

code-tokenize

Python · Multi-language

Works for several programming languages. Provides tokenization tailored to code structure.

javalang

Python · Java only

Java-specific lexer and parser in Python. Produces fine-grained Java tokens.

JavaParser

Java · Java only

Full Java parser that builds a complete AST. Used for structural analysis.

Pygments

Python · Multi-language

Generic syntax highlighter integrating a lexer and parser. Works for many languages.

Module 1 · Slide12

Interactive: Code Tokenizer

Enter Java code below and click "Tokenize" to see it broken into classified tokens.

keyword

type

identifier

literal

operator

separator

Module 1 · Slide13

Beyond Lexer Tokens: Subword Tokenization

Lexer-based tokenization is great for analysis, but modern LLMs use a different approach: BPE (Byte Pair Encoding). Understanding both is essential.

Lexer-based Tokenization

Language-aware — knows Java keywords, operators, types
Fixed vocabulary per language (keyword set + identifiers)
Whole identifiers as single tokens
Example: getMaxValue → [getMaxValue] (1 token)

Subword / BPE Tokenization

Language-agnostic — learns frequent character sequences
Learned vocabulary from training data (32K–100K tokens)
Splits identifiers into common subwords
Example: getMaxValue → [get, Max, Value] (3 tokens)

Why BPE Wins for LLMs

Modern LLMs (GPT, CodeLlama, StarCoder) use BPE because it handles any vocabulary — including unseen identifiers, mixed languages, and even natural language comments — with a fixed-size token dictionary. No out-of-vocabulary problem.

32K

GPT-2 Vocab Size

50K

CodeLlama Vocab

49K

StarCoder Vocab

100K

GPT-4 Vocab Size

Module 1 · Slide14

Interactive: BPE vs Lexer Tokenizer

Type Java code below to see side-by-side comparison: lexer tokens (colored by type) vs simulated BPE tokens (showing how identifiers get split).

Java Lexer Tokens

BPE (GPT-style) Tokens

Lexer tokens: 0 BPE tokens: 0

Module 1 · Slide15

Abstract Syntax Trees (ASTs)

While tokenization produces a flat sequence, an AST captures the hierarchical structure of code. ASTs are used for code understanding tasks and semantic analysis.

            public int add(int a, int b) {

                return a + b;

            }

Key Points

1. ASTs capture program structure, not just tokens
2. Used for code understanding and semantic analysis
3. Enable tree-based neural models (Tree-LSTM, code2seq)
4. Built by parsers like JavaParser, tree-sitter

MethodDeclaration
  ├─ Type: int
  ├─ Name: add
  ├─ Parameters
  │  ├─ Parameter
  │  │  ├─ Type: int
  │  │  └─ Name: a
  │  └─ Parameter
  │     ├─ Type: int
  │     └─ Name: b
  └─ Body
     └─ ReturnStatement
        └─ BinaryExpression (+)
           ├─ Left: a
           └─ Right: b

Tokens vs Trees

While n-gram models see code as flat text, ASTs preserve the hierarchical structure that makes code different from natural language. This distinction becomes important when we discuss code embeddings and transformer architectures in later modules.

Module 1 · Slide16

Tools for Mining at Scale

Mining millions of repositories by hand isn't practical. Specialized APIs and search platforms make large-scale dataset construction possible.

GitHub REST & GraphQL API

Programmatic access to repository metadata, commits, issues, PRs, file contents, and more. Rate-limited (5,000 req/hour with auth). Libraries: PyGitHub (Python), Octokit (JS), go-github (Go).

            # PyGitHub example

            from github import Github

            g = Github("access_token")

            for repo in g.search_repositories(

                query="language:java stars:>100"):

                print(repo.full_name, repo.stargazers_count)

SEART-GHS

The Software Engineering Artifact Repository Tracker — a search engine for GitHub repos maintained by USI. Provides advanced filtering by language, stars, commits, contributors, license, and more.

Best Practices

1. Always respect rate limits and API terms
2. Cache responses to avoid redundant requests
3. Use bulk exports (GH Archive, GHTorrent) for historical data
4. Verify license compatibility for your use case

Module 1 · Slide17

The MSR Pipeline: End to End

From raw repositories to a clean, tokenized dataset ready for model training — here's the complete pipeline.

Select Repos
Stars, activity, license

→

Clone & Extract
Methods, classes, files

→

Preprocess
Dedup, filter, clean

→

Tokenize
Lexer → token sequence

→

Dataset
Train / Val / Test

Stage	Input	Output	Key Concern
Repository Selection	GitHub / SEART-GHS	Repo list (URLs)	Quality proxy (stars, activity)
Data Extraction	Repo list	Raw methods / files	Language filtering, scope
Deduplication	Raw methods	Unique methods	Data leakage prevention
Preprocessing	Unique methods	Clean methods	Outliers, boilerplate, encoding
Tokenization	Clean methods	Token sequences	Vocabulary size, OOV handling
Split	Token sequences	Train / Val / Test	No overlap between splits

Module 1 · Slide18

Code as Data: What Makes It Special?

Source code is unlike natural language text in several important ways. Understanding these differences shapes how we build datasets and models.

Formal Syntax

Code must compile or parse. A single misplaced semicolon breaks everything. This rigid structure is both a constraint and an advantage for learning.

Executable Semantics

Code has deterministic meaning: we can run it, test it, and verify outputs. This enables automatic labeling and evaluation unlike natural language.

Multi-level Representation

The same code can be viewed as characters, tokens, AST nodes, control-flow graphs, or data-flow graphs. Each level reveals different patterns.

Bimodal Nature

Repositories contain both code and natural language (comments, docs, commit messages). Models can learn the mapping between intent and implementation.

Key Implication

Because code is formal, executable, and multi-level, MSR datasets can be richer than typical NLP corpora. We can extract not just text, but structure (ASTs), behavior (tests), and evolution (diffs) from the same repository.

Module 1 · Slide19

Real-World MSR Datasets

Researchers have curated benchmark datasets from mined repositories. These standardized datasets enable reproducible experiments and fair comparisons across techniques.

Dataset	Language(s)	Size	Primary Task	Key Paper
CodeSearchNet	6 languages	2M code-NL pairs	Code search & retrieval	Husain et al., 2019
Defects4J	Java	835 real bugs	APR & testing	Just et al., 2014
BigCloneBench	Java	8M clone pairs	Clone detection	Svajlenko et al., 2014
The Stack	300+ languages	6 TB	Pre-training code LLMs	Kocetkov et al., 2022
Methods2Test	Java	780K focal-test pairs	Test generation	Tufano et al., 2022

Key Insight

These curated datasets are the bridge between raw repositories and reproducible research. Without standardized benchmarks, it would be impossible to compare different approaches fairly.

Module 1 · Slide20

Dataset Licensing & Legal Considerations

Public code is not necessarily free to use for any purpose. Ethical and legal issues are critical when building MSR datasets.

⚖️ License Compliance

MIT — permissive, almost no restrictions. Apache 2.0 — permissive with patent grants. GPL — copyleft, derivatives must also be GPL. Always check if a license permits use as training data.

⚠️ The Copilot Controversy

GitHub Copilot trained on public repos regardless of license, sparking a class-action lawsuit. Developers argued their copyleft code was used without respecting license terms.

🔒 Privacy in Code

Repositories often contain PII in comments (names, emails), hardcoded API keys, database credentials, and internal URLs. These must be scrubbed from training data.

🤝 Responsible Collection

Respect robots.txt and API rate limits. Provide attribution when possible. Consider opt-out mechanisms for developers who do not want their code used for training.

Remember

Just because code is public does not mean it is free to use for any purpose. Always verify license compatibility, scrub sensitive data, and respect the intent of open-source contributors.

Module 1 · Slide21

Data Provenance & Reproducibility

Tracking where data comes from and how it was processed is essential for scientific rigor and reproducibility in MSR research.

Record source repos — Store full URLs, commit hashes, and timestamps for every repository you mine. This lets others verify and replicate your dataset.
Version your filters — Document every inclusion/exclusion criterion (min stars, language, file patterns). Even small filter changes can drastically alter results.
Timestamp your collection — Repositories change constantly. Code added, deleted, or relicensed after your snapshot may differ from what you collected.
Share your pipeline — Publish your mining scripts, configuration files, and random seeds. A dataset without reproducible construction is scientifically weak.
Use standardized formats — Store datasets in well-known formats like JSONL (one JSON object per line) or Parquet (columnar, compressed, fast). Include metadata fields.

Golden Rule

A dataset without provenance is scientifically useless. Always document HOW you built it, so others can reproduce, verify, and extend your work.

Module 1 · Slide22

Interactive: Repository Filter Simulator

Configure repository selection criteria and watch how quickly the pool of usable repositories shrinks. Every filter trades quantity for quality.

0

01050100500

1

11050100

~400M

All 400M+ public GitHub repositories are included. Start adjusting filters to see the funnel effect.

Module 1 · Slide23

Deduplication: Why & How

Duplicate code inflates datasets and causes data leakage. Removing duplicates is essential for training models that generalize rather than memorize.

Exact Duplicates

Hash-based detection (MD5 or SHA-256): compute a hash for each file or method. Identical hashes mean identical content. Fast and simple — catches the same file copied across multiple repos.

Near-Duplicates

Jaccard similarity on token sets measures overlap. For scalability, use MinHash + Locality-Sensitive Hashing (LSH) to find near-duplicates across millions of files without pairwise comparison.

Cross-Split Leakage

The same function appearing in both training and test sets invalidates evaluation. The model appears to generalize but is actually recalling memorized examples.

These two snippets are near-duplicates — same logic, renamed variables:

Version A

public int calculateSum(int x, int y) {
    int result = x + y;
    return result;
}

Version B

public int addNumbers(int a, int b) {
    int sum = a + b;
    return sum;
}

Research Finding

Studies show 10–30% of GitHub code is duplicated. Without deduplication, models memorize rather than generalize — leading to inflated performance metrics.

Module 1 · Slide24

Interactive: Duplicate Detector

Paste two code snippets below to compute similarity. See how near-duplicates can fool exact matching.

Snippet A

Snippet B

Exact match:

0%

Token Jaccard:

0%

Duplicate threshold: 70%

Verdict

Adjust the snippets above to see similarity analysis.

Module 1 · Slide25

Hashing for Deduplication

Deduplication at scale relies on hashing. Here are the key concepts and how they fit into the MSR pipeline.

What Is a Hash?

A deterministic function that maps any input to a fixed-size output (digest). Same input always produces the same hash. Even a 1-character change produces a completely different hash.

MD5 / SHA-256

Collision-resistant, fast to compute. Used for exact duplicate detection: hash each file, group by hash, keep one per group. SHA-256 is preferred for security.

MinHash

Approximates Jaccard similarity efficiently. Instead of comparing all token sets pairwise, MinHash creates compact signatures that can be compared in O(1).

LSH (Locality-Sensitive Hashing)

Groups similar items into the same buckets with high probability. Combined with MinHash, it finds near-duplicates in sub-linear time across millions of files.

Type some code below to see a simulated hash update in real time:

Simulated SHA-256 ...

Pipeline Fit

Step 1: Use SHA-256 to remove exact duplicates (fast, O(n)). Step 2: Use MinHash+LSH to find near-duplicates (scalable, sub-quadratic). This two-phase approach handles datasets with millions of files.

Module 1 · Slide26

Data Splitting Strategies

How you split data into train, validation, and test sets matters as much as the data itself. The wrong strategy can silently invalidate your results.

Random Split

Simplest approach: shuffle all methods and split. Fast but risky — methods from the same project can appear in both train and test sets, causing data leakage.

Project-Based Split

All methods from one project go into the same split. Prevents cross-project leakage since methods in the same project share style, APIs, and patterns.

Temporal Split

Train on older commits, test on newer ones. Simulates real-world deployment where the model must predict code it has never seen from the future.

Typical Ratios

Common splits: 80/10/10 or 70/15/15 (train / validation / test). Larger test sets give more reliable evaluation estimates.

10 projects split into Train / Val / Test under each strategy:

RANDOM SPLIT (leakage risk)

P1 P2 P3* P5 P6 P7 P8 P9*

P3* P4

P9* P10

* P3 and P9 appear in multiple splits = leakage

PROJECT-BASED SPLIT (recommended)

P1 P2 P3 P4 P5 P6 P7 P8

P9

P10

Each project appears in exactly one split

TEMPORAL SPLIT (real-world simulation)

commits before 2022

2022–2023

2024+

Model never sees future code during training

Train

Validation

Test

Module 1 · Slide27

Worked Example: From Repo to Dataset

Let us walk through a concrete example of building a dataset for Java code summarization — generating a natural-language description for a given method.

1

Define the task: Build a dataset of Java method – Javadoc comment pairs for code summarization.

—

2

Query SEART-GHS: Java repos with ≥100 stars and ≥50 commits.

3,847 repos

3

Clone & extract: Parse each repo with JavaParser, extract methods that have Javadoc comments.

2.1M pairs

4

Preprocess: Remove duplicates (→1.4M), remove trivial methods (→980K), filter by length 3–100 tokens (→820K), ASCII-only (→810K).

810K pairs

5

Tokenize: Use javalang to convert each method into a token sequence.

810K sequences

6

Project-based split: Ensure no project appears in multiple splits.

648K / 81K / 81K

The Funnel Effect

Starting from 3,847 repositories and 2.1 million raw pairs, preprocessing removes nearly 62% of the data. This is normal — quality always trumps quantity.

Module 1 · Slide28

Common Pitfalls in MSR Research

Avoid these frequent mistakes that can silently invalidate your MSR experiments and models.

Leaky Splits

Same project (or near-duplicate code) appears in both training and test sets, inflating evaluation metrics.

✓ Use project-based splitting + deduplication across splits.

Stale Data

Using outdated repository snapshots that no longer reflect current coding practices or APIs.

✓ Timestamp collections and re-mine periodically.

Selection Bias

Only mining popular repos (high stars) or English-only projects skews the dataset toward specific demographics.

✓ Document selection criteria; include diverse sources.

Ignoring Tests

Test files mixed with production source code introduce repetitive patterns and assertions into training data.

✓ Filter by file path: exclude **/test/**, *Test.java.

Auto-generated Code

Protobuf stubs, build outputs, and boilerplate generators inflate datasets with non-human code.

✓ Check for generation markers; filter by heuristics.

Missing Documentation

No record of filters, versions, or parameters used. Others cannot reproduce or verify results.

✓ Publish mining scripts, configs, and random seeds.

Module 1 · Slide29

From MSR to Model Training

This module built the foundation. Now let us connect MSR to what comes next: training probabilistic models that understand and predict code.

What We Built
Clean, tokenized,
deduplicated dataset

→

What's Next
Train probabilistic
models on this data

→

The Key Question
Can we compute
P(next_token | context)?

Preview — N-gram Language Models

In Module 2, we will use the datasets built in this module to train n-gram language models. These models estimate the probability of the next token given the previous n−1 tokens: P(t_n | t₁, ..., t_n-1). This is the simplest form of code completion — and the foundation for understanding how modern LLMs work.

The Pipeline So Far

Repos → Filter → Extract → Preprocess → Tokenize (Lexer or BPE) → Deduplicate → Split → Train Model (Module 2)

Module 1 · Slide30

Try It Yourself: Mini MSR Exercise

Put your knowledge into practice with this hands-on mini-assignment. You will use this dataset in the next module.

Assignment

1. Clone 3 Java repositories from GitHub (pick repos with 50+ stars).
2. Extract all methods using a parser (JavaParser or javalang).
3. Count the vocabulary (unique tokens after lexer tokenization).
4. Compute basic statistics: average method length (in tokens), most common tokens, number of duplicates.
5. Save the cleaned methods as a JSONL file (one method per line).

# Skeleton: mini_msr.py import javalang, json, os, hashlib from collections import Counter repos = ["user/repo1", "user/repo2", "user/repo3"] def clone_repos(repos): for r in repos: os.system(f"git clone https://github.com/{r}.git") def extract_methods(java_file): # Parse with javalang, yield method bodies ... def tokenize_method(code): # Use javalang.tokenizer to get tokens ... def deduplicate(methods): seen = set() for m in methods: h = hashlib.sha256(m.encode()).hexdigest() if h not in seen: seen.add(h) yield m # Main pipeline: clone -> extract -> tokenize -> dedup -> stats -> save

Deliverables

A JSONL file with your cleaned methods and a short report: how many repos, how many methods extracted, vocabulary size, avg method length, and how many duplicates you removed. Bring this dataset to the next class — we will use it to build n-gram models.

Module 1 · Slide31

Key Takeaways

01 · Foundation

Data Is the Foundation

AI-driven SE starts with data. Mining software repositories provides the raw material for every downstream model.

02 · Repo Types

Source, Bug, Communication

Each repository type contributes different artifacts. Version control history is a rich dataset of commits, diffs, and branches.

03 · Preprocessing

Quality Over Quantity

Deduplication (hashing, MinHash+LSH), outlier removal, boilerplate filtering, and sanity checks are essential. Garbage in, garbage out.

04 · Tokenization

Lexer Tokens and BPE

Lexer-based tokenization is language-aware; BPE is language-agnostic and used by modern LLMs. ASTs add structural understanding beyond flat tokens.

05 · Ethics & Law

Licensing & Privacy

Public code is not free to use for any purpose. Respect licenses, scrub PII, document provenance, and ensure reproducibility.

06 · Avoid Pitfalls

Leakage, Bias, Staleness

Project-based splits prevent leakage. Diverse selection avoids bias. Timestamped collections prevent staleness. Document everything.

07 · Tools

Mine at Scale

GitHub APIs, SEART-GHS, PyGitHub, javalang, Pygments, tree-sitter — a rich ecosystem exists for every pipeline stage.

08 · The Big Picture

MSR → Train → Deploy

This pipeline feeds directly into n-gram models, deep learning architectures, and pre-trained transformers for code.

Next Steps

Now that you know how to collect and prepare data from software repositories, the next module explores Probabilistic Source Code Modeling — how to model the statistical properties of code using n-grams.

Module 1 · Slide32

Knowledge Check

Test your understanding of the MSR pipeline. Click "Reveal Answer" to check your reasoning.

Which repository type would you mine to collect bug-fix pairs for training an automated program repair model?

Bug repositories (issue trackers linked to commits). You need bug reports linked to the commits that fix them — giving you a buggy → fixed code pair. Source repositories alone lack the structured bug metadata.

Why is project-based splitting preferred over random splitting?

Prevents data leakage. Methods from the same project share coding style, API usage patterns, and variable naming conventions. If they appear in both train and test, the model may appear to generalize when it is actually relying on project-specific patterns it memorized.

A dataset has 2M Java methods. After deduplication, 600K are removed. What does this suggest?

High code duplication on GitHub (~30% duplicates). This is consistent with research findings. Without removal, the model would memorize these repeated patterns rather than learning to generalize, and evaluation metrics would be inflated by test examples the model has already seen during training.

Module Complete

You have covered the full MSR pipeline: from repository selection to data extraction, deduplication, preprocessing, tokenization, and splitting. Next up: Probabilistic Source Code Modeling — how to model the statistical properties of code using n-grams.

Mining Software Repositories

What Is a Repository?

Types of Repositories

Version Control Refresher

Rule-based vs. Data-driven Approaches

🧠 Rule-based (Expert Systems)

🤖 Data-driven (Machine Learning)

Why So Much Data?

Ensuring Data Quality

Preprocessing Source Code

Interactive: Preprocessing Pipeline

Tokenization

Why Is Tokenization Needed?

Tokenization Tools

Interactive: Code Tokenizer

Beyond Lexer Tokens: Subword Tokenization

Interactive: BPE vs Lexer Tokenizer

Abstract Syntax Trees (ASTs)

Tools for Mining at Scale

The MSR Pipeline: End to End

Code as Data: What Makes It Special?

Real-World MSR Datasets

Dataset Licensing & Legal Considerations

⚖️ License Compliance

⚠️ The Copilot Controversy

🔒 Privacy in Code

🤝 Responsible Collection

Data Provenance & Reproducibility

Interactive: Repository Filter Simulator

Deduplication: Why & How

Interactive: Duplicate Detector

Hashing for Deduplication

Data Splitting Strategies

Worked Example: From Repo to Dataset

Common Pitfalls in MSR Research

Leaky Splits

Stale Data

Selection Bias

Ignoring Tests

Auto-generated Code

Missing Documentation

From MSR to Model Training

Try It Yourself: Mini MSR Exercise

Key Takeaways

Data Is the Foundation

Source, Bug, Communication

Quality Over Quantity

Lexer Tokens and BPE

Licensing & Privacy

Leakage, Bias, Staleness

Mine at Scale

MSR → Train → Deploy

Knowledge Check

Module Complete!