Where do we find the data that feeds AI-driven software engineering? This module explores how to collect, clean, and prepare source-code data from public repositories — the critical first step before any deep learning model can learn.
100M+
Developers on GitHub
400M+
Public Repositories
3
Repo Types
∞
Artifacts to Mine
Goal
Build AI systems that support developers in one or more SE-related tasks — by leveraging data available in software repositories.
Module 1 · Slide2
What Is a Repository?
A repository (repo) is a centralized digital storage that developers use to make and manage changes to an application's source code and more. It lets developers track code changes, simultaneously edit files, and collaborate efficiently from any location.
Version Control
A versioning system manages changes to configuration items (artifacts). It tracks what was changed, who did the change, when, and why. It enables retrieving specific revisions and managing branches.
📝
Commit
A snapshot of changes to one or more files, with a message describing what changed and why.
🌿
Branch
An independent line of development allowing parallel work on features, bug fixes, or experiments.
🔀
Merge / Pull Request
Integrating changes from one branch into another, often through a reviewed pull request.
🏷️
Tag / Release
A named reference to a specific commit, typically marking a release version (e.g., v2.1.0).
Module 1 · Slide3
Types of Repositories
Software projects generate artifacts across three main repository types. Click each tab to explore what kind of data lives inside.
📂 Source Repositories
🐛 Bug Repositories
💬 Communication Repos
Key Insight
MSR (Mining Software Repositories) leverages data available in these repositories to aid development activities. Our overarching goal is to build AI systems that can support developers in one or more SE-related tasks.
Module 1 · Slide4
Version Control Refresher
MSR depends on version control history. Here is a quick Git refresher of the concepts that become data points for AI models.
📸
Commit
A snapshot of changes to one or more files. Each commit has a unique SHA hash and records who, when, and why.
Think of it as a save point in a video game.
🌳
Branch
A parallel line of development. Developers create branches to work on features or fixes without disturbing the main codebase.
🔀
Merge / Pull Request
Combining changes from one branch into another. Pull requests add a code-review layer before merging.
🏷️
Tag
A named reference to a specific commit, typically marking a release version (e.g., v2.1.0).
📋
Diff
Shows the exact changes between two versions of a file: lines added, removed, or modified. Diffs are central to code review and change-aware AI models.
Key Insight
Every commit is a data point. MSR treats version history as a rich dataset for training AI models — commit messages become natural language labels, diffs become input features, and branches capture development workflows.
Module 1 · Slide5
Rule-based vs. Data-driven Approaches
AI systems don't have to be grounded in data-driven methods. Some systems encode expert knowledge directly as rules. But for modern SE automation, data-driven approaches (especially deep learning) dominate.
🧠 Rule-based (Expert Systems)
Hand-crafted IF/THEN rules
Domain experts encode knowledge
No training data needed
Brittle: fails on edge cases
Hard to scale and maintain
IF: the screen is blue AND: there is an error message THEN: it's all good, it's Windows
VS
🤖 Data-driven (Machine Learning)
Learn patterns from data automatically
Require large, high-quality datasets
Generalize to unseen examples
Scale with more data
State of the art for SE tasks
Modern Approach
Among all data-driven techniques, deep learning models — particularly Large Language Models (LLMs) — are highly dependent on data. They require vast amounts of data to effectively learn and generalize.
Module 1 · Slide6
Why So Much Data?
Deep learning models, especially LLMs, learn statistical patterns from enormous corpora. More data means better generalization — but the data must be high quality.
The Data Hunger
A model trained on cat images needs millions of examples to distinguish breeds. Similarly, a code model needs millions of functions to learn patterns like variable naming, control flow, and idiomatic usage.
What Type of Data?
We mine data from publicly available GitHub repositories — both source code and natural language (comments, commit messages, issues, documentation).
GitHub Copilot Example
GitHub Copilot can generate code and natural language because it was trained on massive amounts of open-source code from GitHub repositories — learning both the structure of code and the intent expressed in comments and documentation.
GitHub Repos
→
Mine Data
→
Clean & Filter
→
Train Model
Module 1 · Slide7
Ensuring Data Quality
Not all data is useful. We need to ensure high-quality datasets through careful repository selection and rigorous preprocessing.
Step 1 — Select Good Repositories
Use repository quality as a proxy. Filters include: minimum stars, active maintenance (recent commits), non-fork status, proper licensing, and meaningful commit history.
Step 2 — Enforce Quality Checks
Once data is collected, apply sanity checks: remove duplicates, filter by language, remove auto-generated code, check for encoding issues, and validate syntactic correctness.
Remember
Garbage in, garbage out. A model trained on low-quality data will produce low-quality predictions. Data curation is arguably the most important (and most underappreciated) step in the ML pipeline.
Module 1 · Slide8
Preprocessing Source Code
Raw code from repositories needs systematic cleaning before it can be used for training. Here are the essential preprocessing steps applied to each method/function.
Remove duplicates — Their presence can hinder the learning ability of the model by creating data leakage between training and test sets.
ASCII-only characters — Keep code that contains only standard ASCII characters to avoid encoding issues.
Remove outliers — Define outliers as methods that are incredibly long or incredibly short (e.g., single-line getters or 1000-line methods).
Remove boilerplate — Eliminate trivial code like getters, setters, and auto-generated constructors.
Strip comments — Clean the method by removing all inline and block comments.
Custom criteria — Remove code that doesn't fit project-specific criteria (e.g., Cyclomatic Complexity < 5 if training on complex code).
Module 1 · Slide9
Interactive: Preprocessing Pipeline
Toggle each preprocessing step and watch the code get cleaned in real time.
Module 1 · Slide10
Tokenization
Tokenization is the process that breaks down text into smaller units (tokens) that can be analyzed separately. For code, this means converting raw source into a structured sequence of keywords, identifiers, operators, and literals.
Before Tokenization
public int addNumbers(int a, int b){ int sum=a+b; return sum; }
After Tokenization
public int addNumbers ( int a , int b ) { int sum = a + b ; return sum ; }
🔤
Lexer
Converts raw text into a sequence of tokens — the basic building blocks of the language (keywords, operators, literals, identifiers).
🌲
Parser
Takes the token sequence from the lexer and analyzes them to understand the structure and syntax of the code (builds an AST).
Module 1 · Slide11
Why Is Tokenization Needed?
Reducing Complexity
Tokenization divides text into smaller units, making it easier for the model to identify patterns and relationships. Instead of processing raw character streams, the model works with meaningful units.
Handling the Vocabulary
By dividing text into tokens, the model can create a numerical representation of the vocabulary, making it easier to process and understand. Each unique token gets an ID in the vocabulary.
Tokenization Tools
code-tokenize
Python · Multi-language
Works for several programming languages. Provides tokenization tailored to code structure.
javalang
Python · Java only
Java-specific lexer and parser in Python. Produces fine-grained Java tokens.
JavaParser
Java · Java only
Full Java parser that builds a complete AST. Used for structural analysis.
Pygments
Python · Multi-language
Generic syntax highlighter integrating a lexer and parser. Works for many languages.
Module 1 · Slide12
Interactive: Code Tokenizer
Enter Java code below and click "Tokenize" to see it broken into classified tokens.
keyword
type
identifier
literal
operator
separator
Module 1 · Slide13
Beyond Lexer Tokens: Subword Tokenization
Lexer-based tokenization is great for analysis, but modern LLMs use a different approach: BPE (Byte Pair Encoding). Understanding both is essential.
Lexer-based Tokenization
Language-aware — knows Java keywords, operators, types Fixed vocabulary per language (keyword set + identifiers) Whole identifiers as single tokens Example:getMaxValue → [getMaxValue] (1 token)
Subword / BPE Tokenization
Language-agnostic — learns frequent character sequences Learned vocabulary from training data (32K–100K tokens) Splits identifiers into common subwords Example:getMaxValue → [get, Max, Value] (3 tokens)
Why BPE Wins for LLMs
Modern LLMs (GPT, CodeLlama, StarCoder) use BPE because it handles any vocabulary — including unseen identifiers, mixed languages, and even natural language comments — with a fixed-size token dictionary. No out-of-vocabulary problem.
32K
GPT-2 Vocab Size
50K
CodeLlama Vocab
49K
StarCoder Vocab
100K
GPT-4 Vocab Size
Module 1 · Slide14
Interactive: BPE vs Lexer Tokenizer
Type Java code below to see side-by-side comparison: lexer tokens (colored by type) vs simulated BPE tokens (showing how identifiers get split).
Java Lexer Tokens
BPE (GPT-style) Tokens
Lexer tokens: 0BPE tokens: 0
Module 1 · Slide15
Abstract Syntax Trees (ASTs)
While tokenization produces a flat sequence, an AST captures the hierarchical structure of code. ASTs are used for code understanding tasks and semantic analysis.
publicintadd(int a, int b) { return a + b;
}
Key Points
1. ASTs capture program structure, not just tokens 2. Used for code understanding and semantic analysis 3. Enable tree-based neural models (Tree-LSTM, code2seq) 4. Built by parsers like JavaParser, tree-sitter
While n-gram models see code as flat text, ASTs preserve the hierarchical structure that makes code different from natural language. This distinction becomes important when we discuss code embeddings and transformer architectures in later modules.
Module 1 · Slide16
Tools for Mining at Scale
Mining millions of repositories by hand isn't practical. Specialized APIs and search platforms make large-scale dataset construction possible.
GitHub REST & GraphQL API
Programmatic access to repository metadata, commits, issues, PRs, file contents, and more. Rate-limited (5,000 req/hour with auth). Libraries: PyGitHub (Python), Octokit (JS), go-github (Go).
# PyGitHub example from github import Github
g = Github("access_token") for repo in g.search_repositories(
query="language:java stars:>100"): print(repo.full_name, repo.stargazers_count)
SEART-GHS
The Software Engineering Artifact Repository Tracker — a search engine for GitHub repos maintained by USI. Provides advanced filtering by language, stars, commits, contributors, license, and more.
Best Practices
1. Always respect rate limits and API terms 2. Cache responses to avoid redundant requests 3. Use bulk exports (GH Archive, GHTorrent) for historical data 4. Verify license compatibility for your use case
Module 1 · Slide17
The MSR Pipeline: End to End
From raw repositories to a clean, tokenized dataset ready for model training — here's the complete pipeline.
Select Repos Stars, activity, license
→
Clone & Extract Methods, classes, files
→
Preprocess Dedup, filter, clean
→
Tokenize Lexer → token sequence
→
Dataset Train / Val / Test
Stage
Input
Output
Key Concern
Repository Selection
GitHub / SEART-GHS
Repo list (URLs)
Quality proxy (stars, activity)
Data Extraction
Repo list
Raw methods / files
Language filtering, scope
Deduplication
Raw methods
Unique methods
Data leakage prevention
Preprocessing
Unique methods
Clean methods
Outliers, boilerplate, encoding
Tokenization
Clean methods
Token sequences
Vocabulary size, OOV handling
Split
Token sequences
Train / Val / Test
No overlap between splits
Module 1 · Slide18
Code as Data: What Makes It Special?
Source code is unlike natural language text in several important ways. Understanding these differences shapes how we build datasets and models.
Formal Syntax
Code must compile or parse. A single misplaced semicolon breaks everything. This rigid structure is both a constraint and an advantage for learning.
Executable Semantics
Code has deterministic meaning: we can run it, test it, and verify outputs. This enables automatic labeling and evaluation unlike natural language.
Multi-level Representation
The same code can be viewed as characters, tokens, AST nodes, control-flow graphs, or data-flow graphs. Each level reveals different patterns.
Bimodal Nature
Repositories contain both code and natural language (comments, docs, commit messages). Models can learn the mapping between intent and implementation.
Key Implication
Because code is formal, executable, and multi-level, MSR datasets can be richer than typical NLP corpora. We can extract not just text, but structure (ASTs), behavior (tests), and evolution (diffs) from the same repository.
Module 1 · Slide19
Real-World MSR Datasets
Researchers have curated benchmark datasets from mined repositories. These standardized datasets enable reproducible experiments and fair comparisons across techniques.
Dataset
Language(s)
Size
Primary Task
Key Paper
CodeSearchNet
6 languages
2M code-NL pairs
Code search & retrieval
Husain et al., 2019
Defects4J
Java
835 real bugs
APR & testing
Just et al., 2014
BigCloneBench
Java
8M clone pairs
Clone detection
Svajlenko et al., 2014
The Stack
300+ languages
6 TB
Pre-training code LLMs
Kocetkov et al., 2022
Methods2Test
Java
780K focal-test pairs
Test generation
Tufano et al., 2022
Key Insight
These curated datasets are the bridge between raw repositories and reproducible research. Without standardized benchmarks, it would be impossible to compare different approaches fairly.
Module 1 · Slide20
Dataset Licensing & Legal Considerations
Public code is not necessarily free to use for any purpose. Ethical and legal issues are critical when building MSR datasets.
⚖️ License Compliance
MIT — permissive, almost no restrictions. Apache 2.0 — permissive with patent grants. GPL — copyleft, derivatives must also be GPL. Always check if a license permits use as training data.
⚠️ The Copilot Controversy
GitHub Copilot trained on public repos regardless of license, sparking a class-action lawsuit. Developers argued their copyleft code was used without respecting license terms.
🔒 Privacy in Code
Repositories often contain PII in comments (names, emails), hardcoded API keys, database credentials, and internal URLs. These must be scrubbed from training data.
🤝 Responsible Collection
Respect robots.txt and API rate limits. Provide attribution when possible. Consider opt-out mechanisms for developers who do not want their code used for training.
Remember
Just because code is public does not mean it is free to use for any purpose. Always verify license compatibility, scrub sensitive data, and respect the intent of open-source contributors.
Module 1 · Slide21
Data Provenance & Reproducibility
Tracking where data comes from and how it was processed is essential for scientific rigor and reproducibility in MSR research.
Record source repos — Store full URLs, commit hashes, and timestamps for every repository you mine. This lets others verify and replicate your dataset.
Version your filters — Document every inclusion/exclusion criterion (min stars, language, file patterns). Even small filter changes can drastically alter results.
Timestamp your collection — Repositories change constantly. Code added, deleted, or relicensed after your snapshot may differ from what you collected.
Share your pipeline — Publish your mining scripts, configuration files, and random seeds. A dataset without reproducible construction is scientifically weak.
Use standardized formats — Store datasets in well-known formats like JSONL (one JSON object per line) or Parquet (columnar, compressed, fast). Include metadata fields.
Golden Rule
A dataset without provenance is scientifically useless. Always document HOW you built it, so others can reproduce, verify, and extend your work.
Module 1 · Slide22
Interactive: Repository Filter Simulator
Configure repository selection criteria and watch how quickly the pool of usable repositories shrinks. Every filter trades quantity for quality.
Minimum Stars
0
01050100500
Minimum Commits
1
11050100
Language Filter
Additional Filters
~400M
All 400M+ public GitHub repositories are included. Start adjusting filters to see the funnel effect.
Module 1 · Slide23
Deduplication: Why & How
Duplicate code inflates datasets and causes data leakage. Removing duplicates is essential for training models that generalize rather than memorize.
Exact Duplicates
Hash-based detection (MD5 or SHA-256): compute a hash for each file or method. Identical hashes mean identical content. Fast and simple — catches the same file copied across multiple repos.
Near-Duplicates
Jaccard similarity on token sets measures overlap. For scalability, use MinHash + Locality-Sensitive Hashing (LSH) to find near-duplicates across millions of files without pairwise comparison.
Cross-Split Leakage
The same function appearing in both training and test sets invalidates evaluation. The model appears to generalize but is actually recalling memorized examples.
These two snippets are near-duplicates — same logic, renamed variables:
Version A
publicintcalculateSum(intx, inty) {
intresult = x + y;
returnresult;
}
Version B
publicintaddNumbers(inta, intb) {
intsum = a + b;
returnsum;
}
Research Finding
Studies show 10–30% of GitHub code is duplicated. Without deduplication, models memorize rather than generalize — leading to inflated performance metrics.
Module 1 · Slide24
Interactive: Duplicate Detector
Paste two code snippets below to compute similarity. See how near-duplicates can fool exact matching.
Snippet A
Snippet B
Exact match:
0%
Token Jaccard:
0%
70%
Verdict
Adjust the snippets above to see similarity analysis.
Module 1 · Slide25
Hashing for Deduplication
Deduplication at scale relies on hashing. Here are the key concepts and how they fit into the MSR pipeline.
What Is a Hash?
A deterministic function that maps any input to a fixed-size output (digest). Same input always produces the same hash. Even a 1-character change produces a completely different hash.
MD5 / SHA-256
Collision-resistant, fast to compute. Used for exact duplicate detection: hash each file, group by hash, keep one per group. SHA-256 is preferred for security.
MinHash
Approximates Jaccard similarity efficiently. Instead of comparing all token sets pairwise, MinHash creates compact signatures that can be compared in O(1).
LSH (Locality-Sensitive Hashing)
Groups similar items into the same buckets with high probability. Combined with MinHash, it finds near-duplicates in sub-linear time across millions of files.
Type some code below to see a simulated hash update in real time:
Simulated SHA-256...
Pipeline Fit
Step 1: Use SHA-256 to remove exact duplicates (fast, O(n)). Step 2: Use MinHash+LSH to find near-duplicates (scalable, sub-quadratic). This two-phase approach handles datasets with millions of files.
Module 1 · Slide26
Data Splitting Strategies
How you split data into train, validation, and test sets matters as much as the data itself. The wrong strategy can silently invalidate your results.
Random Split
Simplest approach: shuffle all methods and split. Fast but risky — methods from the same project can appear in both train and test sets, causing data leakage.
Project-Based Split
All methods from one project go into the same split. Prevents cross-project leakage since methods in the same project share style, APIs, and patterns.
Temporal Split
Train on older commits, test on newer ones. Simulates real-world deployment where the model must predict code it has never seen from the future.
Typical Ratios
Common splits: 80/10/10 or 70/15/15 (train / validation / test). Larger test sets give more reliable evaluation estimates.
10 projects split into Train / Val / Test under each strategy:
RANDOM SPLIT (leakage risk)
P1 P2 P3* P5 P6 P7 P8 P9*
P3* P4
P9* P10
* P3 and P9 appear in multiple splits = leakage
PROJECT-BASED SPLIT (recommended)
P1 P2 P3 P4 P5 P6 P7 P8
P9
P10
Each project appears in exactly one split
TEMPORAL SPLIT (real-world simulation)
commits before 2022
2022–2023
2024+
Model never sees future code during training
Train
Validation
Test
Module 1 · Slide27
Worked Example: From Repo to Dataset
Let us walk through a concrete example of building a dataset for Java code summarization — generating a natural-language description for a given method.
1
Define the task: Build a dataset of Java method – Javadoc comment pairs for code summarization.
—
2
Query SEART-GHS: Java repos with ≥100 stars and ≥50 commits.
3,847 repos
3
Clone & extract: Parse each repo with JavaParser, extract methods that have Javadoc comments.
Tokenize: Use javalang to convert each method into a token sequence.
810K sequences
6
Project-based split: Ensure no project appears in multiple splits.
648K / 81K / 81K
The Funnel Effect
Starting from 3,847 repositories and 2.1 million raw pairs, preprocessing removes nearly 62% of the data. This is normal — quality always trumps quantity.
Module 1 · Slide28
Common Pitfalls in MSR Research
Avoid these frequent mistakes that can silently invalidate your MSR experiments and models.
Leaky Splits
Same project (or near-duplicate code) appears in both training and test sets, inflating evaluation metrics.
✓ Use project-based splitting + deduplication across splits.
Stale Data
Using outdated repository snapshots that no longer reflect current coding practices or APIs.
✓ Timestamp collections and re-mine periodically.
Selection Bias
Only mining popular repos (high stars) or English-only projects skews the dataset toward specific demographics.
✓ Document selection criteria; include diverse sources.
Ignoring Tests
Test files mixed with production source code introduce repetitive patterns and assertions into training data.
✓ Filter by file path: exclude **/test/**, *Test.java.
Auto-generated Code
Protobuf stubs, build outputs, and boilerplate generators inflate datasets with non-human code.
✓ Check for generation markers; filter by heuristics.
Missing Documentation
No record of filters, versions, or parameters used. Others cannot reproduce or verify results.
✓ Publish mining scripts, configs, and random seeds.
Module 1 · Slide29
From MSR to Model Training
This module built the foundation. Now let us connect MSR to what comes next: training probabilistic models that understand and predict code.
What We Built Clean, tokenized, deduplicated dataset
→
What's Next Train probabilistic models on this data
→
The Key Question Can we compute P(next_token | context)?
Preview — N-gram Language Models
In Module 2, we will use the datasets built in this module to train n-gram language models. These models estimate the probability of the next token given the previous n−1 tokens: P(tn | t1, ..., tn-1). This is the simplest form of code completion — and the foundation for understanding how modern LLMs work.
Put your knowledge into practice with this hands-on mini-assignment. You will use this dataset in the next module.
Assignment
1. Clone 3 Java repositories from GitHub (pick repos with 50+ stars). 2. Extract all methods using a parser (JavaParser or javalang). 3. Count the vocabulary (unique tokens after lexer tokenization). 4. Compute basic statistics: average method length (in tokens), most common tokens, number of duplicates. 5. Save the cleaned methods as a JSONL file (one method per line).
# Skeleton: mini_msr.pyimport javalang, json, os, hashlib
from collections import Counter
repos = ["user/repo1", "user/repo2", "user/repo3"]
defclone_repos(repos):
for r in repos:
os.system(f"git clone https://github.com/{r}.git")
defextract_methods(java_file):
# Parse with javalang, yield method bodies
...
deftokenize_method(code):
# Use javalang.tokenizer to get tokens
...
defdeduplicate(methods):
seen = set()
for m in methods:
h = hashlib.sha256(m.encode()).hexdigest()
if h not in seen:
seen.add(h)
yield m
# Main pipeline: clone -> extract -> tokenize -> dedup -> stats -> save
Deliverables
A JSONL file with your cleaned methods and a short report: how many repos, how many methods extracted, vocabulary size, avg method length, and how many duplicates you removed. Bring this dataset to the next class — we will use it to build n-gram models.
Module 1 · Slide31
Key Takeaways
01 · Foundation
Data Is the Foundation
AI-driven SE starts with data. Mining software repositories provides the raw material for every downstream model.
02 · Repo Types
Source, Bug, Communication
Each repository type contributes different artifacts. Version control history is a rich dataset of commits, diffs, and branches.
03 · Preprocessing
Quality Over Quantity
Deduplication (hashing, MinHash+LSH), outlier removal, boilerplate filtering, and sanity checks are essential. Garbage in, garbage out.
04 · Tokenization
Lexer Tokens and BPE
Lexer-based tokenization is language-aware; BPE is language-agnostic and used by modern LLMs. ASTs add structural understanding beyond flat tokens.
05 · Ethics & Law
Licensing & Privacy
Public code is not free to use for any purpose. Respect licenses, scrub PII, document provenance, and ensure reproducibility.
GitHub APIs, SEART-GHS, PyGitHub, javalang, Pygments, tree-sitter — a rich ecosystem exists for every pipeline stage.
08 · The Big Picture
MSR → Train → Deploy
This pipeline feeds directly into n-gram models, deep learning architectures, and pre-trained transformers for code.
Next Steps
Now that you know how to collect and prepare data from software repositories, the next module explores Probabilistic Source Code Modeling — how to model the statistical properties of code using n-grams.
Module 1 · Slide32
Knowledge Check
Test your understanding of the MSR pipeline. Click "Reveal Answer" to check your reasoning.
Which repository type would you mine to collect bug-fix pairs for training an automated program repair model?
Bug repositories (issue trackers linked to commits). You need bug reports linked to the commits that fix them — giving you a buggy → fixed code pair. Source repositories alone lack the structured bug metadata.
Why is project-based splitting preferred over random splitting?
Prevents data leakage. Methods from the same project share coding style, API usage patterns, and variable naming conventions. If they appear in both train and test, the model may appear to generalize when it is actually relying on project-specific patterns it memorized.
A dataset has 2M Java methods. After deduplication, 600K are removed. What does this suggest?
High code duplication on GitHub (~30% duplicates). This is consistent with research findings. Without removal, the model would memorize these repeated patterns rather than learning to generalize, and evaluation metrics would be inflated by test examples the model has already seen during training.
Module Complete
You have covered the full MSR pipeline: from repository selection to data extraction, deduplication, preprocessing, tokenization, and splitting. Next up: Probabilistic Source Code Modeling — how to model the statistical properties of code using n-grams.
🎉
Module Complete!
You've finished Mining Software Repositories. Great work covering the full MSR pipeline!