Mining Software Repositories
Where do we find the data that feeds AI-driven software engineering? Learn how to collect, clean, and prepare source-code data from public repositories — the critical first step before any deep learning model can learn.
What is a Software Repository?
A repository is a centralized digital storage that developers use to make and manage changes to an application’s source code. Version control systems like Git track what was changed, who did the change, when, and why — enabling teams to collaborate efficiently and retrieve any previous version of their code.
For AI researchers, repositories are goldmines. Every commit, every bug report, every discussion thread is a potential data point for training models that understand and generate code.
Three Types of Repositories
Source Repositories
Store the complete history of source code changes. GitHub, GitLab, Bitbucket. Contains source code files, commit messages, diffs, branch history, pull request reviews, and configuration files.
Bug Repositories
Track defects, feature requests, and tasks. Jira, BugZilla, GitHub Issues. Contains bug reports, labels, priority metadata, discussion threads, status transitions, and links to fixing commits.
Communication Repositories
Capture developer discussions. Mailing lists, Slack, Stack Overflow, IRC logs. Contains Q&A threads, chat logs, meeting notes, design documents, and announcements.
Key Terms
Rule-Based vs. Data-Driven Approaches
Not all AI systems learn from data. Some encode expert knowledge directly as hand-crafted rules — IF/THEN patterns created by domain experts. These rule-based systems require no training data but are brittle, fail on edge cases, and are hard to scale.
Modern software engineering automation has moved decisively toward data-driven approaches, particularly deep learning. These models learn patterns automatically from large datasets, generalize to unseen examples, and improve with more data. The trade-off: they require vast amounts of high-quality training data.
Why So Much Data?
Deep learning models are fundamentally statistical pattern matchers. They learn by observing millions of examples and discovering regularities — correlations between inputs and outputs that generalize to unseen data. The more complex the task, the more examples the model needs to learn robust patterns rather than memorizing surface-level noise.
Consider an analogy from computer vision: a model trained to classify cat breeds needs millions of labeled photographs to distinguish a Maine Coon from a Norwegian Forest Cat. With only a few hundred images, it might latch onto background color or image resolution instead of actual feline features. The same principle applies to code. A model that learns to summarize Java methods needs millions of method–summary pairs to understand that return a + b; is an addition regardless of whether the variables are named x and y, left and right, or salary and bonus.
What Do We Mine?
Software repositories are uniquely rich because they contain both source code and natural language, tightly interleaved:
- Source code — method bodies, class definitions, configuration files
- Comments & documentation — Javadoc, docstrings, inline annotations
- Commit messages — concise descriptions of what changed and why
- Issue reports & discussions — bug descriptions, feature requests, design debates
- Code reviews — pull request comments explaining improvements, catching mistakes
Each of these artifacts provides a different view of developer intent. By mining all of them, we can train models that understand not just what code does, but why it was written that way.
Building a Dataset: Selecting Repositories
Garbage in, garbage out. A model trained on low-quality data will produce low-quality predictions. Data curation is arguably the most important — and most underappreciated — step in the ML pipeline.
Repository selection uses quality proxies to filter the hundreds of millions of public repos down to a manageable, high-quality subset:
- Minimum stars — popularity as a quality signal
- Active maintenance — recent commits indicate a living project
- Non-fork status — avoid counting duplicated repositories
- Proper licensing — ensure legal use for training
- Meaningful commit history — enough data to be useful
Tools for Mining at Scale
Specialized APIs and search platforms make large-scale dataset construction possible: the GitHub REST & GraphQL API (rate-limited, 5,000 req/hour with auth), SEART-GHS (a search engine for GitHub repos with advanced filtering), and libraries like PyGitHub (Python), Octokit (JS), and go-github (Go).
Here’s how to query GitHub’s API for high-quality Java repositories, filtering by stars and excluding forks:
def fetch_top_java_repos(num_repos=200, per_page=100):
repos = []
page = 1
while len(repos) < num_repos:
url = "https://api.github.com/search/repositories"
params = {
"q": "language:java stars:>1000",
"sort": "stars",
"order": "desc",
"per_page": per_page,
"page": page
}
response = requests.get(url, params=params)
data = response.json()
for item in data.get("items", []):
if item.get("fork", False):
continue
repos.append({
"full_name": item["full_name"],
"clone_url": item["clone_url"],
"stars": item["stargazers_count"],
})
page += 1
return repos[:num_repos]
Once we have the repo list, we shallow-clone each one — --depth 1 grabs only the latest snapshot, saving time and disk space:
def clone_repo(clone_url, dest_dir):
cmd = ["git", "clone", "--depth", "1", "--quiet", clone_url, dest_dir]
result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
return result.returncode == 0
Data Quality Challenges
Selecting high-quality repositories is necessary but not sufficient. The raw code extracted from even the best projects contains numerous quality issues that can corrupt model training if left unaddressed.
Common Quality Issues
- Encoding problems — files with mixed encodings (UTF-8, Latin-1, Shift-JIS) produce garbled tokens. Non-ASCII identifiers in comments or string literals can break tokenizers that expect clean ASCII input.
- Auto-generated code — protobuf stubs, IDE-generated boilerplate, ORM mapping files, and build tool outputs inflate the dataset with repetitive, formulaic code that teaches the model nothing about human programming patterns.
- Test code vs. production code — unit tests follow very different patterns from production code (setup/teardown, assertions, mocking). Depending on the downstream task, you may want to separate or exclude test files entirely.
- Dead code and commented-out blocks — abandoned code paths and large commented-out sections add noise without contributing meaningful signal.
- Minified or obfuscated code — JavaScript bundles and obfuscated releases contain valid syntax but no readable structure.
Cross-Language Considerations
Different programming languages have fundamentally different characteristics that affect dataset construction. Java is verbose with explicit type declarations, while Python is concise with dynamic typing. A method that takes 15 lines in Java might take 4 lines in Python. This means cross-language models must account for:
- Token-length variance — the same logic produces vastly different token counts across languages
- Vocabulary differences — language-specific keywords, idioms, and standard library names
- Structural conventions — Java’s class-centric design vs. Python’s module-level functions vs. Go’s package-level organization
A Filtering Example
Here is a realistic breakdown of what happens when you apply quality filters to a raw Java method dataset:
A 36% reduction is typical. Each filter addresses a specific source of noise, and skipping any one of them can measurably degrade model performance.
Extracting and Filtering Methods
Once repositories are cloned, we need to extract individual methods or functions from the source files. For Java, this means finding .java files, parsing them with tools like javalang or JavaParser, and extracting method bodies along with their signatures.
Raw extracted methods need systematic cleaning before they can be used for training:
- Remove duplicates — their presence creates data leakage between training and test sets
- ASCII-only characters — avoid encoding issues across different systems
- Remove outliers — methods that are incredibly long (1000+ lines) or incredibly short (single-line getters)
- Remove boilerplate — trivial code like getters, setters, and auto-generated constructors
- Strip comments — remove all inline and block comments from the method body
The core extraction logic uses brace-counting to find where each method starts and ends:
def extract_method_source(source_code, method_node, lines):
start_line = method_node.position.line - 1
brace_count = 0
started = False
end_line = start_line
for i in range(start_line, len(lines)):
for char in lines[i]:
if char == '{':
brace_count += 1
started = True
elif char == '}':
brace_count -= 1
if started and brace_count == 0:
end_line = i
break
return '\n'.join(lines[start_line:end_line + 1])
Tokenization: From Code to Tokens
Tokenization breaks down raw source code into smaller units (tokens) that can be analyzed separately. For code, this means converting raw source into a structured sequence of keywords, identifiers, operators, and literals.
Lexer-Based vs. BPE Tokenization
Lexer-based tokenization is language-aware — it knows Java keywords, operators, and types. It produces whole identifiers as single tokens (e.g., getMaxValue → 1 token). This is great for analysis but creates a fixed, language-specific vocabulary.
BPE (Byte Pair Encoding) is language-agnostic — it learns frequent character sequences from training data and builds a vocabulary of 32K–100K subword tokens. It splits identifiers into common subwords (e.g., getMaxValue → [get, Max, Value]). This handles any vocabulary, including unseen identifiers and mixed languages.
Abstract Syntax Trees (ASTs) offer a third perspective — capturing the hierarchical structure of code rather than a flat sequence. ASTs are used for code understanding tasks, semantic analysis, and tree-based neural models.
Each extracted method is tokenized into space-separated tokens using javalang’s lexer:
def tokenize_method(source_code):
tokens = list(tokenize(source_code))
token_values = [token.value for token in tokens]
return ' '.join(token_values)
Abstract Syntax Trees
Lexer-based tokenization and BPE both produce flat sequences of tokens — they treat code as a linear stream, much like reading a sentence word by word. But code has a deeper, hierarchical structure that flat sequences discard. An Abstract Syntax Tree (AST) captures this structure explicitly, representing code as a tree where each node corresponds to a syntactic construct (declaration, expression, statement) and edges represent containment relationships.
Consider this simple Java method:
public int add(int a, int b) {
return a + b;
}
Its AST looks like this:
MethodDeclaration (name="add", returnType="int")
├── Modifier: public
├── FormalParameter (name="a", type="int")
├── FormalParameter (name="b", type="int")
└── BlockStatement
└── ReturnStatement
└── BinaryExpression (operator="+")
├── NameExpr: a
└── NameExpr: b
Why ASTs Matter for Code Mining
Structure over Surface
ASTs capture the syntactic structure of code, not its surface tokens. Two methods with different variable names but identical logic produce different token sequences but structurally similar ASTs.
Code Understanding
Tasks like clone detection, bug finding, and code classification benefit from structural representations that reveal what code does rather than how it looks.
Tree-Based Neural Models
Architectures like Tree-LSTM and code2seq operate directly on AST nodes, learning to compose meaning bottom-up from leaves to root — mirroring how compilers process code.
Parsing Tools
Libraries like JavaParser (Java), tree-sitter (multi-language), and Python’s built-in ast module make AST extraction straightforward at scale.
[public, int, add, (, int, a, ...] loses the information that a + b is the return expression, not just two identifiers near a plus sign. The AST preserves this nesting explicitly. This distinction becomes increasingly important in later modules on code embeddings and transformer architectures.
Deduplication
Duplicate code inflates datasets and causes data leakage — if the same function appears in both training and test sets, the model appears to generalize but is actually recalling memorized examples. Studies show 10–30% of GitHub code is duplicated.
Exact Duplicates: SHA-256 Hashing
Compute a hash for each file or method. Identical hashes mean identical content. Fast, simple, and catches the same file copied across multiple repos.
Near-Duplicates: MinHash + LSH
Jaccard similarity on token sets measures overlap. For scalability, use MinHash + Locality-Sensitive Hashing (LSH) to find near-duplicates across millions of files without expensive pairwise comparison.
These two snippets are near-duplicates — same logic, renamed variables:
Version A
public int calculateSum(int x, int y) {int result = x + y;return result;}
Version B
public int addNumbers(int a, int b) {int sum = a + b;return sum;}
An exact hash check would miss this pair entirely. MinHash + LSH catches them because their token sets overlap significantly.
We first clean malformed methods, then remove exact duplicates using a simple set-based approach:
def is_clean_method(tokenized_code):
method_keywords = (tokenized_code.count("public ") +
tokenized_code.count("private ") +
tokenized_code.count("protected "))
if method_keywords > 1:
return False
if not tokenized_code.endswith("}"):
return False
return True
seen = set()
unique_methods = []
for m in tokenized_methods:
if m['tokenized_code'] not in seen:
seen.add(m['tokenized_code'])
unique_methods.append(m)
Splitting the Dataset
How you split data into train, validation, and test sets matters as much as the data itself. The wrong strategy can silently invalidate your results.
Random Split (Risky)
Shuffle all methods and split. Fast but dangerous — methods from the same project can appear in both train and test sets, causing data leakage.
Project-Based Split (Recommended)
All methods from one project go into the same split. Prevents cross-project leakage since methods in the same project share coding style, API usage, and naming conventions.
Temporal Split
Train on older commits, test on newer ones. Simulates real-world deployment where the model must predict code it has never seen from the future.
Typical split ratios are 80/10/10 or 70/15/15 (train / validation / test). Larger test sets give more reliable evaluation estimates.
Project-Based Splitting in Practice
The idea is straightforward: group all methods by their source project, then assign entire projects (not individual methods) to splits. This ensures that no project’s coding style, API usage, or naming conventions leak from training into evaluation.
import random
def project_based_split(methods, train_ratio=0.8, val_ratio=0.1):
# Group methods by their source project
projects = {}
for m in methods:
proj = m["project"]
projects.setdefault(proj, []).append(m)
# Shuffle project names, then assign to splits
proj_names = list(projects.keys())
random.shuffle(proj_names)
total = len(methods)
train, val, test = [], [], []
count = 0
for name in proj_names:
group = projects[name]
if count < total * train_ratio:
train.extend(group)
elif count < total * (train_ratio + val_ratio):
val.extend(group)
else:
test.extend(group)
count += len(group)
return train, val, test
Code as Data: What Makes It Special?
Source code is unlike natural language text in several important ways. Understanding these differences shapes how we build datasets and models.
Formal Syntax
Code must compile or parse. A single misplaced semicolon breaks everything. This rigid structure is both a constraint and an advantage for learning.
Executable Semantics
Code has deterministic meaning: we can run it, test it, and verify outputs. This enables automatic labeling and evaluation.
Multi-level Representation
The same code can be viewed as characters, tokens, AST nodes, control-flow graphs, or data-flow graphs. Each level reveals different patterns.
Bimodal Nature
Repositories contain both code and natural language (comments, docs, commit messages). Models can learn the mapping between intent and implementation.
Real-World MSR Datasets
Researchers have curated benchmark datasets from mined repositories. These standardized datasets enable reproducible experiments and fair comparisons across techniques.
| Dataset | Languages | Size | Primary Task |
|---|---|---|---|
| CodeSearchNet | 6 languages | 2M code-NL pairs | Code search & retrieval |
| Defects4J | Java | 835 real bugs | Automated program repair & testing |
| BigCloneBench | Java | 8M clone pairs | Clone detection |
| The Stack | 300+ languages | 6 TB | Pre-training code LLMs |
| Methods2Test | Java | 780K focal-test pairs | Test generation |
Ethics, Licensing, and Provenance
Public code is not necessarily free to use for any purpose. Ethical and legal considerations are critical when building MSR datasets.
- MIT — permissive, almost no restrictions
- Apache 2.0 — permissive with patent grants
- GPL — copyleft, derivatives must also be GPL
The Copilot controversy highlighted the tension: GitHub Copilot trained on public repos regardless of license, sparking a class-action lawsuit. Developers argued their copyleft code was used without respecting license terms.
Repositories also often contain privacy risks — PII in comments (names, emails), hardcoded API keys, database credentials, and internal URLs that must be scrubbed from training data.
For reproducibility, always record source repo URLs, commit hashes, and timestamps. Version your filters, timestamp your collection, share your mining scripts, and use standardized formats like JSONL or Parquet.
Data Provenance and Documentation
Responsible dataset creation requires thorough documentation of how the data was collected, filtered, and prepared. Two emerging standards address this need:
- Datasheets for Datasets — a framework proposed by Gebru et al. that asks dataset creators to document motivation, composition, collection process, preprocessing, intended uses, and maintenance plans.
- Data Cards — concise summaries that accompany a dataset release, covering provenance, known biases, ethical considerations, and recommended use cases.
These documents help downstream users make informed decisions about whether a dataset is appropriate for their task and what limitations to expect.
Opt-Out Mechanisms
The Stack, a 6 TB dataset of permissively licensed source code, introduced an important precedent: developers can request that their code be removed from the dataset via a simple opt-out form. This respects contributor autonomy even when the code’s license technically permits inclusion. The opt-out mechanism acknowledges that legal permission and ethical consent are not the same thing.
GDPR and PII in Mined Data
Mined code repositories frequently contain personally identifiable information (PII) — author names in commit logs, email addresses in file headers, usernames in configuration files. Under regulations like the EU’s General Data Protection Regulation (GDPR), processing PII requires a lawful basis. Researchers working with mined data should:
- Strip or anonymize author metadata before releasing datasets
- Remove hardcoded credentials, API keys, and internal URLs
- Consider whether commit messages or code comments contain personal data
- Document what PII scrubbing was performed and what residual risks remain
The Complete Pipeline
From raw repositories to a clean, tokenized dataset ready for model training:
The Funnel Effect
A worked example for Java code summarization — generating natural-language descriptions for methods:
Starting from 3,847 repositories and 2.1 million raw pairs, preprocessing removes nearly 62% of the data. This is normal — quality always trumps quantity.
Worked Example: End-to-End MSR Pipeline
Let’s walk through a complete, concrete example of building a small dataset for Java method summarization — generating natural-language descriptions from method bodies.
Step 1: Select Repositories
We pick three well-known, actively maintained Java projects with permissive licenses:
apache/commons-lang
General-purpose utility library. Apache 2.0 license. ~2,400 stars. Rich Javadoc coverage across string, number, and date utilities.
google/guava
Core libraries for collections, caching, and I/O. Apache 2.0 license. ~50,000 stars. Extensive, well-documented API surface.
square/okhttp
HTTP client for Java and Android. Apache 2.0 license. ~46,000 stars. Production-grade networking code with clear method structure.
Step 2: Query and Clone
We use the GitHub API to verify each repository meets our criteria, then shallow-clone:
# Verify repos meet quality criteria
curl -s "https://api.github.com/repos/apache/commons-lang" \
| jq '{stars: .stargazers_count, license: .license.spdx_id, fork: .fork}'
# Shallow-clone each repo
git clone --depth 1 https://github.com/apache/commons-lang.git
git clone --depth 1 https://github.com/google/guava.git
git clone --depth 1 https://github.com/square/okhttp.git
Step 3: Extract Methods
Using javalang, we parse every .java file and extract methods that have Javadoc comments (our natural-language summaries):
Step 4: Preprocessing Funnel
Each filtering step removes a specific category of noise:
Step 5: Project-Based Split
We assign entire projects to splits. Since we only have three projects, one natural assignment is:
With only three projects, the split ratios deviate from the ideal 80/10/10. In practice, you would mine dozens or hundreds of projects to achieve better balance while still maintaining strict project-level separation.
Try It Yourself
Put your knowledge into practice. Clone 3 Java repositories from GitHub (50+ stars), extract all methods using javalang, compute basic statistics (vocabulary size, average method length, duplicate count), and save the cleaned methods as a JSONL file. You will use this dataset in the next module on Source Code Modeling.
Open the Exercise in Google Colab →
Module 2: Source Code Modeling
Learn how to model the statistical properties of code using n-grams — the foundation for understanding how modern LLMs predict code.