Mining Software Repositories

Where do we find the data that feeds AI-driven software engineering? Learn how to collect, clean, and prepare source-code data from public repositories — the critical first step before any deep learning model can learn.

What is a Software Repository?

A repository is a centralized digital storage that developers use to make and manage changes to an application’s source code. Version control systems like Git track what was changed, who did the change, when, and why — enabling teams to collaborate efficiently and retrieve any previous version of their code.

For AI researchers, repositories are goldmines. Every commit, every bug report, every discussion thread is a potential data point for training models that understand and generate code.

Three Types of Repositories

Source Repositories

Store the complete history of source code changes. GitHub, GitLab, Bitbucket. Contains source code files, commit messages, diffs, branch history, pull request reviews, and configuration files.

Bug Repositories

Track defects, feature requests, and tasks. Jira, BugZilla, GitHub Issues. Contains bug reports, labels, priority metadata, discussion threads, status transitions, and links to fixing commits.

Communication Repositories

Capture developer discussions. Mailing lists, Slack, Stack Overflow, IRC logs. Contains Q&A threads, chat logs, meeting notes, design documents, and announcements.

Key Terms

Commit
A snapshot of changes to one or more files, with a message describing what changed and why. Each commit has a unique SHA hash.
Branch
An independent line of development allowing parallel work on features, bug fixes, or experiments.
Merge / PR
Integrating changes from one branch into another, often through a reviewed pull request that adds a code-review layer.
Tag
A named reference to a specific commit, typically marking a release version (e.g., v2.1.0).
Diff
Shows the exact changes between two versions of a file: lines added, removed, or modified. Central to code review and change-aware AI models.
Every commit is a data point. MSR treats version history as a rich dataset for training AI models — commit messages become natural language labels, diffs become input features, and branches capture development workflows.

Rule-Based vs. Data-Driven Approaches

Not all AI systems learn from data. Some encode expert knowledge directly as hand-crafted rules — IF/THEN patterns created by domain experts. These rule-based systems require no training data but are brittle, fail on edge cases, and are hard to scale.

Modern software engineering automation has moved decisively toward data-driven approaches, particularly deep learning. These models learn patterns automatically from large datasets, generalize to unseen examples, and improve with more data. The trade-off: they require vast amounts of high-quality training data.

Deep learning models are highly dependent on data. They require vast amounts of data to effectively learn and generalize. This is why mining software repositories — the systematic collection and preparation of code data — is the essential first step in the AI4SE pipeline.

Why So Much Data?

Deep learning models are fundamentally statistical pattern matchers. They learn by observing millions of examples and discovering regularities — correlations between inputs and outputs that generalize to unseen data. The more complex the task, the more examples the model needs to learn robust patterns rather than memorizing surface-level noise.

Consider an analogy from computer vision: a model trained to classify cat breeds needs millions of labeled photographs to distinguish a Maine Coon from a Norwegian Forest Cat. With only a few hundred images, it might latch onto background color or image resolution instead of actual feline features. The same principle applies to code. A model that learns to summarize Java methods needs millions of method–summary pairs to understand that return a + b; is an addition regardless of whether the variables are named x and y, left and right, or salary and bonus.

What Do We Mine?

Software repositories are uniquely rich because they contain both source code and natural language, tightly interleaved:

  • Source code — method bodies, class definitions, configuration files
  • Comments & documentation — Javadoc, docstrings, inline annotations
  • Commit messages — concise descriptions of what changed and why
  • Issue reports & discussions — bug descriptions, feature requests, design debates
  • Code reviews — pull request comments explaining improvements, catching mistakes

Each of these artifacts provides a different view of developer intent. By mining all of them, we can train models that understand not just what code does, but why it was written that way.

Scale enables generalization. GitHub Copilot, one of the most visible applications of mined code data, was trained on billions of lines of public source code. This massive scale is what allows it to suggest contextually relevant completions across dozens of languages and frameworks — it has seen enough patterns to generalize beyond any single project or coding style.

Building a Dataset: Selecting Repositories

Garbage in, garbage out. A model trained on low-quality data will produce low-quality predictions. Data curation is arguably the most important — and most underappreciated — step in the ML pipeline.

Repository selection uses quality proxies to filter the hundreds of millions of public repos down to a manageable, high-quality subset:

  • Minimum stars — popularity as a quality signal
  • Active maintenance — recent commits indicate a living project
  • Non-fork status — avoid counting duplicated repositories
  • Proper licensing — ensure legal use for training
  • Meaningful commit history — enough data to be useful

Tools for Mining at Scale

Specialized APIs and search platforms make large-scale dataset construction possible: the GitHub REST & GraphQL API (rate-limited, 5,000 req/hour with auth), SEART-GHS (a search engine for GitHub repos with advanced filtering), and libraries like PyGitHub (Python), Octokit (JS), and go-github (Go).

Here’s how to query GitHub’s API for high-quality Java repositories, filtering by stars and excluding forks:

Python
def fetch_top_java_repos(num_repos=200, per_page=100):
    repos = []
    page = 1
    while len(repos) < num_repos:
        url = "https://api.github.com/search/repositories"
        params = {
            "q": "language:java stars:>1000",
            "sort": "stars",
            "order": "desc",
            "per_page": per_page,
            "page": page
        }
        response = requests.get(url, params=params)
        data = response.json()
        for item in data.get("items", []):
            if item.get("fork", False):
                continue
            repos.append({
                "full_name": item["full_name"],
                "clone_url": item["clone_url"],
                "stars": item["stargazers_count"],
            })
        page += 1
    return repos[:num_repos]

Once we have the repo list, we shallow-clone each one — --depth 1 grabs only the latest snapshot, saving time and disk space:

Python
def clone_repo(clone_url, dest_dir):
    cmd = ["git", "clone", "--depth", "1", "--quiet", clone_url, dest_dir]
    result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
    return result.returncode == 0

Data Quality Challenges

Selecting high-quality repositories is necessary but not sufficient. The raw code extracted from even the best projects contains numerous quality issues that can corrupt model training if left unaddressed.

Common Quality Issues

  • Encoding problems — files with mixed encodings (UTF-8, Latin-1, Shift-JIS) produce garbled tokens. Non-ASCII identifiers in comments or string literals can break tokenizers that expect clean ASCII input.
  • Auto-generated code — protobuf stubs, IDE-generated boilerplate, ORM mapping files, and build tool outputs inflate the dataset with repetitive, formulaic code that teaches the model nothing about human programming patterns.
  • Test code vs. production code — unit tests follow very different patterns from production code (setup/teardown, assertions, mocking). Depending on the downstream task, you may want to separate or exclude test files entirely.
  • Dead code and commented-out blocks — abandoned code paths and large commented-out sections add noise without contributing meaningful signal.
  • Minified or obfuscated code — JavaScript bundles and obfuscated releases contain valid syntax but no readable structure.

Cross-Language Considerations

Different programming languages have fundamentally different characteristics that affect dataset construction. Java is verbose with explicit type declarations, while Python is concise with dynamic typing. A method that takes 15 lines in Java might take 4 lines in Python. This means cross-language models must account for:

  • Token-length variance — the same logic produces vastly different token counts across languages
  • Vocabulary differences — language-specific keywords, idioms, and standard library names
  • Structural conventions — Java’s class-centric design vs. Python’s module-level functions vs. Go’s package-level organization

A Filtering Example

Here is a realistic breakdown of what happens when you apply quality filters to a raw Java method dataset:

Raw extracted methods500,000
Remove auto-generated code (protobuf, Lombok, IDE stubs)-62,000
Remove encoding errors and non-ASCII identifiers-18,000
Remove methods <3 tokens or >512 tokens-45,000
Remove trivial getters, setters, and constructors-38,000
Remove exact and near-duplicates-17,000
Clean methods remaining320,000

A 36% reduction is typical. Each filter addresses a specific source of noise, and skipping any one of them can measurably degrade model performance.

Garbage in, garbage out. No model architecture can compensate for poor training data. Investing time in rigorous data cleaning consistently yields larger improvements than switching to a more complex model.

Extracting and Filtering Methods

Once repositories are cloned, we need to extract individual methods or functions from the source files. For Java, this means finding .java files, parsing them with tools like javalang or JavaParser, and extracting method bodies along with their signatures.

Raw extracted methods need systematic cleaning before they can be used for training:

  1. Remove duplicates — their presence creates data leakage between training and test sets
  2. ASCII-only characters — avoid encoding issues across different systems
  3. Remove outliers — methods that are incredibly long (1000+ lines) or incredibly short (single-line getters)
  4. Remove boilerplate — trivial code like getters, setters, and auto-generated constructors
  5. Strip comments — remove all inline and block comments from the method body

The core extraction logic uses brace-counting to find where each method starts and ends:

Python
def extract_method_source(source_code, method_node, lines):
    start_line = method_node.position.line - 1
    brace_count = 0
    started = False
    end_line = start_line
    for i in range(start_line, len(lines)):
        for char in lines[i]:
            if char == '{':
                brace_count += 1
                started = True
            elif char == '}':
                brace_count -= 1
        if started and brace_count == 0:
            end_line = i
            break
    return '\n'.join(lines[start_line:end_line + 1])

Tokenization: From Code to Tokens

Tokenization breaks down raw source code into smaller units (tokens) that can be analyzed separately. For code, this means converting raw source into a structured sequence of keywords, identifiers, operators, and literals.

Lexer-Based vs. BPE Tokenization

Lexer-based tokenization is language-aware — it knows Java keywords, operators, and types. It produces whole identifiers as single tokens (e.g., getMaxValue → 1 token). This is great for analysis but creates a fixed, language-specific vocabulary.

BPE (Byte Pair Encoding) is language-agnostic — it learns frequent character sequences from training data and builds a vocabulary of 32K–100K subword tokens. It splits identifiers into common subwords (e.g., getMaxValue[get, Max, Value]). This handles any vocabulary, including unseen identifiers and mixed languages.

Abstract Syntax Trees (ASTs) offer a third perspective — capturing the hierarchical structure of code rather than a flat sequence. ASTs are used for code understanding tasks, semantic analysis, and tree-based neural models.

Each extracted method is tokenized into space-separated tokens using javalang’s lexer:

Python
def tokenize_method(source_code):
    tokens = list(tokenize(source_code))
    token_values = [token.value for token in tokens]
    return ' '.join(token_values)
Modern LLMs use BPE because it handles any vocabulary — including unseen identifiers, mixed languages, and natural language comments — with a fixed-size token dictionary. No out-of-vocabulary problem.

Abstract Syntax Trees

Lexer-based tokenization and BPE both produce flat sequences of tokens — they treat code as a linear stream, much like reading a sentence word by word. But code has a deeper, hierarchical structure that flat sequences discard. An Abstract Syntax Tree (AST) captures this structure explicitly, representing code as a tree where each node corresponds to a syntactic construct (declaration, expression, statement) and edges represent containment relationships.

Consider this simple Java method:

Java
public int add(int a, int b) {
    return a + b;
}

Its AST looks like this:

AST
MethodDeclaration (name="add", returnType="int")
├── Modifier: public
├── FormalParameter (name="a", type="int")
├── FormalParameter (name="b", type="int")
└── BlockStatement
    └── ReturnStatement
        └── BinaryExpression (operator="+")
            ├── NameExpr: a
            └── NameExpr: b

Why ASTs Matter for Code Mining

Structure over Surface

ASTs capture the syntactic structure of code, not its surface tokens. Two methods with different variable names but identical logic produce different token sequences but structurally similar ASTs.

Code Understanding

Tasks like clone detection, bug finding, and code classification benefit from structural representations that reveal what code does rather than how it looks.

Tree-Based Neural Models

Architectures like Tree-LSTM and code2seq operate directly on AST nodes, learning to compose meaning bottom-up from leaves to root — mirroring how compilers process code.

Parsing Tools

Libraries like JavaParser (Java), tree-sitter (multi-language), and Python’s built-in ast module make AST extraction straightforward at scale.

Flat vs. hierarchical. A flat token sequence like [public, int, add, (, int, a, ...] loses the information that a + b is the return expression, not just two identifiers near a plus sign. The AST preserves this nesting explicitly. This distinction becomes increasingly important in later modules on code embeddings and transformer architectures.

Deduplication

Duplicate code inflates datasets and causes data leakage — if the same function appears in both training and test sets, the model appears to generalize but is actually recalling memorized examples. Studies show 10–30% of GitHub code is duplicated.

Exact Duplicates: SHA-256 Hashing

Compute a hash for each file or method. Identical hashes mean identical content. Fast, simple, and catches the same file copied across multiple repos.

Near-Duplicates: MinHash + LSH

Jaccard similarity on token sets measures overlap. For scalability, use MinHash + Locality-Sensitive Hashing (LSH) to find near-duplicates across millions of files without expensive pairwise comparison.

These two snippets are near-duplicates — same logic, renamed variables:

Version A

  • public int calculateSum(int x, int y) {
  •   int result = x + y;
  •   return result;
  • }

Version B

  • public int addNumbers(int a, int b) {
  •   int sum = a + b;
  •   return sum;
  • }

An exact hash check would miss this pair entirely. MinHash + LSH catches them because their token sets overlap significantly.

We first clean malformed methods, then remove exact duplicates using a simple set-based approach:

Python
def is_clean_method(tokenized_code):
    method_keywords = (tokenized_code.count("public ") +
                       tokenized_code.count("private ") +
                       tokenized_code.count("protected "))
    if method_keywords > 1:
        return False
    if not tokenized_code.endswith("}"):
        return False
    return True

seen = set()
unique_methods = []
for m in tokenized_methods:
    if m['tokenized_code'] not in seen:
        seen.add(m['tokenized_code'])
        unique_methods.append(m)

Splitting the Dataset

How you split data into train, validation, and test sets matters as much as the data itself. The wrong strategy can silently invalidate your results.

Random Split (Risky)

Shuffle all methods and split. Fast but dangerous — methods from the same project can appear in both train and test sets, causing data leakage.

Project-Based Split (Recommended)

All methods from one project go into the same split. Prevents cross-project leakage since methods in the same project share coding style, API usage, and naming conventions.

Temporal Split

Train on older commits, test on newer ones. Simulates real-world deployment where the model must predict code it has never seen from the future.

Typical split ratios are 80/10/10 or 70/15/15 (train / validation / test). Larger test sets give more reliable evaluation estimates.

Project-Based Splitting in Practice

The idea is straightforward: group all methods by their source project, then assign entire projects (not individual methods) to splits. This ensures that no project’s coding style, API usage, or naming conventions leak from training into evaluation.

Python
import random

def project_based_split(methods, train_ratio=0.8, val_ratio=0.1):
    # Group methods by their source project
    projects = {}
    for m in methods:
        proj = m["project"]
        projects.setdefault(proj, []).append(m)

    # Shuffle project names, then assign to splits
    proj_names = list(projects.keys())
    random.shuffle(proj_names)

    total = len(methods)
    train, val, test = [], [], []
    count = 0

    for name in proj_names:
        group = projects[name]
        if count < total * train_ratio:
            train.extend(group)
        elif count < total * (train_ratio + val_ratio):
            val.extend(group)
        else:
            test.extend(group)
        count += len(group)

    return train, val, test
Watch out for temporal leakage. Even with project-based splitting, a subtle form of leakage can occur if your training data includes code written after the code in your test set. In a real deployment, your model will never see future code during training. If your dataset spans multiple years, consider combining project-based and temporal splitting: assign projects to splits, and ensure the training set contains only commits before a cutoff date.

Code as Data: What Makes It Special?

Source code is unlike natural language text in several important ways. Understanding these differences shapes how we build datasets and models.

Formal Syntax

Code must compile or parse. A single misplaced semicolon breaks everything. This rigid structure is both a constraint and an advantage for learning.

Executable Semantics

Code has deterministic meaning: we can run it, test it, and verify outputs. This enables automatic labeling and evaluation.

Multi-level Representation

The same code can be viewed as characters, tokens, AST nodes, control-flow graphs, or data-flow graphs. Each level reveals different patterns.

Bimodal Nature

Repositories contain both code and natural language (comments, docs, commit messages). Models can learn the mapping between intent and implementation.

Real-World MSR Datasets

Researchers have curated benchmark datasets from mined repositories. These standardized datasets enable reproducible experiments and fair comparisons across techniques.

DatasetLanguagesSizePrimary Task
CodeSearchNet6 languages2M code-NL pairsCode search & retrieval
Defects4JJava835 real bugsAutomated program repair & testing
BigCloneBenchJava8M clone pairsClone detection
The Stack300+ languages6 TBPre-training code LLMs
Methods2TestJava780K focal-test pairsTest generation

Ethics, Licensing, and Provenance

Public code is not necessarily free to use for any purpose. Ethical and legal considerations are critical when building MSR datasets.

  • MIT — permissive, almost no restrictions
  • Apache 2.0 — permissive with patent grants
  • GPL — copyleft, derivatives must also be GPL

The Copilot controversy highlighted the tension: GitHub Copilot trained on public repos regardless of license, sparking a class-action lawsuit. Developers argued their copyleft code was used without respecting license terms.

Repositories also often contain privacy risks — PII in comments (names, emails), hardcoded API keys, database credentials, and internal URLs that must be scrubbed from training data.

For reproducibility, always record source repo URLs, commit hashes, and timestamps. Version your filters, timestamp your collection, share your mining scripts, and use standardized formats like JSONL or Parquet.

Data Provenance and Documentation

Responsible dataset creation requires thorough documentation of how the data was collected, filtered, and prepared. Two emerging standards address this need:

  • Datasheets for Datasets — a framework proposed by Gebru et al. that asks dataset creators to document motivation, composition, collection process, preprocessing, intended uses, and maintenance plans.
  • Data Cards — concise summaries that accompany a dataset release, covering provenance, known biases, ethical considerations, and recommended use cases.

These documents help downstream users make informed decisions about whether a dataset is appropriate for their task and what limitations to expect.

Opt-Out Mechanisms

The Stack, a 6 TB dataset of permissively licensed source code, introduced an important precedent: developers can request that their code be removed from the dataset via a simple opt-out form. This respects contributor autonomy even when the code’s license technically permits inclusion. The opt-out mechanism acknowledges that legal permission and ethical consent are not the same thing.

GDPR and PII in Mined Data

Mined code repositories frequently contain personally identifiable information (PII) — author names in commit logs, email addresses in file headers, usernames in configuration files. Under regulations like the EU’s General Data Protection Regulation (GDPR), processing PII requires a lawful basis. Researchers working with mined data should:

  • Strip or anonymize author metadata before releasing datasets
  • Remove hardcoded credentials, API keys, and internal URLs
  • Consider whether commit messages or code comments contain personal data
  • Document what PII scrubbing was performed and what residual risks remain
Just because code is public does not mean it is free to use for any purpose. Always verify license compatibility, scrub sensitive data, and respect the intent of open-source contributors. Document your dataset’s provenance so others can make informed decisions about its use.

The Complete Pipeline

From raw repositories to a clean, tokenized dataset ready for model training:

Select Repos Clone & Extract Preprocess Tokenize Deduplicate Split

The Funnel Effect

A worked example for Java code summarization — generating natural-language descriptions for methods:

Query SEART-GHS: Java repos with ≥100 stars, ≥50 commits3,847 repos
Clone & extract methods with Javadoc comments2.1M pairs
Deduplicate, remove trivial methods, filter by length, ASCII-only810K pairs
Tokenize with javalang810K sequences
Project-based split (train / val / test)648K / 81K / 81K

Starting from 3,847 repositories and 2.1 million raw pairs, preprocessing removes nearly 62% of the data. This is normal — quality always trumps quantity.

Worked Example: End-to-End MSR Pipeline

Let’s walk through a complete, concrete example of building a small dataset for Java method summarization — generating natural-language descriptions from method bodies.

Step 1: Select Repositories

We pick three well-known, actively maintained Java projects with permissive licenses:

apache/commons-lang

General-purpose utility library. Apache 2.0 license. ~2,400 stars. Rich Javadoc coverage across string, number, and date utilities.

google/guava

Core libraries for collections, caching, and I/O. Apache 2.0 license. ~50,000 stars. Extensive, well-documented API surface.

square/okhttp

HTTP client for Java and Android. Apache 2.0 license. ~46,000 stars. Production-grade networking code with clear method structure.

Step 2: Query and Clone

We use the GitHub API to verify each repository meets our criteria, then shallow-clone:

Bash
# Verify repos meet quality criteria
curl -s "https://api.github.com/repos/apache/commons-lang" \
  | jq '{stars: .stargazers_count, license: .license.spdx_id, fork: .fork}'

# Shallow-clone each repo
git clone --depth 1 https://github.com/apache/commons-lang.git
git clone --depth 1 https://github.com/google/guava.git
git clone --depth 1 https://github.com/square/okhttp.git

Step 3: Extract Methods

Using javalang, we parse every .java file and extract methods that have Javadoc comments (our natural-language summaries):

1,847
commons-lang methods
4,312
guava methods
1,956
okhttp methods
8,115
total extracted

Step 4: Preprocessing Funnel

Each filtering step removes a specific category of noise:

Total extracted methods with Javadoc8,115
Remove auto-generated and boilerplate (getters, setters, builders)-1,430
Remove methods with <3 or >512 tokens-890
Remove methods with non-ASCII identifiers or encoding issues-145
Remove exact duplicates (SHA-256)-520
Remove near-duplicates (MinHash, Jaccard ≥ 0.8)-310
Clean method–summary pairs remaining4,820

Step 5: Project-Based Split

We assign entire projects to splits. Since we only have three projects, one natural assignment is:

Training
google/guava — 2,510 methods (52%). The largest project provides the bulk of training examples.
Validation
apache/commons-lang — 1,205 methods (25%). Used for hyperparameter tuning and early stopping.
Test
square/okhttp — 1,105 methods (23%). Held out entirely for final evaluation.

With only three projects, the split ratios deviate from the ideal 80/10/10. In practice, you would mine dozens or hundreds of projects to achieve better balance while still maintaining strict project-level separation.

What to watch out for. With few projects, a single outlier project can dominate one split. Always inspect per-project statistics (method count, average length, vocabulary size) to ensure no split is systematically different from the others. If one project is an order of magnitude larger than the rest, consider subsampling it to avoid skewing the training distribution.

Try It Yourself

Put your knowledge into practice. Clone 3 Java repositories from GitHub (50+ stars), extract all methods using javalang, compute basic statistics (vocabulary size, average method length, duplicate count), and save the cleaned methods as a JSONL file. You will use this dataset in the next module on Source Code Modeling.

Open the Exercise in Google Colab →

Next Module →

Module 2: Source Code Modeling

Learn how to model the statistical properties of code using n-grams — the foundation for understanding how modern LLMs predict code.