Extra · Slide 01

Genetic Algorithms &
LLMs

Evolutionary computation meets AI-assisted software engineering. Learn how genetic algorithms can evolve prompts, test suites, and code specifications to dramatically improve the quality of LLM-generated software artifacts.

Learning Objectives

Objective 1

Understand the core mechanics of genetic algorithms: selection, crossover, mutation, and fitness evaluation

Objective 2

Apply GA concepts to real optimization problems using binary and real-valued encodings

Objective 3

Explain how GAs can evolve prompts and test descriptions for LLM-based code generation

Objective 4

Analyze the connection between search-based software testing and evolutionary prompt optimization

Objective 5

Evaluate the tradeoffs of GA parameters and their impact on convergence and solution quality

Key References

Holland 1975

John Holland — "Adaptation in Natural and Artificial Systems," foundational work on genetic algorithms

EvoSuite

Fraser & Arcuri (2011) — Search-based test generation for Java using evolutionary algorithms

EvoPrompt

Evolutionary prompt optimization for LLM-driven software engineering tasks

SBST

Search-Based Software Testing — applying metaheuristic search to test case generation

Extra · Slide 02

What Are Genetic Algorithms?

Genetic algorithms (GAs) are optimization techniques inspired by Charles Darwin's theory of natural selection. Instead of computing an exact solution, GAs evolve a population of candidate solutions over successive generations, guided by a fitness function.

Core Idea

Evolve, don't calculate. Start with a random population of possible solutions. Evaluate how "fit" each one is. Select the best, combine them, introduce small random changes, and repeat. Over many generations, the population converges toward optimal solutions.

Gradient-Based Optimization

Requires: a differentiable objective function
Method: compute gradients, follow the slope downhill
Strengths: fast convergence on smooth landscapes
Weakness: gets stuck in local optima; cannot handle discrete or non-differentiable problems

Brief History

John Holland (1975) formalized genetic algorithms in "Adaptation in Natural and Artificial Systems" at the University of Michigan. His Schema Theorem showed that GAs implicitly process short, low-order, high-fitness patterns (called schemata) at an exponential rate — providing the first theoretical explanation for why GAs work. Since then, GAs have been applied to scheduling, routing, design optimization, and — most recently — software engineering.

Evolutionary Optimization

Requires: only a fitness function (no gradients needed)
Method: selection, crossover, mutation on a population
Strengths: handles discrete, noisy, multi-modal search spaces
Weakness: slower convergence; requires more function evaluations

Extra · Slide 03

Core Concepts

Every genetic algorithm operates with the same fundamental building blocks. Understanding this vocabulary is essential before diving into the mechanics.

Population

A collection of candidate solutions (called individuals or chromosomes). The population evolves together over time. Typical sizes range from 20 to 200 individuals.

Gene

The smallest unit of a solution. A chromosome is a sequence of genes. In a binary GA, each gene is a single bit (0 or 1). In real-valued GAs, each gene is a floating-point number.

Chromosome

A complete candidate solution encoded as a string of genes. For example, the binary string [1,0,1,1,0] is a chromosome with 5 genes.

Fitness Function

A function that evaluates the quality of each individual. Higher fitness = better solution. This is the only problem-specific component of a GA — everything else is generic.

Generation

One complete cycle of evaluation, selection, crossover, and mutation. A GA typically runs for 50–500 generations, depending on problem complexity.

Genotype vs Phenotype

The genotype is the encoded representation (e.g., bits). The phenotype is the decoded solution (e.g., the actual number or structure it represents).

GA vs GP — An Important Distinction

Genetic Algorithms (GA) optimize parameters — they evolve fixed-length encodings (bit strings, real vectors) that represent settings, configurations, or inputs. Genetic Programming (GP) evolves programs themselves — tree-structured representations of executable code. When we talk about "evolving code" later in this module, we primarily use GAs to optimize prompts and parameters that an LLM uses to generate code, rather than directly evolving code trees (GP). Both are part of the broader family of Evolutionary Computation.

Extra · Slide 04

The GA Lifecycle

A genetic algorithm follows a cyclical process that mimics natural evolution. Each cycle produces a new generation of candidate solutions that (on average) are fitter than the previous one.

Initialize Population

→

Evaluate Fitness

→

Selection

→

Crossover

→

Mutation

→

New Generation

↑ Repeat until termination condition met ↓

1

Initialize

Create a random population of N chromosomes. Each chromosome is a random candidate solution (e.g., random bit strings).

2

Evaluate

Compute the fitness of every individual using the fitness function. This is often the most computationally expensive step.

3

Select

Choose parent pairs based on fitness. Fitter individuals have a higher probability of being selected to produce offspring.

4

Crossover

Combine two parents to create offspring. Swap portions of parent chromosomes at random crossover points.

5

Mutate

Randomly alter genes in offspring with a small probability. This introduces new genetic material and prevents premature convergence.

6

Terminate?

Check stopping criteria: maximum generations reached, fitness threshold met, or population has converged (all individuals are similar).

Extra · Slide 05

Selection Strategies

Selection determines which individuals get to reproduce. The goal is to bias reproduction toward fitter individuals while still maintaining diversity.

Roulette Wheel Selection

Each individual's probability of selection is proportional to its fitness. Like spinning a roulette wheel where fitter individuals occupy larger slices.

Example with 5 individuals:
A: fitness=10 → P=10/40 = 25%
B: fitness=16 → P=16/40 = 40%
C: fitness= 6 → P= 6/40 = 15%
D: fitness= 4 → P= 4/40 = 10%
E: fitness= 4 → P= 4/40 = 10%
Total = 40

Tournament Selection

Randomly pick k individuals (tournament size), select the fittest among them. Repeat for each parent needed. Simple, efficient, and adjustable selection pressure by changing k. k=2 is a common default.

Rank-Based Selection

Sort individuals by fitness, then assign selection probability based on rank (not raw fitness). Avoids the problem of one super-fit individual dominating the population. Individual ranked #1 gets the highest probability.

Elitism

Always copy the top k individuals directly into the next generation (unchanged). This guarantees the best solution found so far is never lost. Typically k=1 or k=2. Elitism is almost always used in practice and significantly improves convergence.

Extra · Slide 06

Crossover (Recombination)

Crossover combines genetic material from two parents to create offspring. It is the primary driver of exploration in the search space, allowing the GA to combine good building blocks from different solutions.

Single-Point Crossover

Choose a random crossover point. Take genes from Parent A up to that point, and from Parent B after that point.

// Single-point crossover at position 3
Parent A:  [1, 0, 1, 1, 0, 1]
Parent B:  [0, 1, 0, 0, 1, 1]
                       ^-- crossover point
Child 1:   [1, 0, 1, 0, 1, 1]  // A's head + B's tail
Child 2:   [0, 1, 0, 1, 0, 1]  // B's head + A's tail

Two-Point Crossover

Choose two crossover points. The segment between the points comes from one parent; the rest from the other. Provides more mixing than single-point.

// Two-point crossover at positions 2 and 4
Parent A:  [1, 0, 1, 1, 0, 1]
Parent B:  [0, 1, 0, 0, 1, 1]
                  ^        ^
Child:     [1, 0, 0, 0, 0, 1]

Uniform Crossover

For each gene position, flip a coin to decide which parent contributes. Each gene is independently inherited. Maximum mixing of genetic material.

// Uniform crossover (coin flip per gene)
Parent A:  [1, 0, 1, 1, 0, 1]
Parent B:  [0, 1, 0, 0, 1, 1]
Coin:       A   B   A   B   B   A
Child:     [1, 1, 1, 0, 1, 1]

Extra · Slide 07

Mutation

Mutation introduces random changes to individual genes. While crossover combines existing solutions, mutation creates genuinely new genetic material — essential for escaping local optima and maintaining population diversity.

Bit-Flip Mutation

For binary chromosomes: flip a random bit from 0 to 1, or 1 to 0. Each gene has a small probability (typically 1/L where L is chromosome length) of being flipped.

// Bit-flip mutation (position 4 flipped)
Before:  [1, 0, 1, 1, 0, 1, 0, 1]
After:   [1, 0, 1, 1, 1, 1, 0, 1]
                       ^-- mutated

Swap Mutation

Select two random positions and swap their values. Common in permutation-based problems like the Traveling Salesman Problem.

// Swap mutation (positions 1 and 5)
Before:  [A, B, C, D, E, F, G]
After:   [A, F, C, D, E, B, G]

Gaussian Mutation

For real-valued chromosomes: add a small random number drawn from a Gaussian distribution. The standard deviation controls mutation strength.

// Gaussian mutation (gene 2)
Before:  [3.14, 2.71, 1.41, 0.57]
Noise:          +0.12   (from N(0, 0.1))
After:   [3.14, 2.83, 1.41, 0.57]

Why Mutation Matters

Without mutation: the population can only recombine existing genes. If a critical gene value is missing from the initial population, it can never be discovered.

With mutation: any point in the search space can eventually be reached, guaranteeing theoretical completeness. Mutation rate must be balanced — too low limits exploration, too high becomes random search.

Extra · Slide 08

Interactive: Build a GA Step by Step

Walk through one generation of a genetic algorithm. Start with a population of 6 binary chromosomes (8 bits each), then select, crossover, and mutate to create a new generation.

GA Step-by-Step Builder Interactive

Fitness = count of 1-bits (goal: maximize ones). Click the buttons in order to evolve one generation.

Extra · Slide 09

Fitness Landscapes

A fitness landscape maps every candidate solution to its fitness. The shape determines how hard the problem is and how GAs navigate it.

Fit individuals Trapped at local optimum Low-fitness Mutation escape

Local Optima Trap

A local optimum is better than all its neighbors, but worse than the global best. Gradient methods get stuck here permanently. GAs escape via mutation (the red dashed arrow above) and population diversity. This is why even seemingly harmful mutations are valuable.

Explore

Mutation & Large Pop.

vs

Core Tradeoff

Exploit

Selection & Crossover

Exploration vs Exploitation

Exploration = searching new regions (mutation, large populations). Exploitation = refining known solutions (selection, crossover). Too much exploration = random search. Too much exploitation = premature convergence.

Extra · Slide 10

Complete Example: Maximizing f(x) = x²

Walk through a simple GA that maximizes f(x) = x² for x in [0, 31], using 5-bit binary encoding. Each chromosome is a 5-bit string representing an integer from 0 to 31.

1

Encoding & Initial Population

5 bits encode integers 0–31. We create 4 random individuals:

// Initial Population (Generation 0)
Individual 1:  [0,1,1,0,1]  → x=13  → f(13) = 169
Individual 2:  [1,1,0,0,0]  → x=24  → f(24) = 576
Individual 3:  [0,1,0,0,0]  → x= 8  → f( 8) =  64
Individual 4:  [1,0,0,1,0]  → x=18  → f(18) = 324
                                Total fitness = 1133

2

Selection (Roulette Wheel)

Individual 2 (fitness 576) gets 50.8% of the wheel. It is most likely to be selected as a parent. After spinning twice, we select Parents: Ind. 2 & Ind. 4.

3

Crossover (point = 3)

Single-point crossover at position 3:

Parent A (Ind.2): [1,1,0,0,0]
Parent B (Ind.4): [1,0,0,1,0]
                          ^-- cut
Child 1:          [1,1,0,1,0] → x=26
Child 2:          [1,0,0,0,0] → x=16

4

Mutation (bit 4 of Child 1)

Flip bit 4 in Child 1: [1,1,0,1,1] → x=27

5

Generation 1 Results

The new population (with elitism keeping Ind. 2):

// Generation 1
Ind. 2 (elite):  [1,1,0,0,0] → x=24 → f = 576
Child 1 (mut):   [1,1,0,1,1] → x=27 → f = 729  ↑ NEW BEST!
Child 2:         [1,0,0,0,0] → x=16 → f = 256
Ind. 1:          [0,1,1,0,1] → x=13 → f = 169
        Best fitness improved: 576 → 729

Convergence

After a few more generations, the population converges toward x=31 ([1,1,1,1,1]) with f(31) = 961, the global maximum. The GA found the optimum without ever computing a derivative.

Extra · Slide 11

GA Parameters & Tuning

The behavior of a genetic algorithm is controlled by several key parameters. Choosing the right values is critical — poor settings can lead to premature convergence or wasted computation.

Parameter	Typical Range	Too Low	Too High	Recommendation
Population Size	20–200	Low diversity, premature convergence	Slow evaluation, wasted computation	Start with 50–100; increase for complex problems
Crossover Rate	0.6–0.9	Limited recombination, slow progress	Disrupts good solutions too often	0.8 is a safe default
Mutation Rate	0.01–0.05	Cannot escape local optima	Becomes random search, destroys good solutions	1/L (L = chromosome length) is a classic choice
Generations	50–500	May not converge	Diminishing returns, wasted time	Use convergence detection to stop early
Selection Pressure	Tournament k=2–7	Near-random selection, no fitness bias	Only top individuals reproduce, loss of diversity	k=2 or k=3 balances exploration and exploitation
Elitism Count	1–5	Best solution can be lost	Reduces population diversity	k=1 or k=2; always use elitism

Key Insight

There is no universally optimal parameter set. The best parameters depend on the fitness landscape of the specific problem. Multi-modal landscapes with many local optima need higher mutation rates and larger populations. Smooth, unimodal landscapes can converge quickly with smaller populations and lower mutation.

Extra · Slide 12

Interactive: GA Evolution Visualizer

Watch a genetic algorithm evolve in real time. See how a population of candidates is evaluated, how parents are selected, how crossover and mutation create offspring, and how the population improves over generations.

GA Evolution Visualizer Interactive

Speed

Pop:12

Bits:10

Mut:0.08

Cross:0.85

Initialize → Evaluate → Select Parents → Crossover → Mutate → Next Gen Gen 0

POPULATION 12 individuals

Selected Parents

Click Step or Play to begin evolution

Crossover & Mutation → Offspring

Fitness Tracker

■ Best: --

■ Avg: --

Gen: 0

Extra · Slide 13

Why Genetic Algorithms for Code?

Before diving into GA + LLM architectures, let's understand why GAs are uniquely suited to the problem of generating high-quality code.

The Search Problem

Generating optimal code is searching through an enormous space of possible programs. Traditional approaches fall short:

Manual prompting: try a few prompts by hand — inefficient, misses vast regions of the space
Grid search: systematically vary parameters — exponential cost as dimensions grow
Random sampling: generate many candidates blindly — wasteful, no learning between attempts

10⁵⁰⁺

Possible Programs (estimated)

~5

Manual Attempts (typical)

Nature's Search Algorithm

GAs efficiently explore large spaces using evolution:

Maintain a diverse population — don't put all eggs in one basket
Keep what works — selection preserves fit solutions
Combine good parts — crossover merges building blocks
Occasionally try something new — mutation explores the unexpected

Key Advantage

GAs are interesting because they can optimize objectives that are difficult to formalize mathematically — like "good code quality." You don't need a differentiable loss function. You just need a way to compare: is solution A better than solution B?

Extra · Slide 14

The Evaluation Bottleneck

The critical challenge in using GAs for code generation: how do you know if generated code is actually good?

1

LLM Generates Code Candidate

An LLM produces a function, class, or module based on a prompt or specification.

2

To Know If It's Good, We Need Functional Correctness

Does the code actually work? Does it handle edge cases? Does it produce correct output for all inputs?

3

Functional Correctness = Run Test Suite

The gold standard: compile the code and execute it against a comprehensive test suite. This is the only way to know if code works.

4

Industrial Test Suites Can Take HOURS per Candidate

Integration tests, system tests, performance benchmarks — real-world test suites are slow and resource-intensive.

5

1000 Candidates × Hours = Impossible

If we want to evaluate a large population across many generations, exhaustive testing is computationally infeasible.

CodeBLEU

Measures text similarity
~0.1s per eval
DOESN'T tell you if code works

Test Exec

Measures correctness
Minutes to HOURS per eval
The gold standard

LLM Predictor

Predicts pass/fail
~0.5s per eval
~95% accurate

The Core Tension

Surface-level metrics like CodeBLEU are necessary but NOT sufficient. A function can look perfect and still crash. Functional correctness remains the gold standard for code evaluation — but it is expensive. This tension is what makes the next insight so powerful.

Extra · Slide 15

Fitness Approximation: The Key Insight

What if we could PREDICT whether code passes tests without actually running them?

Traditional GA Fitness

Run tests for every candidate → accurate but SLOW.

500 individuals
× 100 generations
× 2 hours/test
= 100,000 hours
= 11.4 YEARS

11.4 yr

Exact Evaluation

Approximated Fitness

Train a model to PREDICT test outcomes → fast but approximate.

500 individuals
× 100 generations
× 0.5 seconds/prediction
= 25,000 seconds
= ~7 hours

~7 hr

Approx Evaluation

5 orders

Magnitude Speedup

~5%

Error Rate Tolerated

95%

Predictor Accuracy

The Tradeoff

If a binary classifier can predict pass/fail with 95% accuracy, we tolerate the 5% error because the alternative — running all tests — is computationally infeasible at scale. Accept 5% error to gain 5 orders of magnitude speedup. This is the fundamental insight behind fitness approximation in evolutionary code optimization.

Extra · Slide 16

Interactive: Fitness Approximation Tradeoff

Explore the tradeoff between exact and approximate fitness evaluation. Adjust the parameters to see how population size, generations, test time, and predictor accuracy affect total computation time and error rates.

Fitness Approximation Calculator Interactive

Population Size200

Generations100

Test Execution Time (minutes)30

Predictor Accuracy (%)95

RESULTS WILL APPEAR HERE

Adjust the sliders and click "Calculate" to compare exact vs. approximate fitness evaluation.

Extra · Slide 17

From Bit Strings to Code

Now comes the key transition: applying genetic algorithm principles not to binary optimization problems, but to software engineering artifacts. The challenge is representing SE artifacts as "chromosomes" that can be evolved.

The Representation Problem

In traditional GAs, chromosomes are bit strings or numeric vectors. But software engineering artifacts — prompts, test cases, code specifications — are structured natural language or code. We need creative encodings that allow crossover and mutation to produce meaningful offspring.

What Can We Evolve?

Prompts: natural language instructions to an LLM
Test descriptions: specifications of what to test
Input parameters: values fed to test cases
Code templates: structural skeletons with variable parts
Configuration files: build/deploy settings

Traditional SE + GA

Search-Based Software Engineering (SBSE) has used evolutionary algorithms since the 2000s. EvoSuite evolves Java test suites to maximize code coverage. GenProg evolves program patches to fix bugs. The new frontier: combining these techniques with LLMs.

// Traditional GA chromosome (bit string)
[1, 0, 1, 1, 0, 1, 0, 0]

// SE chromosome: a test description
"Test that sorting an empty list
 returns an empty list without error"

// SE chromosome: a prompt for an LLM
"Write a function that validates
 email addresses using regex,
 handling edge cases like + aliases"

Extra · Slide 18

GA + LLMs: The Key Insight

The breakthrough idea: use LLMs as a "development function" and genetic algorithms as the optimizer that searches for the best inputs (prompts, specifications) to that function.

GA evolves prompts

→

LLM generates code

→

Tests evaluate code

→

Fitness = test results

→

Evolve better prompts

The Architecture

Chromosome = a natural language prompt or specification
Fitness function = compile the LLM output, run tests, measure coverage/correctness
Crossover = combine parts of two prompts (sentence-level recombination)
Mutation = rephrase, add detail, change keywords, adjust constraints

Why Not Just Prompt Engineering?

Manual prompt engineering requires human intuition and trial-and-error. A GA can explore thousands of prompt variations systematically, discovering phrasings and structures that no human would think to try. It optimizes the prompt for a measurable objective, not subjective quality.

Fitness Functions for Code

Compilability: does the generated code compile? (+1 point)
Test pass rate: what fraction of tests pass? (0.0–1.0)
Code coverage: what % of target code is exercised? (0–100%)
Mutation score: does the test suite detect injected faults?
Composite: weighted combination of the above

GA

Optimizes Prompts

LLM

Generates Code

Tests

Measure Fitness

Extra · Slide 19

GA + LLM: The Architecture

Here is the full pipeline when we combine genetic algorithms with LLM-based fitness approximation. The key: expensive test execution happens only ONCE at the end.

1

Initialize Population

Generate N code solutions using the LLM with different prompts, temperatures, or parameter configurations. Each solution is an "individual."

2

Evaluate Fitness (CHEAP)

Call the LLM predictor for each individual — predict pass@1: 0 or 1. This is just an API call (~0.5s), NOT test execution (hours).

3

Selection

Tournament selection — pick the fittest individuals based on predicted fitness scores.

4

Crossover

Combine parts of good solutions — e.g., take function A's algorithm + function B's error handling. For prompt evolution: merge clauses from two high-fitness prompts.

5

Mutation

Randomly modify candidates: rename variables, restructure logic, change data structures, rephrase prompt clauses, add or remove constraints.

6

Re-evaluate New Population (CHEAP)

Again, just LLM predictor calls — NO test execution. Each new individual costs seconds, not hours.

7

Repeat Until Convergence

Continue for N generations or until fitness plateaus. The entire evolutionary loop uses only approximate fitness.

8

Final Validation: Run ACTUAL Tests

Run the real test suite only on the top-K best candidates from the final generation. This is the only time we pay the full cost of test execution.

The Key Insight

The expensive test execution happens ONCE at the end, on a handful of candidates, instead of at EVERY generation for EVERY individual. For each new individual during evolution, there is only the call to the LLM (seconds) but NOT the execution of test cases (hours).

Extra · Slide 20

Evolving Prompts for Test Generation

One of the most promising applications of GA + LLM is automated test generation. Instead of manually writing test cases, we evolve natural language descriptions that an LLM converts into executable tests.

The Testing Problem

Verifying the correctness of AI-generated code requires comprehensive tests. Manual test writing is slow — developers spend up to 50% of their time on testing. Automated test generation using coverage-guided evolution can explore the input space far more efficiently.

Traditional vs GA Approach

Traditional: developer writes tests manually → slow, incomplete, biased toward obvious cases

GA approach: evolve natural language descriptions of test scenarios → LLM generates executable test code → measure coverage → evolve toward higher coverage

Each Individual = A Test Description

A chromosome is a natural language description like:

"Test the sort function with a list containing duplicate elements and negative numbers, expecting them to be ordered ascending."

The LLM converts this to executable code. Fitness = lines of code covered by the generated test.

The Speedup

By automating the test description evolution, we replace hours of manual test writing with minutes of automated evolution. The GA systematically explores edge cases, boundary conditions, and error paths that humans often miss. Research has shown orders of magnitude speedup in achieving high coverage.

Extra · Slide 21

The EvoPrompt Pipeline

The evolutionary prompt optimization pipeline combines genetic algorithms with LLMs in an automated feedback loop. Each generation produces better prompts that lead to better generated code.

Generate Initial Descriptions

→

LLM → Test Code

→

Run & Measure Coverage

→

Select Best

→

Crossover + Mutate

→

Next Generation

1

Generate Initial Population

Create N random test descriptions (natural language) targeting different aspects of the method under test. Each description covers different input types, edge cases, or expected behaviors.

2

LLM Converts to Executable Tests

Feed each description to an LLM (e.g., GPT-4, Claude) with a prompt: "Convert this test description into an executable JUnit/pytest test for the following method." The LLM produces compilable test code.

3

Execute and Measure Fitness

Compile and run each generated test. Measure fitness using code coverage (line, branch, or mutation coverage). Tests that cover more code get higher fitness scores.

4

Selection

Use tournament selection to pick parent descriptions. Descriptions whose tests achieved higher coverage are more likely to be selected for reproduction.

5

Crossover & Mutation

Crossover: combine clauses from two descriptions (e.g., take input specification from one, expected behavior from another). Mutation: rephrase a clause, add a constraint ("with null input"), change a value ("negative numbers" → "very large numbers").

6

Repeat for N Generations

Continue evolving until coverage plateaus or a generation limit is reached. The best descriptions from the final generation produce the final test suite.

Extra · Slide 22

Interactive: Evolve Test Descriptions

See how evolutionary optimization improves test descriptions over generations. The method under test is a simple sorting function. Click "Evaluate" to score each description, then "Evolve" to create a new generation.

Test Description Evolver Interactive

Method Under Test

public static int[] sort(int[] arr) — Sorts an integer array in ascending order. Handles null inputs, empty arrays, single-element arrays, duplicates, and negative numbers.

GENERATION 1 — Population of 4 test descriptions

Extra · Slide 23

What Can Go Wrong: Honest Limitations

GA + LLM is a promising research direction, but intellectual honesty requires acknowledging significant open challenges.

Fitness Landscape Mismatch

The LLM predictor might not capture the true fitness landscape. Candidates that "look correct" to the predictor may fail actual tests. The approximate fitness surface may have different peaks and valleys than the real one, leading the GA to converge on solutions that are only apparently good.

Constraint Modeling

Real SE problems have complex constraints: performance requirements, memory limits, concurrency safety, backward compatibility. Modeling all of these as a single fitness function is extremely difficult. At the current state of practice, the assumptions needed are often unrealistic.

Representation Problem

How do you represent code as a "chromosome" for crossover? Naive string-level operations (cut at character 50, swap tails) almost always produce syntactically invalid code. Meaningful crossover requires understanding code structure — ASTs, function boundaries, logical blocks.

Deceptive Fitness

If the predictor consistently gets the same 5% wrong, the GA will exploit those blind spots. It will evolve solutions that game the predictor rather than solving the actual problem. This is analogous to reward hacking in reinforcement learning.

The Honest Assessment

GA + LLM is a promising research direction, not a production-ready solution. The gap between proxy fitness and true correctness is real and must be acknowledged. Modeling a real SE problem with all its actual constraints using GAs requires assumptions that are often unrealistic at the current state of practice.

Extra · Slide 24

When GA + LLM Makes Sense

A practical decision framework: when should you consider GA + LLM approaches, and when are simpler methods better?

Good Fit

✓

Large search space — too many candidates for exhaustive evaluation

✓

Slow test execution — minutes to hours per candidate

✓

Reliable pass/fail predictor exists or can be trained

✓

Error tolerance — some false positives are acceptable

✓

Clear objectives — optimizing a small number of measurable goals

Bad Fit

✗

Simple generation tasks — just use Chain-of-Thought prompting

✗

Fast test execution — just run the tests directly!

✗

Safety-critical code — zero error tolerance required

✗

Highly constrained problems — too many objectives to model

✗

No reliable fitness proxy — prediction accuracy too low

Rule of Thumb

Ask yourself: "Is the cost of evaluating one candidate so high that I can't afford to evaluate many?" If yes, fitness approximation with GAs may be worth the tradeoff. If each evaluation is fast (under a minute), just run the tests — the added complexity of GA + LLM is not justified.

Extra · Slide 25

Real-World Applications

The combination of genetic algorithms and LLMs is being applied across software engineering. Here are the key application areas where evolutionary approaches enhance AI-assisted development.

SBST + LLMs

Search-Based Software Testing combined with LLMs. Use GA to evolve test inputs while LLMs generate test oracles and assertions. Achieves higher coverage than either approach alone.

EvoSuite + LLM

Enhance EvoSuite's evolutionary test generation with LLM-generated assertions and meaningful variable names. The GA optimizes structure; the LLM provides semantic understanding.

Automated Program Repair

Evolve patches for buggy code using GA. Each chromosome encodes a set of code edits. The LLM suggests semantically meaningful patches; the GA searches for the combination that passes all tests.

GA-Guided Fuzzing

Use genetic algorithms to guide input generation for fuzz testing. Evolve inputs that trigger new code paths, crashes, or security vulnerabilities. Coverage feedback drives fitness.

Test Amplification

Start with existing developer tests, use GA to evolve amplified versions that cover additional branches and edge cases. The LLM transforms assertions while the GA maximizes coverage delta.

Prompt Optimization

Beyond testing: evolve prompts for any LLM task — code generation, documentation, refactoring. The GA discovers prompt structures that consistently produce higher-quality outputs from the model.

The Common Pattern

All these applications share the same architecture: a genetic algorithm optimizes the input to an LLM, and a measurable objective (coverage, test pass rate, compilation success) provides fitness feedback. The GA handles exploration; the LLM handles generation.

Extra · Slide 26

Research Frontiers: What's Next?

GA + LLM for code is an active and fast-moving research area. Here are the most promising directions.

Surrogate-Assisted Evolution

Train increasingly accurate fitness predictors as the GA runs. Each generation provides new training data (validated solutions), so the surrogate model improves over time. Early generations use rough approximations; later generations get near-exact predictions.

LLM-Guided Mutation

Instead of random string mutations (which usually break code), use LLMs to generate semantically meaningful mutations. Ask the LLM: "Modify this function to handle edge cases better" or "Rewrite this using a different algorithm." The LLM ensures mutations are syntactically valid.

Multi-Objective Optimization

Simultaneously optimize for correctness, performance, readability, and maintainability. Use Pareto-front approaches (NSGA-II) to find solutions that represent the best tradeoffs between competing objectives. No single "best" solution — a set of optimal tradeoffs.

Coevolution

Evolve test cases alongside code solutions in parallel populations. Code evolves to pass tests; tests evolve to break code. This arms-race dynamic produces both robust code and comprehensive test suites simultaneously.

Course Connection

This is an active research area. Several of these topics could be the subject of your course projects or future research. If you're interested in pursuing any of these directions, talk to the instructor about current open problems and available datasets.

Extra · Slide 27

Interactive: Evolution with Fitness Approximation

Compare how a GA converges with exact fitness vs. approximate fitness. Toggle between modes to see the impact of prediction errors on convergence.

Evolution Simulator Interactive

Approximate Fitness

MODE: EXACT FITNESS — Slow but accurate

TOP 5 INDIVIDUALS

0

Gen

0

Best

0

Avg

0

Div

0

FP

Extra · Slide 28

Connecting the Dots: From Evaluation to Optimization

This module completes a narrative arc across the course. Here is how the pieces fit together.

M3

Module 3: Evaluation Metrics

We learned evaluation metrics — BLEU, CodeBLEU, pass@k. We discovered that functional correctness is the gold standard but text similarity metrics are cheap proxies.

M5

Module 5: Prompting Strategies

We studied prompting strategies to generate better code — Chain-of-Thought, few-shot, role prompts. Better prompts → better code, but which prompt is best?

GA

This Module: Genetic Algorithms

GAs can SEARCH for the best prompt or solution across an enormous space. Instead of manually trying a few prompts, we evolve thousands.

Key

The Fitness Approximation Insight

Predict pass@k without running tests. Use evaluation metrics as cheap fitness proxies. Accept small error for massive speedup.

Evaluation Metrics

→

Prompting Strategies

→

GA Search

→

Fitness Approximation

→

Scalable Code Generation

The Big Picture

This is why we studied evaluation metrics before optimization — you can't optimize what you can't measure (or at least approximate). The future of AI-assisted software engineering combines all these pieces: LLMs generate code, evaluation metrics assess quality, GAs search for optimal solutions, and fitness approximation makes the search computationally feasible.

Extra · Slide 29

Interactive: Knowledge Check

Test your understanding of the key concepts covered in this module. Click each question to reveal the answer.

What is the role of the fitness function in a genetic algorithm?

The fitness function evaluates the quality of each candidate solution (individual). It assigns a numeric score that determines how likely an individual is to be selected for reproduction. The fitness function is the only problem-specific component of a GA — it encodes what "good" means for the particular optimization task. Without it, the GA has no way to distinguish better solutions from worse ones.

How does crossover differ from mutation in terms of purpose?

Crossover combines existing genetic material from two parents; mutation creates new genetic material. Crossover (recombination) mixes building blocks from two fit solutions, potentially creating offspring that inherit the best traits of both parents. Mutation introduces random changes that explore entirely new regions of the search space. Crossover is the main driver of exploitation; mutation is the main driver of exploration.

Why use a GA to evolve prompts instead of random search?

GAs exploit the structure of the search space through selection and crossover. Random search treats every candidate independently, with no learning from previous evaluations. A GA preserves and recombines traits from the best-performing prompts, so each generation starts from a better baseline. This guided search is dramatically more efficient — finding high-quality prompts in hundreds of evaluations instead of millions.

What makes the GA + LLM combination achieve large speedups in test generation?

Automation of both test design (GA) and test implementation (LLM). The GA automates the creative process of designing test scenarios (what to test, which edge cases, which inputs), while the LLM automates the mechanical process of writing executable test code. Together, they replace two manual steps — thinking about tests and writing them — producing comprehensive test suites in minutes instead of hours.

Name two selection strategies and explain when you would use each.

Tournament Selection: randomly pick k individuals, select the fittest. Use when you want simple, efficient selection with easily adjustable pressure (change k). Best for most practical applications.

Roulette Wheel Selection: probability proportional to fitness. Use when fitness differences between individuals are meaningful and you want them directly reflected in selection probability. Avoid when one individual has much higher fitness than others (it would dominate).

What is a fitness landscape and why does it matter?

A fitness landscape is a conceptual surface mapping every possible solution to its fitness value. The shape of this landscape determines how difficult the optimization problem is and which GA parameters work best. Smooth landscapes with a single peak (unimodal) are easy. Rugged landscapes with many peaks of different heights (multi-modal) are hard because the GA can get trapped on local optima. Understanding the landscape informs parameter choices: rugged landscapes need higher mutation rates and larger populations to maintain diversity.

Why is fitness approximation necessary for GA + LLM code optimization?

Running test suites for every candidate in every generation is computationally infeasible. If test execution takes hours and we have hundreds of candidates across many generations, the total computation time can reach years. Fitness approximation replaces test execution with a fast predictor (an LLM call taking ~0.5s) that predicts pass/fail with ~95% accuracy. The tradeoff: we accept a small error rate to gain orders of magnitude speedup, then validate only the top candidates with actual tests at the end.

What is "deceptive fitness" and why is it a concern with approximate fitness functions?

Deceptive fitness occurs when the fitness approximator consistently misjudges certain types of solutions. If the predictor always gives high scores to a particular pattern that actually fails real tests, the GA will evolve toward that deceptive pattern — essentially "gaming" the predictor rather than solving the real problem. This is analogous to reward hacking in reinforcement learning. Mitigation strategies include periodically validating a sample of candidates with real tests and retraining the predictor.

Extra · Slide 30

Summary & Key Takeaways

Genetic algorithms provide a powerful, gradient-free optimization framework that, when combined with LLMs, opens new possibilities for automated software engineering.

01

GAs Evolve Solutions

Genetic algorithms maintain a population of candidate solutions, using selection, crossover, and mutation to iteratively improve them across generations without requiring gradients.

02

Fitness Drives Everything

The fitness function is the only problem-specific component. For SE applications, fitness can be test pass rate, code coverage, compilation success, or any measurable quality metric.

03

Balance Explore vs Exploit

Mutation and large populations drive exploration of new solutions. Selection and crossover exploit known good solutions. The right balance depends on the problem landscape.

04

GA + LLM = Synergy

GAs evolve prompts and specifications; LLMs convert them to code and tests. This combination automates both the design and implementation of SE artifacts.

05

Fitness Approximation

When test execution is expensive, approximate fitness with an LLM predictor. Accept small error for orders-of-magnitude speedup. Validate only top candidates with real tests at the end.

06

Automated Test Evolution

Evolving test descriptions with coverage-guided fitness produces comprehensive test suites faster than manual writing, systematically exploring edge cases humans miss.

07

Honest Limitations

GA + LLM is promising but not production-ready. Fitness landscape mismatch, deceptive fitness, constraint modeling, and the representation problem remain open challenges.

08

SBSE Meets AI

Search-Based Software Engineering now leverages LLMs as code generators within evolutionary loops, enabling a new generation of automated testing, repair, and optimization tools.

09

Connecting the Dots

Evaluation metrics (Module 3) enable fitness functions. Prompting strategies (Module 5) generate candidates. GAs search the space. Fitness approximation makes it scalable.

Key References

Holland 1975

John Holland — "Adaptation in Natural and Artificial Systems," University of Michigan

EvoSuite

Fraser & Arcuri (2011) — Evolutionary test generation for Java

GenProg

Le Goues et al. (2012) — Automated program repair using genetic programming

SBSE Survey

Harman & Jones (2001) — Search-based software engineering fundamentals

← Back to All Modules

Genetic Algorithms &LLMs

What Are Genetic Algorithms?

Core Concepts

The GA Lifecycle

Initialize

Evaluate

Select

Crossover

Mutate

Terminate?

Selection Strategies

Crossover (Recombination)

Mutation

Interactive: Build a GA Step by Step

Fitness Landscapes

Complete Example: Maximizing f(x) = x²

Encoding & Initial Population

Selection (Roulette Wheel)

Crossover (point = 3)

Mutation (bit 4 of Child 1)

Generation 1 Results

GA Parameters & Tuning

Interactive: GA Evolution Visualizer

Why Genetic Algorithms for Code?

The Evaluation Bottleneck

LLM Generates Code Candidate

To Know If It's Good, We Need Functional Correctness

Functional Correctness = Run Test Suite

Industrial Test Suites Can Take HOURS per Candidate

1000 Candidates × Hours = Impossible

Fitness Approximation: The Key Insight

Interactive: Fitness Approximation Tradeoff

From Bit Strings to Code

GA + LLMs: The Key Insight

GA + LLM: The Architecture

Initialize Population

Evaluate Fitness (CHEAP)

Selection

Crossover

Mutation

Re-evaluate New Population (CHEAP)

Repeat Until Convergence

Final Validation: Run ACTUAL Tests

Evolving Prompts for Test Generation

The EvoPrompt Pipeline

Generate Initial Population

LLM Converts to Executable Tests

Execute and Measure Fitness

Selection

Crossover & Mutation

Repeat for N Generations

Interactive: Evolve Test Descriptions

What Can Go Wrong: Honest Limitations

When GA + LLM Makes Sense

Real-World Applications

Research Frontiers: What's Next?

Interactive: Evolution with Fitness Approximation

Connecting the Dots: From Evaluation to Optimization

Module 3: Evaluation Metrics

Module 5: Prompting Strategies

This Module: Genetic Algorithms

The Fitness Approximation Insight

Interactive: Knowledge Check

Summary & Key Takeaways

GAs Evolve Solutions

Fitness Drives Everything

Balance Explore vs Exploit

GA + LLM = Synergy

Fitness Approximation

Automated Test Evolution

Honest Limitations

SBSE Meets AI

Connecting the Dots

Module Complete!

Genetic Algorithms &
LLMs