Evolutionary computation meets AI-assisted software engineering. Learn how genetic algorithms can evolve prompts, test suites, and code specifications to dramatically improve the quality of LLM-generated software artifacts.
Learning Objectives
Objective 1
Understand the core mechanics of genetic algorithms: selection, crossover, mutation, and fitness evaluation
Objective 2
Apply GA concepts to real optimization problems using binary and real-valued encodings
Objective 3
Explain how GAs can evolve prompts and test descriptions for LLM-based code generation
Objective 4
Analyze the connection between search-based software testing and evolutionary prompt optimization
Objective 5
Evaluate the tradeoffs of GA parameters and their impact on convergence and solution quality
Key References
Holland 1975
John Holland — "Adaptation in Natural and Artificial Systems," foundational work on genetic algorithms
EvoSuite
Fraser & Arcuri (2011) — Search-based test generation for Java using evolutionary algorithms
EvoPrompt
Evolutionary prompt optimization for LLM-driven software engineering tasks
SBST
Search-Based Software Testing — applying metaheuristic search to test case generation
Extra · Slide 02
What Are Genetic Algorithms?
Genetic algorithms (GAs) are optimization techniques inspired by Charles Darwin's theory of natural selection. Instead of computing an exact solution, GAs evolve a population of candidate solutions over successive generations, guided by a fitness function.
Core Idea
Evolve, don't calculate. Start with a random population of possible solutions. Evaluate how "fit" each one is. Select the best, combine them, introduce small random changes, and repeat. Over many generations, the population converges toward optimal solutions.
Gradient-Based Optimization
Requires: a differentiable objective function Method: compute gradients, follow the slope downhill Strengths: fast convergence on smooth landscapes Weakness: gets stuck in local optima; cannot handle discrete or non-differentiable problems
Brief History
John Holland (1975) formalized genetic algorithms in "Adaptation in Natural and Artificial Systems" at the University of Michigan. His Schema Theorem showed that GAs implicitly process short, low-order, high-fitness patterns (called schemata) at an exponential rate — providing the first theoretical explanation for why GAs work. Since then, GAs have been applied to scheduling, routing, design optimization, and — most recently — software engineering.
Evolutionary Optimization
Requires: only a fitness function (no gradients needed) Method: selection, crossover, mutation on a population Strengths: handles discrete, noisy, multi-modal search spaces Weakness: slower convergence; requires more function evaluations
Extra · Slide 03
Core Concepts
Every genetic algorithm operates with the same fundamental building blocks. Understanding this vocabulary is essential before diving into the mechanics.
Population
A collection of candidate solutions (called individuals or chromosomes). The population evolves together over time. Typical sizes range from 20 to 200 individuals.
Gene
The smallest unit of a solution. A chromosome is a sequence of genes. In a binary GA, each gene is a single bit (0 or 1). In real-valued GAs, each gene is a floating-point number.
Chromosome
A complete candidate solution encoded as a string of genes. For example, the binary string [1,0,1,1,0] is a chromosome with 5 genes.
Fitness Function
A function that evaluates the quality of each individual. Higher fitness = better solution. This is the only problem-specific component of a GA — everything else is generic.
Generation
One complete cycle of evaluation, selection, crossover, and mutation. A GA typically runs for 50–500 generations, depending on problem complexity.
Genotype vs Phenotype
The genotype is the encoded representation (e.g., bits). The phenotype is the decoded solution (e.g., the actual number or structure it represents).
GA vs GP — An Important Distinction
Genetic Algorithms (GA) optimize parameters — they evolve fixed-length encodings (bit strings, real vectors) that represent settings, configurations, or inputs. Genetic Programming (GP) evolves programs themselves — tree-structured representations of executable code. When we talk about "evolving code" later in this module, we primarily use GAs to optimize prompts and parameters that an LLM uses to generate code, rather than directly evolving code trees (GP). Both are part of the broader family of Evolutionary Computation.
Extra · Slide 04
The GA Lifecycle
A genetic algorithm follows a cyclical process that mimics natural evolution. Each cycle produces a new generation of candidate solutions that (on average) are fitter than the previous one.
Initialize Population
→
Evaluate Fitness
→
Selection
→
Crossover
→
Mutation
→
New Generation
↑ Repeat until termination condition met ↓
1
Initialize
Create a random population of N chromosomes. Each chromosome is a random candidate solution (e.g., random bit strings).
2
Evaluate
Compute the fitness of every individual using the fitness function. This is often the most computationally expensive step.
3
Select
Choose parent pairs based on fitness. Fitter individuals have a higher probability of being selected to produce offspring.
4
Crossover
Combine two parents to create offspring. Swap portions of parent chromosomes at random crossover points.
5
Mutate
Randomly alter genes in offspring with a small probability. This introduces new genetic material and prevents premature convergence.
6
Terminate?
Check stopping criteria: maximum generations reached, fitness threshold met, or population has converged (all individuals are similar).
Extra · Slide 05
Selection Strategies
Selection determines which individuals get to reproduce. The goal is to bias reproduction toward fitter individuals while still maintaining diversity.
Roulette Wheel Selection
Each individual's probability of selection is proportional to its fitness. Like spinning a roulette wheel where fitter individuals occupy larger slices.
Randomly pick k individuals (tournament size), select the fittest among them. Repeat for each parent needed. Simple, efficient, and adjustable selection pressure by changing k. k=2 is a common default.
Rank-Based Selection
Sort individuals by fitness, then assign selection probability based on rank (not raw fitness). Avoids the problem of one super-fit individual dominating the population. Individual ranked #1 gets the highest probability.
Elitism
Always copy the top k individuals directly into the next generation (unchanged). This guarantees the best solution found so far is never lost. Typically k=1 or k=2. Elitism is almost always used in practice and significantly improves convergence.
Extra · Slide 06
Crossover (Recombination)
Crossover combines genetic material from two parents to create offspring. It is the primary driver of exploration in the search space, allowing the GA to combine good building blocks from different solutions.
Single-Point Crossover
Choose a random crossover point. Take genes from Parent A up to that point, and from Parent B after that point.
For each gene position, flip a coin to decide which parent contributes. Each gene is independently inherited. Maximum mixing of genetic material.
// Uniform crossover (coin flip per gene)
Parent A: [1, 0, 1, 1, 0, 1]
Parent B: [0, 1, 0, 0, 1, 1]
Coin: A B A B B A
Child: [1, 1, 1, 0, 1, 1]
Extra · Slide 07
Mutation
Mutation introduces random changes to individual genes. While crossover combines existing solutions, mutation creates genuinely new genetic material — essential for escaping local optima and maintaining population diversity.
Bit-Flip Mutation
For binary chromosomes: flip a random bit from 0 to 1, or 1 to 0. Each gene has a small probability (typically 1/L where L is chromosome length) of being flipped.
Without mutation: the population can only recombine existing genes. If a critical gene value is missing from the initial population, it can never be discovered.
With mutation: any point in the search space can eventually be reached, guaranteeing theoretical completeness. Mutation rate must be balanced — too low limits exploration, too high becomes random search.
Extra · Slide 08
Interactive: Build a GA Step by Step
Walk through one generation of a genetic algorithm. Start with a population of 6 binary chromosomes (8 bits each), then select, crossover, and mutate to create a new generation.
GA Step-by-Step Builder Interactive
Fitness = count of 1-bits (goal: maximize ones). Click the buttons in order to evolve one generation.
Extra · Slide 09
Fitness Landscapes
A fitness landscape maps every candidate solution to its fitness. The shape determines how hard the problem is and how GAs navigate it.
Fit individuals Trapped at local optimum Low-fitness Mutation escape
Local Optima Trap
A local optimum is better than all its neighbors, but worse than the global best. Gradient methods get stuck here permanently. GAs escape via mutation (the red dashed arrow above) and population diversity. This is why even seemingly harmful mutations are valuable.
Explore
Mutation & Large Pop.
vs
Core Tradeoff
Exploit
Selection & Crossover
Exploration vs Exploitation
Exploration = searching new regions (mutation, large populations). Exploitation = refining known solutions (selection, crossover). Too much exploration = random search. Too much exploitation = premature convergence.
Extra · Slide 10
Complete Example: Maximizing f(x) = x²
Walk through a simple GA that maximizes f(x) = x² for x in [0, 31], using 5-bit binary encoding. Each chromosome is a 5-bit string representing an integer from 0 to 31.
1
Encoding & Initial Population
5 bits encode integers 0–31. We create 4 random individuals:
Individual 2 (fitness 576) gets 50.8% of the wheel. It is most likely to be selected as a parent. After spinning twice, we select Parents: Ind. 2 & Ind. 4.
3
Crossover (point = 3)
Single-point crossover at position 3:
Parent A (Ind.2): [1,1,0,0,0]
Parent B (Ind.4): [1,0,0,1,0]
^-- cut
Child 1: [1,1,0,1,0] → x=26
Child 2: [1,0,0,0,0] → x=16
4
Mutation (bit 4 of Child 1)
Flip bit 4 in Child 1: [1,1,0,1,1] → x=27
5
Generation 1 Results
The new population (with elitism keeping Ind. 2):
// Generation 1
Ind. 2 (elite): [1,1,0,0,0] → x=24 → f = 576
Child 1 (mut): [1,1,0,1,1] → x=27 → f = 729↑ NEW BEST!
Child 2: [1,0,0,0,0] → x=16 → f = 256
Ind. 1: [0,1,1,0,1] → x=13 → f = 169 Best fitness improved: 576 → 729
Convergence
After a few more generations, the population converges toward x=31 ([1,1,1,1,1]) with f(31) = 961, the global maximum. The GA found the optimum without ever computing a derivative.
Extra · Slide 11
GA Parameters & Tuning
The behavior of a genetic algorithm is controlled by several key parameters. Choosing the right values is critical — poor settings can lead to premature convergence or wasted computation.
Parameter
Typical Range
Too Low
Too High
Recommendation
Population Size
20–200
Low diversity, premature convergence
Slow evaluation, wasted computation
Start with 50–100; increase for complex problems
Crossover Rate
0.6–0.9
Limited recombination, slow progress
Disrupts good solutions too often
0.8 is a safe default
Mutation Rate
0.01–0.05
Cannot escape local optima
Becomes random search, destroys good solutions
1/L (L = chromosome length) is a classic choice
Generations
50–500
May not converge
Diminishing returns, wasted time
Use convergence detection to stop early
Selection Pressure
Tournament k=2–7
Near-random selection, no fitness bias
Only top individuals reproduce, loss of diversity
k=2 or k=3 balances exploration and exploitation
Elitism Count
1–5
Best solution can be lost
Reduces population diversity
k=1 or k=2; always use elitism
Key Insight
There is no universally optimal parameter set. The best parameters depend on the fitness landscape of the specific problem. Multi-modal landscapes with many local optima need higher mutation rates and larger populations. Smooth, unimodal landscapes can converge quickly with smaller populations and lower mutation.
Extra · Slide 12
Interactive: GA Evolution Visualizer
Watch a genetic algorithm evolve in real time. See how a population of candidates is evaluated, how parents are selected, how crossover and mutation create offspring, and how the population improves over generations.
Before diving into GA + LLM architectures, let's understand why GAs are uniquely suited to the problem of generating high-quality code.
The Search Problem
Generating optimal code is searching through an enormous space of possible programs. Traditional approaches fall short:
Manual prompting: try a few prompts by hand — inefficient, misses vast regions of the space Grid search: systematically vary parameters — exponential cost as dimensions grow Random sampling: generate many candidates blindly — wasteful, no learning between attempts
1050+
Possible Programs (estimated)
~5
Manual Attempts (typical)
Nature's Search Algorithm
GAs efficiently explore large spaces using evolution:
Maintain a diverse population — don't put all eggs in one basket Keep what works — selection preserves fit solutions Combine good parts — crossover merges building blocks Occasionally try something new — mutation explores the unexpected
Key Advantage
GAs are interesting because they can optimize objectives that are difficult to formalize mathematically — like "good code quality." You don't need a differentiable loss function. You just need a way to compare: is solution A better than solution B?
Extra · Slide 14
The Evaluation Bottleneck
The critical challenge in using GAs for code generation: how do you know if generated code is actually good?
1
LLM Generates Code Candidate
An LLM produces a function, class, or module based on a prompt or specification.
2
To Know If It's Good, We Need Functional Correctness
Does the code actually work? Does it handle edge cases? Does it produce correct output for all inputs?
3
Functional Correctness = Run Test Suite
The gold standard: compile the code and execute it against a comprehensive test suite. This is the only way to know if code works.
4
Industrial Test Suites Can Take HOURS per Candidate
Integration tests, system tests, performance benchmarks — real-world test suites are slow and resource-intensive.
5
1000 Candidates × Hours = Impossible
If we want to evaluate a large population across many generations, exhaustive testing is computationally infeasible.
CodeBLEU
Measures text similarity ~0.1s per eval DOESN'T tell you if code works
Test Exec
Measures correctness Minutes to HOURS per eval The gold standard
LLM Predictor
Predicts pass/fail ~0.5s per eval ~95% accurate
The Core Tension
Surface-level metrics like CodeBLEU are necessary but NOT sufficient. A function can look perfect and still crash. Functional correctness remains the gold standard for code evaluation — but it is expensive. This tension is what makes the next insight so powerful.
Extra · Slide 15
Fitness Approximation: The Key Insight
What if we could PREDICT whether code passes tests without actually running them?
Traditional GA Fitness
Run tests for every candidate → accurate but SLOW.
If a binary classifier can predict pass/fail with 95% accuracy, we tolerate the 5% error because the alternative — running all tests — is computationally infeasible at scale. Accept 5% error to gain 5 orders of magnitude speedup. This is the fundamental insight behind fitness approximation in evolutionary code optimization.
Extra · Slide 16
Interactive: Fitness Approximation Tradeoff
Explore the tradeoff between exact and approximate fitness evaluation. Adjust the parameters to see how population size, generations, test time, and predictor accuracy affect total computation time and error rates.
Fitness Approximation Calculator Interactive
Population Size200
Generations100
Test Execution Time (minutes)30
Predictor Accuracy (%)95
RESULTS WILL APPEAR HERE
Adjust the sliders and click "Calculate" to compare exact vs. approximate fitness evaluation.
Extra · Slide 17
From Bit Strings to Code
Now comes the key transition: applying genetic algorithm principles not to binary optimization problems, but to software engineering artifacts. The challenge is representing SE artifacts as "chromosomes" that can be evolved.
The Representation Problem
In traditional GAs, chromosomes are bit strings or numeric vectors. But software engineering artifacts — prompts, test cases, code specifications — are structured natural language or code. We need creative encodings that allow crossover and mutation to produce meaningful offspring.
What Can We Evolve?
Prompts: natural language instructions to an LLM Test descriptions: specifications of what to test Input parameters: values fed to test cases Code templates: structural skeletons with variable parts Configuration files: build/deploy settings
Traditional SE + GA
Search-Based Software Engineering (SBSE) has used evolutionary algorithms since the 2000s. EvoSuite evolves Java test suites to maximize code coverage. GenProg evolves program patches to fix bugs. The new frontier: combining these techniques with LLMs.
// Traditional GA chromosome (bit string)
[1, 0, 1, 1, 0, 1, 0, 0]
// SE chromosome: a test description"Test that sorting an empty list returns an empty list without error"// SE chromosome: a prompt for an LLM"Write a function that validates email addresses using regex, handling edge cases like + aliases"
Extra · Slide 18
GA + LLMs: The Key Insight
The breakthrough idea: use LLMs as a "development function" and genetic algorithms as the optimizer that searches for the best inputs (prompts, specifications) to that function.
GA evolves prompts
→
LLM generates code
→
Tests evaluate code
→
Fitness = test results
→
Evolve better prompts
The Architecture
Chromosome = a natural language prompt or specification Fitness function = compile the LLM output, run tests, measure coverage/correctness Crossover = combine parts of two prompts (sentence-level recombination) Mutation = rephrase, add detail, change keywords, adjust constraints
Why Not Just Prompt Engineering?
Manual prompt engineering requires human intuition and trial-and-error. A GA can explore thousands of prompt variations systematically, discovering phrasings and structures that no human would think to try. It optimizes the prompt for a measurable objective, not subjective quality.
Fitness Functions for Code
Compilability: does the generated code compile? (+1 point) Test pass rate: what fraction of tests pass? (0.0–1.0) Code coverage: what % of target code is exercised? (0–100%) Mutation score: does the test suite detect injected faults? Composite: weighted combination of the above
GA
Optimizes Prompts
LLM
Generates Code
Tests
Measure Fitness
Extra · Slide 19
GA + LLM: The Architecture
Here is the full pipeline when we combine genetic algorithms with LLM-based fitness approximation. The key: expensive test execution happens only ONCE at the end.
1
Initialize Population
Generate N code solutions using the LLM with different prompts, temperatures, or parameter configurations. Each solution is an "individual."
2
Evaluate Fitness (CHEAP)
Call the LLM predictor for each individual — predict pass@1: 0 or 1. This is just an API call (~0.5s), NOT test execution (hours).
3
Selection
Tournament selection — pick the fittest individuals based on predicted fitness scores.
4
Crossover
Combine parts of good solutions — e.g., take function A's algorithm + function B's error handling. For prompt evolution: merge clauses from two high-fitness prompts.
5
Mutation
Randomly modify candidates: rename variables, restructure logic, change data structures, rephrase prompt clauses, add or remove constraints.
6
Re-evaluate New Population (CHEAP)
Again, just LLM predictor calls — NO test execution. Each new individual costs seconds, not hours.
7
Repeat Until Convergence
Continue for N generations or until fitness plateaus. The entire evolutionary loop uses only approximate fitness.
8
Final Validation: Run ACTUAL Tests
Run the real test suite only on the top-K best candidates from the final generation. This is the only time we pay the full cost of test execution.
The Key Insight
The expensive test execution happens ONCE at the end, on a handful of candidates, instead of at EVERY generation for EVERY individual. For each new individual during evolution, there is only the call to the LLM (seconds) but NOT the execution of test cases (hours).
Extra · Slide 20
Evolving Prompts for Test Generation
One of the most promising applications of GA + LLM is automated test generation. Instead of manually writing test cases, we evolve natural language descriptions that an LLM converts into executable tests.
The Testing Problem
Verifying the correctness of AI-generated code requires comprehensive tests. Manual test writing is slow — developers spend up to 50% of their time on testing. Automated test generation using coverage-guided evolution can explore the input space far more efficiently.
GA approach: evolve natural language descriptions of test scenarios → LLM generates executable test code → measure coverage → evolve toward higher coverage
Each Individual = A Test Description
A chromosome is a natural language description like:
"Test the sort function with a list containing duplicate elements and negative numbers, expecting them to be ordered ascending."
The LLM converts this to executable code. Fitness = lines of code covered by the generated test.
The Speedup
By automating the test description evolution, we replace hours of manual test writing with minutes of automated evolution. The GA systematically explores edge cases, boundary conditions, and error paths that humans often miss. Research has shown orders of magnitude speedup in achieving high coverage.
Extra · Slide 21
The EvoPrompt Pipeline
The evolutionary prompt optimization pipeline combines genetic algorithms with LLMs in an automated feedback loop. Each generation produces better prompts that lead to better generated code.
Generate Initial Descriptions
→
LLM → Test Code
→
Run & Measure Coverage
→
Select Best
→
Crossover + Mutate
→
Next Generation
1
Generate Initial Population
Create N random test descriptions (natural language) targeting different aspects of the method under test. Each description covers different input types, edge cases, or expected behaviors.
2
LLM Converts to Executable Tests
Feed each description to an LLM (e.g., GPT-4, Claude) with a prompt: "Convert this test description into an executable JUnit/pytest test for the following method." The LLM produces compilable test code.
3
Execute and Measure Fitness
Compile and run each generated test. Measure fitness using code coverage (line, branch, or mutation coverage). Tests that cover more code get higher fitness scores.
4
Selection
Use tournament selection to pick parent descriptions. Descriptions whose tests achieved higher coverage are more likely to be selected for reproduction.
5
Crossover & Mutation
Crossover: combine clauses from two descriptions (e.g., take input specification from one, expected behavior from another). Mutation: rephrase a clause, add a constraint ("with null input"), change a value ("negative numbers" → "very large numbers").
6
Repeat for N Generations
Continue evolving until coverage plateaus or a generation limit is reached. The best descriptions from the final generation produce the final test suite.
Extra · Slide 22
Interactive: Evolve Test Descriptions
See how evolutionary optimization improves test descriptions over generations. The method under test is a simple sorting function. Click "Evaluate" to score each description, then "Evolve" to create a new generation.
Test Description Evolver Interactive
Method Under Test
public static int[] sort(int[] arr) — Sorts an integer array in ascending order. Handles null inputs, empty arrays, single-element arrays, duplicates, and negative numbers.
GENERATION 1 — Population of 4 test descriptions
Extra · Slide 23
What Can Go Wrong: Honest Limitations
GA + LLM is a promising research direction, but intellectual honesty requires acknowledging significant open challenges.
Fitness Landscape Mismatch
The LLM predictor might not capture the true fitness landscape. Candidates that "look correct" to the predictor may fail actual tests. The approximate fitness surface may have different peaks and valleys than the real one, leading the GA to converge on solutions that are only apparently good.
Constraint Modeling
Real SE problems have complex constraints: performance requirements, memory limits, concurrency safety, backward compatibility. Modeling all of these as a single fitness function is extremely difficult. At the current state of practice, the assumptions needed are often unrealistic.
Representation Problem
How do you represent code as a "chromosome" for crossover? Naive string-level operations (cut at character 50, swap tails) almost always produce syntactically invalid code. Meaningful crossover requires understanding code structure — ASTs, function boundaries, logical blocks.
Deceptive Fitness
If the predictor consistently gets the same 5% wrong, the GA will exploit those blind spots. It will evolve solutions that game the predictor rather than solving the actual problem. This is analogous to reward hacking in reinforcement learning.
The Honest Assessment
GA + LLM is a promising research direction, not a production-ready solution. The gap between proxy fitness and true correctness is real and must be acknowledged. Modeling a real SE problem with all its actual constraints using GAs requires assumptions that are often unrealistic at the current state of practice.
Extra · Slide 24
When GA + LLM Makes Sense
A practical decision framework: when should you consider GA + LLM approaches, and when are simpler methods better?
Good Fit
✓
Large search space — too many candidates for exhaustive evaluation
✓
Slow test execution — minutes to hours per candidate
✓
Reliable pass/fail predictor exists or can be trained
✓
Error tolerance — some false positives are acceptable
✓
Clear objectives — optimizing a small number of measurable goals
Bad Fit
✗
Simple generation tasks — just use Chain-of-Thought prompting
✗
Fast test execution — just run the tests directly!
✗
Safety-critical code — zero error tolerance required
✗
Highly constrained problems — too many objectives to model
✗
No reliable fitness proxy — prediction accuracy too low
Rule of Thumb
Ask yourself: "Is the cost of evaluating one candidate so high that I can't afford to evaluate many?" If yes, fitness approximation with GAs may be worth the tradeoff. If each evaluation is fast (under a minute), just run the tests — the added complexity of GA + LLM is not justified.
Extra · Slide 25
Real-World Applications
The combination of genetic algorithms and LLMs is being applied across software engineering. Here are the key application areas where evolutionary approaches enhance AI-assisted development.
SBST + LLMs
Search-Based Software Testing combined with LLMs. Use GA to evolve test inputs while LLMs generate test oracles and assertions. Achieves higher coverage than either approach alone.
EvoSuite + LLM
Enhance EvoSuite's evolutionary test generation with LLM-generated assertions and meaningful variable names. The GA optimizes structure; the LLM provides semantic understanding.
Automated Program Repair
Evolve patches for buggy code using GA. Each chromosome encodes a set of code edits. The LLM suggests semantically meaningful patches; the GA searches for the combination that passes all tests.
GA-Guided Fuzzing
Use genetic algorithms to guide input generation for fuzz testing. Evolve inputs that trigger new code paths, crashes, or security vulnerabilities. Coverage feedback drives fitness.
Test Amplification
Start with existing developer tests, use GA to evolve amplified versions that cover additional branches and edge cases. The LLM transforms assertions while the GA maximizes coverage delta.
Prompt Optimization
Beyond testing: evolve prompts for any LLM task — code generation, documentation, refactoring. The GA discovers prompt structures that consistently produce higher-quality outputs from the model.
The Common Pattern
All these applications share the same architecture: a genetic algorithm optimizes the input to an LLM, and a measurable objective (coverage, test pass rate, compilation success) provides fitness feedback. The GA handles exploration; the LLM handles generation.
Extra · Slide 26
Research Frontiers: What's Next?
GA + LLM for code is an active and fast-moving research area. Here are the most promising directions.
Surrogate-Assisted Evolution
Train increasingly accurate fitness predictors as the GA runs. Each generation provides new training data (validated solutions), so the surrogate model improves over time. Early generations use rough approximations; later generations get near-exact predictions.
LLM-Guided Mutation
Instead of random string mutations (which usually break code), use LLMs to generate semantically meaningful mutations. Ask the LLM: "Modify this function to handle edge cases better" or "Rewrite this using a different algorithm." The LLM ensures mutations are syntactically valid.
Multi-Objective Optimization
Simultaneously optimize for correctness, performance, readability, and maintainability. Use Pareto-front approaches (NSGA-II) to find solutions that represent the best tradeoffs between competing objectives. No single "best" solution — a set of optimal tradeoffs.
Coevolution
Evolve test cases alongside code solutions in parallel populations. Code evolves to pass tests; tests evolve to break code. This arms-race dynamic produces both robust code and comprehensive test suites simultaneously.
Course Connection
This is an active research area. Several of these topics could be the subject of your course projects or future research. If you're interested in pursuing any of these directions, talk to the instructor about current open problems and available datasets.
Extra · Slide 27
Interactive: Evolution with Fitness Approximation
Compare how a GA converges with exact fitness vs. approximate fitness. Toggle between modes to see the impact of prediction errors on convergence.
Evolution Simulator Interactive
MODE: EXACT FITNESS — Slow but accurate
TOP 5 INDIVIDUALS
0
Gen
0
Best
0
Avg
0
Div
0
FP
Extra · Slide 28
Connecting the Dots: From Evaluation to Optimization
This module completes a narrative arc across the course. Here is how the pieces fit together.
M3
Module 3: Evaluation Metrics
We learned evaluation metrics — BLEU, CodeBLEU, pass@k. We discovered that functional correctness is the gold standard but text similarity metrics are cheap proxies.
M5
Module 5: Prompting Strategies
We studied prompting strategies to generate better code — Chain-of-Thought, few-shot, role prompts. Better prompts → better code, but which prompt is best?
GA
This Module: Genetic Algorithms
GAs can SEARCH for the best prompt or solution across an enormous space. Instead of manually trying a few prompts, we evolve thousands.
Key
The Fitness Approximation Insight
Predict pass@k without running tests. Use evaluation metrics as cheap fitness proxies. Accept small error for massive speedup.
Evaluation Metrics
→
Prompting Strategies
→
GA Search
→
Fitness Approximation
→
Scalable Code Generation
The Big Picture
This is why we studied evaluation metrics before optimization — you can't optimize what you can't measure (or at least approximate). The future of AI-assisted software engineering combines all these pieces: LLMs generate code, evaluation metrics assess quality, GAs search for optimal solutions, and fitness approximation makes the search computationally feasible.
Extra · Slide 29
Interactive: Knowledge Check
Test your understanding of the key concepts covered in this module. Click each question to reveal the answer.
What is the role of the fitness function in a genetic algorithm?
The fitness function evaluates the quality of each candidate solution (individual). It assigns a numeric score that determines how likely an individual is to be selected for reproduction. The fitness function is the only problem-specific component of a GA — it encodes what "good" means for the particular optimization task. Without it, the GA has no way to distinguish better solutions from worse ones.
How does crossover differ from mutation in terms of purpose?
Crossover combines existing genetic material from two parents; mutation creates new genetic material. Crossover (recombination) mixes building blocks from two fit solutions, potentially creating offspring that inherit the best traits of both parents. Mutation introduces random changes that explore entirely new regions of the search space. Crossover is the main driver of exploitation; mutation is the main driver of exploration.
Why use a GA to evolve prompts instead of random search?
GAs exploit the structure of the search space through selection and crossover. Random search treats every candidate independently, with no learning from previous evaluations. A GA preserves and recombines traits from the best-performing prompts, so each generation starts from a better baseline. This guided search is dramatically more efficient — finding high-quality prompts in hundreds of evaluations instead of millions.
What makes the GA + LLM combination achieve large speedups in test generation?
Automation of both test design (GA) and test implementation (LLM). The GA automates the creative process of designing test scenarios (what to test, which edge cases, which inputs), while the LLM automates the mechanical process of writing executable test code. Together, they replace two manual steps — thinking about tests and writing them — producing comprehensive test suites in minutes instead of hours.
Name two selection strategies and explain when you would use each.
Tournament Selection: randomly pick k individuals, select the fittest. Use when you want simple, efficient selection with easily adjustable pressure (change k). Best for most practical applications.
Roulette Wheel Selection: probability proportional to fitness. Use when fitness differences between individuals are meaningful and you want them directly reflected in selection probability. Avoid when one individual has much higher fitness than others (it would dominate).
What is a fitness landscape and why does it matter?
A fitness landscape is a conceptual surface mapping every possible solution to its fitness value. The shape of this landscape determines how difficult the optimization problem is and which GA parameters work best. Smooth landscapes with a single peak (unimodal) are easy. Rugged landscapes with many peaks of different heights (multi-modal) are hard because the GA can get trapped on local optima. Understanding the landscape informs parameter choices: rugged landscapes need higher mutation rates and larger populations to maintain diversity.
Why is fitness approximation necessary for GA + LLM code optimization?
Running test suites for every candidate in every generation is computationally infeasible. If test execution takes hours and we have hundreds of candidates across many generations, the total computation time can reach years. Fitness approximation replaces test execution with a fast predictor (an LLM call taking ~0.5s) that predicts pass/fail with ~95% accuracy. The tradeoff: we accept a small error rate to gain orders of magnitude speedup, then validate only the top candidates with actual tests at the end.
What is "deceptive fitness" and why is it a concern with approximate fitness functions?
Deceptive fitness occurs when the fitness approximator consistently misjudges certain types of solutions. If the predictor always gives high scores to a particular pattern that actually fails real tests, the GA will evolve toward that deceptive pattern — essentially "gaming" the predictor rather than solving the real problem. This is analogous to reward hacking in reinforcement learning. Mitigation strategies include periodically validating a sample of candidates with real tests and retraining the predictor.
Extra · Slide 30
Summary & Key Takeaways
Genetic algorithms provide a powerful, gradient-free optimization framework that, when combined with LLMs, opens new possibilities for automated software engineering.
01
GAs Evolve Solutions
Genetic algorithms maintain a population of candidate solutions, using selection, crossover, and mutation to iteratively improve them across generations without requiring gradients.
02
Fitness Drives Everything
The fitness function is the only problem-specific component. For SE applications, fitness can be test pass rate, code coverage, compilation success, or any measurable quality metric.
03
Balance Explore vs Exploit
Mutation and large populations drive exploration of new solutions. Selection and crossover exploit known good solutions. The right balance depends on the problem landscape.
04
GA + LLM = Synergy
GAs evolve prompts and specifications; LLMs convert them to code and tests. This combination automates both the design and implementation of SE artifacts.
05
Fitness Approximation
When test execution is expensive, approximate fitness with an LLM predictor. Accept small error for orders-of-magnitude speedup. Validate only top candidates with real tests at the end.
06
Automated Test Evolution
Evolving test descriptions with coverage-guided fitness produces comprehensive test suites faster than manual writing, systematically exploring edge cases humans miss.
07
Honest Limitations
GA + LLM is promising but not production-ready. Fitness landscape mismatch, deceptive fitness, constraint modeling, and the representation problem remain open challenges.
08
SBSE Meets AI
Search-Based Software Engineering now leverages LLMs as code generators within evolutionary loops, enabling a new generation of automated testing, repair, and optimization tools.
09
Connecting the Dots
Evaluation metrics (Module 3) enable fitness functions. Prompting strategies (Module 5) generate candidates. GAs search the space. Fitness approximation makes it scalable.
Key References
Holland 1975
John Holland — "Adaptation in Natural and Artificial Systems," University of Michigan
EvoSuite
Fraser & Arcuri (2011) — Evolutionary test generation for Java
GenProg
Le Goues et al. (2012) — Automated program repair using genetic programming
SBSE Survey
Harman & Jones (2001) — Search-based software engineering fundamentals