| Prompting LLMs for Software Development Automation Module 5 · Interactive 1 / 30 Home
Module 5 · Slide 01

Prompting LLMs for
Software Development Automation

Focus on In-Context Learning (ICL), few-shot prompting, and example selection for SE tasks. Learn how to adapt large language models without retraining them.

Learning Objectives
Objective 1
Understand prompting as adaptation without parameter updates
Objective 2
Apply few-shot prompting to software engineering tasks
Objective 3
Explain why example selection matters for prompt quality
Key Terms
PLM
Pre-trained Language Model: a model trained on large corpora before task-specific use
ICL
In-Context Learning: the model learns from examples provided directly in the prompt
Few-Shot
Providing a small number of input-output demonstrations in the prompt
Zero-Shot
Prompting with no demonstrations, only a task description
Demonstration
An input-output pair included in the prompt to illustrate the task pattern for the model
Module 5 · Slide 02

What is Prompting?

Prompting guides a deep learning model to perform a task by embedding input-output examples directly in the prompt. Instead of retraining, the user structures input so the model infers the task pattern from context.

Key Insight
Instead of fine-tuning a pre-trained model, we provide: a task description, a few demonstrations of input-output behavior, and a target input for which the model generates the output.
Task Description
+
Demonstrations
+
Target Input
Model Output
Training / Fine-Tuning
Changes the model. Parameter updates via backpropagation over labeled data. The model itself is modified.
Prompting / ICL
Changes the situation. The model stays frozen. We change what the model sees at inference time. Both are forms of adaptation.
Module 5 · Slide 03

Prompting in Action: Bug Fixing

A concrete SE automation task: given a buggy method, generate the fixed version. We show the model how via demonstrations.

Demo 1 — Buggy
public int divide(int a, int b) {
  return a / b;
}
Demo 1 — Fixed
public int divide(int a, int b) {
  if (b == 0) throw new IllegalArgumentException("divisor is zero");
  return a / b;
}
Demo 2 — Buggy
public void setUsername(String name) {
  this.username = name;
}
Demo 2 — Fixed
public void setUsername(String name) {
  if (name == null || name.isEmpty()) throw new IllegalArgumentException("invalid name");
  this.username = name;
}
Pattern
Both demonstrations show the same pattern: add a guard clause that validates input before proceeding. The model should learn to apply this pattern to new buggy methods.
Module 5 · Slide 04

Interactive: Predict the Model Output

Given the demonstrations from the previous slide, what would the model generate for this new buggy method?

New Buggy Method
public String getElement(String[] arr, int idx) {
  return arr[idx];
}
Select the most likely fix Interactive
A) Add synchronized keyword to the method Incorrect
B) Add bounds check: if (arr == null || idx < 0 || idx >= arr.length) throw new IllegalArgumentException(...) Correct
C) Change return type from String to Optional<String> Incorrect
D) Wrap the body in a try-catch block and return null Incorrect
Module 5 · Slide 05

Prompting vs Fine-Tuning

Two different strategies for adapting a pre-trained model to a downstream task. Understanding when to use each is critical.

Prompting / ICL

  • No parameter updates
  • No backpropagation
  • Adapts on the fly at inference
  • Works with just a few examples
  • Same model for multiple tasks
  • No training infrastructure needed
vs

Fine-Tuning

  • Updates model parameters
  • Requires backpropagation
  • Needs a training phase with epochs
  • Requires labeled training data
  • Produces a task-specific model
  • Needs GPUs and training pipeline
Key Distinction
Prompting adapts the input. Fine-tuning adapts the model. Both are valid forms of adaptation, but they have very different costs, requirements, and tradeoffs.
Module 5 · Slide 06

Interactive: Classify the Approach

For each statement, decide whether it describes Prompting/ICL or Fine-Tuning.

Match each statement Interactive
Requires gradient updates
Adapts at inference time
Needs a training phase
Can work with just a few demonstrations
Updates model weights
Uses the same model for multiple tasks
Requires a labeled dataset
No backpropagation at inference
Module 5 · Slide 07

Pros of Prompting

Why has prompting become so popular for SE automation? Three key advantages.

Data Scarcity
Works with just a few demonstrations. No labeled dataset required. Critical for niche SE tasks where data is scarce.
AI Democratization
Leverage large pre-trained models without building or maintaining custom training pipelines. Lower barrier to entry.
Strong Performance
State-of-the-art across many SE tasks without task-specific training. Often competitive with fine-tuned models.
Important Nuance
This does NOT mean prompting always beats fine-tuning. It is a flexible, low-friction, often high-performing strategy — but the best approach depends on the task, data availability, and deployment context.
Module 5 · Slide 08

Cons of Prompting

Prompting is not without significant limitations. The primary tradeoff involves computational cost and model dependency.

Primary Con: Computational Cost
Prompting requires very large models (billions of parameters) to work effectively. These models are expensive to host, serve, and query. Smaller models often fail at in-context learning.
Less Adaptation Overhead
More Dependence on Large Models
B+
Billions of parameters required
$$
API costs per query
Reflection
Is prompting cheaper than fine-tuning? Sometimes yes, sometimes no — it depends on usage frequency, API vs. self-hosted deployment, and model size. For infrequent use, prompting is often cheaper. For high-volume production, fine-tuning a smaller model may be more cost-effective.
Module 5 · Slide 09

Interactive: Decision Scenario

For each scenario, decide the best initial strategy. Click your choice to see the reasoning.

Choose the strategy Interactive

Scenario 1

You have 12 examples of a niche SE task (e.g., translating domain-specific DSL to Java). No large training set exists.

Scenario 2

You have 50,000 labeled code-summary pairs and a dedicated GPU cluster for training.

Scenario 3

You need to support 5 different SE tasks (bug fix, summarization, code review, test generation, documentation) with one model deployment.

Module 5 · Slide 10

Chain-of-Thought Prompting Deep Dive

Chain-of-Thought (CoT) prompting asks the model to show its reasoning step by step, dramatically improving accuracy on complex analysis tasks.

Standard Prompt
Q: "Is this code thread-safe?"

A: "No."

The model jumps to a conclusion without showing its work. Hard to verify, easy to be wrong.
CoT Prompt
Q: "Is this code thread-safe? Let's analyze step by step."

A: "Step 1: Identify shared state — counter is accessed by multiple threads.
Step 2: Check for synchronization — no locks or atomic operations found.
Step 3: Look for race conditions — increment is read-modify-write, not atomic.
Conclusion: No, because the shared counter has no synchronization."
25-40%
Accuracy improvement on code analysis
2022
Introduced by Wei et al. (Google)
100B+
Parameters needed for CoT to emerge
Why It Works
CoT works because it forces the model to show intermediate reasoning, making it harder to skip to wrong conclusions. Each step constrains the next, creating a logical chain that reduces errors on multi-step problems.
Module 5 · Slide 11

Prompt Engineering Best Practices

A systematic framework for writing effective prompts. Each principle is a lever you can pull to improve output quality.

1

Be Specific

Bad: "Write a function" → Good: "Write a Python function that takes a list of integers and returns the second largest unique value"

2

Provide Context

Include imports, class structure, coding style, and framework details. The model cannot read your mind or your codebase.

3

Specify Output Format

Request JSON output, docstring format, test structure, or specific coding conventions. Ambiguity in format wastes iterations.

4

Use Delimiters

Triple backticks, XML tags, or markers to separate code from instructions. Prevents the model from confusing code with prompt text.

5

Include Examples

One good input-output example is worth more than ten words of explanation. Show, don't just tell.

6

Iterate & Refine

Your first prompt is rarely perfect. Examine the output, identify gaps, and refine. Prompt engineering is an iterative process.

Rule of Thumb
If you have to explain to the model what went wrong after seeing the output, that explanation should have been in your original prompt. Every clarification is a missed constraint.
Module 5 · Slide 12

Case Study: Code Summarization

A research study on using few-shot prompting for automatic code summarization. Can we generate natural-language summaries of code methods using only a few demonstrations?

Motivation
Software projects exhibit project-specific linguistic phenomena: unique identifier naming conventions, domain-specific APIs, and coding patterns. A model trained on general code may miss these nuances.
Project A
Spring Boot APIs
Project B
Android UI code
Project C
Data pipelines

Different projects produce different vocabulary, APIs, and code patterns.

Research Questions
RQ1: Can few-shot learning effectively extend to code summarization?
RQ2: Does selecting demonstrations from the same project improve performance?
Module 5 · Slide 13

Few-Shot Prompt Structure

The anatomy of a few-shot prompt for code summarization. Each demonstration pairs a code method with its summary.

1

Demo 1

// Code:
public void setUsername(String name) { if (isValid(name)) this.user = name; }
// Summary: Set the user to username if the provided name is valid
2

Demo 2

// Code:
public String getUsername(int id) { if (id >= 0 && id < users.size()) return users.get(id); return null; }
// Summary: Get user if it is in range
...

Demos 3 through k

Additional code-summary pairs. The study scales up to 10-shot (10 demonstrations per prompt).

T

Target Code

// Code:
public List<String> getActiveUsers() { return users.stream().filter(User::isActive).collect(Collectors.toList()); }
// Summary: ??? ← model generates this
Module 5 · Slide 14

Interactive: Build a Prompt

Assemble a valid few-shot prompt by clicking the blocks in the correct order. One block is a distractor that should not be included.

Click blocks in order Interactive
AVAILABLE BLOCKS
Task: Given a Java method, generate a one-line summary.
Demo 1: setUsername(...) → "Set the user to username if valid"
Demo 2: getUsername(...) → "Get user if it is in range"
Target: getActiveUsers() → ???
System: Always respond in JSON format with error codes.
YOUR PROMPT (click to place)
Slot 1 — click a block
Slot 2
Slot 3
Slot 4
Module 5 · Slide 15

Interactive: Prompt Playground

Explore how different prompting strategies affect output quality. Select a scenario, then compare a basic prompt with an optimized one.

Prompt Playground Interactive
BASIC PROMPT
Select a scenario above to begin.
BASIC OUTPUT
OPTIMIZED PROMPT
OPTIMIZED OUTPUT
Module 5 · Slide 16

Results: What the Research Found

Four key observations from studying few-shot code summarization with large language models.

A
10-shot Codex outperforms all fine-tuned models
B
Zero-shot & one-shot do NOT work well
C
Same-project demos improve performance
D
Larger models benefit more from few-shot
Observation A & B
With 10 demonstrations, Codex surpasses fine-tuned baselines. But with 0 or 1 demo, performance drops significantly. Code summarization requires enough examples to convey the expected style and level of detail.
Observation C & D
Same-project demonstrations consistently outperform cross-project ones, because they share naming conventions and API patterns. Larger models (e.g., Codex vs. CodeGen) leverage demonstrations more effectively.
Takeaway
Few-shot is not magic. It depends on model scale, prompt quality, and demonstration relevance. Getting all three right is what produces strong results.
Module 5 · Slide 17

Interactive: Interpret the Results

Why does 10-shot help while 0-shot and 1-shot fail for code summarization? Select all correct explanations.

Multi-select quiz Interactive
The task requires project-specific conventions that a single example cannot convey Correct
A single example does not expose enough variation in summary style and length Correct
Multiple demos help the model infer the expected phrasing style and level of detail Correct
10 examples are enough to fine-tune the model internally Incorrect — ICL does not fine-tune
Module 5 · Slide 18

Shot Selection Matters

It is not just about how many examples you provide — it is about which ones. This is a major transition in understanding few-shot prompting.

Key Insight
Bad examples waste prompt budget. Relevant examples substantially improve performance. The quality of your demonstrations often matters more than the quantity.
Random / Cross-Project Selection
Demonstrations pulled from unrelated projects. Different naming conventions, APIs, and coding styles. The model receives noisy signals about expected output.
Careful / Same-Project Selection
Demonstrations from the same project or similar codebases. Shared vocabulary, consistent style. The model receives clear, aligned signals.
What This Motivates
If the right examples matter so much, how do we automatically find them? This motivates the idea of retrieval — searching a corpus for the most relevant demonstrations given a target input.
Module 5 · Slide 19

Interactive: Rank the Shots

You need 3 demonstrations for a code summarization prompt. The target method processes HTTP requests using Spring Boot. Select the 3 best candidates from the 6 below.

Target Method
public ResponseEntity<User> handleGetUser(@PathVariable Long id) {
  return userService.findById(id).map(ResponseEntity::ok).orElse(ResponseEntity.notFound().build());
}
Select the 3 best demos Interactive
handlePostOrder(@RequestBody Order o) — Spring REST controller, same project
sortArray(int[] arr) — Utility method from an algorithms library
handleDeleteUser(@PathVariable Long id) — Same REST controller class
readFile(String path) — File I/O utility, different project
handleUpdateUser(@PathVariable Long id, @RequestBody User u) — Same REST controller
calculateTax(double amount) — Business logic helper, unrelated domain
Module 5 · Slide 20

Why Selection is a Retrieval Problem

Connecting shot selection to information retrieval. Finding the best demonstrations is essentially a search problem.

Target Task
Search Corpus
Best Demos
Few-Shot Prompt
The Retrieval Framing
Given a target input, find the most relevant demonstrations from a corpus of examples. This is the same fundamental problem as document retrieval in search engines, but applied to prompt construction.
Textual Overlap
Token-level similarity (e.g., BM25, Jaccard) between target code and candidate demos
AST Similarity
Structural similarity of abstract syntax trees, capturing code structure beyond surface text
Embedding Cosine
Semantic similarity using neural embeddings, capturing meaning even with different tokens
Transition
We now understand that example quality matters as much as quantity. How do we retrieve the most relevant examples automatically? This leads to Retrieval-Augmented Generation (RAG) — the topic of the next module.
Module 5 · Slide 21

RAG: Retrieval-Augmented Generation

RAG combines retrieval with generation: instead of hoping the model knows your codebase, retrieve relevant snippets and include them in the prompt.

1

Problem

The LLM does not know your codebase, your APIs, or your coding conventions. It hallucinates when it guesses.

2

Solution

Retrieve relevant code snippets at query time and inject them into the prompt as context before generation.

3

Pipeline

Query → Embed query → Search vector database → Retrieve top-k docs → Concatenate with prompt → Send to LLM.

Query
Embed
Vector DB
Top-k Docs
LLM + Context
When to Use RAG
Large codebases that exceed context limits. Domain-specific knowledge the model was not trained on. Up-to-date info that changes frequently (docs, APIs).
Key Insight
RAG lets you give the LLM relevant knowledge without fine-tuning. It is the most practical approach for enterprise code assistance and connects directly to the shot-selection problem we studied.
Module 5 · Slide 22

Context Window Management

The context window is the maximum number of tokens a model can process at once. Managing it effectively is critical for real-world SE tasks.

128K
GPT-4 Turbo
200K
Claude 3.5
100K
Code Llama
1M
Gemini 1.5
1

Why It Matters

You cannot fit an entire codebase into a single prompt. Even 200K tokens is roughly 150K lines — most projects are larger.

2

Chunking

Split large files into logical chunks (functions, classes, modules). Process each chunk independently or in sequence.

3

RAG / Selective Retrieval

Retrieve only the relevant code snippets instead of including everything. This is the most common production strategy.

4

Summarization

Compress prior context into summaries. Use the model to summarize earlier parts before feeding new content.

Practical Reality
Context window management is the single most important practical skill for using LLMs on real codebases. Models degrade on very long contexts even when they technically fit — important information in the middle gets overlooked ("lost in the middle" effect).
Module 5 · Slide 23

Tool Use & Function Calling

Modern LLMs can request to call external tools — transforming them from text generators into agents that interact with the real world.

What Is Tool Use?
The LLM can request to call external functions:

Web search
Code execution & testing
Database queries
API calls & CI/CD

"tool": "run_tests",
"args": "test_sort.py"
Why It Matters for SE
Iterative development loop:

1. LLM generates code
2. Calls run_tests() tool
3. Sees test failures
4. Fixes the code
5. Calls run_tests() again
6. All tests pass

This is how tools like Copilot Workspace, Cursor, and Claude Code work.
Key Insight
Tool use transforms LLMs from text generators into agents that can interact with the real world. Instead of guessing if code works, the model can run it and verify.
Module 5 · Slide 24

Prompt Chaining & Multi-Step Workflows

Complex tasks benefit from decomposition. Instead of one massive prompt, chain multiple focused prompts where each step's output feeds the next.

Prompt 1
Analyze bugs
Output 1
Prompt 2
Suggest fixes
Output 2
Prompt 3
Generate tests
Final Result
Single Prompt Approach
"Analyze this code for bugs, suggest fixes for each bug, and generate tests for the fixed code."

Result: Lower quality. The model tries to do everything at once, often missing bugs or generating inconsistent fixes and tests.
Chained Approach
Each prompt is focused on one task. The output of step 1 becomes input to step 2.

Result: Higher quality. Each step can be optimized independently. Errors are caught earlier in the pipeline.
Design Principle
Complex tasks benefit from decomposition. Each prompt in the chain can be optimized independently, tested separately, and debugged in isolation. This is the foundation of agentic workflows.
Module 5 · Slide 25

Self-Consistency & Majority Voting

Generate multiple responses to the same prompt and take the majority vote. Trade compute for accuracy.

1

Generate N Responses

Send the same prompt N times with temperature > 0 so each response uses a different reasoning path.

2

Each Response Reasons Differently

With non-zero temperature, the model explores different solution strategies, variable names, and code structures.

3

Extract Final Answers

From each response, extract the final answer (the generated code, the bug diagnosis, the classification).

4

Majority Vote

Take the answer that appears most frequently, or run tests on all solutions and pick the best performer.

Path 1: Yes
Path 2: Yes
Path 3: No
Path 4: Yes
Path 5: No
Yes (3/5)
Connection to Code Evaluation
Self-consistency trades compute for accuracy. It is the basis of pass@k evaluation from Module 3: generate k solutions, check if any pass all tests. In production, generate 5 code solutions, run tests on all, and pick the one that passes the most.
Module 5 · Slide 26

Evaluating Prompt Effectiveness

How do you know if your prompts are working? Systematic evaluation separates prompt engineering from guesswork.

Automated Metrics
pass@k: does generated code pass tests? BLEU: similarity to reference output. Test pass rate: percentage of generated tests that execute correctly.
A/B Testing
Compare two prompt versions on the same set of inputs. Measure which produces better outputs across multiple dimensions (correctness, style, completeness).
Error Analysis
Categorize failures: wrong logic, wrong syntax, wrong API usage, hallucinated functions. Each category suggests a different prompt improvement.
Regression Testing
Save your best prompts. When the underlying model updates, re-run your test suite to catch regressions. Prompts are fragile across model versions.
Worked Example
We tested 3 prompt variants for unit test generation:

Variant A (zero-shot): 45% test pass rate
Variant B (few-shot, 3 examples): 62% test pass rate
Variant C (few-shot + CoT + format spec): 78% test pass rate

Variant C combined examples with chain-of-thought reasoning and explicit output format constraints — each technique stacked to improve quality.
Module 5 · Slide 27

Prompting Patterns for SE Automation

A summary of the major prompting strategies used in software engineering research and practice.

Pattern Description When to Use Performance Prompt Length
Zero-Shot Task description only, no demos Simple, well-known tasks Variable Short
One-Shot Single demonstration Tasks with clear patterns Moderate Short
Few-Shot 3–10 demonstrations Complex SE tasks, code summarization Strong Medium
Chain-of-Thought Step-by-step reasoning in demos Debugging, code review, analysis Strong Long
Choosing a Pattern
Start with zero-shot to gauge baseline. If performance is insufficient, add demonstrations (few-shot). For tasks requiring logical reasoning, use chain-of-thought. Always consider the context window limit of your target model.
Module 5 · Slide 28

Recap & Knowledge Check

01

Prompting = Adaptation

Prompting adapts a model without updating its parameters. It changes the input, not the model.

02

Few-Shot Embeds Demos

Demonstrations are embedded directly in the prompt context. The model infers patterns from them.

03

CoT & Best Practices

Chain-of-Thought forces step-by-step reasoning. Specificity, context, and format specification are essential.

04

Shot Selection Matters

Which examples you choose is as important as how many. Relevance is the key signal.

05

RAG & Context Management

Retrieve relevant code for prompts. Manage context windows carefully on real projects.

06

Tool Use & Chaining

LLMs can call tools and chain prompts. This is the foundation of agentic SE workflows.

What distinguishes prompting from fine-tuning at a fundamental level?
Prompting changes the input; fine-tuning changes the model. Prompting does not update model parameters — it adapts behavior by structuring the context. Fine-tuning performs gradient-based updates to the model weights.
Why does Chain-of-Thought prompting improve accuracy on code analysis tasks?
CoT forces the model to show intermediate reasoning steps, making it harder to skip to wrong conclusions. Each step constrains the next, creating a logical chain. This mirrors how humans analyze code: step by step rather than jumping to a verdict.
When would you use RAG instead of including all code in the prompt?
When the codebase exceeds the context window, when you need domain-specific knowledge the model was not trained on, or when information changes frequently. RAG retrieves only the relevant snippets, keeping prompts focused and within token limits.
Module 5 · Slide 29

Lab Challenge: Prompt Engineering Competition

Put your prompt engineering skills to the test. Design the best prompt for a code generation task and analyze the results.

The Task
Given a code generation challenge (implement a sorting algorithm with specific requirements), design the best possible prompt to get correct, efficient, and readable code from an LLM.
Requirements
Implement a function that sorts a list of dictionaries by a specified key, with support for:
• Ascending and descending order
• Handling missing keys gracefully
• Stable sort behavior
• Type checking and error handling
• Comprehensive docstring and type hints
Submit
1) Your prompt
2) The generated output
3) Test results (write & run tests)
4) 1-paragraph analysis of your prompt design choices
Bonus
Try 3 or more prompting strategies (zero-shot, few-shot, CoT, etc.) and compare results. Which strategy produced the best code? Why?
Grading Criteria
Prompt Quality (30%): Specificity, structure, use of techniques from this module.
Output Correctness (30%): Does the generated code actually work?
Test Coverage (20%): Did you write meaningful tests?
Analysis Depth (20%): Insightful reflection on what worked and why.
Module 5 · Slide 30

What's Next

This module covered prompting and in-context learning for SE automation. Here is where the journey continues.

Next: RAG
Retrieval-Augmented Generation: automatically selecting the most relevant demonstrations from a corpus to build optimal prompts.
Later: Hallucinations
When models confidently generate wrong code. Understanding, detecting, and mitigating hallucinated outputs.
Later: Agents
Multi-step LLM workflows for SE tasks. Combining prompting with tool use, planning, and iterative refinement.
Final Reflection
Prompting is not just asking a model a question. It is a systematic technique for lightweight adaptation — a way to steer powerful pre-trained models toward specific tasks using carefully structured context, without ever changing a single weight.
Prompting
RAG
Hallucinations
Agents
🎉

Module Complete!

You've finished Prompting LLMs. Great work!