Module 5 · Slide 01

Prompting LLMs for
Software Development Automation

Focus on In-Context Learning (ICL), few-shot prompting, and example selection for SE tasks. Learn how to adapt large language models without retraining them.

Learning Objectives

Objective 1

Understand prompting as adaptation without parameter updates

Objective 2

Apply few-shot prompting to software engineering tasks

Objective 3

Explain why example selection matters for prompt quality

Key Terms

PLM

Pre-trained Language Model: a model trained on large corpora before task-specific use

ICL

In-Context Learning: the model learns from examples provided directly in the prompt

Few-Shot

Providing a small number of input-output demonstrations in the prompt

Zero-Shot

Prompting with no demonstrations, only a task description

Demonstration

An input-output pair included in the prompt to illustrate the task pattern for the model

Module 5 · Slide 02

What is Prompting?

Prompting guides a deep learning model to perform a task by embedding input-output examples directly in the prompt. Instead of retraining, the user structures input so the model infers the task pattern from context.

Key Insight

Instead of fine-tuning a pre-trained model, we provide: a task description, a few demonstrations of input-output behavior, and a target input for which the model generates the output.

Task Description

+

Demonstrations

+

Target Input

→

Model Output

Training / Fine-Tuning

Changes the model. Parameter updates via backpropagation over labeled data. The model itself is modified.

Prompting / ICL

Changes the situation. The model stays frozen. We change what the model sees at inference time. Both are forms of adaptation.

Module 5 · Slide 03

Prompting in Action: Bug Fixing

A concrete SE automation task: given a buggy method, generate the fixed version. We show the model how via demonstrations.

Demo 1 — Buggy

public int divide(int a, int b) {
  return a / b;
}

Demo 1 — Fixed

public int divide(int a, int b) {
  if (b == 0) throw new IllegalArgumentException("divisor is zero");
  return a / b;
}

Demo 2 — Buggy

public void setUsername(String name) {
  this.username = name;
}

Demo 2 — Fixed

public void setUsername(String name) {
  if (name == null || name.isEmpty()) throw new IllegalArgumentException("invalid name");
  this.username = name;
}

Pattern

Both demonstrations show the same pattern: add a guard clause that validates input before proceeding. The model should learn to apply this pattern to new buggy methods.

Module 5 · Slide 04

Interactive: Predict the Model Output

Given the demonstrations from the previous slide, what would the model generate for this new buggy method?

New Buggy Method

public String getElement(String[] arr, int idx) {
  return arr[idx];
}

Select the most likely fix Interactive

A) Add synchronized keyword to the method Incorrect

B) Add bounds check: if (arr == null || idx < 0 || idx >= arr.length) throw new IllegalArgumentException(...) Correct

C) Change return type from String to Optional<String> Incorrect

D) Wrap the body in a try-catch block and return null Incorrect

Module 5 · Slide 05

Prompting vs Fine-Tuning

Two different strategies for adapting a pre-trained model to a downstream task. Understanding when to use each is critical.

Prompting / ICL

No parameter updates
No backpropagation
Adapts on the fly at inference
Works with just a few examples
Same model for multiple tasks
No training infrastructure needed

vs

Fine-Tuning

Updates model parameters
Requires backpropagation
Needs a training phase with epochs
Requires labeled training data
Produces a task-specific model
Needs GPUs and training pipeline

Key Distinction

Prompting adapts the input. Fine-tuning adapts the model. Both are valid forms of adaptation, but they have very different costs, requirements, and tradeoffs.

Module 5 · Slide 06

Interactive: Classify the Approach

For each statement, decide whether it describes Prompting/ICL or Fine-Tuning.

Match each statement Interactive

Requires gradient updates

Adapts at inference time

Needs a training phase

Can work with just a few demonstrations

Updates model weights

Uses the same model for multiple tasks

Requires a labeled dataset

No backpropagation at inference

Module 5 · Slide 07

Pros of Prompting

Why has prompting become so popular for SE automation? Three key advantages.

Data Scarcity

Works with just a few demonstrations. No labeled dataset required. Critical for niche SE tasks where data is scarce.

AI Democratization

Leverage large pre-trained models without building or maintaining custom training pipelines. Lower barrier to entry.

Strong Performance

State-of-the-art across many SE tasks without task-specific training. Often competitive with fine-tuned models.

Important Nuance

This does NOT mean prompting always beats fine-tuning. It is a flexible, low-friction, often high-performing strategy — but the best approach depends on the task, data availability, and deployment context.

Module 5 · Slide 08

Cons of Prompting

Prompting is not without significant limitations. The primary tradeoff involves computational cost and model dependency.

Primary Con: Computational Cost

Prompting requires very large models (billions of parameters) to work effectively. These models are expensive to host, serve, and query. Smaller models often fail at in-context learning.

Less Adaptation Overhead

↔

More Dependence on Large Models

B+

Billions of parameters required

$$

API costs per query

Reflection

Is prompting cheaper than fine-tuning? Sometimes yes, sometimes no — it depends on usage frequency, API vs. self-hosted deployment, and model size. For infrequent use, prompting is often cheaper. For high-volume production, fine-tuning a smaller model may be more cost-effective.

Module 5 · Slide 09

Interactive: Decision Scenario

For each scenario, decide the best initial strategy. Click your choice to see the reasoning.

Choose the strategy Interactive

Scenario 1

You have 12 examples of a niche SE task (e.g., translating domain-specific DSL to Java). No large training set exists.

Scenario 2

You have 50,000 labeled code-summary pairs and a dedicated GPU cluster for training.

Scenario 3

You need to support 5 different SE tasks (bug fix, summarization, code review, test generation, documentation) with one model deployment.

Module 5 · Slide 10

Chain-of-Thought Prompting Deep Dive

Chain-of-Thought (CoT) prompting asks the model to show its reasoning step by step, dramatically improving accuracy on complex analysis tasks.

Standard Prompt

Q: "Is this code thread-safe?"

A: "No."

The model jumps to a conclusion without showing its work. Hard to verify, easy to be wrong.

CoT Prompt

Q: "Is this code thread-safe? Let's analyze step by step."

A: "Step 1: Identify shared state — counter is accessed by multiple threads.
Step 2: Check for synchronization — no locks or atomic operations found.
Step 3: Look for race conditions — increment is read-modify-write, not atomic.
Conclusion: No, because the shared counter has no synchronization."

25-40%

Accuracy improvement on code analysis

2022

Introduced by Wei et al. (Google)

100B+

Parameters needed for CoT to emerge

Why It Works

CoT works because it forces the model to show intermediate reasoning, making it harder to skip to wrong conclusions. Each step constrains the next, creating a logical chain that reduces errors on multi-step problems.

Module 5 · Slide 11

Prompt Engineering Best Practices

A systematic framework for writing effective prompts. Each principle is a lever you can pull to improve output quality.

1

Be Specific

Bad: "Write a function" → Good: "Write a Python function that takes a list of integers and returns the second largest unique value"

2

Provide Context

Include imports, class structure, coding style, and framework details. The model cannot read your mind or your codebase.

3

Specify Output Format

Request JSON output, docstring format, test structure, or specific coding conventions. Ambiguity in format wastes iterations.

4

Use Delimiters

Triple backticks, XML tags, or markers to separate code from instructions. Prevents the model from confusing code with prompt text.

5

Include Examples

One good input-output example is worth more than ten words of explanation. Show, don't just tell.

6

Iterate & Refine

Your first prompt is rarely perfect. Examine the output, identify gaps, and refine. Prompt engineering is an iterative process.

Rule of Thumb

If you have to explain to the model what went wrong after seeing the output, that explanation should have been in your original prompt. Every clarification is a missed constraint.

Module 5 · Slide 12

Case Study: Code Summarization

A research study on using few-shot prompting for automatic code summarization. Can we generate natural-language summaries of code methods using only a few demonstrations?

Motivation

Software projects exhibit project-specific linguistic phenomena: unique identifier naming conventions, domain-specific APIs, and coding patterns. A model trained on general code may miss these nuances.

Project A
Spring Boot APIs

Project B
Android UI code

Project C
Data pipelines

Different projects produce different vocabulary, APIs, and code patterns.

Research Questions

RQ1: Can few-shot learning effectively extend to code summarization?
RQ2: Does selecting demonstrations from the same project improve performance?

Module 5 · Slide 13

Few-Shot Prompt Structure

The anatomy of a few-shot prompt for code summarization. Each demonstration pairs a code method with its summary.

1

Demo 1

// Code:
public void setUsername(String name) { if (isValid(name)) this.user = name; }
// Summary: Set the user to username if the provided name is valid

2

Demo 2

// Code:
public String getUsername(int id) { if (id >= 0 && id < users.size()) return users.get(id); return null; }
// Summary: Get user if it is in range

...

Demos 3 through k

Additional code-summary pairs. The study scales up to 10-shot (10 demonstrations per prompt).

T

Target Code

// Code:
public List<String> getActiveUsers() { return users.stream().filter(User::isActive).collect(Collectors.toList()); }
// Summary: ???  ← model generates this

Module 5 · Slide 14

Interactive: Build a Prompt

Assemble a valid few-shot prompt by clicking the blocks in the correct order. One block is a distractor that should not be included.

Click blocks in order Interactive

AVAILABLE BLOCKS

Task: Given a Java method, generate a one-line summary.

Demo 1: setUsername(...) → "Set the user to username if valid"

Demo 2: getUsername(...) → "Get user if it is in range"

Target: getActiveUsers() → ???

System: Always respond in JSON format with error codes.

YOUR PROMPT (click to place)

Slot 1 — click a block

Slot 2

Slot 3

Slot 4

Module 5 · Slide 15

Interactive: Prompt Playground

Explore how different prompting strategies affect output quality. Select a scenario, then compare a basic prompt with an optimized one.

Prompt Playground Interactive

BASIC PROMPT

Select a scenario above to begin.

BASIC OUTPUT

—

OPTIMIZED PROMPT

—

OPTIMIZED OUTPUT

—

Module 5 · Slide 16

Results: What the Research Found

Four key observations from studying few-shot code summarization with large language models.

A

10-shot Codex outperforms all fine-tuned models

B

Zero-shot & one-shot do NOT work well

C

Same-project demos improve performance

D

Larger models benefit more from few-shot

Observation A & B

With 10 demonstrations, Codex surpasses fine-tuned baselines. But with 0 or 1 demo, performance drops significantly. Code summarization requires enough examples to convey the expected style and level of detail.

Observation C & D

Same-project demonstrations consistently outperform cross-project ones, because they share naming conventions and API patterns. Larger models (e.g., Codex vs. CodeGen) leverage demonstrations more effectively.

Takeaway

Few-shot is not magic. It depends on model scale, prompt quality, and demonstration relevance. Getting all three right is what produces strong results.

Module 5 · Slide 17

Interactive: Interpret the Results

Why does 10-shot help while 0-shot and 1-shot fail for code summarization? Select all correct explanations.

Multi-select quiz Interactive

The task requires project-specific conventions that a single example cannot convey Correct

A single example does not expose enough variation in summary style and length Correct

Multiple demos help the model infer the expected phrasing style and level of detail Correct

10 examples are enough to fine-tune the model internally Incorrect — ICL does not fine-tune

Module 5 · Slide 18

Shot Selection Matters

It is not just about how many examples you provide — it is about which ones. This is a major transition in understanding few-shot prompting.

Key Insight

Bad examples waste prompt budget. Relevant examples substantially improve performance. The quality of your demonstrations often matters more than the quantity.

Random / Cross-Project Selection

Demonstrations pulled from unrelated projects. Different naming conventions, APIs, and coding styles. The model receives noisy signals about expected output.

Careful / Same-Project Selection

Demonstrations from the same project or similar codebases. Shared vocabulary, consistent style. The model receives clear, aligned signals.

What This Motivates

If the right examples matter so much, how do we automatically find them? This motivates the idea of retrieval — searching a corpus for the most relevant demonstrations given a target input.

Module 5 · Slide 19

Interactive: Rank the Shots

You need 3 demonstrations for a code summarization prompt. The target method processes HTTP requests using Spring Boot. Select the 3 best candidates from the 6 below.

Target Method

public ResponseEntity<User> handleGetUser(@PathVariable Long id) {
  return userService.findById(id).map(ResponseEntity::ok).orElse(ResponseEntity.notFound().build());
}

Select the 3 best demos Interactive

handlePostOrder(@RequestBody Order o) — Spring REST controller, same project

sortArray(int[] arr) — Utility method from an algorithms library

handleDeleteUser(@PathVariable Long id) — Same REST controller class

readFile(String path) — File I/O utility, different project

handleUpdateUser(@PathVariable Long id, @RequestBody User u) — Same REST controller

calculateTax(double amount) — Business logic helper, unrelated domain

Module 5 · Slide 20

Why Selection is a Retrieval Problem

Connecting shot selection to information retrieval. Finding the best demonstrations is essentially a search problem.

Target Task

→

Search Corpus

→

Best Demos

→

Few-Shot Prompt

The Retrieval Framing

Given a target input, find the most relevant demonstrations from a corpus of examples. This is the same fundamental problem as document retrieval in search engines, but applied to prompt construction.

Textual Overlap

Token-level similarity (e.g., BM25, Jaccard) between target code and candidate demos

AST Similarity

Structural similarity of abstract syntax trees, capturing code structure beyond surface text

Embedding Cosine

Semantic similarity using neural embeddings, capturing meaning even with different tokens

Transition

We now understand that example quality matters as much as quantity. How do we retrieve the most relevant examples automatically? This leads to Retrieval-Augmented Generation (RAG) — the topic of the next module.

Module 5 · Slide 21

RAG: Retrieval-Augmented Generation

RAG combines retrieval with generation: instead of hoping the model knows your codebase, retrieve relevant snippets and include them in the prompt.

1

Problem

The LLM does not know your codebase, your APIs, or your coding conventions. It hallucinates when it guesses.

2

Solution

Retrieve relevant code snippets at query time and inject them into the prompt as context before generation.

3

Pipeline

Query → Embed query → Search vector database → Retrieve top-k docs → Concatenate with prompt → Send to LLM.

Query

→

Embed

→

Vector DB

→

Top-k Docs

→

LLM + Context

When to Use RAG

Large codebases that exceed context limits. Domain-specific knowledge the model was not trained on. Up-to-date info that changes frequently (docs, APIs).

Key Insight

RAG lets you give the LLM relevant knowledge without fine-tuning. It is the most practical approach for enterprise code assistance and connects directly to the shot-selection problem we studied.

Module 5 · Slide 22

Context Window Management

The context window is the maximum number of tokens a model can process at once. Managing it effectively is critical for real-world SE tasks.

128K

GPT-4 Turbo

200K

Claude 3.5

100K

Code Llama

1M

Gemini 1.5

1

Why It Matters

You cannot fit an entire codebase into a single prompt. Even 200K tokens is roughly 150K lines — most projects are larger.

2

Chunking

Split large files into logical chunks (functions, classes, modules). Process each chunk independently or in sequence.

3

RAG / Selective Retrieval

Retrieve only the relevant code snippets instead of including everything. This is the most common production strategy.

4

Summarization

Compress prior context into summaries. Use the model to summarize earlier parts before feeding new content.

Practical Reality

Context window management is the single most important practical skill for using LLMs on real codebases. Models degrade on very long contexts even when they technically fit — important information in the middle gets overlooked ("lost in the middle" effect).

Module 5 · Slide 23

Tool Use & Function Calling

Modern LLMs can request to call external tools — transforming them from text generators into agents that interact with the real world.

What Is Tool Use?

The LLM can request to call external functions:

→ Web search
→ Code execution & testing
→ Database queries
→ API calls & CI/CD

"tool": "run_tests",

"args": "test_sort.py"

Why It Matters for SE

Iterative development loop:

1. LLM generates code
2. Calls run_tests() tool
3. Sees test failures
4. Fixes the code
5. Calls run_tests() again
6. All tests pass

This is how tools like Copilot Workspace, Cursor, and Claude Code work.

Key Insight

Tool use transforms LLMs from text generators into agents that can interact with the real world. Instead of guessing if code works, the model can run it and verify.

Module 5 · Slide 24

Prompt Chaining & Multi-Step Workflows

Complex tasks benefit from decomposition. Instead of one massive prompt, chain multiple focused prompts where each step's output feeds the next.

Prompt 1
Analyze bugs

→

Output 1

→

Prompt 2
Suggest fixes

→

Output 2

→

Prompt 3
Generate tests

→

Final Result

Single Prompt Approach

"Analyze this code for bugs, suggest fixes for each bug, and generate tests for the fixed code."

Result: Lower quality. The model tries to do everything at once, often missing bugs or generating inconsistent fixes and tests.

Chained Approach

Each prompt is focused on one task. The output of step 1 becomes input to step 2.

Result: Higher quality. Each step can be optimized independently. Errors are caught earlier in the pipeline.

Design Principle

Complex tasks benefit from decomposition. Each prompt in the chain can be optimized independently, tested separately, and debugged in isolation. This is the foundation of agentic workflows.

Module 5 · Slide 25

Self-Consistency & Majority Voting

Generate multiple responses to the same prompt and take the majority vote. Trade compute for accuracy.

1

Generate N Responses

Send the same prompt N times with temperature > 0 so each response uses a different reasoning path.

2

Each Response Reasons Differently

With non-zero temperature, the model explores different solution strategies, variable names, and code structures.

3

Extract Final Answers

From each response, extract the final answer (the generated code, the bug diagnosis, the classification).

4

Majority Vote

Take the answer that appears most frequently, or run tests on all solutions and pick the best performer.

Path 1: Yes

Path 2: Yes

Path 3: No

Path 4: Yes

Path 5: No

→

Yes (3/5)

Connection to Code Evaluation

Self-consistency trades compute for accuracy. It is the basis of pass@k evaluation from Module 3: generate k solutions, check if any pass all tests. In production, generate 5 code solutions, run tests on all, and pick the one that passes the most.

Module 5 · Slide 26

Evaluating Prompt Effectiveness

How do you know if your prompts are working? Systematic evaluation separates prompt engineering from guesswork.

Automated Metrics

pass@k: does generated code pass tests? BLEU: similarity to reference output. Test pass rate: percentage of generated tests that execute correctly.

A/B Testing

Compare two prompt versions on the same set of inputs. Measure which produces better outputs across multiple dimensions (correctness, style, completeness).

Error Analysis

Categorize failures: wrong logic, wrong syntax, wrong API usage, hallucinated functions. Each category suggests a different prompt improvement.

Regression Testing

Save your best prompts. When the underlying model updates, re-run your test suite to catch regressions. Prompts are fragile across model versions.

Worked Example

We tested 3 prompt variants for unit test generation:

Variant A (zero-shot): 45% test pass rate
Variant B (few-shot, 3 examples): 62% test pass rate
Variant C (few-shot + CoT + format spec): 78% test pass rate

Variant C combined examples with chain-of-thought reasoning and explicit output format constraints — each technique stacked to improve quality.

Module 5 · Slide 27

Prompting Patterns for SE Automation

A summary of the major prompting strategies used in software engineering research and practice.

Pattern	Description	When to Use	Performance	Prompt Length
Zero-Shot	Task description only, no demos	Simple, well-known tasks	Variable	Short
One-Shot	Single demonstration	Tasks with clear patterns	Moderate	Short
Few-Shot	3–10 demonstrations	Complex SE tasks, code summarization	Strong	Medium
Chain-of-Thought	Step-by-step reasoning in demos	Debugging, code review, analysis	Strong	Long

Choosing a Pattern

Start with zero-shot to gauge baseline. If performance is insufficient, add demonstrations (few-shot). For tasks requiring logical reasoning, use chain-of-thought. Always consider the context window limit of your target model.

Module 5 · Slide 28

Recap & Knowledge Check

01

Prompting = Adaptation

Prompting adapts a model without updating its parameters. It changes the input, not the model.

02

Few-Shot Embeds Demos

Demonstrations are embedded directly in the prompt context. The model infers patterns from them.

03

CoT & Best Practices

Chain-of-Thought forces step-by-step reasoning. Specificity, context, and format specification are essential.

04

Shot Selection Matters

Which examples you choose is as important as how many. Relevance is the key signal.

05

RAG & Context Management

Retrieve relevant code for prompts. Manage context windows carefully on real projects.

06

Tool Use & Chaining

LLMs can call tools and chain prompts. This is the foundation of agentic SE workflows.

What distinguishes prompting from fine-tuning at a fundamental level?

Prompting changes the input; fine-tuning changes the model. Prompting does not update model parameters — it adapts behavior by structuring the context. Fine-tuning performs gradient-based updates to the model weights.

Why does Chain-of-Thought prompting improve accuracy on code analysis tasks?

CoT forces the model to show intermediate reasoning steps, making it harder to skip to wrong conclusions. Each step constrains the next, creating a logical chain. This mirrors how humans analyze code: step by step rather than jumping to a verdict.

When would you use RAG instead of including all code in the prompt?

When the codebase exceeds the context window, when you need domain-specific knowledge the model was not trained on, or when information changes frequently. RAG retrieves only the relevant snippets, keeping prompts focused and within token limits.

Module 5 · Slide 29

Lab Challenge: Prompt Engineering Competition

Put your prompt engineering skills to the test. Design the best prompt for a code generation task and analyze the results.

The Task

Given a code generation challenge (implement a sorting algorithm with specific requirements), design the best possible prompt to get correct, efficient, and readable code from an LLM.

Requirements

Implement a function that sorts a list of dictionaries by a specified key, with support for:
• Ascending and descending order
• Handling missing keys gracefully
• Stable sort behavior
• Type checking and error handling
• Comprehensive docstring and type hints

Submit

1) Your prompt
2) The generated output
3) Test results (write & run tests)
4) 1-paragraph analysis of your prompt design choices

Bonus

Try 3 or more prompting strategies (zero-shot, few-shot, CoT, etc.) and compare results. Which strategy produced the best code? Why?

Grading Criteria

Prompt Quality (30%): Specificity, structure, use of techniques from this module.
Output Correctness (30%): Does the generated code actually work?
Test Coverage (20%): Did you write meaningful tests?
Analysis Depth (20%): Insightful reflection on what worked and why.

Module 5 · Slide 30

What's Next

This module covered prompting and in-context learning for SE automation. Here is where the journey continues.

Next: RAG

Retrieval-Augmented Generation: automatically selecting the most relevant demonstrations from a corpus to build optimal prompts.

Later: Hallucinations

When models confidently generate wrong code. Understanding, detecting, and mitigating hallucinated outputs.

Later: Agents

Multi-step LLM workflows for SE tasks. Combining prompting with tool use, planning, and iterative refinement.

Final Reflection

Prompting is not just asking a model a question. It is a systematic technique for lightweight adaptation — a way to steer powerful pre-trained models toward specific tasks using carefully structured context, without ever changing a single weight.

Prompting

→

RAG

→

Hallucinations

→

Agents

Prompting LLMs forSoftware Development Automation

What is Prompting?

Prompting in Action: Bug Fixing

Interactive: Predict the Model Output

Prompting vs Fine-Tuning

Prompting / ICL

Fine-Tuning

Interactive: Classify the Approach

Pros of Prompting

Cons of Prompting

Interactive: Decision Scenario

Scenario 1

Scenario 2

Scenario 3

Chain-of-Thought Prompting Deep Dive

Prompt Engineering Best Practices

Be Specific

Provide Context

Specify Output Format

Use Delimiters

Include Examples

Iterate & Refine

Case Study: Code Summarization

Few-Shot Prompt Structure

Demo 1

Demo 2

Demos 3 through k

Target Code

Interactive: Build a Prompt

Interactive: Prompt Playground

Results: What the Research Found

Interactive: Interpret the Results

Shot Selection Matters

Interactive: Rank the Shots

Why Selection is a Retrieval Problem

RAG: Retrieval-Augmented Generation

Problem

Solution

Pipeline

Context Window Management

Why It Matters

Chunking

RAG / Selective Retrieval

Summarization

Tool Use & Function Calling

Prompt Chaining & Multi-Step Workflows

Self-Consistency & Majority Voting

Generate N Responses

Each Response Reasons Differently

Extract Final Answers

Majority Vote

Evaluating Prompt Effectiveness

Prompting Patterns for SE Automation

Recap & Knowledge Check

Prompting = Adaptation

Few-Shot Embeds Demos

CoT & Best Practices

Shot Selection Matters

RAG & Context Management

Tool Use & Chaining

Lab Challenge: Prompt Engineering Competition

What's Next

Module Complete!

Prompting LLMs for
Software Development Automation