</> CodeLab|Prompting LLMs for Software Development AutomationModule 5 · Interactive1 / 30Home
Module 5 · Slide 01
Prompting LLMs for Software Development Automation
Focus on In-Context Learning (ICL), few-shot prompting, and example selection for SE tasks. Learn how to adapt large language models without retraining them.
Learning Objectives
Objective 1
Understand prompting as adaptation without parameter updates
Objective 2
Apply few-shot prompting to software engineering tasks
Objective 3
Explain why example selection matters for prompt quality
Key Terms
PLM
Pre-trained Language Model: a model trained on large corpora before task-specific use
ICL
In-Context Learning: the model learns from examples provided directly in the prompt
Few-Shot
Providing a small number of input-output demonstrations in the prompt
Zero-Shot
Prompting with no demonstrations, only a task description
Demonstration
An input-output pair included in the prompt to illustrate the task pattern for the model
Module 5 · Slide 02
What is Prompting?
Prompting guides a deep learning model to perform a task by embedding input-output examples directly in the prompt. Instead of retraining, the user structures input so the model infers the task pattern from context.
Key Insight
Instead of fine-tuning a pre-trained model, we provide: a task description, a few demonstrations of input-output behavior, and a target input for which the model generates the output.
Task Description
+
Demonstrations
+
Target Input
→
Model Output
Training / Fine-Tuning
Changes the model. Parameter updates via backpropagation over labeled data. The model itself is modified.
Prompting / ICL
Changes the situation. The model stays frozen. We change what the model sees at inference time. Both are forms of adaptation.
Module 5 · Slide 03
Prompting in Action: Bug Fixing
A concrete SE automation task: given a buggy method, generate the fixed version. We show the model how via demonstrations.
Demo 1 — Buggy
public intdivide(int a, int b) { return a / b; }
Demo 1 — Fixed
public intdivide(int a, int b) { if (b == 0) throw newIllegalArgumentException("divisor is zero"); return a / b; }
Demo 2 — Buggy
public voidsetUsername(String name) { this.username = name; }
Demo 2 — Fixed
public voidsetUsername(String name) { if (name == null || name.isEmpty()) throw newIllegalArgumentException("invalid name"); this.username = name; }
Pattern
Both demonstrations show the same pattern: add a guard clause that validates input before proceeding. The model should learn to apply this pattern to new buggy methods.
Module 5 · Slide 04
Interactive: Predict the Model Output
Given the demonstrations from the previous slide, what would the model generate for this new buggy method?
New Buggy Method
publicStringgetElement(String[] arr, int idx) { return arr[idx]; }
Select the most likely fix Interactive
A) Add synchronized keyword to the methodIncorrect
B) Add bounds check: if (arr == null || idx < 0 || idx >= arr.length) throw new IllegalArgumentException(...)Correct
C) Change return type from String to Optional<String>Incorrect
D) Wrap the body in a try-catch block and return nullIncorrect
Explanation
The model extracts the pattern: add input validation before the operation. Both demos added guard clauses with IllegalArgumentException. The model applies this same pattern — checking the array and index bounds before accessing the element.
Module 5 · Slide 05
Prompting vs Fine-Tuning
Two different strategies for adapting a pre-trained model to a downstream task. Understanding when to use each is critical.
Prompting / ICL
No parameter updates
No backpropagation
Adapts on the fly at inference
Works with just a few examples
Same model for multiple tasks
No training infrastructure needed
vs
Fine-Tuning
Updates model parameters
Requires backpropagation
Needs a training phase with epochs
Requires labeled training data
Produces a task-specific model
Needs GPUs and training pipeline
Key Distinction
Prompting adapts the input. Fine-tuning adapts the model. Both are valid forms of adaptation, but they have very different costs, requirements, and tradeoffs.
Module 5 · Slide 06
Interactive: Classify the Approach
For each statement, decide whether it describes Prompting/ICL or Fine-Tuning.
Match each statement Interactive
Requires gradient updates
Adapts at inference time
Needs a training phase
Can work with just a few demonstrations
Updates model weights
Uses the same model for multiple tasks
Requires a labeled dataset
No backpropagation at inference
Module 5 · Slide 07
Pros of Prompting
Why has prompting become so popular for SE automation? Three key advantages.
Data Scarcity
Works with just a few demonstrations. No labeled dataset required. Critical for niche SE tasks where data is scarce.
AI Democratization
Leverage large pre-trained models without building or maintaining custom training pipelines. Lower barrier to entry.
Strong Performance
State-of-the-art across many SE tasks without task-specific training. Often competitive with fine-tuned models.
Important Nuance
This does NOT mean prompting always beats fine-tuning. It is a flexible, low-friction, often high-performing strategy — but the best approach depends on the task, data availability, and deployment context.
Module 5 · Slide 08
Cons of Prompting
Prompting is not without significant limitations. The primary tradeoff involves computational cost and model dependency.
Primary Con: Computational Cost
Prompting requires very large models (billions of parameters) to work effectively. These models are expensive to host, serve, and query. Smaller models often fail at in-context learning.
Less Adaptation Overhead
↔
More Dependence on Large Models
B+
Billions of parameters required
$$
API costs per query
Reflection
Is prompting cheaper than fine-tuning? Sometimes yes, sometimes no — it depends on usage frequency, API vs. self-hosted deployment, and model size. For infrequent use, prompting is often cheaper. For high-volume production, fine-tuning a smaller model may be more cost-effective.
Module 5 · Slide 09
Interactive: Decision Scenario
For each scenario, decide the best initial strategy. Click your choice to see the reasoning.
Choose the strategy Interactive
Scenario 1
You have 12 examples of a niche SE task (e.g., translating domain-specific DSL to Java). No large training set exists.
Scenario 2
You have 50,000 labeled code-summary pairs and a dedicated GPU cluster for training.
Scenario 3
You need to support 5 different SE tasks (bug fix, summarization, code review, test generation, documentation) with one model deployment.
Module 5 · Slide 10
Chain-of-Thought Prompting Deep Dive
Chain-of-Thought (CoT) prompting asks the model to show its reasoning step by step, dramatically improving accuracy on complex analysis tasks.
Standard Prompt
Q: "Is this code thread-safe?"
A: "No."
The model jumps to a conclusion without showing its work. Hard to verify, easy to be wrong.
CoT Prompt
Q: "Is this code thread-safe? Let's analyze step by step."
A: "Step 1: Identify shared state — counter is accessed by multiple threads.
Step 2: Check for synchronization — no locks or atomic operations found.
Step 3: Look for race conditions — increment is read-modify-write, not atomic.
Conclusion: No, because the shared counter has no synchronization."
25-40%
Accuracy improvement on code analysis
2022
Introduced by Wei et al. (Google)
100B+
Parameters needed for CoT to emerge
Why It Works
CoT works because it forces the model to show intermediate reasoning, making it harder to skip to wrong conclusions. Each step constrains the next, creating a logical chain that reduces errors on multi-step problems.
Module 5 · Slide 11
Prompt Engineering Best Practices
A systematic framework for writing effective prompts. Each principle is a lever you can pull to improve output quality.
1
Be Specific
Bad: "Write a function" → Good: "Write a Python function that takes a list of integers and returns the second largest unique value"
2
Provide Context
Include imports, class structure, coding style, and framework details. The model cannot read your mind or your codebase.
3
Specify Output Format
Request JSON output, docstring format, test structure, or specific coding conventions. Ambiguity in format wastes iterations.
4
Use Delimiters
Triple backticks, XML tags, or markers to separate code from instructions. Prevents the model from confusing code with prompt text.
5
Include Examples
One good input-output example is worth more than ten words of explanation. Show, don't just tell.
6
Iterate & Refine
Your first prompt is rarely perfect. Examine the output, identify gaps, and refine. Prompt engineering is an iterative process.
Rule of Thumb
If you have to explain to the model what went wrong after seeing the output, that explanation should have been in your original prompt. Every clarification is a missed constraint.
Module 5 · Slide 12
Case Study: Code Summarization
A research study on using few-shot prompting for automatic code summarization. Can we generate natural-language summaries of code methods using only a few demonstrations?
Motivation
Software projects exhibit project-specific linguistic phenomena: unique identifier naming conventions, domain-specific APIs, and coding patterns. A model trained on general code may miss these nuances.
Project A Spring Boot APIs
Project B Android UI code
Project C Data pipelines
Different projects produce different vocabulary, APIs, and code patterns.
Research Questions
RQ1: Can few-shot learning effectively extend to code summarization? RQ2: Does selecting demonstrations from the same project improve performance?
Module 5 · Slide 13
Few-Shot Prompt Structure
The anatomy of a few-shot prompt for code summarization. Each demonstration pairs a code method with its summary.
1
Demo 1
// Code: public voidsetUsername(String name) { if (isValid(name)) this.user = name; } // Summary: Set the user to username if the provided name is valid
2
Demo 2
// Code: publicStringgetUsername(int id) { if (id >= 0 && id < users.size()) return users.get(id); return null; } // Summary: Get user if it is in range
...
Demos 3 through k
Additional code-summary pairs. The study scales up to 10-shot (10 demonstrations per prompt).
T
Target Code
// Code: publicList<String> getActiveUsers() { return users.stream().filter(User::isActive).collect(Collectors.toList()); } // Summary: ??? ← model generates this
Module 5 · Slide 14
Interactive: Build a Prompt
Assemble a valid few-shot prompt by clicking the blocks in the correct order. One block is a distractor that should not be included.
Click blocks in order Interactive
AVAILABLE BLOCKS
Task: Given a Java method, generate a one-line summary.
Demo 1:setUsername(...) → "Set the user to username if valid"
Demo 2:getUsername(...) → "Get user if it is in range"
Target:getActiveUsers() → ???
System: Always respond in JSON format with error codes.
YOUR PROMPT (click to place)
Slot 1 — click a block
Slot 2
Slot 3
Slot 4
Module 5 · Slide 15
Interactive: Prompt Playground
Explore how different prompting strategies affect output quality. Select a scenario, then compare a basic prompt with an optimized one.
Prompt Playground Interactive
BASIC PROMPT
Select a scenario above to begin.
BASIC OUTPUT
—
OPTIMIZED PROMPT
—
OPTIMIZED OUTPUT
—
Module 5 · Slide 16
Results: What the Research Found
Four key observations from studying few-shot code summarization with large language models.
A
10-shot Codex outperforms all fine-tuned models
B
Zero-shot & one-shot do NOT work well
C
Same-project demos improve performance
D
Larger models benefit more from few-shot
Observation A & B
With 10 demonstrations, Codex surpasses fine-tuned baselines. But with 0 or 1 demo, performance drops significantly. Code summarization requires enough examples to convey the expected style and level of detail.
Observation C & D
Same-project demonstrations consistently outperform cross-project ones, because they share naming conventions and API patterns. Larger models (e.g., Codex vs. CodeGen) leverage demonstrations more effectively.
Takeaway
Few-shot is not magic. It depends on model scale, prompt quality, and demonstration relevance. Getting all three right is what produces strong results.
Module 5 · Slide 17
Interactive: Interpret the Results
Why does 10-shot help while 0-shot and 1-shot fail for code summarization? Select all correct explanations.
Multi-select quiz Interactive
The task requires project-specific conventions that a single example cannot conveyCorrect
A single example does not expose enough variation in summary style and lengthCorrect
Multiple demos help the model infer the expected phrasing style and level of detailCorrect
10 examples are enough to fine-tune the model internallyIncorrect — ICL does not fine-tune
Explanation
The first three are correct: code summarization needs sufficient demonstrations to convey project-specific naming, summary style, and detail level. The last option is wrong because ICL never updates model weights — the model is not fine-tuned by the examples in the prompt.
Module 5 · Slide 18
Shot Selection Matters
It is not just about how many examples you provide — it is about which ones. This is a major transition in understanding few-shot prompting.
Key Insight
Bad examples waste prompt budget. Relevant examples substantially improve performance. The quality of your demonstrations often matters more than the quantity.
Random / Cross-Project Selection
Demonstrations pulled from unrelated projects. Different naming conventions, APIs, and coding styles. The model receives noisy signals about expected output.
Careful / Same-Project Selection
Demonstrations from the same project or similar codebases. Shared vocabulary, consistent style. The model receives clear, aligned signals.
What This Motivates
If the right examples matter so much, how do we automatically find them? This motivates the idea of retrieval — searching a corpus for the most relevant demonstrations given a target input.
Module 5 · Slide 19
Interactive: Rank the Shots
You need 3 demonstrations for a code summarization prompt. The target method processes HTTP requests using Spring Boot. Select the 3 best candidates from the 6 below.
Target Method
public ResponseEntity<User> handleGetUser(@PathVariableLong id) { return userService.findById(id).map(ResponseEntity::ok).orElse(ResponseEntity.notFound().build()); }
Select the 3 best demos Interactive
handlePostOrder(@RequestBody Order o) — Spring REST controller, same project
sortArray(int[] arr) — Utility method from an algorithms library
handleDeleteUser(@PathVariable Long id) — Same REST controller class
readFile(String path) — File I/O utility, different project
handleUpdateUser(@PathVariable Long id, @RequestBody User u) — Same REST controller
calculateTax(double amount) — Business logic helper, unrelated domain
Module 5 · Slide 20
Why Selection is a Retrieval Problem
Connecting shot selection to information retrieval. Finding the best demonstrations is essentially a search problem.
Target Task
→
Search Corpus
→
Best Demos
→
Few-Shot Prompt
The Retrieval Framing
Given a target input, find the most relevant demonstrations from a corpus of examples. This is the same fundamental problem as document retrieval in search engines, but applied to prompt construction.
Textual Overlap
Token-level similarity (e.g., BM25, Jaccard) between target code and candidate demos
AST Similarity
Structural similarity of abstract syntax trees, capturing code structure beyond surface text
Embedding Cosine
Semantic similarity using neural embeddings, capturing meaning even with different tokens
Transition
We now understand that example quality matters as much as quantity. How do we retrieve the most relevant examples automatically? This leads to Retrieval-Augmented Generation (RAG) — the topic of the next module.
Module 5 · Slide 21
RAG: Retrieval-Augmented Generation
RAG combines retrieval with generation: instead of hoping the model knows your codebase, retrieve relevant snippets and include them in the prompt.
1
Problem
The LLM does not know your codebase, your APIs, or your coding conventions. It hallucinates when it guesses.
2
Solution
Retrieve relevant code snippets at query time and inject them into the prompt as context before generation.
3
Pipeline
Query → Embed query → Search vector database → Retrieve top-k docs → Concatenate with prompt → Send to LLM.
Query
→
Embed
→
Vector DB
→
Top-k Docs
→
LLM + Context
When to Use RAG
Large codebases that exceed context limits. Domain-specific knowledge the model was not trained on. Up-to-date info that changes frequently (docs, APIs).
Key Insight
RAG lets you give the LLM relevant knowledge without fine-tuning. It is the most practical approach for enterprise code assistance and connects directly to the shot-selection problem we studied.
Module 5 · Slide 22
Context Window Management
The context window is the maximum number of tokens a model can process at once. Managing it effectively is critical for real-world SE tasks.
128K
GPT-4 Turbo
200K
Claude 3.5
100K
Code Llama
1M
Gemini 1.5
1
Why It Matters
You cannot fit an entire codebase into a single prompt. Even 200K tokens is roughly 150K lines — most projects are larger.
2
Chunking
Split large files into logical chunks (functions, classes, modules). Process each chunk independently or in sequence.
3
RAG / Selective Retrieval
Retrieve only the relevant code snippets instead of including everything. This is the most common production strategy.
4
Summarization
Compress prior context into summaries. Use the model to summarize earlier parts before feeding new content.
Practical Reality
Context window management is the single most important practical skill for using LLMs on real codebases. Models degrade on very long contexts even when they technically fit — important information in the middle gets overlooked ("lost in the middle" effect).
Module 5 · Slide 23
Tool Use & Function Calling
Modern LLMs can request to call external tools — transforming them from text generators into agents that interact with the real world.
What Is Tool Use?
The LLM can request to call external functions:
→ Web search → Code execution & testing → Database queries → API calls & CI/CD
"tool": "run_tests", "args": "test_sort.py"
Why It Matters for SE
Iterative development loop:
1. LLM generates code
2. Calls run_tests() tool
3. Sees test failures
4. Fixes the code
5. Calls run_tests() again
6. All tests pass
This is how tools like Copilot Workspace, Cursor, and Claude Code work.
Key Insight
Tool use transforms LLMs from text generators into agents that can interact with the real world. Instead of guessing if code works, the model can run it and verify.
Module 5 · Slide 24
Prompt Chaining & Multi-Step Workflows
Complex tasks benefit from decomposition. Instead of one massive prompt, chain multiple focused prompts where each step's output feeds the next.
Prompt 1 Analyze bugs
→
Output 1
→
Prompt 2 Suggest fixes
→
Output 2
→
Prompt 3 Generate tests
→
Final Result
Single Prompt Approach
"Analyze this code for bugs, suggest fixes for each bug, and generate tests for the fixed code."
Result: Lower quality. The model tries to do everything at once, often missing bugs or generating inconsistent fixes and tests.
Chained Approach
Each prompt is focused on one task. The output of step 1 becomes input to step 2.
Result: Higher quality. Each step can be optimized independently. Errors are caught earlier in the pipeline.
Design Principle
Complex tasks benefit from decomposition. Each prompt in the chain can be optimized independently, tested separately, and debugged in isolation. This is the foundation of agentic workflows.
Module 5 · Slide 25
Self-Consistency & Majority Voting
Generate multiple responses to the same prompt and take the majority vote. Trade compute for accuracy.
1
Generate N Responses
Send the same prompt N times with temperature > 0 so each response uses a different reasoning path.
2
Each Response Reasons Differently
With non-zero temperature, the model explores different solution strategies, variable names, and code structures.
3
Extract Final Answers
From each response, extract the final answer (the generated code, the bug diagnosis, the classification).
4
Majority Vote
Take the answer that appears most frequently, or run tests on all solutions and pick the best performer.
Path 1: Yes
Path 2: Yes
Path 3: No
Path 4: Yes
Path 5: No
→
Yes (3/5)
Connection to Code Evaluation
Self-consistency trades compute for accuracy. It is the basis of pass@k evaluation from Module 3: generate k solutions, check if any pass all tests. In production, generate 5 code solutions, run tests on all, and pick the one that passes the most.
Module 5 · Slide 26
Evaluating Prompt Effectiveness
How do you know if your prompts are working? Systematic evaluation separates prompt engineering from guesswork.
Automated Metrics
pass@k: does generated code pass tests? BLEU: similarity to reference output. Test pass rate: percentage of generated tests that execute correctly.
A/B Testing
Compare two prompt versions on the same set of inputs. Measure which produces better outputs across multiple dimensions (correctness, style, completeness).
Error Analysis
Categorize failures: wrong logic, wrong syntax, wrong API usage, hallucinated functions. Each category suggests a different prompt improvement.
Regression Testing
Save your best prompts. When the underlying model updates, re-run your test suite to catch regressions. Prompts are fragile across model versions.
Worked Example
We tested 3 prompt variants for unit test generation:
Variant A (zero-shot): 45% test pass rate Variant B (few-shot, 3 examples): 62% test pass rate Variant C (few-shot + CoT + format spec): 78% test pass rate
Variant C combined examples with chain-of-thought reasoning and explicit output format constraints — each technique stacked to improve quality.
Module 5 · Slide 27
Prompting Patterns for SE Automation
A summary of the major prompting strategies used in software engineering research and practice.
Pattern
Description
When to Use
Performance
Prompt Length
Zero-Shot
Task description only, no demos
Simple, well-known tasks
Variable
Short
One-Shot
Single demonstration
Tasks with clear patterns
Moderate
Short
Few-Shot
3–10 demonstrations
Complex SE tasks, code summarization
Strong
Medium
Chain-of-Thought
Step-by-step reasoning in demos
Debugging, code review, analysis
Strong
Long
Choosing a Pattern
Start with zero-shot to gauge baseline. If performance is insufficient, add demonstrations (few-shot). For tasks requiring logical reasoning, use chain-of-thought. Always consider the context window limit of your target model.
Module 5 · Slide 28
Recap & Knowledge Check
01
Prompting = Adaptation
Prompting adapts a model without updating its parameters. It changes the input, not the model.
02
Few-Shot Embeds Demos
Demonstrations are embedded directly in the prompt context. The model infers patterns from them.
03
CoT & Best Practices
Chain-of-Thought forces step-by-step reasoning. Specificity, context, and format specification are essential.
04
Shot Selection Matters
Which examples you choose is as important as how many. Relevance is the key signal.
05
RAG & Context Management
Retrieve relevant code for prompts. Manage context windows carefully on real projects.
06
Tool Use & Chaining
LLMs can call tools and chain prompts. This is the foundation of agentic SE workflows.
What distinguishes prompting from fine-tuning at a fundamental level?
Prompting changes the input; fine-tuning changes the model. Prompting does not update model parameters — it adapts behavior by structuring the context. Fine-tuning performs gradient-based updates to the model weights.
Why does Chain-of-Thought prompting improve accuracy on code analysis tasks?
CoT forces the model to show intermediate reasoning steps, making it harder to skip to wrong conclusions. Each step constrains the next, creating a logical chain. This mirrors how humans analyze code: step by step rather than jumping to a verdict.
When would you use RAG instead of including all code in the prompt?
When the codebase exceeds the context window, when you need domain-specific knowledge the model was not trained on, or when information changes frequently. RAG retrieves only the relevant snippets, keeping prompts focused and within token limits.
Module 5 · Slide 29
Lab Challenge: Prompt Engineering Competition
Put your prompt engineering skills to the test. Design the best prompt for a code generation task and analyze the results.
The Task
Given a code generation challenge (implement a sorting algorithm with specific requirements), design the best possible prompt to get correct, efficient, and readable code from an LLM.
Requirements
Implement a function that sorts a list of dictionaries by a specified key, with support for:
• Ascending and descending order
• Handling missing keys gracefully
• Stable sort behavior
• Type checking and error handling
• Comprehensive docstring and type hints
Submit
1) Your prompt 2) The generated output 3) Test results (write & run tests) 4) 1-paragraph analysis of your prompt design choices
Bonus
Try 3 or more prompting strategies (zero-shot, few-shot, CoT, etc.) and compare results. Which strategy produced the best code? Why?
Grading Criteria
Prompt Quality (30%): Specificity, structure, use of techniques from this module. Output Correctness (30%): Does the generated code actually work? Test Coverage (20%): Did you write meaningful tests? Analysis Depth (20%): Insightful reflection on what worked and why.
Module 5 · Slide 30
What's Next
This module covered prompting and in-context learning for SE automation. Here is where the journey continues.
Next: RAG
Retrieval-Augmented Generation: automatically selecting the most relevant demonstrations from a corpus to build optimal prompts.
Later: Hallucinations
When models confidently generate wrong code. Understanding, detecting, and mitigating hallucinated outputs.
Later: Agents
Multi-step LLM workflows for SE tasks. Combining prompting with tool use, planning, and iterative refinement.
Final Reflection
Prompting is not just asking a model a question. It is a systematic technique for lightweight adaptation — a way to steer powerful pre-trained models toward specific tasks using carefully structured context, without ever changing a single weight.