Guide
Designing High-Signal Prompts & Evaluations
This guide is about making models predictable. You'll learn how to design prompts that are clear, testable, and grounded in your domain—and how to wrap them in evaluation harnesses so you can tell if changes are making things better or worse.
You don't need to become a prompt "artist". You just need a small set of patterns, a way to capture examples, and a habit of running checks before you ship changes.
What we’ll cover
Use this to make your prompts less mystical and more like regular, testable software.
Prompt foundations
- • Roles, instructions, and constraints
- • Few-shot examples that actually help
- • Structuring outputs
Evaluation harnesses
- • Golden sets and test cases
- • Automatic vs. human review
- • When to re-run evals
Operating prompts
- • Versioning and rollbacks
- • Logging and feedback
- • Collaborating with non-technical teams
1. Prompt foundations: structure beats cleverness
Most flaky behavior comes from vague prompts, not bad models.
A solid prompt usually has four parts:
- • Role: who the model is pretending to be.
- • Task: what you want, in one clear sentence.
- • Constraints: format, tone, length, don't-dos.
- • Examples: a few labeled inputs and outputs.
You want the model to have less room to guess. The more you can show with examples instead of adjectives, the better.
Example: classifying support tickets
Instead of: "You are an AI that classifies emails into categories."
Try something like:
You are a support triage assistant for a SaaS company. Task: - Read the email body and assign ONE category. Categories: - BUG - BILLING - HOW_TO - OTHER Constraints: - Answer with ONLY the category name. - If you aren't sure, answer OTHER. Examples: EMAIL: "I'm getting a 500 error when I try to upload a CSV." CATEGORY: BUG EMAIL: "Can you explain how to invite a new team member?" CATEGORY: HOW_TO EMAIL: "Our card was charged twice this month." CATEGORY: BILLING
2. Few-shot examples and structured outputs
If you want consistent behavior, show consistent examples.
Few-shot prompting is just "here are a few examples, do it like this." The key is to keep them:
- • Representative of real data
- • Diverse enough to cover important edge cases
- • Clean and clearly labeled
For outputs, prefer structured formats like JSON or well-labeled sections. It's much easier to wire into a workflow and to evaluate later.
Pattern: "decision JSON"
Ask the model to respond with a small JSON object you can parse:
Return a JSON object with this shape:
{
"category": "BUG | BILLING | HOW_TO | OTHER",
"priority": "LOW | MEDIUM | HIGH",
"confidence": 0-100,
"notes": "short explanation"
}Even if the JSON is occasionally imperfect, it's still far easier to work with than unstructured paragraphs.
3. Building an evaluation harness
Prompts without tests are like code without tests: fine, until they’re not.
An evaluation harness is just a repeatable way to answer: "Did this change make things better or worse?"
Start simple with a golden set:
- • 20–50 real examples
- • The expected output for each (label, category, etc.)
- • A small script or notebook to run the model and compare
For generation tasks, you won't always have a single "correct" answer. In those cases, you can:
- • Use another model to score (carefully)
- • Use heuristics (length, presence of key fields)
- • Do human review on a smaller sample
When to run evals
- • Before changing prompts in production
- • When switching models or providers
- • Before big launches or campaigns
- • On a schedule (weekly / monthly) for key workflows
Think of evals as a smoke test. They don't have to be perfect, just good enough to catch obvious regressions.
4. Operating prompts in production
Prompts are living artifacts—treat them like code, not one-off experiments.
Once a prompt is live, you'll want a bit of hygiene:
- • Version prompts with comments on why you changed them
- • Tie versions to eval runs and results
- • Keep a short change log people can actually read
- • Make it easy for users to flag bad outputs
This doesn't require a huge platform. A Git repo, a small admin UI, or even a well-kept doc can go a long way at small scale.
Collaborating across teams
Your best prompts often come from pairing:
- • Someone close to the customer or domain
- • Someone who understands model behavior and limits
- • Someone who owns the workflow / metrics
Use a shared workspace (Notion, Confluence, repo) where prompts, examples, and decisions live together. Make it searchable. Future you will be grateful.
FAQ: prompts & evaluations
Questions teams ask once they realize they need more than trial-and-error.
How many examples do we need for a good golden set?
Start with 20–50. More is nice, but even a small, curated set is far better than nothing. Expand over time as you see new patterns and edge cases.
Should we let another model grade our outputs?
Model-graded evals are powerful, but they're still fallible. Use them for quick comparisons, but periodically spot-check with humans—especially for anything high risk.
How often should we update prompts?
As rarely as you can while still improving. Frequent, untracked changes are hard to debug. Batch changes, run evals, then ship with a note on what you expect to happen.
Want help designing prompts & evals for your product?
We work with small teams and startups to turn ad-hoc prompts into repeatable, testable systems—complete with golden sets, evaluation harnesses, and change management.
If you'd rather skip months of trial-and-error, we can design the patterns and playbooks with your team and leave you with a setup you own.
Typical work includes a short discovery, a designed prompt + eval stack, and hands-on sessions with your team.