Meerkat · No. 02field guide · vol. 02
00 · cover · anatomy of a working prompt

What a workingprompt looks like.

A dissection. Four ingredients in the prompt, six classes of model behind it, and a way to figure out which one fits your job.

01 · the four parts
role · task · context · rules

Every working prompt is four bones in a row.

part 01 · ROLE

Hire the model.

Who is writing.

The model collapses to whoever you say it is. A vague prompt asks 'an AI' to do the work — you get an AI's median answer. Hand it a job title with judgment and it stops auditioning.

before

you are a helpful AI assistant

after

You are a careful editor with a bias toward plain language.

why · Specifics inherit specifics. Generic role → generic output.

part 02 · TASK

Name the job.

The verb.

One sentence. One verb. If you cannot say it in plain English, the model cannot do it at all. The task line is where most prompts go to die — buried under hedges and qualifications.

before

can you maybe help me with my email and like, make it better

after

Rewrite the email below for clarity and a calmer tone.

why · A real verb (rewrite, summarize, extract, draft) gives the model an action. 'Help' gives it nothing.

part 03 · CONTEXT

Frame the room.

Audience and constraints.

Who is reading. How long. What format. The boring details that decide whether it lands. Without context, the model defaults to the average reader of the average blog — which is rarely your audience.

before

make it good and professional

after

Audience: a board chair. Length: 3–4 sentences. Format: plain text, no markdown.

why · 'Professional' is a vibe. 'Board chair, 3 sentences, no markdown' is a target.

part 04 · RULES

Draw the lines.

Guardrails and tells.

What to do, what not to do, what counts as failure. Rules are how you say no without arguing later. Be specific — 'no clichés' is hand-waving; 'no AI clichés (delve, leverage, in today's fast-paced)' is operable.

before

don't be cringe

after

Lead with the most important point. No hedges. Banned words: delve, leverage, robust, in today's fast-paced.

why · Models love rules. They especially love rules they can check themselves against.

02 · smell tests

Six tests. No mercy.

Read your prompt out loud. Ask each question below. If any answer feels squishy, you found the bone that's about to break in production.

Most failures are visible before you ever hit run. You just have to look.

  1. test 01

    The role test

    Could a junior on the team read the role line and act on it?

    FIX Replace 'expert' with a job title plus a stance. 'Editor with a bias toward plain language' beats 'world-class writing expert' every time.

  2. test 02

    The verb test

    Is there exactly one action verb in the task line?

    FIX If you have two, split the prompt into two prompts. If you have zero, you wrote a wish, not a task.

  3. test 03

    The audience test

    Could the model produce a wrong-but-plausible answer for the wrong audience?

    FIX Name the reader. 'For a clinician' and 'for a 14-year-old' produce wildly different paragraphs from the same task.

  4. test 04

    The format test

    Did you say what shape the output should be?

    FIX JSON schema, sentence count, max words, bullet vs prose. The model will guess if you don't tell it, and it will guess wrong on the high-stakes one.

  5. test 05

    The failure test

    Did you tell it what 'bad' looks like?

    FIX Banned words, banned shapes, banned moves. 'No filler. No clichés. No "as an AI" disclaimers.' Three lines saves three rounds.

  6. test 06

    The grandma test

    Could you read the prompt to a smart human who's never seen the task and have them do it?

    FIX If they'd ask follow-up questions, the model is silently making those answers up.

03 · failure modes · the rogues' gallery
you have written all of these

Five ways your prompt is quietly lying to you.

01

The Pep Talk

symptom

'You are a world-class, award-winning, deeply experienced…'

root cause

Confusing motivation with instruction. The model has no morale.

cure

Cut the adjectives. Keep the role. Add the stance.

02

The Kitchen Sink

symptom

Twelve rules, six examples, three personas, and a poem.

root cause

Hedging against failure by piling on. The signal drowns.

cure

Pick the four rules that matter. Move the rest to a fallback prompt.

03

The Vibe Brief

symptom

'Make it good. Professional. But also fun. Not too long.'

root cause

Using adjectives where measurements belong.

cure

Replace each adjective with a number, an example, or a banned alternative.

04

The Buried Lede

symptom

The actual task appears in paragraph four, after the backstory.

root cause

Writing the prompt like a memo instead of a spec.

cure

Task line first. Context second. Rules third. Input last.

05

The Open Door

symptom

No format, no length, no failure conditions.

root cause

Trusting the model to read your mind. It can't.

cure

Close the door. State the shape. State what 'wrong' looks like.

04 · model bestiary · pick the right animal
there are six. mostly.

The model is half the prompt. Choose carefully.

A perfect prompt aimed at the wrong model still loses. These are the six classes you actually need to know — what they cost, where they shine, and the moment they quietly start lying to you.

class 01

Fast & cheap

The line cook. Volume work, instant answers.

small · 4–8B class · sub-second

  • openai/gpt-5-nano
  • anthropic/claude-haiku-4
  • google/gemini-3-flash
latency
≈ 300–800 ms
cost
fractions of a cent
ceiling
shallow reasoning

best for

  • Classification, routing, tagging
  • Extraction from clean inputs
  • Autocomplete, suggestions, rewrites in flight
  • First-pass filters before a slower model

not for

  • Anything requiring multi-step reasoning
  • Long-context analysis (>20k tokens)
  • Nuanced editorial judgment

If the task fits on a Post-it and the right answer is obvious to a human, this tier is the answer.

class 02

Balanced workhorse

The senior IC. Most production traffic lives here.

mid · general purpose

  • openai/gpt-5-mini
  • anthropic/claude-sonnet-4-5
  • google/gemini-3-pro
latency
≈ 1–3 s
cost
cents per call
ceiling
broad competence, weakens on novel logic

best for

  • Drafting, summarizing, rewriting at quality
  • Multi-turn chat with memory
  • Most tool-using agents
  • Code that fits in one file

not for

  • Hard math, formal proofs, legal edge cases
  • Tight latency budgets (<500ms)
  • Bulk classification — overkill

Default here. Step down to fast when latency or cost is the constraint. Step up to heavy only when you can prove the workhorse failed.

class 03

Heavy frontier

The principal. Big context, hard problems, infrequent calls.

flagship · 200k–1M context

  • openai/gpt-5
  • anthropic/claude-opus-4.6
  • google/gemini-3-ultra
latency
≈ 3–10 s
cost
tens of cents per call
ceiling
highest available quality

best for

  • Long-context synthesis (full books, codebases, transcripts)
  • High-stakes drafts where a human won't edit further
  • Final-pass review behind a workhorse
  • Complex tool orchestration

not for

  • Anything you'll run a thousand times a minute
  • Tasks a smaller model already nails — you're burning money

Use it as the editor, not the writer. Workhorse drafts, frontier polishes. Cost goes down, quality goes up.

class 04

Reasoning specialists

The mathematician. Thinks before it speaks. Literally.

o-series · thinking models

  • openai/o5
  • openai/o5-mini
  • anthropic/claude-opus-4.6 (extended thinking)
latency
≈ 5–60 s
cost
high — pays for hidden reasoning tokens
ceiling
best-in-class on math, logic, planning

best for

  • Multi-step math, optimization, constraint solving
  • Code that requires planning, not pattern matching
  • Debugging across files
  • Strategic plans where the steps matter

not for

  • Conversation, summarization, creative writing
  • Anything time-sensitive
  • Cost-sensitive bulk work

Hand it problems where a human would pull out a notepad. Hand it chat and it'll over-engineer a hello.

class 05

Vision & multimodal

The pair of eyes. Reads screenshots, charts, PDFs, the wild.

multimodal · image + video in

  • openai/gpt-5 (vision)
  • anthropic/claude-sonnet-4-5
  • google/gemini-3-pro (vision)
  • google/gemini-3.1-flash-image-preview
latency
≈ 2–6 s
cost
image tokens are cheaper than you think
ceiling
still misreads dense tables and tiny text

best for

  • Pulling structured data out of receipts, invoices, IDs
  • UI review, screenshot QA, accessibility audits
  • Chart and diagram interpretation
  • OCR for messy real-world documents

not for

  • Pixel-perfect tasks — measuring, color matching
  • Anything where a human would also need to zoom

If the answer is in the image, use vision. If the image is decoration, you're paying twice for nothing.

class 06

Embeddings

The librarian. Not a chatbot. A coordinate system.

vector · semantic similarity

  • openai/text-embedding-3-large
  • google/gemini-embedding-001
  • voyage/voyage-3
latency
≈ 50–200 ms per batch
cost
near-zero
ceiling
search quality, not language

best for

  • Semantic search over your own data
  • Deduplication, clustering, classification at scale
  • Retrieval (RAG) before a generative model
  • Recommendations and 'more like this'

not for

  • Generating text — that's not what they do
  • Tasks where keywords matter more than meaning

If you said 'the model should know about my data,' the answer is almost always embeddings + retrieval, not a bigger context window.

Then layer them. A stack beats a single shot.

most production traffic is two models in a row

You almost never need one big model doing one big call. The teams shipping the best work pipe a cheap model into a smart one — or a smart one into a frontier one. Four patterns cover most jobs.

pattern 01

Router → workhorse

Fast (classify) → Balanced (do the job)

Triage cheap, execute well. Stops the workhorse from wasting cycles on stuff a 4B model could route.

pattern 02

Writer → editor

Balanced (draft) → Heavy (critique & polish)

Most of the words come from the cheaper model. The expensive model only touches the final 10%.

pattern 03

Retrieve → reason

Embeddings (find context) → Balanced or Heavy (answer)

Don't stuff the prompt with everything. Search first, then hand the model the five paragraphs that matter.

pattern 04

Plan → execute

Reasoning (decompose) → Balanced (do each step)

Reasoning models are great at planning, expensive at typing. Let them outline, let the workhorse fill in.

05 · the advisor · ask, get steered

Don't know which? Ask the meerkat.

Describe the job — the volume, the latency, the stakes, what the input looks like — and the advisor recommends a tier (and a stack pattern, if it earns it). Editorial, not enthusiastic. It will push back.

  • Recommends by tier and pattern, not brand.
  • Defaults to the workhorse. Has to be talked into the frontier.
  • Ends with one model to try first, and one alternative.

Not a sales pitch. A second opinion before you wire up the wrong thing.

live · the advisor

Tell it the job. Get the model.

editorial, not enthusiastic

Describe what you're actually trying to ship — volume, latency, stakes, what kind of input — and the advisor will steer you to a tier (and a stack, if it helps). Or pick one of these to start:

06 · enough theory

Now go write one that ships.

Meerkat asks the four questions for you and assembles the prompt in the format the model actually understands. Sixty seconds, no incantations.