claude
ai
tooling

Building with Claude skills: a practical intro

Anthropic's Claude skills turn abstract 'prompts' into reusable, reviewable units of behavior. Here's how I've been shipping them at work without ceremony.

Jan 12, 2026·6 min read·updated Jan 15, 2026

When Anthropic introduced Claude skills, the framing shifted. You used to write prompts — paragraphs of careful adjectives hoping the model took the hint. A skill is closer to a small, reviewable program: a named behavior with typed inputs, explicit refusal conditions, and a clear output contract.

That shift is mostly cultural. Same model, same API. But it pays out the same way small, single-purpose functions pay out in normal code — they compose, they diff cleanly, and they make it easy to say no.

#What a skill actually is

A skill, in my day-to-day, is a file. Something like this:

skills/summarize-release-notes.ts

export const skill = {
  name: "summarize-release-notes",
  input: "changelog markdown",
  output: "5 bullet points, <=20 words each",
  refuseIf: [
    "input is empty",
    "changelog has more than 50 entries",
  ],
  model: "claude-sonnet-4-5",
  temperature: 0.2,
};

The shape doesn’t matter much. What matters is that the skill is addressable — summarize-release-notes is a thing you can import, test, and replace. That’s the whole game.

#The three disciplines

After shipping a dozen or so, I keep coming back to the same three habits:

Name the outcome, not the activity. triage-a11y-issue beats analyze-with-claude. The second tells me nothing about what it does.
Write the refusal conditions first. If you can’t list what the skill should decline, you probably don’t know what it should do.
Version the file. Diff it. Review it. A skill is code. Treat it like code.

#Where they fit in a real app

In a Shopify admin app I’m working on, two skills carry most of the weight. One triages accessibility issues coming out of an automated audit; another writes merchant-facing copy for error states. Both live in the same repo as the Liquid templates they serve, and both are wired to a tiny fixture set that regression-tests the output shape on every PR.

The rule I keep learning the hard way: if a skill’s output is going to be rendered as UI, the fixture set should include at least one example of the output you don’t want.

That’s how you find out the model loves to start sentences with "As an AI assistant" even when you beg it not to.

#Testing the unpredictable

You can’t unit-test an LLM the way you unit-test a function. What you can do is assert on the cheap, structural things: is the output valid JSON? is the bullet count correct? does it contain a forbidden phrase?

skills/summarize-release-notes.test.ts

import { run } from "./summarize-release-notes";
import { expect, test } from "vitest";
 
test("returns exactly 5 bullets", async () => {
  const out = await run(fixture);
  const bullets = out.split("\n").filter(l => l.startsWith("- "));
  expect(bullets).toHaveLength(5);
});
 
test("refuses on empty input", async () => {
  await expect(run("")).rejects.toThrow(/empty/);
});

Two tests. Neither cares about the content. Both catch 80% of the regressions I’ve hit in production.

#When to not reach for a skill

Some problems look like they want an LLM and really don’t. Anything with a deterministic right answer — format conversion, simple parsing, date math — is a bad fit. Reach for the regex first. Reach for the skill when the problem is shaped like a judgement call.

None of this is exotic. It’s just the normal habits of shipping software, applied to a medium that used to feel magical and is quietly becoming ordinary.