TL;DR

Stripe Billing was built for seats and fixed-tier SaaS in 2019. AI inference is usage-based, asymmetric, and fail-closed by default. The four primitives that actually matter — idempotent event ingestion, fail-closed budgets, per-agent margin attribution, and credit drawdown — are either missing or hard to build on Stripe. We'll show what breaks, what AI-first billing infrastructure (built on Polar) gives you instead, and why Macropay's 4.5% + $0.50 all-in is cheaper than "Stripe + custom usage layer + Stripe Tax" for most AI companies under $20M ARR.

Stripe Billing is a good product. It is not a good product for AI inference.

Those two sentences are the entire thesis of this essay. The rest is just the receipts: what specifically breaks when you put GPT-4o or Claude Sonnet on Stripe Billing, what you have to build yourself to compensate, and why a billing stack designed around metered AI usage — with margin baked in at the token level — comes out cheaper and cleaner than the Stripe-plus-custom-code path for nearly every AI company under $20M ARR.

What Stripe Billing was actually for

Stripe Billing was built in 2018–2019 to solve a very specific problem: SaaS companies with fixed monthly subscriptions, per-seat pricing, and occasional usage-based add-ons. The mental model behind it is:

  • A customer subscribes to a plan.
  • The plan has a base price.
  • The plan may have usage components measured against meters.
  • Meters get values pushed to them periodically (hourly, daily).
  • At end of period, Stripe rolls everything into an invoice.

That model works beautifully for a Slack-shaped business: seats per month, occasional message overage. It works adequately for a Datadog-shaped business: per-host plans with ingest overage. It breaks for an OpenRouter-shaped business: every API call is a billable event, prices vary by model, prices change as upstream providers move, and every customer effectively has a custom margin.

Where Stripe Billing breaks for AI

Six specific failure modes. We've hit all six. We've seen our customers hit all six.

1. Event ingest latency

Stripe's meter event endpoint has a documented 60-second ingest delay before an event is reflected in the meter total. For an AI product that needs to check a customer's remaining budget before dispatching the next inference call, 60 seconds is forever. We've seen customers burn $2,400 of credit in 90 seconds because budget enforcement was 60 seconds stale.

2. No native idempotency on usage events

Stripe's standard idempotency-key model covers payment intents but not meter events. If your inference proxy retries on a network blip, you log the same usage twice. We've cleaned up customers' books to remove $40k of double-billed events in a single month.

3. Pricing is per-plan, not per-call

Stripe Billing expresses prices as plan-level tiers. AI inference prices are per-model-per-token, and they shift weekly as upstream providers reprice or as you change your margin policy. There's no clean primitive for "the cost of this specific call is $0.184, of which $0.074 is margin."

4. No fail-closed budget primitive

For a payments stack, fail-open (let the call through, charge later) is the correct default. For an AI inference stack, fail-closed (block the call when the budget is empty) is the correct default. Stripe Billing has no concept of "reject this usage event if it would put the customer over budget." You have to build it yourself, and you have to build it without the 60-second latency above.

5. Refunds for outages don't work

If OpenAI has a 30-minute outage and you serve degraded results, you want to credit your customers automatically based on usage during that window. Stripe's refund primitive operates on invoices, not on meter events. You can't cleanly say "credit back the 1,840 events between 14:02 and 14:34 UTC."

6. Per-customer margin is invisible

This is the most painful one. On Stripe, you see GMV. You don't see cost basis. Which means you don't see margin per customer. Which means you can't answer the question "which 5% of our customers are eating 60% of our gross margin because they're hammering Claude Opus for $14 of inference and paying us a $20 monthly plan?"

The single insight from running this stack for two years

Stripe could meter our tokens. It couldn't tell us which customers were unprofitable. Once we could see margin per customer, we found out 6% of our customer base was responsible for negative gross margin at the company level. We weren't losing money on AI — we were losing money on six people.

The four primitives we wished we'd had on day one

We rebuilt the billing stack on top of the Polar codebase, with four new primitives that solve the failure modes above. These are what ships in Macropay today.

1. Idempotent events

Every meter event carries an idempotency_key. If the same key arrives twice — because of a retry, a clock skew, a job restart — only the first one counts. The dedup window is 24 hours and uses a Postgres unique constraint plus a Redis bloom filter for hot-path latency. Ingest is < 8ms p99.

await mp.meter("gpt-4o", {
  customer: "cus_a1b2",
  agent: "agt_research_pro",
  tokens: 4_280,
  idempotency_key: "evt_8f3a1c", // safe to replay
});

2. Fail-closed budgets

Every customer can set a hard budget. Every meter event runs through the budget check before dispatch — not after. The check is in-process (Redis-backed, sub-millisecond), not a separate API round trip. Budgets are decremented atomically; race conditions are impossible by construction.

Setting a budget is one call:

await mp.budgets.set("cus_a1b2", {
  monthly_usd: 200,
  on_exceed: "block",       // or "throttle" or "notify"
  auto_reload: { at: 0.1, by: 100 }
});

3. Per-agent margin

Every meter event records two numbers: the upstream cost (what OpenAI charged us) and the customer charge (what we charged the end user). The difference is margin. Margin attributes to the calling agent — which means you can pull margin reports per agent, per customer, per cohort, per model. The schema is:

event {
  id: evt_8f3a1c
  agent_id: agt_research_pro
  customer_id: cus_a1b2
  upstream_cost: 0.074   // OpenAI charged us this
  customer_charge: 0.184 // we charged customer this
  margin: 0.110          // this is the gross margin line
  margin_pct: 0.598      // 59.8% gross margin on this call
}

Roll those up over a month and you get per-customer profitability with no extra reporting layer. The first time most AI founders see this view, they're surprised.

4. Credit drawdown

Credit packs are the natural primitive for prepaid AI usage. We treat them as first-class objects: customer prepays $1,000, receives a credit balance, every meter event draws down. When the balance hits a configurable threshold, auto-reload fires against the stored payment method. Allowlists control which models / agents the credit can be spent on.

await mp.credits.create({
  customer: "cus_a1b2",
  amount_usd: 1000,
  auto_reload: { at: 100, by: 1000 },
  allowed_models: ["gpt-4o", "claude-sonnet"]
});

Drawdown is FIFO across credit packs. Expiry, refunds, and partial allocations are all first-class. When the CFO asks "how much unredeemed credit liability do we have on the balance sheet right now," the answer is one SQL query, not a quarterly spreadsheet.

The math on a $1M MRR AI company

Let's compare the cost of building this on Stripe Billing yourself vs. switching to Macropay, for an AI inference company at $1M MRR ($12M ARR):

Cost lineStripe + DIY billing layerMacropay
Card processing (avg ticket $220, 70% intl)$508,000 (~4.23%)$546,000 (4.55%)
Stripe Tax$60,000 (0.5%)Included
FX + intl card surcharge$72,000Included
Engineer-months: idempotent metering, budget enforcement, margin reporting~8 engineer-months ($240k loaded)Included
Ongoing maintenance (15% of dev cost/yr)~$36k/yrIncluded
Disputes$15 × ~1,100 cases = $16,500$30 × 1,100 = $33,000
Year-1 total~$932,500~$579,000
Steady-state year~$692,500~$579,000

The headline rate (4.5% + $0.50) sounds higher than Stripe's 2.9% + $0.30 until you actually price the things Stripe doesn't do. The Stripe Tax line alone makes up for the headline-rate gap on most AI companies. The eng-months gap is what closes the deal.

When Stripe still makes sense

Three cases where Stripe Billing is the right call for an AI company:

  1. You sell pure B2B annual contracts at $50k+ ACV. Per-seat is fine, usage volatility is bounded, your customer count is small. Stripe is fine.
  2. You're past $100M ARR and you have the engineering budget to maintain the custom billing layer indefinitely. At that point, the MoR margin matters more than the eng cost.
  3. You're in a regulated vertical where you need to be the seller of record for legal reasons (some healthcare carve-outs, some government contracts). MoR doesn't work there.

Everyone else — which is most AI companies under $20M ARR, including the ones reading this — should look hard at the cost of the "DIY billing layer" line in the table above. Eight engineer-months is twelve months of feature work the team didn't ship.

We had a senior engineer spend nine months making Stripe Billing do per-token metering. He's now back on product. The thing he built is still in production, but it runs on Macropay's SDK, not Stripe's.— Andrei Volkov, CTO, Lattix.ai

The full breakdown — including a calculator that lets you swap your own volume and ticket size in — lives at /compare. The defaults are tuned for an AI inference company at $1.2M/month. Slide them around; the math is the math.