We Replaced REST with Kafka and Cut API Failures by 90% — A Production Cutover (.NET 9)
TL;DR — We swapped Mattrx's inter-service REST calls for Kafka topics (multi-tenant marketing analytics SaaS, .NET 9 / ASP.NET Core, Azure SQL, ~3,200 req/sec peak). Over 8 weeks: end-to-end ingestion failures 1.9% → 0.18% (−90%), ingestion p95 180 ms → 8 ms (−96%), events lost on downstream outage tens of thousands → 0, cascading-failure incidents ~3/month → 0, time to add a new consumer ~1 sprint → ~1 day.
3-minute video walkthrough on YouTube: https://youtu.be/O21rbuQdM1Y
Full deep-dive (architecture, code, pre-adoption checklist, when NOT to reach for Kafka) on PrepStack: https://prepstack.co.in/blog/replaced-rest-with-kafka-cut-failures-90-percent
The one mental shift
Distinguish commands and queries (need an answer → REST) from events (fire-and-forget → log). A synchronous call makes the caller's uptime a function of the callee's uptime. An event published to a durable log decouples them in time. Most "we need a message bus" problems are really "we mislabeled an event as a function call."
The old pipeline — synchronous REST chain
Customer site → Collector → Enrichment → Analytics → Persister. Each service waits on the next. The Collector can't return 200 until all three downstream services succeed. Availability multiplies. Analytics slow at month-end → Collector times out → customer's event is gone → customer retries → MORE load on the struggling service → retry storm.
The new pipeline — Kafka log in the middle
The collector produces a single event to events.raw and returns 202 Accepted in ~8 ms. Three independent consumer groups read the same topic in parallel. The collector's success no longer depends on any consumer being up. Analytics down? Its consumer group lags and catches up. Zero customer impact. Zero data loss.
The consumer pattern
Three things matter. One: a consumer group, so partitions parallelize across instances. Two: manual offset commit — commit only after successful processing (at-least-once delivery). Three: a retry topic and a dead-letter topic. Consumer lag is the one health metric.
The outbox pattern — atomic produce with a DB write
If the same handler writes to Azure SQL and then produces to Kafka, those are two systems — a crash between them loses or duplicates the event. For events tied to a committed DB change, use the outbox: write the event to an outbox table in the same SQL transaction as the business data; a relay reads unsent rows and publishes them.
Idempotent consumers — at-least-once + dedup = effectively-once
Kafka gives at-least-once delivery by default. Consumers must be idempotent. A dedup key with a unique constraint is the simplest tool. At-least-once + idempotent = effectively-once, and it's far simpler than chasing true exactly-once.
The two superpowers REST never gave us
Replay. A bug corrupted yesterday's rollup? Reset that one consumer group's offset and reprocess. The log is the source of truth.
Adding a consumer for free. New consumer group on the same topic. The producer doesn't change. ~1 day instead of ~1 sprint.
When NOT to reach for Kafka
Don't replace request/response with Kafka — queries stay synchronous. Kafka is operational weight; use managed Kafka or weigh Service Bus / Event Hubs on Azure first. Exactly-once is a trap. Ordering is per-partition, not global. You traded synchronous errors for asynchronous lag — watch lag or you've hidden the problem.
The closing mental model
A synchronous call couples you to the callee being up. An event in a log couples you to the log being up. For anything that's "this happened" rather than "tell me this," publishing to a durable log decouples producers from consumers in time.
Three-minute video walkthrough on YouTube: https://youtu.be/O21rbuQdM1Y
Full deep-dive on PrepStack: https://prepstack.co.in/blog/replaced-rest-with-kafka-cut-failures-90-percent
If this saved you a 3 AM page, a 💖 or bookmark helps it reach more backend engineers.
