We Replaced REST with Kafka and Cut API Failures by 90%

TL;DR — We swapped Mattrx's inter-service REST calls for Kafka topics (multi-tenant marketing analytics SaaS, .NET 9 / ASP.NET Core, Azure SQL, ~3,200 req/sec peak). Over 8 weeks: end-to-end ingestion failures 1.9% → 0.18% (−90%), ingestion p95 180 ms → 8 ms (−96%), events lost on downstream outage tens of thousands → 0, cascading-failure incidents ~3/month → 0, time to add a new consumer ~1 sprint → ~1 day.

3-minute video walkthrough on YouTube: https://youtu.be/O21rbuQdM1Y

Full deep-dive (architecture, code, pre-adoption checklist, when NOT to reach for Kafka) on PrepStack: https://prepstack.co.in/blog/replaced-rest-with-kafka-cut-failures-90-percent

The one mental shift

Distinguish commands and queries (need an answer → REST) from events (fire-and-forget → log). A synchronous call makes the caller's uptime a function of the callee's uptime. An event published to a durable log decouples them in time. Most "we need a message bus" problems are really "we mislabeled an event as a function call."

The old pipeline — synchronous REST chain

Customer site → Collector → Enrichment → Analytics → Persister. Each service waits on the next. The Collector can't return 200 until all three downstream services succeed. Availability multiplies. Analytics slow at month-end → Collector times out → customer's event is gone → customer retries → MORE load on the struggling service → retry storm.

The new pipeline — Kafka log in the middle

The collector produces a single event to events.raw and returns 202 Accepted in ~8 ms. Three independent consumer groups read the same topic in parallel. The collector's success no longer depends on any consumer being up. Analytics down? Its consumer group lags and catches up. Zero customer impact. Zero data loss.

The consumer pattern

Three things matter. One: a consumer group, so partitions parallelize across instances. Two: manual offset commit — commit only after successful processing (at-least-once delivery). Three: a retry topic and a dead-letter topic. Consumer lag is the one health metric.

The outbox pattern — atomic produce with a DB write

If the same handler writes to Azure SQL and then produces to Kafka, those are two systems — a crash between them loses or duplicates the event. For events tied to a committed DB change, use the outbox: write the event to an outbox table in the same SQL transaction as the business data; a relay reads unsent rows and publishes them.

Idempotent consumers — at-least-once + dedup = effectively-once

Kafka gives at-least-once delivery by default. Consumers must be idempotent. A dedup key with a unique constraint is the simplest tool. At-least-once + idempotent = effectively-once, and it's far simpler than chasing true exactly-once.

The two superpowers REST never gave us

Replay. A bug corrupted yesterday's rollup? Reset that one consumer group's offset and reprocess. The log is the source of truth.

Adding a consumer for free. New consumer group on the same topic. The producer doesn't change. ~1 day instead of ~1 sprint.

When NOT to reach for Kafka

Don't replace request/response with Kafka — queries stay synchronous. Kafka is operational weight; use managed Kafka or weigh Service Bus / Event Hubs on Azure first. Exactly-once is a trap. Ordering is per-partition, not global. You traded synchronous errors for asynchronous lag — watch lag or you've hidden the problem.

The closing mental model

A synchronous call couples you to the callee being up. An event in a log couples you to the log being up. For anything that's "this happened" rather than "tell me this," publishing to a durable log decouples producers from consumers in time.

Three-minute video walkthrough on YouTube: https://youtu.be/O21rbuQdM1Y

Full deep-dive on PrepStack: https://prepstack.co.in/blog/replaced-rest-with-kafka-cut-failures-90-percent

If this saved you a 3 AM page, a 💖 or bookmark helps it reach more backend engineers.

We Replaced REST with Kafka and Cut API Failures by 90% — A Production Cutover (.NET 9)

Comments

More from this blog

Our Azure Bill Spiked Overnight — Here's Exactly How We Cut It 60% (7 Real Fixes)

.NET 11 vs .NET 10: We Benchmarked Both on a Real Production App (Should You Upgrade?)

Angular State Management in 2026 — NgRx, Signals, NGXS, Akita Compared (with Real Numbers)

Command Palette

Comments

More from this blog