Case Study

Inventory & Promotion Platform for a 120-Store Supermarket Chain

Regional supermarket chain (APAC)

Retail — grocery

Australia

View product demo

Inventory & Promotion Platform for a 120-Store Supermarket Chain

3 days → ~2h

Promotion go-live time

Self-service via console; price in stores within 2 hours of publish

9% → 3.2%

Click-and-collect cancellations

Fail-closed reservation API + near-real-time stock

6–18h → 12–18s p95

Stock data freshness

Event-driven store-edge replaces nightly file export

Duration

10 months (Phase 1–5)

Team Size

8 people

Services

3 services

Client Context

The category manager's problem was simple to state and extremely hard to solve. By the time a promotional decision was made, reviewed, emailed to store managers, and manually entered into 120 POS systems by store staff, three days had passed. In grocery retail, three days is the difference between a campaign that captures the weekend trading window and one that runs only on Tuesday and Wednesday. The chain was competing against the Coles/Woolworths duopoly on promotional speed and losing. Their legacy Windows POS was stable and well-understood — 14 years old, sat on a private VLAN in every store, worked fine at the checkout. The problem was the integration surface: the only documented way to change a price was a vendor service ticket costing $2,400 per change request. The POS vendor's API was a RS-232 file-drop protocol from 2009. Nobody wanted to replace 120 POS systems. The question was how to get promotion data from a category manager's decision into 120 stores in under two hours without touching the POS software at all.

The Challenge

Business Challenge

Three days from promo decision to live in-store price meant the chain couldn't run tactical promotions around competitor price moves or weather events. Stock visibility for online click-and-collect was 6–18 hours stale, causing ~9% order-line cancellations — each one a customer service call. The $2,400 vendor change-request fee meant promotions were batched rather than responsive. At 40+ promotions per quarter, that was $400K+ in change-management cost per year.

Technical Challenge

The POS estate couldn't be replaced: 120 stores × $85K per POS system = $10.2M, plus 2 years of migration risk. The legacy POS communicated only via a file-drop protocol: it wrote daily export files to an SMB share, and accepted a price-list file in a specific fixed-width format. No real-time API, no push capability. Each store had one VPN tunnel back to HQ — often saturated during trading hours. Stock data was available only in those nightly export files. The core technical challenge: how do you build a real-time inventory and promotion system on top of a polling-based, file-oriented legacy system without disrupting the checkout flow that processes $2M+ of transactions per day?

Signals Before We Started

3 days from promotion decision to live in-store price; weekend trading windows missed
~9% line cancellation rate on click-and-collect; each cancelled line = customer service call
Stock data at HQ 6–18 hours stale; online availability decisions based on yesterday's numbers
$2,400 per vendor change request; 40+ promotions/quarter = $400K+/year in change fees
No ability to run tactical promotions; must batch-plan 3 days ahead

Our Solution

Overview

An event-driven inventory and promotion platform that augments — not replaces — the legacy POS estate. Each store runs a lightweight Linux agent ('store-edge') on a $180 mini-PC that tails the POS export files, normalizes them into domain events, and publishes to a central Kafka cluster. In the other direction, the category manager publishes a promotion in the admin console; the engine validates it against 240+ historical promotion rules; the store-edge agent polls for active price lists every 60 seconds and writes a promotion overlay file in the POS's exact fixed-width format. The POS never knows there's a new system — it just reads its price file as always. The three hard architectural problems were: (1) exactly-once semantics from a file that gets appended continuously by the POS — solved with file-position watermarks and idempotent event IDs; (2) the reservation API for click-and-collect — fail-closed on stale stock (reject rather than oversell, with a configurable freshness threshold); (3) promotion rule reproducibility as the trust gate — we did not allow any self-service promotion until the engine could reproduce all 240 historical promotions byte-for-byte.

Architecture

Java 21 Spring Boot 3 microservices on AWS EKS (ap-southeast-2, multi-AZ). Why Java and not .NET? The team inheriting this system was Java-native; we match the client's long-term maintenance capability. PostgreSQL 16 for canonical product, store, and promotion state. Apache Kafka on AWS MSK for event distribution — this is one of the few systems in our portfolio where Kafka is genuinely necessary: 120 stores × continuous sales events × real-time stock updates is a volume and fan-out problem that message queues don't solve elegantly. Redis on ElastiCache for the hot-path stock cache (inventory queries during trading hours are extremely read-heavy). Each store runs a store-edge agent: a ~3MB Java process that tails the POS SMB export, maintains a file-position watermark in a local SQLite database, normalizes rows into domain events, and publishes to MSK via mTLS. The agent buffers to local disk if MSK connectivity drops — no sales events are lost. React 18 admin console for category managers. Grafana + Loki + Tempo for observability.

Approach

1
Edge-pilot in 6 metro stores: prove that a $180 mini-PC can run the agent without impacting trading
2
File-position watermark for exactly-once event extraction from appended POS export files
3
Promotion engine validated against 240 historical promotions before any self-service access granted
4
Fail-closed reservation API: reject on stale stock rather than oversell (configurable freshness threshold)
5
mTLS between every store-edge agent and MSK; tokenized loyalty IDs on Kafka (no raw PII)
6
Canary rollout for edge-agent updates: 5 stores → 24h stable → remaining stores

Platform Modules

The system was delivered as the following modules — each with its own owner, integration contract and rollout plan.

Store-Edge Agent

~3MB Java process on a $180 mini-PC per store. Tails POS export files using a file-position watermark stored in local SQLite — this solves the exactly-once problem for a continuously-appended file without requiring POS modification. Buffers events to local disk if MSK is unreachable; replays in order on reconnect. Polls HQ for active price lists every 60 seconds and writes promotion overlays in the POS's fixed-width format.

Inventory Service (CQRS)

Write side: consumes `sale.line` and `stock.adjust` events from Kafka; maintains canonical stock per SKU per store using an event-sourced ledger. Read side: a denormalized stock-position table updated in-process, served behind Redis cache for the hot path. The separation allows the write side to handle burst ingestion without impacting read latency.

Promotion Engine

Rule-based engine with dry-run preview, margin-impact estimation, time-window enforcement, and customer-segment targeting. The trust gate: no self-service access until the engine reproduced all 240 historical promotions byte-for-byte. Stacking conflicts are detected at publication time, not at POS receipt.

Reservation API (Fail-Closed)

Public API for click-and-collect. Checks stock at the destination store against a configurable staleness threshold (90s at peak, 5 minutes at off-peak). If canonical stock data is older than the threshold, the API rejects the reservation rather than overselling. This was counterintuitive to the business team — but the data proved that trustworthy rejections outperform optimistic overselling on customer LTV.

Category-Manager Console

React 18 admin app. Promotion builder with dry-run simulation, historical sales comparison, and margin-impact estimate. Explicit staging → preview → publish flow; preview shows the exact price overlay the store-edge agent will write to the POS price file. No promotion ships without a preview sign-off.

Edge Observability

Per-store Grafana dashboards: event lag (seconds behind real-time), last heartbeat, price-list sync status. Alert if any store hasn't synced in 90 seconds during trading hours or if event lag exceeds 45 seconds. The ops team can trigger a remote edge-agent restart from the dashboard without SSH access.

Data Flow

Each store-edge agent watches the POS export directory for new file bytes, reads from the last watermark position, parses fixed-width rows into `sale.line` and `stock.adjust` events, and publishes to per-store MSK topics via mTLS. The Inventory Service consumes these topics and updates its event-sourced ledger; a Redis-cached read projection serves the Reservation API. When the e-commerce front-end calls the Reservation API, the request succeeds only if canonical stock is within the staleness threshold and the requested quantity is available — otherwise a structured rejection is returned. Promotion changes flow in reverse: the category manager publishes in the console, the Promotion Engine validates and persists the rule, and broadcasts an update event. Each store-edge agent polls the promotion endpoint every 60 seconds, fetches the active price list, and writes a fixed-width price-overlay file to the POS share. The POS reads this file at its normal 60-second price-refresh cycle — from the POS's perspective, a price file appeared on the share as always.

Integrations

Legacy Windows POS estate (120 stores) — SMB export file tailing + fixed-width price-list injection
Loyalty program — customer segment IDs (tokenized, no raw PII on Kafka topics) for targeted promotions
E-commerce front-end — Reservation API (REST + webhooks for availability changes)
Snowflake data warehouse — outbound CDC from PostgreSQL for analytics and category-manager reporting
AWS Cognito — category-manager SSO

Delivery Timeline

Phased delivery — each phase had explicit goals, measurable outcomes and a checkpoint before progression.

Phase 1 — Edge-agent pilot
Week 1–6
Goals
- ·Validate that a $180 mini-PC running the agent doesn't disrupt POS trading
- ·Solve exactly-once semantics from the continuously-appended POS export file
- ·Stream sales + stock events to MSK with p95 end-to-end latency < 30s
Outcomes
- ✓File-position watermark approach solved exactly-once: zero duplicate events across 6-week pilot
- ✓p95 end-to-end latency: 12–18 seconds (well under the 30s target)
- ✓Zero POS incidents attributable to the agent across all 6 pilot stores
- ✓HQ now sees pilot-store stock at near-real-time freshness vs the previous 6–18h lag
Phase 2 — Canonical catalog & promotion engine
Week 5–14
Goals
- ·Canonical product, store, and price entities in PostgreSQL
- ·Rule-based promotion engine: multi-buy, buy-X-get-Y, time-windowed, loyalty-only, category-wide
- ·Category-manager console with validation, dry-run preview, and margin-impact estimate
Outcomes
- ✓Engine reproduced 240 historical promotions with 100% result fidelity — the trust gate for self-service access
- ✓Category managers ran their first self-service promotion in week 12, shipping in ~45 minutes vs 3 days
- ✓Dry-run mode caught 3 promotions that would have gone margin-negative due to stacking conflicts
Phase 3 — Click-and-collect reservation API
Week 12–20
Goals
- ·Stock reservation API for the online channel with configurable staleness threshold
- ·Fail-closed policy: reject rather than oversell when stock data exceeds freshness threshold
- ·Order-line cancellation telemetry and customer-service ticket attribution
Outcomes
- ✓Click-and-collect cancellation rate: 9% → 3.2%; customer service tickets on missing items: -45%
- ✓Staleness threshold set at 90s after analysis of stock-change event velocity during peak trading
- ✓Key insight: the fail-closed policy reduced offered orders by 2.1% but increased fulfillment rate by 67%
Phase 4 — Full-chain rollout (120 stores)
Week 18–32
Goals
- ·Roll out edge agent to all NSW stores (~78), then VIC stores (~42)
- ·Canary update mechanism: 5 stores → 24h stable → remaining stores
- ·Decommission spreadsheet-based promotion distribution
Outcomes
- ✓All 120 stores connected by week 30; average promotion go-live time: 3 days → ~2 hours
- ✓Vendor change-request spend on promotions: $400K+/year → $0
- ✓Canary mechanism caught one edge-agent bug in a VIC store before it reached the remaining 38 stores
Phase 5 — Peak prep & chaos engineering
Week 30–40
Goals
- ·Chaos drills: Kafka broker loss, edge-agent restart mid-trade, VPN saturation
- ·2× Black-Friday load test across full store estate
- ·Runbook certification with the ops team
Outcomes
- ✓Survived MSK broker rebalance + 3 simultaneous edge-agent restarts without stock event loss
- ✓Load test: sustained 14× normal RPS for 30 minutes; stock p99 latency < 200ms
- ✓First real production incident (unexpected MSK rebalance, week 36): recovered in 8 minutes using rehearsed runbook

Technology Stack

Java 21

Spring Boot 3

PostgreSQL 16

Apache Kafka (AWS MSK)

Redis (ElastiCache)

React 18

AWS (EKS, MSK, S3, Cognito)

Grafana / Loki / Tempo

SQLite (edge-agent watermark)

The Results

Measurable impact delivered within 10 months (Phase 1–5).

3 days → ~2h

Promotion go-live time

Self-service via console; price in stores within 2 hours of publish

9% → 3.2%

Click-and-collect cancellations

Fail-closed reservation API + near-real-time stock

6–18h → 12–18s p95

Stock data freshness

Event-driven store-edge replaces nightly file export

$400K+/yr → $0

Vendor change-request cost

Self-service promotions removed all POS vendor change tickets

Security & Compliance

✓All customer data in ap-southeast-2; no cross-region replication
✓Tokenized loyalty IDs on Kafka — raw PII never leaves the loyalty system
✓mTLS between every store-edge agent and MSK; certificate rotation every 90 days
✓Audit trail on every promotion publish, every reservation, and every price-list push
✓Quarterly penetration test against the public Reservation API
✓Snowflake outbound CDC uses a dedicated read-replica; no production DB access from analytics

Delivery & Operations

✓GitHub Actions CI: promotion-engine reproducibility tests (240 historical promos as golden files), edge-agent watermark unit tests, Reservation API staleness-threshold contract tests
✓Argo CD for cluster delivery; per-region progressive rollout (NSW canary → NSW full → VIC)
✓Edge-agent updates via Ansible: 5-store canary → 24h stable SLO → remaining stores
✓Quarterly chaos drills: MSK broker loss, edge-agent crash during trading, VPN saturation at peak
✓Per-store SLO dashboards; on-call alert if any store goes dark during trading hours

What we'd do again

Key Learnings

Augment, don't replace. Replacing 120 in-store POS systems would have been a $10M, 2-year program with enormous migration risk. A $180 mini-PC per store and a file-position watermark gave us the same operational outcome in 10 months without touching the POS software.
Fail-closed on stale stock is the right default for inventory reservation. The 2.1% reduction in offered orders was outweighed by a 67% improvement in fulfillment rate and the customer trust that comes from reliable delivery promises. Optimistic overselling optimizes for order count; fail-closed optimizes for fulfillment rate and LTV.
Promotion reproducibility was the trust gate that made self-service possible. We did not grant any category manager self-service access until the engine could reproduce all 240 historical promotions byte-for-byte. That reproducibility test is also the canary that catches engine regressions before they reach production.
Chaos drills paid for themselves on their first real test. An unexpected MSK broker rebalance in week 36 would have been a 2-hour incident without the rehearsed runbook. The team recovered in 8 minutes because they had run that exact scenario in a drill 6 weeks earlier.

Honest retrospective

What We'd Do Differently

Every project teaches us something we hadn't anticipated. Here's what we'd change if we were starting again.

The file-based SMB adapter was sensitive to permission changes on the POS vendor side — we had two outages in the first month from silent permission revocations we weren't notified about. We added a health-check monitor in week 6. It should have been in week 1. Any integration with a third-party file system needs an active health check that alerts before a silent failure becomes a trading impact.
We underestimated category-manager training time. The platform was intuitive to us. For someone who'd spent 10 years building promotions in spreadsheets, the promotion builder had too many options on screen at once. We ran 3 rounds of UX simplification that could have been done in 1 round if we'd run structured usability sessions during design rather than after release.

Similar Projects

Other high-stakes systems we've shipped.

Automotive aftermarket

EU Auto Parts Group

Auto Parts Data Platform for a European B2B Marketplace

−45% Wrong-fit returns

Read case study →

Construction (general contracting)

Bau Nord GmbH

Construction ERP for a Mid-Sized German Bau Contractor

5.2d → 1.4d Approval lead time