Skip to main content

Inventory & Promotion Platform for a 120-Store Supermarket Chain

Regional supermarket chain (APAC)
Retail — grocery
Australia
Inventory & Promotion Platform for a 120-Store Supermarket Chain
Duration
10 months (Phase 1–5)
Team Size
8 people
Services
3 services

Client Context

A 120-store regional grocery chain operating across two Australian states. Stores ran on a legacy Windows-based POS with stock managed per-store and promotions configured in a category-management spreadsheet emailed weekly to store managers. The chain wanted to compete on freshness and promotional speed against the Coles / Woolworths duopoly without replacing the POS estate.

The Challenge

Business Challenge

Promotions took on average 3 days from category-manager decision to live price in store. Stock visibility for online click-and-collect was unreliable, leading to ~9% order-line cancellations. The chain could not run national-scale dynamic promotions because rule changes had to be pushed by hand.

Technical Challenge

Legacy POS at each store had limited connectivity (one VPN tunnel, often saturated). No central inventory system; stock numbers in head office were 6–18h stale. Promotion rules were hard-coded into the legacy POS and required a vendor service ticket to change.

Signals Before We Started

  • Average 3 days from promo decision to live in-store price

  • ~9% line cancellation rate on click-and-collect orders

  • Stock data in HQ 6–18 hours stale

  • Promotion changes required a vendor change request (~$2,400 per change)

  • Peak-hour POS latency spikes during weekend trading

Our Solution

Overview

An event-driven inventory + promotion platform that sits alongside the existing POS estate. Each store runs a lightweight edge agent that publishes sales and stock events to HQ in near real-time, while a rules-based promotion engine lets category managers ship pricing changes themselves in minutes.

Architecture

Java 21 Spring Boot microservices, PostgreSQL 16 for canonical product / store / promotion state, Apache Kafka for event distribution, Redis for hot-path caches at the store edge, AWS EKS in ap-southeast-2 with multi-AZ, React admin console for category managers, Grafana / Loki for observability. Each store runs a small Linux box (the 'store-edge') that publishes to Kafka via a managed MSK private endpoint.

Approach

  • 1

    Edge-agent pilot in 6 stores before broad rollout

  • 2

    Event-driven stock model — never block POS

  • 3

    Rule-based promo engine with safe-by-default validation

  • 4

    Phased regional rollout (NSW first, then VIC)

  • 5

    Reuse existing POS — do not replace, only augment

  • 6

    Click-and-collect stock guarantee via reservation API

Platform Modules

The system was delivered as the following modules — each with its own owner, integration contract and rollout plan.

Store-edge agent

Lightweight Linux service per store; tails POS export files, normalizes to domain events, publishes to Kafka. Buffers locally if HQ connectivity drops.

Inventory service

Maintains canonical stock per SKU per store from event stream; serves reservation requests for online channels.

Promotion engine

Rule-based engine with dry-run, time windows, customer-segment targeting, and audit history.

Category-manager console

React admin app — design, validate and ship promotions; preview impact on margin and historical sales.

Reservation API

Public-facing API for click-and-collect; reserves stock at the destination store with a TTL; rejects rather than overselling.

Observability stack

Grafana + Loki + Tempo; per-store SLO dashboards; alerts for edge-agent disconnects or excessive event lag.

Data Flow

Each store-edge agent watches POS export files and emits `sale.line` and `stock.adjust` events to Kafka. The Inventory service consumes these and updates canonical stock per SKU per store with full event history. When the e-commerce front-end calls the Reservation API, the request is served against canonical stock with a configurable freshness threshold — if the data is staler than the threshold, the API rejects rather than overselling. Promotion changes flow the other direction: category managers publish in the console, the Promotion engine validates and stores them, and the store-edge agent pulls the active price list every 60s and writes it back into the local POS via the vendor's documented price-update file format.

Integrations

  • Existing Windows POS estate (export files + reverse webhook for price updates)

  • Loyalty program (customer segments and points balance)

  • E-commerce front-end via REST + webhook

  • AWS Cognito for category-manager SSO

  • Existing data warehouse (Snowflake) — outbound CDC for analytics

Delivery Timeline

Phased delivery — each phase had explicit goals, measurable outcomes and a checkpoint before progression.

  1. Phase 1 — Edge pilot

    Week 1–6
    Goals
    • ·Validate that a small Linux box per store can read POS exports without disrupting trading
    • ·Stream sales + stock events to HQ Kafka in <30s p95
    • ·Prove the model in 6 pilot stores in metro Sydney
    Outcomes
    • Edge agent shipped events at 12–18s end-to-end p95
    • Zero POS incidents attributable to the agent over 6-week pilot
    • HQ now sees pilot-store stock at near-real-time freshness
  2. Phase 2 — Canonical product & promotion engine

    Week 5–14
    Goals
    • ·Canonical product, store, and price entities in PostgreSQL
    • ·Rule-based promotion engine (buy X get Y, multi-buy, loyalty-only, time-windowed)
    • ·Category-manager admin console with validation + dry-run
    Outcomes
    • Engine validated against 240 historical promotions — 100% reproducible
    • Category managers ran their first self-service promo in week 12
  3. Phase 3 — Click & collect reservation

    Week 12–20
    Goals
    • ·Stock reservation API for the online channel
    • ·Stale-stock protection (rejection rather than overselling)
    • ·Order-line cancellation telemetry
    Outcomes
    • Click-and-collect cancellation rate dropped from 9% to 3.2%
    • Customer service tickets on missing items fell ~45%
  4. Phase 4 — Regional rollout

    Week 18–32
    Goals
    • ·Roll out edge agent to all NSW stores (≈78)
    • ·Then VIC stores (≈42)
    • ·Decommission spreadsheet-based promo distribution
    Outcomes
    • All 120 stores connected by week 30
    • Average promo go-live time dropped from 3 days to ~2h
    • Vendor change-request spend on promos reduced to ~$0
  5. Phase 5 — Hardening & peak prep

    Week 30–40
    Goals
    • ·Chaos drills against the edge agent
    • ·Peak-trading load test (2× Black-Friday-equivalent)
    • ·Runbook handover to ops team
    Outcomes
    • Survived simulated Kafka broker loss + edge-agent restart without store impact
    • Peak load test sustained at 14× normal RPS for 30 minutes
    • Ops team certified on incident runbooks

Technology Stack

Java 21
Spring Boot 3
PostgreSQL 16
Apache Kafka (MSK)
Redis
React 18
AWS (EKS, MSK, S3, Cognito)
Grafana / Loki / Tempo

The Results

Measurable impact delivered within 10 months (Phase 1–5).

Promotion go-live time
Self-service promotions across 120 stores
Click-and-collect cancellations
Reservation API + canonical stock
Stock data freshness at HQ
Event-driven sync from store-edge
Vendor change-request spend
~$2,400 per change × tens of changes per quarter, removed

Security & Compliance

  • All customer data hosted in ap-southeast-2; no cross-region replication
  • Tokenized loyalty IDs in event stream — no raw PII on Kafka topics
  • Per-store mutual TLS between edge agent and MSK
  • Audit trail on every promotion publish and every reservation
  • Quarterly penetration test against the public reservation API

Delivery & Operations

  • GitHub Actions CI with promotion-engine reproducibility tests against 240 historical promos
  • Argo CD for cluster delivery; per-region progressive rollout
  • Edge-agent updates via Ansible with canary group (5 stores) → full rollout
  • On-call rotation across HQ ops and Vireon platform engineer
  • Quarterly chaos drill: simulate broker loss, edge-agent crash, and connectivity outage
What we'd do again

Key Learnings

  • Augment, don't replace. Replacing 120 in-store POS systems would have been a 3-year program; the store-edge agent gave us the same operational outcome in months without touching the POS.

  • Reject, don't oversell. The single most impactful product decision was making the Reservation API fail-closed on stale stock — counterintuitively, fewer offered orders made customers trust the channel more.

  • Reproducibility was the gate to category-manager trust. We did not let anyone ship a self-service promo until the engine could reproduce every one of the last 240 historical promotions byte-for-byte.

  • Chaos drills paid for themselves the first time a Kafka broker rebalanced unexpectedly in production — the team recovered in under 8 minutes because they had rehearsed the exact scenario.

Let's Discuss Your Project

Schedule a free consultation to explore how we can help you achieve your goals.