Microservices Modernization: From Monolith to Observable Platform

Context

I joined an insurtech product running as a monolithic PHP application. The system handled high traffic and business-critical user journeys (quoting, onboarding, policy updates, claims), but deployment cycles were slow (2-3 week releases) and cascading failures were becoming more frequent as both traffic and feature scope grew.

The engineering team of ~15 people was organized functionally, making it difficult to move quickly. Any change required coordination across multiple teams, and the blast radius of bugs was large-a single bad deployment could take down the entire platform.

Problem

The core challenges we faced:

Deployment bottlenecks: The monolith required coordinated releases with extensive QA cycles. Teams couldn't ship independently, leading to batched releases and delayed features.

Reliability concerns: Cascading failures were common. A memory leak in one subsystem could bring down the entire application. We lacked visibility into which components were struggling.

Scaling limitations: The monolith scaled as a single unit. We couldn't independently scale high-traffic services (like quote/pricing) without over-provisioning resources for low-traffic features.

Poor observability: We had basic monitoring (CPU, memory, response times) but lacked the granular insights needed to understand system behavior under load or diagnose issues quickly.

Approach

We took an incremental approach to modernization, prioritizing high-value services and establishing reliability practices alongside the migration:

Service decomposition strategy:

Started with bounded contexts that had clear interfaces (quoting, customer identity/profile, policy lifecycle, claims intake)
Used the strangler fig pattern-routing traffic to new services while keeping the monolith as a fallback
Defined service contracts with OpenAPI specs and implemented contract testing
Introduced a split stack where it made sense: Go for performance-sensitive services, NestJS for product APIs/BFF-style orchestration, and Next.js for the frontend

SLI/SLO framework:

Worked with product and business stakeholders to define meaningful SLIs (availability, latency, error rate)
Set SLOs based on user expectations and business requirements (e.g., 99.5% availability for the quote flow)
Introduced error budgets to balance feature velocity with reliability work

Platform and delivery foundations:

Containerized services with Docker to standardize local dev, testing, and deployments
Moved the runtime footprint onto AWS to enable repeatable environments and safer rollouts
Implemented CI/CD pipelines to automate builds, tests, and deployments (reducing manual release overhead and risk)

Observability stack:

Deployed Prometheus for metrics collection with Grafana dashboards
Centralized logging with structured logs (JSON format) aggregated in ELK stack
Implemented distributed tracing with Jaeger to understand request flows across services
Created runbooks and automated alerts tied to SLO violations

Organizational changes:

Shifted to product-aligned teams with end-to-end ownership of services
Established on-call rotations with clear escalation paths
Ran game days to practice incident response and validate our observability tools

Outcomes

Over 18 months, we successfully migrated 8 core services to microservices architecture:

Delivery predictability: Teams shipped independently and more frequently; release cycles moved from weeks to days/hours (representative).

Reliability and ops: SLOs + error budgets focused reliability work. Observability (metrics/logs/traces) reduced incident impact and improved time-to-diagnosis.

Efficiency and autonomy: Services scaled independently based on demand, and teams had end-to-end ownership with clearer accountability.

Stack / Constraints

Stack: PHP monolith migrated to microservices using Go and NestJS, with a Next.js frontend; Docker for containerization; AWS for hosting; CI/CD for automated build/test/deploy; OpenAPI for contracts; Prometheus/Grafana for metrics; ELK for logs; Jaeger for distributed tracing.

Constraints: High traffic volumes, risk management requirements, and need for incremental migration without downtime.

Decisions & Tradeoffs

Incremental migration vs. rewrite: Used the strangler pattern to reduce risk and keep business continuity.
Service boundaries vs. time: Invested in domain modeling to avoid chatty services, accepting slower early progress for long-term operability.
Observability rollout: Prioritized metrics and alerting tied to SLOs before deep tracing, to keep adoption sustainable.

What I'd Do Differently

Earlier investment in contract testing: We encountered several integration issues that could have been caught earlier with more comprehensive contract tests between services. I would prioritize this from day one.

More gradual rollout of observability: We tried to implement metrics, logs, and tracing simultaneously, which was overwhelming for teams. A phased approach (metrics first, then logs, then tracing) would have been more sustainable.

Clearer service boundaries upfront: Some of our initial service boundaries weren't quite right, leading to chatty inter-service communication. Spending more time on domain modeling and event storming before cutting code would have helped.

Better documentation of decision-making: We made many architectural trade-offs during the migration, but didn't always document the reasoning. This made it harder for new team members to understand why certain patterns existed.

Despite these learnings, the migration was successful in achieving its core goals: improved reliability, faster deployment cycles, and better team autonomy. The SLO-driven approach to reliability continues to guide engineering priorities today.