November 15, 2024
Release cycle days → hours + SLO-driven opsMicroservices Modernization: From Monolith to Observable Platform
Shortened release cycles and reduced incident impact through microservices migration and SLO-driven reliability
- Delivery speed: Teams shipped independently; release cycles moved from weeks to days/hours (representative).
- Reliability and ops: SLOs and error budgets focused reliability work on user impact. Observability (metrics/logs/traces) reduced incident impact and improved time-to-diagnosis.
- Efficiency and autonomy: Services scaled independently based on demand. Teams had end-to-end ownership with clearer accountability.
Scope & constraints
- Role
- System Architect
- Scope
- platform modernization + observability + reliability
- Constraints
- high traffic, risk management, and incremental migration without downtime
Role & Scope
- Team: ~15 engineers across product-aligned squads
- Responsibility: target architecture, migration strategy, reliability standards, and platform delivery foundations
- Stakeholders: product leadership, operations, and business owners for critical flows
- Constraints: high traffic, risk management, and incremental migration without downtime
Outcome metrics
- Delivery: release cycle moved from weeks to days/hours (representative)
- Reliability: SLOs + error budgets focused work on user-impacting issues
- Operations: faster diagnosis via metrics/logs/traces (representative MTTR reduction)
Context
I joined an insurtech product running as a monolithic PHP application. The system handled high traffic and business-critical user journeys (quoting, onboarding, policy updates, claims), but deployment cycles were slow (2-3 week releases) and cascading failures were becoming more frequent as both traffic and feature scope grew.
The engineering team of ~15 people was organized functionally, making it difficult to move quickly. Any change required coordination across multiple teams, and the blast radius of bugs was large-a single bad deployment could take down the entire platform.
Problem
The core challenges we faced:
Deployment bottlenecks: The monolith required coordinated releases with extensive QA cycles. Teams couldn't ship independently, leading to batched releases and delayed features.
Reliability concerns: Cascading failures were common. A memory leak in one subsystem could bring down the entire application. We lacked visibility into which components were struggling.
Scaling limitations: The monolith scaled as a single unit. We couldn't independently scale high-traffic services (like quote/pricing) without over-provisioning resources for low-traffic features.
Poor observability: We had basic monitoring (CPU, memory, response times) but lacked the granular insights needed to understand system behavior under load or diagnose issues quickly.
Approach
We took an incremental approach to modernization, prioritizing high-value services and establishing reliability practices alongside the migration:
Service decomposition strategy:
- Started with bounded contexts that had clear interfaces (quoting, customer identity/profile, policy lifecycle, claims intake)
- Used the strangler fig pattern-routing traffic to new services while keeping the monolith as a fallback
- Defined service contracts with OpenAPI specs and implemented contract testing
- Introduced a split stack where it made sense: Go for performance-sensitive services, NestJS for product APIs/BFF-style orchestration, and Next.js for the frontend
SLI/SLO framework:
- Worked with product and business stakeholders to define meaningful SLIs (availability, latency, error rate)
- Set SLOs based on user expectations and business requirements (e.g., 99.5% availability for the quote flow)
- Introduced error budgets to balance feature velocity with reliability work
Platform and delivery foundations:
- Containerized services with Docker to standardize local dev, testing, and deployments
- Moved the runtime footprint onto AWS to enable repeatable environments and safer rollouts
- Implemented CI/CD pipelines to automate builds, tests, and deployments (reducing manual release overhead and risk)
Observability stack:
- Deployed Prometheus for metrics collection with Grafana dashboards
- Centralized logging with structured logs (JSON format) aggregated in ELK stack
- Implemented distributed tracing with Jaeger to understand request flows across services
- Created runbooks and automated alerts tied to SLO violations
Organizational changes:
- Shifted to product-aligned teams with end-to-end ownership of services
- Established on-call rotations with clear escalation paths
- Ran game days to practice incident response and validate our observability tools
Outcomes
Over 18 months, we successfully migrated 8 core services to microservices architecture:
Delivery predictability: Teams shipped independently and more frequently; release cycles moved from weeks to days/hours (representative).
Reliability and ops: SLOs + error budgets focused reliability work. Observability (metrics/logs/traces) reduced incident impact and improved time-to-diagnosis.
Efficiency and autonomy: Services scaled independently based on demand, and teams had end-to-end ownership with clearer accountability.
Stack / Constraints
Stack: PHP monolith migrated to microservices using Go and NestJS, with a Next.js frontend; Docker for containerization; AWS for hosting; CI/CD for automated build/test/deploy; OpenAPI for contracts; Prometheus/Grafana for metrics; ELK for logs; Jaeger for distributed tracing.
Constraints: High traffic volumes, risk management requirements, and need for incremental migration without downtime.
Decisions & Tradeoffs
- Incremental migration vs. rewrite: Used the strangler pattern to reduce risk and keep business continuity.
- Service boundaries vs. time: Invested in domain modeling to avoid chatty services, accepting slower early progress for long-term operability.
- Observability rollout: Prioritized metrics and alerting tied to SLOs before deep tracing, to keep adoption sustainable.
What I'd Do Differently
Earlier investment in contract testing: We encountered several integration issues that could have been caught earlier with more comprehensive contract tests between services. I would prioritize this from day one.
More gradual rollout of observability: We tried to implement metrics, logs, and tracing simultaneously, which was overwhelming for teams. A phased approach (metrics first, then logs, then tracing) would have been more sustainable.
Clearer service boundaries upfront: Some of our initial service boundaries weren't quite right, leading to chatty inter-service communication. Spending more time on domain modeling and event storming before cutting code would have helped.
Better documentation of decision-making: We made many architectural trade-offs during the migration, but didn't always document the reasoning. This made it harder for new team members to understand why certain patterns existed.
Despite these learnings, the migration was successful in achieving its core goals: improved reliability, faster deployment cycles, and better team autonomy. The SLO-driven approach to reliability continues to guide engineering priorities today.