Engineering a Platform from Zero to IPO: Technical Decisions That Drove Nykaa's Growth
Between 2017 and 2021, I served as Principal Engineering Consultant at Nykaa, India’s largest beauty and personal care e-commerce platform. During that period, the platform went through 500% traffic growth, a complete platform migration, and ultimately an IPO that valued the company at $13 billion. This article documents the engineering decisions that enabled that trajectory—what we got right, what we got wrong, and what I’d do differently.
This isn’t a success narrative. It’s a technical post-mortem of a scaling journey, written for engineering leaders facing similar growth challenges.
The Starting State
When I joined Nykaa in 2017, the platform was a monolithic PHP application running on a small cluster of EC2 instances. It worked, but it had the predictable problems of early-stage e-commerce platforms:
- Deployment frequency: Once a week, usually on Sundays, with the entire engineering team on standby.
- Incident rate: 2-3 production incidents per week, mostly related to database overload during flash sales.
- Feature velocity: 4-6 weeks from concept to production for standard features.
- Scalability ceiling: The platform could handle approximately 50K concurrent users before degrading.
The business was growing at 80% year-over-year. The platform couldn’t keep up. The question wasn’t whether to re-architect—it was how to do it without slowing down a business that couldn’t afford to slow down.
Decision 1: Strangler Fig Migration
The decision: Migrate from monolith to microservices incrementally using the strangler fig pattern, rather than attempting a complete rewrite.
Why this was right: A full rewrite would have taken 12-18 months and frozen feature development. With 80% growth, that was commercially unacceptable. The strangler fig approach let us extract services one at a time while the monolith continued to serve traffic.
How we did it:
-
Identified extraction boundaries based on business capability, not technical convenience. The first service we extracted wasn’t the easiest—it was catalog search, because search latency directly impacted conversion.
-
Built an API gateway that routed traffic between the monolith and new services. This was the critical enabler—it let us redirect traffic at the URL level, making migrations invisible to the frontend.
-
Established a service template with standardized observability, deployment, and error handling. Every new service started from the same base, reducing per-service setup time from weeks to hours.
The outcome: Over 18 months, we extracted 12 core services from the monolith. Feature velocity improved from 6 weeks to 2 weeks. Deployment frequency went from weekly to daily.
What I’d do differently: We extracted services too granularly in some cases, creating distributed monolith problems. Three of our “microservices” were later consolidated because they were always deployed together and shared a data model. I’d be more aggressive about keeping related functionality in the same service.
Decision 2: Performance as a Feature
The decision: Treat page load time as a product feature with specific targets, dedicated engineering time, and business metric correlation.
The context: In 2018, our mobile web experience had a 6-second time to interactive. India’s e-commerce market is mobile-first—80%+ of Nykaa’s traffic was mobile. Six seconds was losing customers before they saw a product.
The target: Sub-2-second time to interactive on a mid-range Android device over 4G.
The approach:
-
Measured the business impact first. We ran a controlled experiment: serve a percentage of users an intentionally faster (pre-cached) version of the homepage. The result—every 100ms of improvement correlated with a 0.8% increase in session duration and a 0.3% increase in add-to-cart rate. This turned performance optimization from a technical initiative into a revenue initiative.
-
Attacked the critical rendering path. Server-side rendering for above-the-fold content. Lazy loading for everything below. Image optimization pipeline that served WebP with appropriate dimensions based on device.
-
Built a performance budget system. Every new feature had a performance budget. If adding a recommendation widget increased page weight by 200KB, the team had to find 200KB of savings elsewhere or get explicit approval for the regression.
-
Continuous monitoring. Real user monitoring (RUM) dashboards displayed p50/p95/p99 load times segmented by device type, network, and geography. Any regression triggered an automated alert and a mandatory investigation.
The outcome: TTI went from 6 seconds to 1.8 seconds over 6 months. During the same period, mobile conversion rate improved by 15%. The CFO became the performance team’s biggest advocate—unusual for a cost center.
The lesson: Performance work that’s connected to revenue metrics gets funded indefinitely. Performance work pitched as “technical improvement” gets one quarter of investment and then loses priority.
Decision 3: Flash Sale Architecture
The decision: Build a dedicated flash sale system with pre-computed inventory, edge caching, and queue-based checkout rather than scaling the general e-commerce platform for peak load.
The context: Nykaa’s flash sales generated 10-20x normal traffic in 15-minute windows. Scaling the entire platform for these peaks was cost-prohibitive and architecturally wasteful—the sale traffic pattern was fundamentally different from browsing traffic.
The architecture:
-
Pre-computed product pages: Flash sale items were rendered to static HTML and pushed to CDN edge nodes 30 minutes before the sale. The product page served during a flash sale was essentially a static file, not a database query.
-
Inventory management via Redis: Flash sale inventory was loaded into Redis with atomic decrement operations. No database involved in the hot path. When Redis inventory hit zero, the CDN served a “sold out” page.
-
Queue-based checkout: Instead of processing checkouts in real-time during the sale (which would overwhelm the payment system), users entered a checkout queue. The queue processed orders at a rate the downstream systems could handle, typically 500 orders per second.
-
Graceful degradation: If any component failed, the system defaulted to “sold out” rather than erroring. A false “sold out” is recoverable (restock and re-announce). A checkout error during a flash sale is a customer service nightmare.
The outcome: We handled 500K concurrent users during a major sale event with zero downtime. The previous architecture had crashed at 50K. Cost per transaction during flash sales dropped 85% because we weren’t over-provisioning the entire platform.
What I’d do differently: The queue-based checkout introduced a UX challenge—users didn’t know their position or estimated wait time. Adding a position indicator and time estimate would have reduced cart abandonment in the queue significantly.
Decision 4: Observability-First Development
The decision: Mandate that every new service ship with full observability before accepting any traffic.
The context: By 2019, we had 12 microservices, and debugging cross-service issues was consuming 30% of senior engineering time. We had logging but not tracing. We had metrics but not correlation. Every incident required manual log-grepping across multiple services.
The standard:
Every service must ship with:
- Distributed tracing using OpenTelemetry, with trace IDs propagated across all service calls
- Structured logging in JSON format with standardized fields (request_id, user_id, service_name, latency_ms)
- RED metrics (Rate, Errors, Duration) exposed via Prometheus endpoints
- Health checks that test actual dependencies, not just process liveness
- Runbooks for the top 3 expected failure modes
The enforcement mechanism: Services that didn’t meet the observability standard weren’t allowed to register in the service mesh. No observability, no traffic. This was controversial—it slowed initial service deployment by 2-3 days. But it eliminated the “we can add monitoring later” pattern that had created our debugging crisis.
The outcome: Mean time to detection (MTTD) for incidents dropped from 12 minutes to under 2 minutes. Mean time to resolution (MTTR) dropped from 90 minutes to 25 minutes. The 30% senior engineer time spent on debugging was redirected to feature work.
Decision 5: Database Strategy for Scale
The decision: Adopt a polyglot persistence strategy with clear guidelines for when to use each technology.
The monolith’s approach: Everything in MySQL. Product catalog, user sessions, order history, inventory, search indices—all in a single MySQL cluster.
The new strategy:
| Data Type | Technology | Why |
|---|---|---|
| Product catalog | PostgreSQL | Complex queries, JSONB for flexible attributes |
| User sessions | Redis | Ephemeral, high-frequency reads |
| Search | Elasticsearch | Full-text, faceted search |
| Order history | PostgreSQL (separate cluster) | Transactional integrity, audit requirements |
| Inventory | Redis (primary) + PostgreSQL (source of truth) | Atomic operations at flash sale speed, durable record |
| Analytics | ClickHouse | Columnar storage for high-volume event data |
| Cache | Redis + CDN | Multi-layer caching for read-heavy traffic patterns |
The critical rule: Every data store had a designated owner team and documented write patterns. No service could write to another service’s data store. This prevented the distributed data coupling that kills microservice architectures.
The outcome: Database-related incidents dropped from 2-3 per week to less than 1 per month. Individual data stores could be scaled, optimized, and maintained independently.
The Numbers
By the time of Nykaa’s IPO filing in 2021, the engineering metrics told the story:
| Metric | 2017 | 2021 | Change |
|---|---|---|---|
| Deploy frequency | Weekly | 15-20/day | ~100x |
| Lead time (commit to prod) | 7 days | 45 minutes | ~220x |
| Change failure rate | 18% | 2.1% | ~9x improvement |
| MTTR | 90 min | 25 min | ~3.5x improvement |
| Concurrent user capacity | 50K | 500K+ | 10x |
| Mobile TTI | 6s | 1.8s | 3.3x improvement |
| Platform uptime | 99.5% | 99.99% | — |
| Feature lead time | 6 weeks | 2 weeks | 3x |
These aren’t aspirational numbers or benchmarks. They’re the actual measurements from the monitoring systems we built.
What I Got Wrong
Over-investment in microservice granularity: Some services should have stayed together. The overhead of managing 12 services was manageable; the trajectory toward 30+ was not.
Under-investment in developer experience: We built great production systems but neglected the local development environment. Engineers spent too long setting up dependencies and too little time coding. A docker-compose-based local environment should have been a top-3 priority from day one.
Late adoption of feature flags: We should have implemented feature flagging in the first month, not the first year. The ability to decouple deployment from release would have saved us several painful rollbacks.
Insufficient documentation of trade-offs: We documented decisions but not the alternatives we rejected. When new engineers asked “why didn’t we use X?” we often couldn’t articulate the reasons, leading to re-litigation of settled decisions.
The Transferable Lessons
These insights apply beyond e-commerce:
-
Connect engineering metrics to business metrics early. The moment engineering work has a revenue number attached, organizational support follows.
-
Migrate incrementally, not heroically. The strangler fig pattern is slower but orders of magnitude less risky than a rewrite.
-
Build for the peak, serve the baseline efficiently. Flash sale architecture is a specific example of a general principle: design for your worst case separately from your normal case.
-
Observability is not optional. It’s a prerequisite for production traffic. Enforce this mechanically, not culturally.
-
Boring technology choices, exceptional execution. None of our technology choices were cutting-edge. PostgreSQL, Redis, Elasticsearch—all proven, well-understood tools. The value was in how we applied them, not what we chose.
An IPO is a business milestone, not an engineering one. But the engineering decisions made years earlier determined whether the platform could support the business growth that made the IPO possible. Scaling isn’t a single decision—it’s a sequence of deliberate choices, each building on the last, each accepting specific trade-offs for specific gains.
Related Articles
Measuring Engineering Impact: Beyond Lines of Code and Story Points
Dipankar Sarkar presents a framework for measuring engineering impact that connects technical work to business outcomes. Move beyond vanity metrics to measure what actually matters.
Building Scalable E-Commerce Infrastructure: Platform Migration and High-Performance Services
Dipankar Sarkar led Nykaa's platform migration from Magento to custom Python, implementing in-memory cart service and Kong API gateway for 500% traffic growth.
Enterprise Platform Development: Scaling Telecommunications and Advertising Solutions
Dipankar Sarkar built Kirusa Voice SMS (250M users) and enhanced Clickable PPC platform (TechCrunch Top 50) - enterprise-scale telecommunications and advertising.