Measuring Engineering Impact: Beyond Lines of Code and Story Points

Every engineering organization I’ve joined had metrics. Velocity charts. Sprint burndown. Lines of code. Commit frequency. And in every case, these metrics measured activity while saying nothing about impact.

At Nykaa, we had a team with the highest velocity in the organization. They completed more story points per sprint than any other team. They were also responsible for the most production incidents, the highest bug count, and the slowest time-to-resolution. High activity, negative impact.

The problem isn’t measurement—it’s measuring the wrong things. Engineering organizations need metrics that connect technical work to business outcomes. Here’s the framework I’ve developed across 15+ years of building production systems.

The Metrics Hierarchy

Engineering metrics exist at four levels. Most organizations measure only the bottom two and wonder why their metrics don’t correlate with business results.

Level 1: Activity Metrics (Least Valuable)

What they measure: How much work is happening.

Lines of code written
Commits per day
Story points completed
PRs merged
Tickets closed

Why they fail: Activity metrics are trivially gameable and inversely correlated with the work that matters most. The engineer who spends three days thinking through an architecture decision before writing 50 lines of elegant code delivers more value than the one who writes 500 lines of code that needs to be rewritten next quarter.

When they’re useful: Never as primary metrics. Occasionally as anomaly detectors—a sudden drop in PR activity from a usually productive engineer might indicate a blocker, burnout, or unclear requirements.

Level 2: Process Metrics (Necessary but Insufficient)

What they measure: How well the engineering process functions.

Lead time: Time from commit to production deployment
Deploy frequency: How often you ship to production
Change failure rate: Percentage of deployments causing incidents
Mean time to recovery (MTTR): How quickly you restore service after an incident

These are the DORA metrics, and they’re valuable. They measure engineering capability—the ability to ship changes quickly and safely. But they don’t measure whether those changes mattered.

A team with excellent DORA metrics that ships features nobody uses is still failing. Process metrics tell you that you can deliver. They don’t tell you that you’re delivering the right things.

Level 3: Output Metrics (Getting Warmer)

What they measure: What engineering produced.

Features shipped to users
API endpoints with active consumers
System reliability (uptime, error rates, latency percentiles)
Technical debt reduced (measured by concrete indicators, not gut feel)
Platform capabilities enabled (e.g., “we can now A/B test any checkout flow”)

Output metrics are where most sophisticated engineering organizations stop. They’re meaningful—they represent real work delivered to real users. But they still have a blind spot: they don’t measure whether the output changed anything.

Level 4: Outcome Metrics (Most Valuable)

What they measure: What changed because of engineering’s work.

Revenue impact of shipped features
User engagement changes attributable to technical improvements
Cost reduction from infrastructure optimization
Time-to-market improvement for product teams
Customer satisfaction changes linked to reliability improvements

Why most organizations don’t measure these: Outcome metrics require collaboration between engineering, product, and data teams. They require attribution—connecting a business result to a specific technical change. This is hard, imperfect, and sometimes contentious.

But imperfect outcome measurement is infinitely more valuable than precise activity measurement. A rough estimate that “the checkout optimization shipped by the platform team increased conversion by 0.3%” is more useful than knowing the team completed 47 story points.

Implementing the Framework

Step 1: Define Your Outcome Map

Every engineering team exists to drive specific business outcomes. Make these explicit:

Platform team: System reliability → user trust → retention → revenue Feature team: Feature delivery → user engagement → activation → revenue Infrastructure team: Developer productivity → faster shipping → more experiments → better product-market fit Data team: Analytics capabilities → better decisions → optimized operations → margin improvement

At Nykaa, the platform team’s outcome map was concrete: every 10ms of homepage latency reduction correlated with a 0.1% increase in conversion rate. This made infrastructure optimization directly measurable in revenue terms. When I proposed a CDN migration that would cost $30K/month but reduce p95 latency by 80ms, the business case was trivial: the estimated conversion improvement was worth 20x the CDN cost.

Step 2: Instrument the Connections

Outcome measurement requires instrumentation at every level of the stack:

Technical instrumentation: Standard observability—latency, error rates, throughput. This is table stakes.

Product instrumentation: Feature flags with analytics. Every feature ships with tracking that measures adoption, engagement, and the product metric it’s targeting.

Business instrumentation: Revenue attribution, cost tracking, and conversion funnels that connect to technical changes.

The critical integration: deployment events correlated with business metrics. When you can overlay “deployed checkout optimization v2” on your conversion rate graph, you can see impact directly. Tools like feature flags with analytics built in make this straightforward.

Step 3: Establish Attribution Conventions

Perfect attribution is impossible. Multiple changes ship simultaneously. External factors affect metrics. User behavior is noisy. Establish conventions that are good enough:

Direct attribution: The feature was behind a feature flag. The flag was enabled. The metric changed. Confidence is high.

Correlated attribution: The change shipped and the metric moved in the expected direction within the expected timeframe. Confidence is medium.

Inferred attribution: The change was part of a bundle of improvements. The combined impact is measurable but individual contributions are estimated. Confidence is low but still useful for prioritization.

At Hike, the ML team’s recommendation system improvements were measured through direct attribution—A/B tests with clear control groups. But the infrastructure team’s latency improvements were correlated attribution at best. We accepted this asymmetry and used it for directional guidance rather than precise accounting.

Step 4: Build the Reporting Cadence

Weekly: Process metrics (DORA). These move fast and indicate operational health.

Monthly: Output metrics. Features shipped, reliability maintained, technical debt addressed. This is the engineering team’s primary reporting cadence.

Quarterly: Outcome metrics. Business impact of engineering work over the quarter. This connects engineering effort to organizational goals and informs the next quarter’s prioritization.

Annually: Strategic review. Which engineering investments produced the highest outcome-to-effort ratio? Which areas deserve more investment? Which should be wound down?

The Metrics That Actually Changed My Teams

Beyond the framework, here are specific metrics that drove the most behavioral improvement in teams I’ve led:

Time to First Meaningful Contribution

What it measures: How long it takes a new engineer to merge their first PR that affects production behavior (not a docs fix or config change).

Why it matters: This metric is a proxy for onboarding effectiveness, codebase quality, and team collaboration culture. A team where new engineers ship in their first week has good documentation, approachable code, and supportive teammates.

Target: Under 5 business days for experienced engineers, under 10 for early-career.

At Orangewood Labs, this metric exposed that our robotics SDK had an undocumented dependency on a specific ROS version that took every new engineer 3 days to discover and resolve. Fixing the documentation cut time-to-first-contribution from 12 days to 4.

Escaped Defect Rate

What it measures: Bugs found in production versus bugs caught before deployment.

Why it matters: It measures the effectiveness of your entire quality pipeline—code review, testing, staging environments, and monitoring. A decreasing escaped defect rate means your prevention systems are improving.

Target: Below 10% for mature teams (90%+ of bugs caught before production).

Decision-to-Deploy Latency

What it measures: Time from “we’ve decided to build X” to “X is in production.”

Why it matters: This captures everything—planning overhead, development time, review bottlenecks, deployment pipeline speed, and organizational friction. Unlike pure lead time (which measures only the technical pipeline), this includes the human and organizational delays.

Target: Under 2 weeks for standard features in a mature organization.

Recovery Learning Rate

What it measures: For each incident category, how much faster was the team’s response the second time it occurred?

Why it matters: Every team has incidents. The best teams learn from them measurably. If your median response time for database issues was 45 minutes last quarter and 20 minutes this quarter, your learning systems are working.

Common Pitfalls

Goodhart’s Law

“When a measure becomes a target, it ceases to be a good measure.” This applies to every metric in this framework.

Mitigation: Use a balanced scorecard approach. No single metric is a target. The portfolio of metrics across all four levels provides a holistic picture that’s resistant to gaming.

Survivor Bias in Metrics

You measure the features you ship but not the features you don’t. The most impactful engineering decision might be saying no to a feature that would have been expensive to maintain.

Mitigation: Track “complexity avoided” as an explicit metric. When you simplify a design, eliminate a dependency, or reject a feature for maintenance reasons, document the estimated ongoing cost you avoided.

The McNamara Fallacy

Measuring what’s easy to measure, ignoring what’s hard, and then assuming what’s hard to measure isn’t important.

Mitigation: Accept imprecise measurement of important things over precise measurement of unimportant things. A rough revenue attribution is better than an exact commit count.

Making It Real

Start with one outcome metric per team. Don’t try to implement the entire framework at once. Pick the metric that most directly connects engineering work to a business result, instrument it, and report on it for one quarter.

When I introduced outcome metrics at Nykaa, we started with a single metric for the platform team: p95 API latency correlated with checkout conversion. This one metric changed how the team prioritized work more than any process improvement I’d implemented before. When engineers could see that their optimization work translated directly to revenue, intrinsic motivation replaced the need for velocity tracking.

That’s the real power of measuring impact: when engineers understand how their work matters, you spend less time managing and more time enabling.