Technical Excellence

Monitoring/ Observability

Monitoring and observability are critical in product development because they enable teams to build, maintain, and scale reliable, high-performing software.

Observability helps us understand why something is wrong. It provides real-time visibility into system behaviour, and allows us to detect issues before they are reported (eg. performance bottlenecks, errors, or failures). This visibility supports safer deployments and continuous delivery, as we can validate the impact of new releases, and roll back/ remediate if anomalies are detected after deployment; acting as a feedback loop to inform whether new features perform as expected. It also improves security, as real-time monitoring can catch suspicious behavior, failed auth attempts, or unauthorized data access. It additionally enables experimentation; feature rollouts, A/B testing, and canary deployments require confidence and the ability to measure outcomes and catch regressions.

Monitoring tells us when something is wrong. It involves using our observability to take positive action. For example, by tracing issues directly to their root cause (e.g. slow DB queries, memory leaks), teams can reduce MTTR (mean time to recovery). It should be a continuous and recurring process; for example, one approach is to report on metrics in sprint and/ or monthly reviews, and create action items based on anomalies that present. If possible, it is also invaluable to have an alerting system that pages on-call engineers when anomalies in metrics are detected, so they can begin remediation immediately.

The 2 sets of metrics I’ve found most valuable to monitor are:

DORA Metrics (from the DevOps Research and Assessment group) are a set of four key performance indicators used to measure the effectiveness and efficiency of software delivery and operational performance¹. They help organizations assess how well they build, deliver, and maintain software systems.

The four DORA metrics are²:

Deployment Frequency
How often code is deployed to production or released to users.
- Why? High-performing teams deploy more frequently, enabling faster delivery of value and more continuous feedback loops.
Lead Time for Changes
The time it takes for a code commit to get into production.
- Why? Shorter lead times indicate more efficient development processes and quicker iteration on customer needs.
Change Failure Rate
The percentage of deployments that cause a failure in production.
- Why? A lower change failure rate reflects better code quality, testing, and deployment practices.
Mean Time to Recovery
The average time it takes to restore service after an incident/ failure.
- Why? Faster recovery times reduce downtime and customer impact, indicating strong incident response and system resilience.

Monitoring DORA metrics enables for the following³:

Data-Driven Improvements
Provides actionable insights into bottlenecks and areas for process optimization.
Benchmarking
Enables comparison against industry standards (e.g., elite, high, medium, low performers).
Alignment Across Teams
Promotes shared goals between engineering, product, and operations.
Balance Speed and Stability
Encourages fast delivery without sacrificing reliability.
Continuous Improvement
Guides investments in tooling, automation, and culture change.

Delivery Metrics

Monitoring velocity, lead time, cycle time, and related delivery metrics provides crucial insights into a software development team’s efficiency, predictability, and capacity for improvement⁴.

Metric	What It Measures	Why It’s Useful
Velocity	Amount of work completed in a sprint (e.g., story points)	Helps in sprint/release planning
Lead Time	Time from ticket creation to production release	Reflects total delivery efficiency
Cycle Time	Time from when work starts to when it’s completed	Indicates execution speed and flow
Throughput	Number of tasks completed in a period	Measures team delivery rate

Monitoring delivery metrics enables for the following:

Understand Team Performance
Velocity shows how much work a team can consistently deliver within a sprint or timebox, and helps assess whether teams are underloaded, overloaded, or working at a sustainable pace.
Improve Forecasting and Planning
By tracking velocity and cycle time, teams can better estimate how long future work will take, and increase confidence in sprint and release planning, enabling more accurate commitments to stakeholders.
Identify Process Bottlenecks
Lead time and cycle time reveal how long it takes to deliver value from idea to production; Long lead/cycle times can highlight issues like too much work-in-progress (WIP), blocked tasks, or review delays.
Drive Continuous Improvement
Regular monitoring helps teams set improvement goals (e.g., “reduce average cycle time by 20%”), and makes it easier to measure the impact of process or tooling changes.
Ensure Flow Efficiency
Helps teams achieve a smoother development flow with less context switching and rework; Cycle time variations can indicate inconsistent workflows or unclear prioritization.
Monitor Delivery Health and Trends
Trends in velocity and time-based metrics can reveal burnout, team churn, or scope creep; early detection of declining performance supports proactive intervention.
Support Agile and Lean Practices
These metrics align with Agile and Lean values like short feedback loops, incremental delivery, and sustainable pace; they reinforce team autonomy while providing structure for accountability.

Technical Debt

Monitoring and prioritizing technical debt is essential in software development because it directly affects the long-term health, scalability, and velocity of your product and engineering teams. Ignoring it can lead to slower delivery, more bugs, and developer burnout.

Technical debt slows down every future change, so prioritizing and addressing it keeps your codebase maintainable and extensible, allowing teams to build features faster and safer over time. Messy or outdated code increases the risk of regressions, bugs, and runtime failures; cleaning up debt makes the system more predictable and testable, which leads to fewer production issues.

Prioritizing high-impact debt unlocks faster development and easier onboarding for new engineers. When debt is not addressed and accumulates, teams often hit a point where a full rewrite seems necessary; incrementally tackling debt avoids these all-or-nothing, high-risk, high-cost rewrite scenarios. It also shows that engineering quality is valued, improving job satisfaction and retention.

It is best practice to track tech debt in your backlog just like features, and be evaluated based on impact vs. effort (e.g. what’s slowing down the team or risking system health). A common approach, that has worked well for me, is to try and keep 10-20% capacity per sprint for the team to tackle technical debt⁵. You can also run regular “Debt Discovery” sessions where engineers nominate pain points from recent sprints.

An effective approach for tracking debt is add a “Technical Debt” label or custom issue type to your ticket/ task tracking tool (e.g. Jira, Linear, Notion), where each ticket includes:

What the problem is
Its impact (speed, bugs, UX, etc.)
How often it affects the team
Suggested fix or next step
Estimated effort (optional)

A simple prioritization matrix can help you decide what to tackle first⁶:

	Urgent	Not Urgent
Important	Do – clear deadlines or consequences	Schedule – unclear deadlines, but long-term success
Not Important	Delegate – must do, but others can do it	Delete – unnecessary distraction

The Eisenhower Matrix

Apart from prioritizing and tracking technical debt, here are some additional strategies for managing technical debt⁷:

Refactor incrementally.
Improve code quality practices.
Communicate and educate.
Prevent new debt.
Modernize and upgrade.
Automate where possible.

Footnotes

Use Four Keys metrics like change failure rate to measure your DevOps performance (Google Cloud) ↩︎
DORA’s software delivery metrics: the four keys (DORA) ↩︎
DORA Metrics: The Right Fit for ekino? (Ekino) ↩︎
Agile Metrics (Adobe) ↩︎
Broader definition of technical debt (Stepsize) ↩︎
The Eisenhower Matrix: How to prioritize your to-do list (Asana) ↩︎
What is technical debt, and how do you manage it? (Monday) ↩︎