Monitoring/ Observability

Monitoring and observability are critical in product development because they enable teams to build, maintain, and scale reliable, high-performing software. 

Observability helps us understand why something is wrong. It provides real-time visibility into system behaviour, and allows us to detect issues before they are reported (eg. performance bottlenecks, errors, or failures). This visibility supports safer deployments and continuous delivery, as we can validate the impact of new releases, and roll back/ remediate if anomalies are detected after deployment; acting as a feedback loop to inform whether new features perform as expected. It also improves security, as real-time monitoring can catch suspicious behavior, failed auth attempts, or unauthorized data access. It additionally enables experimentation; feature rollouts, A/B testing, and canary deployments require confidence and the ability to measure outcomes and catch regressions. 

Monitoring tells us when something is wrong. It involves using our observability to take positive action. For example, by tracing issues directly to their root cause (e.g. slow DB queries, memory leaks), teams can reduce MTTR (mean time to recovery). It should be a continuous and recurring process; for example, one approach is to report on metrics in sprint and/ or monthly reviews, and create action items based on anomalies that present. If possible, it is also invaluable to have an alerting system that pages on-call engineers when anomalies in metrics are detected, so they can begin remediation immediately. 


The 2 sets of metrics I’ve found most valuable to monitor are:

DORA Metrics (from the DevOps Research and Assessment group) are a set of four key performance indicators used to measure the effectiveness and efficiency of software delivery and operational performance1. They help organizations assess how well they build, deliver, and maintain software systems.

The four DORA metrics are2:

  1. Deployment Frequency
    How often code is deployed to production or released to users.
    • Why? High-performing teams deploy more frequently, enabling faster delivery of value and more continuous feedback loops.
  2. Lead Time for Changes
    The time it takes for a code commit to get into production.
    • Why? Shorter lead times indicate more efficient development processes and quicker iteration on customer needs.
  3. Change Failure Rate
    The percentage of deployments that cause a failure in production.
    • Why? A lower change failure rate reflects better code quality, testing, and deployment practices.
  4. Mean Time to Recovery
    The average time it takes to restore service after an incident/ failure.
    • Why? Faster recovery times reduce downtime and customer impact, indicating strong incident response and system resilience.

Monitoring DORA metrics enables for the following3:

  • Data-Driven Improvements
    Provides actionable insights into bottlenecks and areas for process optimization.
  • Benchmarking
    Enables comparison against industry standards (e.g., elite, high, medium, low performers).
  • Alignment Across Teams
    Promotes shared goals between engineering, product, and operations.
  • Balance Speed and Stability
    Encourages fast delivery without sacrificing reliability.
  • Continuous Improvement
    Guides investments in tooling, automation, and culture change.

Delivery Metrics

Monitoring velocity, lead time, cycle time, and related delivery metrics provides crucial insights into a software development team’s efficiency, predictability, and capacity for improvement4

Metric What It Measures Why It’s Useful
Velocity Amount of work completed in a sprint (e.g., story points) Helps in sprint/release planning
Lead Time Time from ticket creation to production release Reflects total delivery efficiency
Cycle Time Time from when work starts to when it’s completed Indicates execution speed and flow
Throughput Number of tasks completed in a period Measures team delivery rate

Monitoring delivery metrics enables for the following:

  • Understand Team Performance
    Velocity shows how much work a team can consistently deliver within a sprint or timebox, and helps assess whether teams are underloaded, overloaded, or working at a sustainable pace.
  • Improve Forecasting and Planning
    By tracking velocity and cycle time, teams can better estimate how long future work will take, and increase confidence in sprint and release planning, enabling more accurate commitments to stakeholders.
  • Identify Process Bottlenecks
    Lead time and cycle time reveal how long it takes to deliver value from idea to production; Long lead/cycle times can highlight issues like too much work-in-progress (WIP), blocked tasks, or review delays.
  • Drive Continuous Improvement
    Regular monitoring helps teams set improvement goals (e.g., “reduce average cycle time by 20%”), and makes it easier to measure the impact of process or tooling changes.
  • Ensure Flow Efficiency
    Helps teams achieve a smoother development flow with less context switching and rework; Cycle time variations can indicate inconsistent workflows or unclear prioritization.
  • Monitor Delivery Health and Trends
    Trends in velocity and time-based metrics can reveal burnout, team churn, or scope creep; early detection of declining performance supports proactive intervention.
  • Support Agile and Lean Practices
    These metrics align with Agile and Lean values like short feedback loops, incremental delivery, and sustainable pace; they reinforce team autonomy while providing structure for accountability.

Technical Debt

Monitoring and prioritizing technical debt is essential in software development because it directly affects the long-term health, scalability, and velocity of your product and engineering teams. Ignoring it can lead to slower delivery, more bugs, and developer burnout. 

Technical debt slows down every future change, so prioritizing and addressing it keeps your codebase maintainable and extensible, allowing teams to build features faster and safer over time. Messy or outdated code increases the risk of regressions, bugs, and runtime failures; cleaning up debt makes the system more predictable and testable, which leads to fewer production issues.

Prioritizing high-impact debt unlocks faster development and easier onboarding for new engineers. When debt is not addressed and accumulates, teams often hit a point where a full rewrite seems necessary; incrementally tackling debt avoids these all-or-nothing, high-risk, high-cost rewrite scenarios. It also shows that engineering quality is valued, improving job satisfaction and retention.

It is best practice to track tech debt in your backlog just like features, and be evaluated based on impact vs. effort (e.g. what’s slowing down the team or risking system health). A common approach, that has worked well for me, is to try and keep 10-20% capacity per sprint for the team to tackle technical debt5. You can also run regular “Debt Discovery” sessions where engineers nominate pain points from recent sprints.

An effective approach for tracking debt is add a “Technical Debt” label or custom issue type to your ticket/ task tracking tool (e.g. Jira, Linear, Notion), where each ticket includes:

  1. What the problem is
  2. Its impact (speed, bugs, UX, etc.)
  3. How often it affects the team
  4. Suggested fix or next step
  5. Estimated effort (optional)

A simple prioritization matrix can help you decide what to tackle first6:

Urgent Not Urgent
Important Do – clear deadlines or consequences Schedule – unclear deadlines, but long-term success
Not Important Delegate – must do, but others can do it Delete – unnecessary distraction
The Eisenhower Matrix

Apart from prioritizing and tracking technical debt, here are some additional strategies for managing technical debt7:

  1. Refactor incrementally. 
  2. Improve code quality practices. 
  3. Communicate and educate. 
  4. Prevent new debt. 
  5. Modernize and upgrade. 
  6. Automate where possible. 

Footnotes

  1. Use Four Keys metrics like change failure rate to measure your DevOps performance (Google Cloud) ↩︎
  2. DORA’s software delivery metrics: the four keys (DORA) ↩︎
  3. DORA Metrics: The Right Fit for ekino? (Ekino) ↩︎
  4. Agile Metrics (Adobe) ↩︎
  5. Broader definition of technical debt (Stepsize) ↩︎
  6. The Eisenhower Matrix: How to prioritize your to-do list (Asana) ↩︎
  7. What is technical debt, and how do you manage it? (Monday) ↩︎