Microsoft recently open-sourced Lumos, a Python library for automatically detecting and diagnosing metric regressions in “web-scale” applications. In a technical paper, company researchers claim Lumos has been deployed in millions of sessions across the developer teams at Skype and Microsoft Teams, enabling engineers to detect hundreds of changes in metrics and reject thousands of false alarms surfaced by anomaly detectors.
Online services’ health is typically monitored by tracking key performance indicator (KPI) metrics over time. Regressions in these require a follow-up as they could be indicative of major problems, resulting in costs and the potential of loss of users. But it’s time-consuming to track down the root cause of every KPI regression because a single anomaly can take days or weeks to investigate.
Lumos is a novel methodology that encompasses existing, domain-specific anomaly detectors but reduces the false-positive alert rate by a claimed over 90%. It eliminates the process of establishing whether a change is due to a shift in population or a product update by providing a prioritized list of the most important variables in explaining changes in the metric value. The library also serves the wider purpose of understanding the difference in a metric between any two corpora, including bias, by comparing a control and treatment data set while remaining agnostic to the time series component.
“[Lumos] provides product owners with key insights about demographics changes of their application, and … it identifies opportunities for service owners to improve their engineering system,” wrote the paper’s coauthors. “[Lumos enables engineers] to spend less time in diagnosing metric regressions … and more time on building exciting features.”
Lumos leverages the principles of A/B testing to compare pairs of data sets. Each data set is a tabular data set where rows correspond to samples and the column values include metrics of interest, like variables that represent the KPI, describe the population (e.g., platform, device type, network type, and country), and provide hypotheses for diagnosing metric regressions. An accompanying configuration file specifies hyperparameters (variables) for running the workflow and details which columns in the data sets correspond to the metric, invariant, and hypothesis columns.
Lumos begins by verifying if the regression in the metric between data sets is statistically significant. It then follows up with a population bias check and bias normalization to account for any population changes between the two data sets. If there’s no statistically significant regression in the metric after the data has been normalized, the regression in the metric can be explained by the change in the population. But if the delta in the metric is statistically significant, the features are ranked according to their contribution to the delta in the target metric.
The Microsoft researchers say Lumos serves as the primary tool for scenario monitoring of hundreds of metrics related to the reliability of calling, meetings, and public switched telephone network (PSTN) services at Microsoft. It’s running on Azure Databricks, the company’s Apache-spark-based big data analytics service, with multiple jobs configured based on priority, complexity, and metrics type. And jobs complete asynchronously such that whenever an anomaly is detected, it triggers the Lumos workflow, raising an incident alert (ticket) if the library determines it to be a legitimate issue.
“We have 15 primary metrics each of which are being monitored against key dimensions like platform, tenant, meeting type, [join, dial out, and create call], resulting in thousands of aggregated time series we track for a single metric. We have millions of call legs per day and each leg generates hundreds of telemetry fields serving as the input for Lumos,” wrote the coauthors, who claim Lumos freed up 65% to 95% of teams’ development time. “One incident that Lumos was able to detect for … involved a bug in the code that impacted video-based screen sharing. Two different teams released updates and those conflicted with each other. As a result, when users tried to use the screen sharing functionality, they experienced errors.”
The Microsoft researchers caution that Lumos isn’t guaranteed to catch all regressions in services and that it can’t provide insights without a sufficiently large amount of data. In an effort to address this, they plan to focus on expanding support for continuous metrics, perform feature ranking using multi-variate features, and introduce feature clustering to tackle the problem of multicollinearity in feature ranking.