mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale

mTSBench offers the largest open benchmark for multivariate time-series anomalies, letting teams fairly compare 24 detectors across 344 labelled streams so they can deploy the right model before failures hit revenue.

1. What the method is

mTSBench is an open-source benchmark that unifies 344 labelled series from nineteen public datasets and evaluates twenty-four anomaly-detection algorithms under identical preprocessing, windowing, and metric pipelines. The package ships with ready-to-run code and a public leaderboard, enabling reproducible comparison and rapid model-selection research across industries.

2. Why the method was developed

Prior anomaly studies mixed datasets, metrics, and hyper-parameters, making results irreproducible and leaving practitioners guessing which detector to trust. The authors built mTSBench to standardise evaluation, expose each model’s blind spots, and test unsupervised selector modules that pick a detector without labels—critical for operations where ground truth is rare and distribution drift is common.

3. Who should care

Site-reliability engineers preventing service outages
Industrial IoT teams monitoring sensor fleets
Data-reliability leads guarding ETL pipelines

4. How the method works

The pipeline ingests diverse datasets, applies uniform scaling, then feeds each series into statistical, classical ML, deep-learning, and LLM-based detectors. Point-level anomaly scores are threshold-swept to produce AUC curves, while a meta-learning selector extracts features—dimensionality, seasonality strength, detector variance—and predicts which algorithm will perform best on unseen data, enabling hands-off deployment in label-scarce environments.

5. How it was evaluated

The authors ran all detectors on the 344-series corpus, logging AUC-ROC, AUC-PR, and range-based F1. Selector quality was scored by gap-to-oracle (percentage of ideal AUC captured) and Spearman rank correlation between chosen and optimal models, providing a rigorous measure of automated model-selection reliability.

6. How it performed

No single detector won everywhere: top AUC-ROC varied from 0.99 to 0.58 across datasets. The best unsupervised selector captured 72 % of oracle performance, highlighting room for smarter strategies. LLM-driven detectors delivered strong precision but ran ten-times slower than classical baselines. (Source: arXiv 2506.21550, 2025)

← Back to dossier index