ProbTS: Unified benchmarking for time-series forecasting

September 28, 2024

Share this page

Author: Machine Learning Group

Time-series forecasting is crucial across various industries, including health, energy, commerce, climate, etc. Accurate forecasts over different prediction horizons are essential for both short-term and long-term planning needs across these domains. For instance, during a public health emergency such as the COVID-19 pandemic, projections of infected cases and fatalities over one to four weeks are essential for allocating medical and societal resources effectively. In the energy sector, precise forecasts of electricity demand on an hourly, daily, weekly, and even monthly basis are crucial for power management and renewable energy scheduling. Logistics relies on forecasting short-term and long-term cargo volumes for adaptive route scheduling and efficient supply chain management.

Beyond covering various prediction horizons, accurate forecasting must extend beyond point estimates to include distributional forecasts that quantify estimation uncertainty. Both the expected estimates and the associated uncertainties are indispensable for subsequent planning and optimization, providing a comprehensive view that informs better decision-making.

Given the critical need for accurate point and distributional forecasting across diverse prediction horizons, researchers from Microsoft Research Asia revisited existing time-series forecasting studies to assess their effectiveness in meeting these essential demands. The review encompasses state-of-the-art models developed across various research threads:

Classical time-series models: These models typically require training from scratch on each dataset, focusing on either long-term point forecasting (e.g., PatchTST, iTransformer) or short-term distributional forecasting (e.g., CSDI, TimeGrad).
Recent time-series foundation models: These models involve universal pre-training across extensive datasets and are developed by both industrial labs (e.g., TimesFM, MOIRAI, Chronos) and academic institutions (e.g., Timer, UniTS).

Despite the advancements, researchers find that existing approaches often lack a holistic consideration of all essential forecasting needs. This limitation results in “biased” methodological designs and unverified performance in untested scenarios.

To address the gaps identified in existing time-series forecasting studies, researchers developed the ProbTS tool. ProbTS serves as a unified benchmarking platform designed to evaluate how well current approaches meet essential forecasting needs. By highlighting crucial methodological differences, ProbTS provides a comprehensive understanding of the strengths and weaknesses of advanced time-series models and unveils opportunities for future research and innovation.

Repo: https://github.com/microsoft/ProbTS (opens in new tab)

Paper: https://arxiv.org/abs/2310.07446v4 (opens in new tab)

Paradigm differences: Methodological analysis of time series forecasting

The benchmark study using ProbTS highlights two crucial methodological differences found in contemporary research: the forecasting paradigms for point and distributional estimation, and the decoding schemes for variable-length forecasting across different horizons.

Forecasting paradigms for point and distributional estimation

Point forecasting only: Approaches that support only point forecasting, providing expected estimates without uncertainty quantification.
Predefined distribution heads: Methods that use predefined distribution heads to generate distributional forecasts, offering a fixed structure for uncertainty estimation.
Neural distribution estimation modules: Techniques employing neural network-based modules to estimate distributions, allowing for more flexible and potentially more accurate uncertainty quantification.

Decoding schemes for variable-length forecasting across different horizons

Autoregressive (AR) methods: These methods generate forecasts step-by-step, using previous predictions as inputs for future time steps. They are suitable for scenarios where sequential dependencies are crucial.
Non-Autoregressive (NAR) methods: These methods produce forecasts for all time steps simultaneously, offering faster predictions and potentially better performance for long-term forecasting.

Figure 1. An overview of ProbTS. illustrating its coverage across diverse forecasting scenarios, including typical models developed in different research branches and comprehensive evaluation metrics.

The research results under the ProbTS framework reveals several key insights:

Firstly, due to customized neural architectures, long-term point forecasting approaches excel in long-term scenarios but struggle in short-term cases and with complex data distributions. The lack of uncertainty quantification leads to significant performance gaps compared to probabilistic models when dealing with complex data distributions. Conversely, short-term probabilistic forecasting methods are proficient in short-term distributional forecasting but exhibit performance degradation and efficiency issues in long-term scenarios.

Secondly, regarding the characteristics of different decoding schemes, NAR decoding is predominantly used in long-term point forecasting models, while short-term probabilistic forecasting models do not show such a biased preference. Meanwhile, AR decoding suffers from error accumulation over extended horizons but may perform better with strong seasonal patterns.

Lastly, for current time-series foundation models, the limitations of AR decoding are reaffirmed in long-term forecasting. Additionally, foundation models show limited support for distributional forecasting, highlighting the need for improved modeling of complex data distributions.

Detailed results and analysis on classical time-series models

Researchers benchmark classical time-series models across a wide range of forecasting scenarios, encompassing both short and long prediction horizons. The evaluation includes both point forecasting metrics (Normalized Mean Absolute Error, NMAE) and distributional forecasting metrics (Continuous Ranked Probability Score, CRPS). Additionally, researchers calculate a non-Gaussianity score to quantify the complexity of data distribution for each forecasting scenario.

Based on the data presented in Figure 2, several noteworthy observations emerge:

Limitations of long-term point forecasting models: Customized neural architectures for time-series, primarily designed for long-term point forecasting, excel in long-term scenarios. However, their architectural benefits significantly diminish in short-term cases (see Figure 2(a) and 2(c)). Furthermore, their inability to quantify forecasting uncertainty results in larger performance gaps compared to probabilistic models, especially when the data distribution is complex (see Figure 2(c) and 2(d)).
Weaknesses of short-term probabilistic forecasting models: Current probabilistic forecasting models, while proficient in short-term distributional forecasting, face challenges in long-term scenarios, as evidenced by significant performance degradations (see Figure 2(a) and 2(b)). In addition to unsatisfactory performance, some models experience severe efficiency issues as the prediction horizon increases.

Figure 2. Benchmark classical time-series models with ProbTS

These observations yield several important implications. Firstly, effective architecture designs for short-term forecasting remain elusive and warrant further research. Secondly, the ability to characterize complex data distributions is crucial, as long-term distributional forecasting presents significant challenges in both performance and efficiency.

Following that, researchers compare Autoregressive (AR) and Non-Autoregressive (NAR) decoding schemes across various forecasting scenarios, highlighting their respective pros and cons in relation to forecasting horizons, trend strength, and seasonality strength.

Figure 3. Comparing AR and NAR schemes with ProbTS

Researchers find that nearly all long-term point forecasting models use the NAR decoding scheme for multi-step outputs, whereas probabilistic forecasting models exhibit a more balanced use of AR and NAR schemes. Researchers aim to elucidate this disparity and highlight the pros and cons of each scheme, as shown in Figure 3.

Error accumulation in AR decoding: Figure 3(a) shows that existing AR models experiences a larger performance gap compared to NAR methods as the prediction horizon increases, suggesting that AR may suffer from error accumulation.
Impact of trend strength: Figure 3(b) connects the performance gap with trend strength, indicating that strong trending effects can lead to significant performance differences between NAR and AR models. However, there are exceptions where strong trends do not cause substantial performance degradation in AR-based models.
Impact of seasonality strength: Figure 3(c) explains these exceptions by introducing seasonality strength as a factor. Surprisingly, AR-based models perform better in scenarios with strong seasonal patterns, likely due to their parameter efficiency in such contexts.
Combined effects of trend and seasonality: Figure 3(d) demonstrates the combined effects of trend and seasonality on performance differences.

Based on these analyses, researchers point out that the choice between AR and NAR decoding schemes in different research branches is primarily driven by the specific data characteristics in their focused forecasting scenarios. This explains the preference for the NAR decoding paradigm in most long-term forecasting models. However, this preference for NAR may overlook the advantages of AR, particularly its effectiveness in handling strong seasonality. Since both NAR and AR have their own strengths and weaknesses, future research should aim for a more balanced exploration, leveraging their unique advantages and addressing their limitations.

Detailed results and analysis on time-series foundation models

Researchers then extend the analysis framework to include recent time-series foundation models and examine their distributional forecasting capabilities.

Figure 4. Benchmark recent time-series foundation models with ProbTS

The results show that:

AR vs. NAR decoding in long-term forecasting: Figure 4(a) reaffirms the limitations of AR decoding over extended forecasting horizons. This suggests that time-series data, due to its continuous nature, may require special adaptations beyond those used in language modeling (which operates in a discrete space). Additionally, it is confirmed that AR-based and NAR-based models can deliver comparable performance in short-term scenarios, with AR-based models occasionally outperforming their NAR counterparts.
Distributional forecasting capabilities: Figure 4(b) compares the distributional forecasting capabilities of foundation models with CSDI, underscoring the importance of capturing complex data distributions. Current foundation models demonstrate limited support for distributional forecasting, typically using predefined distribution heads (e.g., MOIRAI) or approximated distribution modeling in a value-quantized space (e.g., Chronos).

These observations lead to several important conclusions: While AR-based models can be effective in short-term scenarios, their performance diminishes over longer horizons, highlighting the need for further refinement. Time-series data may require unique treatments to optimize AR decoding, particularly for long-term forecasting. The ability to accurately model complex data distributions remains a critical area for improvement in time-series foundation models.

Future directions: Evolving perspectives, models, and tools

Based on the evaluation and analysis of existing methods, researchers have proposed several important future directions for time series prediction models. These directions, if pursued, could significantly impact key scenarios across various industry sectors.

Future direction 1: Adopting a comprehensive perspective of forecasting demands. One primary future direction is to adopt a holistic perspective of essential forecasting demands when developing new models. This approach can help rethink the methodological choices of different models, understand their strengths and weaknesses, and foster more diverse research explorations.

Future direction 2: Designing a universal model. A fundamental question raised by these results is whether we can develop a universal model that fulfills all essential forecasting demands or if we should treat different forecasting demands separately, introducing specific techniques for each. While it is challenging to provide a definitive answer, the ultimate goal could be to create a universal model. When developing such a model, it is necessary to consider issues such as input representation, encoding architecture, decoding scheme, and distribution estimation module, etc. Additionally, future research is needed to address the challenge of distributional forecasting in high-dimensional and noisy scenarios, particularly for long horizons, and to leverage the advantages of both AR and NAR decoding schemes while avoiding their weaknesses.

Future direction 3: Developing tools for future research. To support future research in these directions, researchers have made the ProbTS tool publicly available, hoping this tool will facilitate advancements in the field and encourage collective efforts from the research community.

By addressing these future directions, researchers aim to push the boundaries of time-series forecasting, ultimately developing models that are more robust, versatile, and capable of handling a wide range of forecasting challenges. This progress holds the potential to significantly impact numerous industries, leading to better decision-making, optimized operations, and improved outcomes across critical sectors.

Microsoft Research Lab – Asia