Dynamic Time Warping

Useful to note that the majority of the posts written for this blog are in support of the mathematical analysis formulated in Mathematical Geoenergy (Wiley/AGU, 2018). As both new data becomes available and new techniques for model fitting & parameter estimation — aka inverse modeling (predominantly from the machine learning community) — are suggested, an iterative process of validation, fueled by the latest advancements, ensures that the GeoEnergyMath models remain robust and accurately reflective of the underlying observed behaviors. This of course should be done in conjunction with submitting significant findings to the research literature pipeline. However, as publication is pricey, my goal is to make the cross-validation so obvious that I can get an invitation for a review paper — with submission costs waived. Perhaps this post will be the deal-maker — certainly not the deal-breaker, but you can be the judge.

A comparison metric that I was only vaguely aware of, though it has shown increasing use in machine learning, is Dynamic Time Warping. The most common application of DTW is in speech recognition, where both the amplitude and time axis in an audio time-series data set is modified (stretched/compressed) to match against pronounced words and phonemes. So it’s essentially a practical pattern recognition tool, used to quickly find a best match among candidates by applying an efficient dynamic programming algorithm for segment optimization.

My application of DTW is on a different track as I needed something that would optimize for the best fitting model parameters, namely, scalar values which exist on a sliding scale. Thus, the goal is to use it as a replacement for the correlation coefficient metric which I’ve been using in cross-validation of ocean wave models (Chapter 12) against ENSO, AMO, etc. climate index time-series observations. The basic issue with the correlation coefficient as a metric is that it will strongly penalize sharp transitions in amplitude that are only slightly displaced in time, caused by whatever error or uncertainty associated with the measurement. As such, the correlation coefficient essentially uses a Euclidian distance metric as shown below, while DTW applies a clever algorithm to compensate for slight temporal displacements.

This situation is especially problematic for sharp spikes as found in climate indices such as ENSO. So, for example, a lone delta spike that occurs only once in several years will produce a low value of correlation coefficient against a model that shows the delta a month off. Yet the DTW algorithm will warp the time axis slightly to produce an acceptable match. The DTW penalty is in the degree of warping applied to the data.

I wanted the DTW algorithm (my prototype here) to produce something that would range from approximately 0 to 1 — as with the correlation coefficient — so I could plug it in to the LTE optimization software package. Naively, I accomplished this by comparing the fit to a complement of the fit – reversing the sign of the amplitude of the model — and then calling that the Worst fit. Then the metric would be

DTW Metric = (Worst – Best) / Worst

As the Best value is the minimal cost (or error) in the match, this seems to work as a 0-1 scaling*(if worst>best) independent of the amplitude of the data. The drawback is that it takes two DTW passes (one for Worst and one for Best) instead of a single computation for Best.

Approach

The proof is in the pudding in terms of how well the DTW metric works in comparison to the conventional CC for cross-validation. In general, optimization using CC needs to be very aggressive to have the models visually match to the target time-series as only the amplitude provides a visual degree of freedom. Whereas, with DTW, the temporal jitter provides an extra DoF to absorb the costs in fitting. Thus, a DTW fit will quite quickly “look” better than a comparable CC fit, leading to the observation that the CC value could be surprisingly low for the apparent quality of the fit.

Because of this nuance, over-confidence in fitting via DTW will likely be an issue, as one may be satisfied with a fit that’s actually not valid. Of course, the obvious recourse to this predicament is to rely more on cross-validation of results. In that case, one can gain confidence in the results across training runs if the test interval continually generates a validating metric value.

For evaluation purposes, the following modeling examples presents the results of applying DTW-based cross-validation to three different climate indices: El Nino Modoki (EMI), Atlantic Multidecadal Oscillation (AMO), and Pacific Decadal Oscillation (PDO). IMO, the results are equally impressive even though the time-series are quite varied. The selected order of the results is a difficult decision as they each stand on their own. Each case study was performed many times from a fixed calibration to indicate that the results were statistically valid and not a fluke.

The skeletal model is the same for each: A non-linear Laplace’s Tidal Equation (LTE) modulation is applied to an annual delta impulse integration of the tidal signal occurring in effect at that time (consider a sample and hold analogy). Or as a data flow diagram: {input long-period tidal forcing} → {impulse sampled annually and integrated} → {LTE sinusoidal modulation}, as shown in the schematic below,

The cross-validation procedure is straightforward — whatever modulation is applied to the training interval to optimize the fit, the same modulation is applied blindly to the excluded validation interval. The tidal factors are slightly altered from their dLOD calibrated values to provide a means to escape from locally minimized solutions. All results have been saved on a GIST page.

EMI

Below is a typical EMI result using a DTW window setting of 9 points width. In this case, the cross-validation testing interval is from the year 1930 to 1960 (highlighted in yellow) against JAMSTEC EMI data. Only 4 LTE sinusoidal modulation factors are significant (seen as spectral peaks in the lower panel) and a good DTW fit is readily apparent based on the initially calibrated tidal forcing (middle panel). The small inset shows that the windowed correlation is constant across the range — moderate in terms of fit, yet the excluded-from-training test interval appears just as good as the trained intervals.

The ease and speed at which the model is cross-validated via DTW rivals that of conventional tidal analysis, where a correlation coefficient is routinely applied. For tidal analysis, the data is voluminous, densely packed, and accurately timed & measured so that the fitting process can be aggressive. In contrast, the El Nino Modoki data is limited to a per-month basis and likely suffers from uncertainties over the years it has been collected. That is perhaps where DTW excels — as with human speech, climate indices likely contain similar perturbations that can use a warped correction (synthesized speech would then be analogous to conventional tides — no corrections necessary).

Note the sinusoidal LTE modulation is faintly apparent in the lower right inset; this will be more apparent in the PDO and AMO fits described next.

PDO

The PDO index extends to 1880 archived here. The training interval in this case runs from 1900 to 2000 (highlighted in yellow). The tidal forcing profile has a different phase and flatter holding characteristics but is still aligned to the calibrated tidal factors as set by the LOD data.

The excluded cross-validating testing intervals are before 1900 and after 2000. Both of these show a pattern that clearly match that of the PDO behavior. Not perfect, but neither is the training, as can be seen with the windowed cross-correlation shown in the small inset panel.

Note that the LTE sinusoidal modulation is more gradual for the PDO model that the EMI model, as apparent from the lower right panel. This is in keeping with the longer decadal variation in the PDO. As described in Chapter 12 of Mathematical Geoenergy, the modulation corresponds to spatially wider standing-wave modes and weaker temporal sinusoidal scaling. This will become even more apparent with the multidecadal AMO modeled next.

AMO

The AMO data was taken from here. The model optimization uses the same training and testing interval as PDO. Only 3 LTE sinusoidal modulations are relevant — a strong gradual fundamental, a weaker harmonic at twice that value, and a fast harmonic of 30 times the frequency. The latter causes the erratic fluctuations in the model — matching that in the AMO data.

As for the excluded cross-validating testing intervals before 1900 and after 2000, the agreement is very good, and it extends the gradual multidecadal variation within the training interval to the testing regions.

The harmonic relationships were enforced in the fitting process for this model as it seemed to reflect the symmetry and strength of the inferred modulation shown in the lower right panel. Compared to EMI and PDO, the gradual LTE modulation is markedly apparent. So the AMO is a good candidate to start the modeling process, as one can quickly determine the relevant LTE modulation factors.

Commonality

The utility of the DTW-guided optimization process among the 3 models is collectively impressive. Obviously, the LTE modulations differ, which makes each model distinctive. Yet, the input tidal forcings are each closely aligned with the calibrated dLOD tidal factors, maintaining within a 0.998 correlation coefficient to the conventional multiple linear regression fit of the dLOD data. Below is the comparison of the original dLOD calibration (in blue) to the EMI model variation (in red). The bottom panel shows the two at a finer comparison level.

No real visible difference is seen here, yet during the optimization process, the tidal factors corresponding to the monthly synodic (29.53d) and draconic lunar (27.21d) cycles were amplified, as was the 4.4 year anomalistic perigee (1616d) cycle, along with the 3rd harmonic of the monthly anomalistic (27.55d / 3 = 9.18d) cycle. This happened consistently across the 3 models indicating that this subtle modification is real. As the fortnightly lunar term (13.66d) dominates the dLOD variations by 2 orders of magnitude and is a symmetric north vs south forcing term, any monthly contribution would imply an asymmetry in north vs south forcing (im)balance. For the 4.4y cycle, note the beat frequency in the above dLOD waveform, which is also 4.4 years. The simplest way to affect a +/- asymmetry in the amplitude of that waveform is to apply a sine wave of that frequency. This would provide a correction that could make the peaks slightly more spiky (flatter) and the valleys slightly flatter (more spiky). That’s a typical 1st-order nonlinear correction for gravity, as is the 3rd harmonic of the anomalistic cycle.

The LTE modeling software generates the following parameter dump, with the 5th column showing the amplification of each of the tidal factors (1st column) shown as a percentage from the calibrated levels. So, the large ~13X amplification of the 1616d cycle is in relation to the small value it starts from ~0.00016, whereas the predominant 13.66d cycle is at ~0.31.

Yet, whatever the correction, it is indeed subtle.

Conclusion

Applying the DTW metric is a fast and robust approach to fit and optimize LTE models to climate index behaviors. I only regret not having the insight to use it much earlier during cross-validation experiments, as it would have prevented many pointless dead-ends caused by over-fitting using a correlation coefficient metric. It was much too easy to inadvertently chase corner cases (such as slightly offset temperature spikes in data vs model) in a time-series data set that would otherwise be readily absorbed by the DTW algorithm.

The promise of this approach is that it will be picked up by machine-learning experiments in modeling climate behaviors, whereby as in speech-recognition, algorithms such as DTW speed up the process of matching patterns of behavior. See comments below for citations to such work.

8 thoughts on “Dynamic Time Warping

  1. i am not familiar with the datasets you are fitting (e.g., AMO, PDO, etc). But my understanding of DTW is different and possibly more general. I am not suggesting you are wrong in any way, just trying to align my experience and understanding with yours to try to contribute something.

    In my experience in signal processing DTW takes two time dependent signals s(t) and u(t), and reexpresses them so their domains are each on [0, 2pi) and produces a third signal, w(t), which shifts the time varying phase of u(t) so DTW[u(t), w(t)] ~ s(t), that is, s(t) – DTW[u(t), w(t)] is the constant signal 0(t) or as close to it as possible.

    The thought is that each piece of w(t) applies a phase shift to the corresponding part of u(t) so it looks as much as possible like the corresponding part of s(t). Now, this doesn’t work in general. That is, it’s possible to concoct a s(t) and u(t) pair for which no w(t) which “aligns” them exists. It’s interesting to examine what the criteria are so this is satisfied. Can’t address it here and now but one way of stating this is to impose conditions on the Fourier transforms of both s(t) and u(t). But assuming it does work, the purpose of w(t) is to provide a “phase steering” to speed u(t) up or slow it down with the goal being to bring the adjusted u(t) closer to s(t).

    For cases where this is NOT possible it might be that u(t) for some pertinent section is missing the ordinate values s(t) requires. This is not the only way to get into trouble. For instance it’s not difficult to see that there are constraints needed on the time derivatives of s(t) and u(t) as well.

    So, I guess I would get nervous if I couldn’t be assured the given datasets were capable of being “warped” in this way. Maybe they always can be. But if that’s true, what’s the physical meaning of those constraints?

    Liked by 1 person

  2. Jan, Much of the application of Dynamic Time Warping to Deep Learning (DL) is still maturing. As I stated in the post, it’s interesting in how it helps mitigate over-fitting. I found a paper from this year that claims a similar finding:

    ” In this paper, we propose a non-intrusive overfitting detection and prevention approach using time series classifiers trained on the training history of DL models. Our approach (when using the KNN-DTW time series classifier) has (1) better classification performance than correlation-based approaches for overfitting detection, and (2) greater accuracy than early stopping for overfitting prevention with a shorter delay. We evaluate our approach on a real-world dataset of labelled training histories collected from the papers published at top AI venues in the last 5 years. Our approach can be a useful tool for researchers and developers of DL software”

    Li, Hao, et al. “Keeping Deep Learning Models in Check: A History-Based Approach to Mitigate Overfitting.” arXiv preprint arXiv:2401.10359 (2024).

    https://arxiv.org/pdf/2401.10359.pdf

    Remember that the DTW algorithm is just a metric and it is not actually adding any phase shifts to the data or model; instead it is just estimating a cost penalty depending on the distance of matching patterns (using a dynamic programming approach to reduce the computation time). 

    Like

  3. https://dl.acm.org/doi/pdf/10.14778/1454159.1454226

    Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures (2008)

    “Inspired by the need to handle time warping in similarity computation, Berndt and Clifford [5] introduced DTW, a classic speech recognition tool, to the data mining community, in order to allow a time series to be “stretched” or “compressed” to provide a better match with another time series. Several lower bounding measures have been introduced to speed up similarity search using DTW [21, 26, 27, 44], and it has been shown that the amortized cost for computing DTW on large data sets is linear”

    Like

  4. Pingback: Are the QBO disruptions anomalous? | GeoEnergy Math

  5. Pingback: Are the QBO disruptions anomalous? | GeoEnergy Math

  6. Pingback: Bay of Fundy subbands | GeoEnergy Math

Leave a comment