Overfitting+Cross-Validation: ENSO→AMO

I presented at the 2018 AGU Fall meeting on the topic of cross-validation. From those early results, I updated a fitted model comparison between the Pacific ocean’s ENSO time-series and the Atlantic Ocean’s AMO time-series. The premise is that the tidal forcing is essentially the same in the two oceans, but that the standing-wave configuration differs. So the approach is to maintain a common-mode forcing in the two basins while only adjusting the Laplace’s tidal equation (LTE) modulation.

If you don’t know about these completely orthogonal time series, the thought that one can avoid overfitting the data — let alone two sets simultaneously — is unheard of (Michael Mann doesn’t even think that the AMO is a real oscillation based on reading his latest research article called “Absence of internal multidecadal and interdecadal oscillations in climate model simulations“).

This is the latest product (click to expand)

Read this backwards from H to A.

H = The two tidal forcing inputs for ENSO and AMO — differs really only by scale and a slight offset

G = The constituent tidal forcing spectrum comparison of the two — primarily the expected main constituents of the Mf fortnightly tide and the Mm monthly tide (and the Mt composite of Mf × Mm), amplified by an annual impulse train which creates a repeating Brillouin zone in frequency space.

E&F = The LTE modulation for AMO, essentially comprised of one strong high-wavenumber modulation as shown in F

C&D = The LTE modulation for ENSO, a strong low-wavenumber that follows the El Nino La Nina cycles and then a faster modulation

B = The AMO fitted model modulating H with E

A = The ENSO fitted model modulating the other H with C

Ordinarily, this would take eons worth of machine learning compute time to determine this non-linear mapping, but with knowledge of how to solve Navier-Stokes, it becomes a tractable problem.

Now, with that said, what does this have to do with cross-validation? By fitting only to the ENSO time-series, the model produced does indeed have many degrees of freedom (DOF), based on the number of tidal constituents shown in G. Yet, by constraining the AMO fit to require essentially the same constituent tidal forcing as for ENSO, the number of additional DOF introduced is minimal — note the strong spike value in F.

Since parsimony of a model fit is based on information criteria such as number of DOF, as that is exactly what is used as a metric characterizing order in the previous post, then it would be reasonable to assume that fitting a waveform as complex as B with only the additional information of F cross-validates the underlying common-mode model according to any information criteria metric.

For further guidance, this is an informative article on model selection in regards to complexity — “A Primer for Model Selection: The Decisive Role of Model Complexity

excerpt:

The Search for Order

Chap 10 Mathematical Geoenergy

For the LTE formulation along the equator, the analytical solution reduces to g(f(t)), where g(x) is a periodic function. Without knowing what g(x) is, we can use the frequency-domain entropy or spectral entropy of the Fourier series mapping an estimated x=f(t) forcing amplitude to a measured climate index time series such as ENSO. The frequency-domain entropy is the sum or integral of this mapping of x to g(x) in reciprocal space applying the Shannon entropy –I(f).ln(I(f)) normalized over the I(f) frequency range, which is the power spectral (frequency) density of the mapping from the modeled forcing to the time-series waveform sample.

This measures the entropy or degree of disorder of the mapping. So to maximize the degree of order, we minimize this entropy value.

This calculated entropy is a single scalar metric that eliminates the need for evaluating various cyclic g(x) patterns to achieve the best fit. Instead, what it does is point to a highly-ordered spectrum (top panel in the above figure), of which the delta spikes can then be reverse engineered to deduce the primary frequency components arising from the the LTE modulation factor g(x).

The approach works particularly well once the spectral spikes begin to emerge from the background. In terms of a physical picture, what is actually emerging are the principle standing wave solutions for particular wavenumbers. One can see this in the LTE modulation spectrum below where there is a spike at a wavenumber at 1.5 and one at around 10 in panel A (isolating the sin spectrum and cosine spectrum separately instead of the quadrature of the two giving the spectral intensity). This is then reverse engineered as a fit to the actual LTE modulation g(x) in panel B. Panel D is the tidal forcing x=f(t) that minimized the Shannon entropy, thus creating the final fit g(f(t)) in panel C when the LTE modulation is applied to the forcing.

The approach does work, which is quite a boon to the efficiency of iterative fitting towards a solution, reducing the number of DOF involved in the calculation. Prior to this, a guess for the LTE modulation was required and the iterative fit would need to evolve towards the optimal modulation periods. In other words, either approach works, but the entropy approach may provide a quicker and more efficient path to discovering the underlying standing-wave order.

I will eventually add this to the LTE fitting software distro available on GitHub. This may also be applicable to other measures of entropy such as Tallis, Renyi, multi-scale, and perhaps Bispectral entropy, and will add those to the conventional Shannon entropy measure as needed.