Canonical Cross-Validation

The only hope for a non-controlled-experiment-verified model to gain acceptance is either by (1) showing repeated success in predictions, or, precluding that due to long cycle time (2) producing rock-solid cross-validation results. Why? Let ChatGPT-4 answer:

It’s likely that for any new prediction of El Nino, at least 20 years of new observations will be required to gain statistical significance for a proposed geophysical model. Why 20 years? Basing the success of a model on just one new prediction could be a fluke, just as one wrong prediction could be an anomaly, so that a string of several El Nino events — say 5 at an average interval of 4 years — would require at least 20 years of new observations. It’s counter-productive, if not silly, to wait that long to evaluate a prospective model, thus the reliance on cross-validation.

The key to a meaningful cross-validation, with regards to potential over-fitting, is to create a canonical model that has few adjustable parameters or whose parameters are tightly constrained by other factors.

For the forcing of ENSO, the goal is to prepare the tidal factors so that they best represent the measured length-of-day (LOD) deviation over time in a canonical representation. The analysis from R.D. Ray presents over 150 values of amplitude and phase (see below for the non-exhaustive list)

Fortunately, I discovered a highly-compact reduced representation that expands the fundamental tidal factors, mainly Mf and Mm, into the many non-linear harmonic and cross-harmonic terms that Ray enumerates above.

The following list of 29 terms is sufficient at the moment (the last 6 are very long-period) to do a very credible job of capturing dLOD via a few non-linear cross-mixing terms (K1, K2, K3, A2A, E2A, Q1, Q2). The factors ending in P are phase factors (so 11 each amplitude and phase, and 7 cross-mix terms). Even some of these 11 may not be required for improving a fit (time will tell).

function Tide_Sum (Template : in Data_Pairs;
Constituents : in Long_Periods_Amp_Phase;
Periods : in Long_Periods;
Ref_Time : in Long_Float := 0.0;
Scaling : in Long_Float := 1.0;
Cos_Phase : in Boolean := True;
Year_Len : in Long_Float := Year_Length;
Integ: in Long_Float := 0.0
) return Data_Pairs is
Pi : Long_Float := Ada.Numerics.Pi;
Time : Long_Float;
Res : Data_Pairs := Template;
Yr : Long_Float := Year_Len;
Drac : Long_Float := 2.0*pi*Yr/Draconic;
Trop : Long_Float := 2.0*pi*Yr/Tropical;
Anom : Long_Float := 2.0*pi*Yr/Anomalistic;
Nodal : Long_Float := Drac-Trop;
Peri : Long_Float := (Trop-Anom)/2.0;
K1 : Long_Float := Constituents(1).Value;
K2 : Long_Float := Constituents(2).Value;
TA : Long_Float := Constituents(3).Value;
TP : Long_Float := Constituents(4).Value;
AA : Long_Float := Constituents(5).Value;
AP : Long_Float := Constituents(6).Value;
T2A : Long_Float := Constituents(7).Value;
T2P : Long_Float := Constituents(8).Value;
SA : Long_Float := Constituents(9).Value;
SP : Long_Float := Constituents(10).Value;
A2A : Long_Float := Constituents(11).Value;
SYA : Long_Float := Constituents(12).Value;
SYP : Long_Float := Constituents(13).Value;
EA : Long_Float := Constituents(14).Value;
EP : Long_Float := Constituents(15).Value;
E2A : Long_Float := Constituents(16).Value;
YA : Long_Float := Constituents(17).Value;
YP : Long_Float := Constituents(18).Value;
Q1 : Long_Float := Constituents(19).Value;
Q2 : Long_Float := Constituents(20).Value;
DA : Long_Float := Constituents(21).Value;
DP : Long_Float := Constituents(22).Value;
K3 : Long_Float := Constituents(23).Value;
NA : Long_Float := Constituents(24).Value;
NP : Long_Float := Constituents(25).Value;
PA : Long_Float := Constituents(26).Value;
PP : Long_Float := Constituents(27).Value;
N2A : Long_Float := Constituents(28).Value;
N2P : Long_Float := Constituents(29).Value;
function Ds is new gsin (DA, 2.0*Drac, DP); — = A*sin(2*Drac*(x+p));
function Ts is new gsin (TA, 2.0*Trop, TP); — = A*sin(2*Trop*(x+p))
function T2s is new gsin (T2A, Trop+Drac, T2P); — = A*sin((Trop+Drac)*(x+p))
function As is new gsin (AA, Anom, AP); — = A*sin(Anom*(x+p))
function Semi is new gsin(SA, 4.0*Pi, SP); — = A*sin(4*pi*(x+p))
function Evect is new gsin(EA, 2.0*Trop-Anom-4.0*Pi, EP); — = A*sin((2*Trop-Anom-4*pi)*(x+p))
function Syn is new gsin(SYA, 2.0*Trop-4.0*Pi, SYP); — = A*sin((2*Trop-4*pi)*(x+p))
function Annual is new gsin(YA, 2.0*Pi, YP); — = A*cos(2*pi*(x+p))
function N is new gsin (NA, Nodal, NP); — = A*sin(Nodal*(x+p))
function N2 is new gsin (N2A, Nodal/2.0, N2P); — = A*sin(Nodal/2.0*(x+p))
function P is new gsin (PA, Peri, PP); — = A*sin(Peri*(x+p))
— # Define the model
function Model (Time : in Long_Float) return Long_Float is
MM : Long_Float := As(Time);
Msm : Long_Float := Evect(Time);
Mfp : Long_Float := T2s(Time + A2A*Mm + E2A*Msm);
Mf : Long_Float := Ts(Time + A2A*Mm + E2A*Msm);
Mfd : Long_Float := Ds(Time + A2A*Mm + E2A*Msm);
Msf : Long_Float := Syn(Time + Q1*Mf + Q2*Mfp);
begin
return Mf + Mfp + Mfd + Semi(Time) + Annual(Time) + Msm + Msf + N(Time) + N2(Time) + P(Time) +
As(Time + K1*(Mf+Mfp+Mfd) + K2*Mm + K3*Msf);
end Model;
begin
for I in Template'Range loop
Time := Template(I).Date – Ref_Time;
declare
TF : Long_Float := 0.0;
begin
Tf := Model (Time);
Res(I) := (Time, Scaling * TF);
end;
end loop;
return Res;
end Tide_Sum;

This code has been incorporated into the https://github.com/pukpr/GeoEnergyMath repo source code (not committed yet), and evaluated over the SST climate indices NINO4, PDO, IOD-East, IOD-West, AMO, and the Darwin and Tahiti SOI datasets.

Now comes the daring part — the training essentially takes place only over the interval 1880 to 1930 (give or take a few years). And this also excludes a test interval from 1902-1907 to monitor the training. The rest of the data interval, notably post-1932 is out-of-bounds starting from an initial parameter set.

Moreover, only one fundamental LTE modulation is included in the fitting process, with the max harmonic of this fundamental controlled externally. So, a max harmonic of 29 allows all the harmonics 2,3,4,…,28,29 to comprise the solution, using a multiple linear regression algorithm to map the forcing to the climax index intensity over the training interval. Outside of the training interval a stationary LTE modulation mapping is maintained.

Starting with NINO4 w/parameters, the fit is below (the YELLOW region is the discontinuous training interval).

The compact LOD representation prevents overfitting, as even though the values are allowed to slightly vary, they do not deviate far from the initial settings. Can see this by using the same settings and training intervals on the other modeled indices, as follows:

IOD-West w/parameters

Next is IOD-East

Next is PDO w/parameters

For AMO need to apply an annual impulse, not semi-annual, w/parameters.

Note the impulse-stimulated forcing differs but not the underlying tidal factors.

This results in all tidal factors comprising the LOD calibration aligning to a large degree for each of the climate indices modeled and fitted to this point. In the chart below, the canonical dLOD tidal forcing representing each of the 5 models are compared to a 64-point (amp+phases) calibration. Click on link to expand as deviations are slight.

The Fourier series of the above, using the NINO4 tidal factor calibration:

For completeness, the Darwin and Tahiti fits are also fitted, which are of course closely related to the NINO4 index.

Darwin w/parameters

Tahiti, reversing the sign of the impulse

keeping the sign of the impulse, but increasing the max harmonic

Other model fits are shown on the GIST page; this includes experiments with higher harmonics and restricting to odd harmonics. For each case, it’s clear that the out-of-band correlation is both striking from a visual sense as well as quantitatively, showing windowed correlation coefficients averaging above 0.5 over the entire cross-validating interval. And finally, recall from the ChatGPT the concept of robustness assessment — the fact that several different climate indices are collectively modelled quickly and using nearly identical tidal factor sets indicates confidence in the cross-validation results. ***

The basic approach for LTE modeling is described in Chapter 12 of Mathematical Geoenergy.

Notes:
*** The other climate index often considered – QBO – has a wavenumber=0 symmetry so can’t be modelled using the calibrated LOD forcing.

One thought on “Canonical Cross-Validation

  1. Pingback: No Dice | GeoEnergy Math

Leave a comment