Sequential fitting

It is not always clear how to select good hyperparameters for calculations. The second tutorial “Getting insights about the model,” showed how to plot PCA spectrums for all lambda channels and parities. This information, along with the other one, such as regression accuracy, might be useful to select better hypers. Particularly, the most straightforward way is to select the number of PCA components in such a way as to cover the most part of the variance and do it successively from block to block.

In this case, it is very undesirable to fit all parts of the model, including not changed ones from scratch. One possible way around is to do all things by hand, as was described in the tutorial “Constructor or non standard_sequence,” but there would be an additional headache with packing resulting blocks into a single model with a convenient .transform method. Nice toolbox has the capability to do it very succinctly.

First of all, we need to get spherical expansion coefficients the same way as in previous tutorials:

Preliminaries

# downloading dataset from https://archive.materialscloud.org/record/2020.110

!wget "https://archive.materialscloud.org/record/file?file_id=b612d8e3-58af-4374-96ba-b3551ac5d2f4&filename=methane.extxyz.gz&record_id=528" -O methane.extxyz.gz
!gunzip -k methane.extxyz.gz

import numpy as np
import ase.io
import tqdm
from nice.blocks import *
from nice.utilities import *
from matplotlib import pyplot as plt
from sklearn.linear_model import BayesianRidge

structures = ase.io.read('methane.extxyz', index='0:1000')

HYPERS = {
    'interaction_cutoff': 6.3,
    'max_radial': 5,
    'max_angular': 5,
    'gaussian_sigma_type': 'Constant',
    'gaussian_sigma_constant': 0.05,
    'cutoff_smooth_width': 0.3,
    'radial_basis': 'GTO'
}

all_species = get_all_species(structures)

coefficients = get_spherical_expansion(structures, HYPERS, all_species)
coefficients = coefficients[1]


    --2022-11-04 12:38:33--  https://archive.materialscloud.org/record/file?file_id=b612d8e3-58af-4374-96ba-b3551ac5d2f4&filename=methane.extxyz.gz&record_id=528
    Resolving archive.materialscloud.org (archive.materialscloud.org)... 148.187.149.49
    Connecting to archive.materialscloud.org (archive.materialscloud.org)|148.187.149.49|:443... connected.
    HTTP request sent, awaiting response... 302 FOUND
    Location: https://object.cscs.ch/archive/b6/12/d8e3-58af-4374-96ba-b3551ac5d2f4/data?response-content-type=application%2Foctet-stream&response-content-disposition=attachment%3B%20filename%3Dmethane.extxyz.gz&Signature=m5bCM21hSbBLcX0R6CrPB8BzOMg%3D&AWSAccessKeyId=f30fe0bddb114e91abe6adf3d36c6f2e&Expires=1667561973 [following]
    --2022-11-04 12:38:33--  https://object.cscs.ch/archive/b6/12/d8e3-58af-4374-96ba-b3551ac5d2f4/data?response-content-type=application%2Foctet-stream&response-content-disposition=attachment%3B%20filename%3Dmethane.extxyz.gz&Signature=m5bCM21hSbBLcX0R6CrPB8BzOMg%3D&AWSAccessKeyId=f30fe0bddb114e91abe6adf3d36c6f2e&Expires=1667561973
    Resolving object.cscs.ch (object.cscs.ch)... 148.187.25.201, 148.187.25.204, 148.187.25.200, ...
    Connecting to object.cscs.ch (object.cscs.ch)|148.187.25.201|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 1218139661 (1.1G) [application/octet-stream]
    Saving to: ‘methane.extxyz.gz’

    methane.extxyz.gz   100%[===================>]   1.13G   169MB/s    in 6.9s

    2022-11-04 12:38:40 (167 MB/s) - ‘methane.extxyz.gz’ saved [1218139661/1218139661]


    100%|██████████| 10/10 [00:00<00:00, 124.68it/s]
    100%|██████████| 2/2 [00:00<00:00, 262.07it/s]

coefficients are now spherical expansion coefficients for H centered environments:

print(coefficients.shape)


    (4000, 10, 6, 11)

Let’s do the first steps from standar sequence:

even_0, odd_0 = InitialTransformer().transform(coefficients)
initial_pca = IndividualLambdaPCAsBoth()
initial_pca.fit(even_0, odd_0)
even_0_t, odd_0_t = initial_pca.transform(even_0, odd_0)

Now we can fit couple of standard blocks:

block_1 = StandardBlock(ThresholdExpansioner(100), None,
                        IndividualLambdaPCAsBoth(20))
block_1.fit(even_0_t, odd_0_t, even_0_t, odd_0_t)
even_1, odd_1, _ = block_1.transform(even_0_t, odd_0_t, even_0_t, odd_0_t)

block_2 = StandardBlock(None, None, None,
                        ThresholdExpansioner(100, mode='invariants'))
block_2.fit(even_1, odd_1, even_0_t, odd_0_t)
_, _, even_invariants = block_2.transform(even_1, odd_1, even_0_t, odd_0_t)

At his moment we have all parts of this standard sequence fitted:

nice = StandardSequence(initial_pca=initial_pca, blocks=[block_1, block_2])
print(initial_pca.is_fitted())
print(block_1.is_fitted())
print(block_2.is_fitted())

True True True

what about full model?

print(nice.is_fitted())

False

Nope.

At this point, there is a very high probability of making a mistake. Particularly one can feed StandardSequence with some fitted initial_pca along with blocks, which were fitted based not on the same initial_pca, with different initial_normalizer, or even on different data. In order to prevent it, there is a requirement to pass an additional flag guaranteed_parts_fitted_consistently = True to the model:

nice = StandardSequence(initial_pca=initial_pca,
                        blocks=[block_1, block_2],
                        guaranteed_parts_fitted_consistently=True)
print(nice.is_fitted())

True

Model is considered to be fitted if 1) all parts are fitted and 2) if guaranteed_parts_fitted_consistently is set to be True

Golden rule: Every time you pass guaranteed_parts_fitted_consistently = True make a pause and think twice.

Let’s check consistency:

even_invariants_2 = nice.transform(coefficients,
                                   return_only_invariants=True)[3]
print(np.sum(np.abs(even_invariants - even_invariants_2)))

0.0

This also works in other direction:

initial_pca = IndividualLambdaPCAsBoth()
block_1 = StandardBlock(ThresholdExpansioner(100), None,
                        IndividualLambdaPCAsBoth(20))
block_2 = StandardBlock(None, None, None,
                        ThresholdExpansioner(100, mode='invariants'))

print(initial_pca.is_fitted())
print(block_1.is_fitted())
print(block_2.is_fitted())

nice = StandardSequence(initial_pca=initial_pca, blocks=[block_1, block_2])
nice.fit(coefficients)

print(initial_pca.is_fitted())
print(block_1.is_fitted())
print(block_2.is_fitted())

False False False True True True

StandardBlock behaves the same way:

expansioner, pca = ThresholdExpansioner(100), IndividualLambdaPCAsBoth(20)
print(expansioner.is_fitted())
print(pca.is_fitted())

block = StandardBlock(expansioner, None, pca)
block.fit(even_0_t, odd_0_t, even_0_t, odd_0_t)

print(expansioner.is_fitted())
print(pca.is_fitted())

False False True True

expansioner, pca = ThresholdExpansioner(100), IndividualLambdaPCAsBoth(20)
expansioner.fit(even_0_t, odd_0_t, even_0_t, odd_0_t)
even_1, odd_1 = expansioner.transform(even_0_t, odd_0_t, even_0_t, odd_0_t)
pca.fit(even_1, odd_1)

block = StandardBlock(expansioner,
                      None,
                      pca,
                      guaranteed_parts_fitted_consistently=True)

print(block.is_fitted())

True

There is another group of blocks that accepts classes, such as sklearn.linear_model.Ridge in the initialization. But in their case, there is a need to apply several distinct regressors separately for each lambda channel and parity. Thus, the input regressor is cloned, and initial instances are not touched in any way. So, the material of this tutorial does not apply to purifiers.