Sequential fitting
It is not always clear how to select good hyperparameters for calculations. The second tutorial “Getting insights about the model,” showed how to plot PCA spectrums for all lambda channels and parities. This information, along with the other one, such as regression accuracy, might be useful to select better hypers. Particularly, the most straightforward way is to select the number of PCA components in such a way as to cover the most part of the variance and do it successively from block to block.
In this case, it is very undesirable to fit all parts of the model, including not changed ones from scratch. One possible way around is to do all things by hand, as was described in the tutorial “Constructor or non standard_sequence,” but there would be an additional headache with packing resulting blocks into a single model with a convenient .transform method. Nice toolbox has the capability to do it very succinctly.
First of all, we need to get spherical expansion coefficients the same way as in previous tutorials:
Preliminaries
# downloading dataset from https://archive.materialscloud.org/record/2020.110
!wget "https://archive.materialscloud.org/record/file?file_id=b612d8e3-58af-4374-96ba-b3551ac5d2f4&filename=methane.extxyz.gz&record_id=528" -O methane.extxyz.gz
!gunzip -k methane.extxyz.gz
import numpy as np
import ase.io
import tqdm
from nice.blocks import *
from nice.utilities import *
from matplotlib import pyplot as plt
from sklearn.linear_model import BayesianRidge
structures = ase.io.read('methane.extxyz', index='0:1000')
HYPERS = {
'interaction_cutoff': 6.3,
'max_radial': 5,
'max_angular': 5,
'gaussian_sigma_type': 'Constant',
'gaussian_sigma_constant': 0.05,
'cutoff_smooth_width': 0.3,
'radial_basis': 'GTO'
}
all_species = get_all_species(structures)
coefficients = get_spherical_expansion(structures, HYPERS, all_species)
coefficients = coefficients[1]
coefficients are now spherical expansion coefficients for H centered environments:
print(coefficients.shape)
Let’s do the first steps from standar sequence:
even_0, odd_0 = InitialTransformer().transform(coefficients)
initial_pca = IndividualLambdaPCAsBoth()
initial_pca.fit(even_0, odd_0)
even_0_t, odd_0_t = initial_pca.transform(even_0, odd_0)
Now we can fit couple of standard blocks:
block_1 = StandardBlock(ThresholdExpansioner(100), None,
IndividualLambdaPCAsBoth(20))
block_1.fit(even_0_t, odd_0_t, even_0_t, odd_0_t)
even_1, odd_1, _ = block_1.transform(even_0_t, odd_0_t, even_0_t, odd_0_t)
block_2 = StandardBlock(None, None, None,
ThresholdExpansioner(100, mode='invariants'))
block_2.fit(even_1, odd_1, even_0_t, odd_0_t)
_, _, even_invariants = block_2.transform(even_1, odd_1, even_0_t, odd_0_t)
At his moment we have all parts of this standard sequence fitted:
nice = StandardSequence(initial_pca=initial_pca, blocks=[block_1, block_2])
print(initial_pca.is_fitted())
print(block_1.is_fitted())
print(block_2.is_fitted())
what about full model?
print(nice.is_fitted())
Nope.
At this point, there is a very high probability of making a mistake. Particularly one can feed StandardSequence with some fitted initial_pca along with blocks, which were fitted based not on the same initial_pca, with different initial_normalizer, or even on different data. In order to prevent it, there is a requirement to pass an additional flag guaranteed_parts_fitted_consistently = True to the model:
nice = StandardSequence(initial_pca=initial_pca,
blocks=[block_1, block_2],
guaranteed_parts_fitted_consistently=True)
print(nice.is_fitted())
Model is considered to be fitted if 1) all parts are fitted and 2) if guaranteed_parts_fitted_consistently is set to be True
Golden rule: Every time you pass guaranteed_parts_fitted_consistently = True make a pause and think twice.
Let’s check consistency:
even_invariants_2 = nice.transform(coefficients,
return_only_invariants=True)[3]
print(np.sum(np.abs(even_invariants - even_invariants_2)))
This also works in other direction:
initial_pca = IndividualLambdaPCAsBoth()
block_1 = StandardBlock(ThresholdExpansioner(100), None,
IndividualLambdaPCAsBoth(20))
block_2 = StandardBlock(None, None, None,
ThresholdExpansioner(100, mode='invariants'))
print(initial_pca.is_fitted())
print(block_1.is_fitted())
print(block_2.is_fitted())
nice = StandardSequence(initial_pca=initial_pca, blocks=[block_1, block_2])
nice.fit(coefficients)
print(initial_pca.is_fitted())
print(block_1.is_fitted())
print(block_2.is_fitted())
StandardBlock behaves the same way:
expansioner, pca = ThresholdExpansioner(100), IndividualLambdaPCAsBoth(20)
print(expansioner.is_fitted())
print(pca.is_fitted())
block = StandardBlock(expansioner, None, pca)
block.fit(even_0_t, odd_0_t, even_0_t, odd_0_t)
print(expansioner.is_fitted())
print(pca.is_fitted())
expansioner, pca = ThresholdExpansioner(100), IndividualLambdaPCAsBoth(20)
expansioner.fit(even_0_t, odd_0_t, even_0_t, odd_0_t)
even_1, odd_1 = expansioner.transform(even_0_t, odd_0_t, even_0_t, odd_0_t)
pca.fit(even_1, odd_1)
block = StandardBlock(expansioner,
None,
pca,
guaranteed_parts_fitted_consistently=True)
print(block.is_fitted())
There is another group of blocks that accepts classes, such as sklearn.linear_model.Ridge in the initialization. But in their case, there is a need to apply several distinct regressors separately for each lambda channel and parity. Thus, the input regressor is cloned, and initial instances are not touched in any way. So, the material of this tutorial does not apply to purifiers.