Custom regressors into purifiers

As was already mentioned in the first tutorial, purifiers can accept arbitrarily linear regressors from sklearn.linear_model. In order to feed it with a custom linear regressor, some requirements should be fulfilled. Firstly, it should have the same interface as linear regressors from sklearn with the fit and predict methods. Secondly, it should fulfill sklearn requirements to make it possible to clone with sklearn.base.clone function. This tutorial shows an example of such a class.

As before, let’s calculate spherical expansion coefficients for H environments:

Preliminaries

# downloading dataset from https://archive.materialscloud.org/record/2020.110

!wget "https://archive.materialscloud.org/record/file?file_id=b612d8e3-58af-4374-96ba-b3551ac5d2f4&filename=methane.extxyz.gz&record_id=528" -O methane.extxyz.gz
!gunzip -k methane.extxyz.gz

import numpy as np
import ase.io
import tqdm
from nice.blocks import *
from nice.utilities import *
from matplotlib import pyplot as plt
from sklearn.linear_model import BayesianRidge

structures = ase.io.read('methane.extxyz', index='0:1000')

HYPERS = {
    'interaction_cutoff': 6.3,
    'max_radial': 5,
    'max_angular': 5,
    'gaussian_sigma_type': 'Constant',
    'gaussian_sigma_constant': 0.05,
    'cutoff_smooth_width': 0.3,
    'radial_basis': 'GTO'
}

all_species = get_all_species(structures)

coefficients = get_spherical_expansion(structures, HYPERS, all_species)
coefficients = coefficients[1]

--2022-11-04 12:34:23-- https://archive.materialscloud.org/record/file?file_id=b612d8e3-58af-4374-96ba-b3551ac5d2f4&filename=methane.extxyz.gz&record_id=528 Resolving archive.materialscloud.org (archive.materialscloud.org)... 148.187.149.49 Connecting to archive.materialscloud.org (archive.materialscloud.org)|148.187.149.49|:443... connected. HTTP request sent, awaiting response... 302 FOUND Location: https://object.cscs.ch/archive/b6/12/d8e3-58af-4374-96ba-b3551ac5d2f4/data?response-content-type=application%2Foctet-stream&response-content-disposition=attachment%3B%20filename%3Dmethane.extxyz.gz&Signature=lAbBpbEhPFr4cdkzOLJA%2FI1AWAM%3D&AWSAccessKeyId=f30fe0bddb114e91abe6adf3d36c6f2e&Expires=1667561723 [following] --2022-11-04 12:34:23-- https://object.cscs.ch/archive/b6/12/d8e3-58af-4374-96ba-b3551ac5d2f4/data?response-content-type=application%2Foctet-stream&response-content-disposition=attachment%3B%20filename%3Dmethane.extxyz.gz&Signature=lAbBpbEhPFr4cdkzOLJA%2FI1AWAM%3D&AWSAccessKeyId=f30fe0bddb114e91abe6adf3d36c6f2e&Expires=1667561723 Resolving object.cscs.ch (object.cscs.ch)... 148.187.25.201, 148.187.25.204, 148.187.25.200, ... Connecting to object.cscs.ch (object.cscs.ch)|148.187.25.201|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1218139661 (1.1G) [application/octet-stream] Saving to: ‘methane.extxyz.gz’ methane.extxyz.gz 100%[===================>] 1.13G 424MB/s in 2.7s 2022-11-04 12:34:26 (424 MB/s) - ‘methane.extxyz.gz’ saved [1218139661/1218139661]

100%|██████████| 10/10 [00:00<00:00, 95.10it/s] 100%|██████████| 2/2 [00:00<00:00, 248.04it/s]

Our custom class looks like this:

from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import Ridge


class AdaptiveRidge:
    def __init__(self):
        pass

    def fit(self, X, y):
        minimum = None
        self.best_alpha_ = None
        for alpha in np.logspace(-25, 10, 300):
            regressor = Ridge(alpha=alpha, fit_intercept=False)
            predictions = cross_val_predict(regressor, X, y)
            now = np.mean((predictions - y)**2)
            if (minimum is None) or (now < minimum):
                minimum = now
                self.best_alpha_ = alpha

        self.ridge_ = Ridge(alpha=self.best_alpha_, fit_intercept=False)
        self.ridge_.fit(X, y)

    def predict(self, X):
        return self.ridge_.predict(X)

    def get_params(self, deep=True):
        return {}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

During fitting it estimates best value of regularization by cross validation using training data. There are additional methods get_params and set_params. These methods are required for sklearn.base.clone function. More details about it here (It is necessary to read only cloning section).

Let’s use it:

from scipy.linalg import LinAlgWarning

nice = StandardSequence([
    StandardBlock(ThresholdExpansioner(50), None, IndividualLambdaPCAsBoth(20),
                  ThresholdExpansioner(50, mode='invariants'), None, None),
    StandardBlock(
        ThresholdExpansioner(50),
        CovariantsPurifierBoth(regressor=AdaptiveRidge(), max_take=20),
        IndividualLambdaPCAsBoth(10),
        ThresholdExpansioner(50, mode='invariants'),
        InvariantsPurifier(regressor=AdaptiveRidge(), max_take=20),
        InvariantsPCA(20)),
])

with warnings.catch_warnings():
    # a lot of ill conditioned matrices with super small alpha
    warnings.filterwarnings("ignore", category=LinAlgWarning)
    nice.fit(coefficients)

res = nice.transform(coefficients)

It is possible to access best alpha parameters for all paritiies and lambda chanels in the final model:

(convenient getters might be added in the next version of NICE)

for lambd in range(6):
    if (nice.blocks_[1].covariants_purifier_.even_purifier_.purifiers_[lambd]):
        print("parity: even; lambda: {}; best alpha: {}".format(
            lambd, nice.blocks_[1].covariants_purifier_.even_purifier_.
            purifiers_[lambd].regressor_.best_alpha_))
    if (nice.blocks_[1].covariants_purifier_.odd_purifier_.purifiers_[lambd]):
        print("parity odd; lambda: {}; best alpha: {}".format(
            lambd, nice.blocks_[1].covariants_purifier_.odd_purifier_.
            purifiers_[lambd].regressor_.best_alpha_))

parity: even; lambda: 0; best alpha: 1.5996073018614912e-19 parity: even; lambda: 1; best alpha: 3.1744774091092e-20 parity odd; lambda: 1; best alpha: 2.0944511431514688e-19 parity: even; lambda: 2; best alpha: 3.1744774091092e-20 parity odd; lambda: 2; best alpha: 1e-25 parity: even; lambda: 3; best alpha: 2.4244620170823406e-20 parity odd; lambda: 3; best alpha: 2.7423765732649412e-19 parity: even; lambda: 4; best alpha: 2.4244620170823406e-20 parity odd; lambda: 4; best alpha: 1.2216773489967981e-19 parity: even; lambda: 5; best alpha: 1e-25 parity odd; lambda: 5; best alpha: 1e-25

The same for InvariantsPurifier:

print("best alpha of invariants purifier: ",
      nice.blocks_[1].invariants_purifier_.regressor_.best_alpha_)

best alpha of invariants purifier: 1.381873305653628e-18