{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Constructing machine learning potential"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First of all we need to get some dataset for fitting. Good example\n",
    "is [this one ](https://archive.materialscloud.org/record/2020.110):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2022-11-04 12:25:01--  https://archive.materialscloud.org/record/file?file_id=b612d8e3-58af-4374-96ba-b3551ac5d2f4&filename=methane.extxyz.gz&record_id=528\n",
      "Resolving archive.materialscloud.org (archive.materialscloud.org)... 148.187.149.49\n",
      "Connecting to archive.materialscloud.org (archive.materialscloud.org)|148.187.149.49|:443... connected.\n",
      "HTTP request sent, awaiting response... 302 FOUND\n",
      "Location: https://object.cscs.ch/archive/b6/12/d8e3-58af-4374-96ba-b3551ac5d2f4/data?response-content-type=application%2Foctet-stream&response-content-disposition=attachment%3B%20filename%3Dmethane.extxyz.gz&Signature=%2BxLnAvJ4CwQ4JY8hbo7MwpILPco%3D&AWSAccessKeyId=f30fe0bddb114e91abe6adf3d36c6f2e&Expires=1667561161 [following]\n",
      "--2022-11-04 12:25:01--  https://object.cscs.ch/archive/b6/12/d8e3-58af-4374-96ba-b3551ac5d2f4/data?response-content-type=application%2Foctet-stream&response-content-disposition=attachment%3B%20filename%3Dmethane.extxyz.gz&Signature=%2BxLnAvJ4CwQ4JY8hbo7MwpILPco%3D&AWSAccessKeyId=f30fe0bddb114e91abe6adf3d36c6f2e&Expires=1667561161\n",
      "Resolving object.cscs.ch (object.cscs.ch)... 148.187.25.204, 148.187.25.200, 148.187.25.202, ...\n",
      "Connecting to object.cscs.ch (object.cscs.ch)|148.187.25.204|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 1218139661 (1.1G) [application/octet-stream]\n",
      "Saving to: ‘methane.extxyz.gz’\n",
      "\n",
      "methane.extxyz.gz   100%[===================>]   1.13G   223MB/s    in 5.7s    \n",
      "\n",
      "2022-11-04 12:25:07 (205 MB/s) - ‘methane.extxyz.gz’ saved [1218139661/1218139661]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# downloading dataset from https://archive.materialscloud.org/record/2020.110\n",
    "\n",
    "!wget \"https://archive.materialscloud.org/record/file?file_id=b612d8e3-58af-4374-96ba-b3551ac5d2f4&filename=methane.extxyz.gz&record_id=528\" -O methane.extxyz.gz\n",
    "!gunzip -k methane.extxyz.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import ase.io\n",
    "import tqdm\n",
    "from nice.blocks import *\n",
    "from nice.utilities import *\n",
    "from matplotlib import pyplot as plt\n",
    "from sklearn.linear_model import BayesianRidge\n",
    "\n",
    "HARTREE_TO_EV = 27.211386245988"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The total amount of structures in the methane dataset is huge. Thus it is a good idea to select a smaller amount of structures to speed up the calculations. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_subset = \"0:10000\"  #input for ase.io.read command\n",
    "test_subset = \"10000:15000\"  #input to ase.io.read command"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Two out of the three steps of NICE require data to be fitted. In the PCA step, atomic environments are used to determine the matrix of a linear transformation, suitable for the preservation of the most amount of information for **this particular dataset**. In purifiers, eliminated correlations are also dataset-specific. Though, it is absolutely not necessary to use the same amount of data to fit the NICE transformer and to fit the subsequent machine learning model. Typically, the NICE transformer requires less amount of data to be fitted. In addition, the fitting process requires a noticeable amount of RAM. Thus, it is a good idea to restrict the amount of data for this step, which is controlled by ``environments_for_fitting variable``. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "environments_for_fitting = 1000  #number of environments to fit nice transfomers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we must define our hyperparameters for our representation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "``grid`` defines the set of numbers of training configurations for which error would be estimated in order to get an idea of the quality of the model, depending on the number of training configurations. \n",
    "(yep, the NICE transformer uses more data for fitting for a few first points, but it is just a tutorial)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "grid = [150, 200, 350, 500, 750, 1000, 1500, 2000, 3000, 5000, 7500,\n",
    "        10000]  #for learning curve"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also must define parameters for the initial spherical expansion. For more detail, we refer the reader to [librascal](https://github.com/cosmo-epfl/librascal) documentation. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "#HYPERS for librascal spherical expansion coefficients\n",
    "HYPERS = {\n",
    "    'interaction_cutoff': 6.3,\n",
    "    'max_radial': 5,\n",
    "    'max_angular': 5,\n",
    "    'gaussian_sigma_type': 'Constant',\n",
    "    'gaussian_sigma_constant': 0.05,\n",
    "    'cutoff_smooth_width': 0.3,\n",
    "    'radial_basis': 'GTO'\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Blocks and Sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each NICE model is the sequence of standard transformations. In NICE, each transformation is called a ``StandardBlock`` and a sequence of blocks is a ``StandardSequence``."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A ``StandardBlock`` takes in 6 inputs: 1) an expansioner, 2) a purifier, and 3) a compressor for both the covariant and invariants. For more explanation of the theory behind this, please see [the theory section](https://lab-cosmo.github.io/nice/theory.html). \n",
    "\n",
    "**Scaling** Let's imagine uniform multiplication of spherical expansion coefficients by some constant k. In this case, covariants of order k would change as `*= k ^(body order)`. In other words,  the relative scale of different body orders would change. This might affect subsequent regression, so it is a good idea to fix the scale in a proper way. This is done by ``initial scaler``. It has two modes - ``signal integral`` and ``variance``. In the first case, it scales coefficients in such a way as to make the integral of the squared corresponding signal over the ball to be one. In the second case, it assures the variance of the coefficient's entries to be one. In practice, the first mode gives better results. The second parameter to this class is to scale coefficients individually, i. e. separately for each environment or globally, thus preserving information about the scale of signals in relation to each other. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Expansioners\n",
    "\n",
    "During the expansion step in each block, features of the next body order are produced by Clebsch-Gordan iteration between features from the previous block and spherical expansion coefficients after ``initial_pca``. \n",
    "\n",
    "Full expansion (expanding the coefficients for each pair of input covariant vectors) results in an untenably large number of features, thus, we typically only store the most important coefficients. In a standard sequence, these importances are just explained variance ratios after the PCA step. We then choose the ``num_expand`` most important features. If ``num_expand`` is not specified (or set to ``None``), we keep all coefficients. \n",
    "\n",
    "In NICE, we invoke this with ``ThresholdExpansioner``."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<nice.blocks.expansioners.ThresholdExpansioner at 0x7f84a2db9790>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ThresholdExpansioner(num_expand=150)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Purifiers\n",
    "\n",
    "In a NICE purifier, the parameter ``max_take`` controls the number of features to take for purification from previous body orders. (Features are always stored in descending order of importance, and it uses the first ones first). If ``max_take`` is not specified (``None``) it will use all available features. \n",
    "\n",
    "\n",
    "One additional parameter is a linear regressor to use. For example \n",
    "\n",
    "``` python\n",
    "\n",
    "from sklearn.linear_model import Ridge\n",
    "CovariantsPurifierBoth(regressor = Ridge(alpha = 42, fit_intercept = False), max_take = 10)\n",
    "\n",
    "```\n",
    "\n",
    "or \n",
    "\n",
    "``` python\n",
    "\n",
    "from sklearn.linear_model import Lars\n",
    "InvariantsPurifier(regressor = Lars(n_nonzero_coefs = 7), max_take = 10)\n",
    "\n",
    "```\n",
    "\n",
    "The default one is ```Ridge(alpha = 1e-12)``` without fitting intercept for covariants purifier and with fitting intercept for invariants purifier. \n",
    "\n",
    "***Important!*** always put ``fit_intercept = False`` to the regressor in covariants purifier. Otherwise, the resulting scaled features would not longer be covariants. \n",
    "\n",
    "Custom regressors can be fed into purifiers. More details about it in the tutorial [Custom regressors into purifiers](https://lab-cosmo.github.io/nice/custom_regressors_into_purifiers.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<nice.blocks.purifiers.CovariantsPurifierBoth at 0x7f84a2db9d30>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "CovariantsPurifierBoth(max_take=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compressors\n",
    "\n",
    "A compressor is like a PCA step -- we group the majority of the variance into a small subset of components.\n",
    "\n",
    "Parameter of PCA states for the number of output features (``n_components``). If it is not specified (None), full PCA will be performed. \n",
    "\n",
    "\"Both\" in name of classes states the fact that transformations are done simultaneously on even and odd features (more details about it in the tutorials \"Calculating covariants\" (what are even and odd features?) and \"Constructor or non-standard sequence\" (classes to work with no separation?)). \n",
    "\n",
    "\"Individual\" in ``IndividualLambdaPCAsBoth`` stands for the fact that transformations are independent for each lambda channel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<nice.blocks.compressors.IndividualLambdaPCAsBoth at 0x7f84a2db9730>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "IndividualLambdaPCAsBoth(n_components=50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Putting it all Together"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, parameters of covariant and invariant branches (such as ``num_expand`` in expansioners) are not dramatically different, but in real-life calculations they usually differ from each other dramatically (see examples folder). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is not necessary to always fill all the transformation steps. For example, the following block is valid:\n",
    "\n",
    "``` python\n",
    "StandardBlock(ThresholdExpansioner(num_expand = 150),\n",
    "              None, \n",
    "              IndividualLambdaPCAsBoth(n_components = 50), \n",
    "              ThresholdExpansioner(num_expand =300, mode = 'invariants'), \n",
    "              InvariantsPurifier(max_take = 50), \n",
    "              None)\n",
    "``` \n",
    "  \n",
    "In this case, purifying step in the covariants branch and the PCA step in the invariants branch would be omitted. Covariants and invariants branches are independent. In case of invalid  combinations, such as \n",
    "\n",
    "``` python\n",
    "StandardBlock(None, \n",
    "              None,\n",
    "              IndividualLambdaPCAsBoth(n_components = 50), \n",
    "              ...)\n",
    "```             \n",
    "\n",
    "It would raise a value error with the description of the problem during initialization.\n",
    "\n",
    "All intermediate blocks must compute covariants. A block is considered to be computing covariants if it contains covariant expansion and covariant pca. The latter is required since expansioners in subsequent blocks require not only covariants themselves but also their **importances** for thresholding. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "#our model:\n",
    "def get_nice():\n",
    "    return StandardSequence([\n",
    "        StandardBlock(ThresholdExpansioner(num_expand=150),\n",
    "                      CovariantsPurifierBoth(max_take=10),\n",
    "                      IndividualLambdaPCAsBoth(n_components=50),\n",
    "                      ThresholdExpansioner(num_expand=300, mode='invariants'),\n",
    "                      InvariantsPurifier(max_take=50),\n",
    "                      InvariantsPCA(n_components=200)),\n",
    "        StandardBlock(ThresholdExpansioner(num_expand=150),\n",
    "                      CovariantsPurifierBoth(max_take=10),\n",
    "                      IndividualLambdaPCAsBoth(n_components=50),\n",
    "                      ThresholdExpansioner(num_expand=300, mode='invariants'),\n",
    "                      InvariantsPurifier(max_take=50),\n",
    "                      InvariantsPCA(n_components=200)),\n",
    "        StandardBlock(None, None, None,\n",
    "                      ThresholdExpansioner(num_expand=300, mode='invariants'),\n",
    "                      InvariantsPurifier(max_take=50),\n",
    "                      InvariantsPCA(n_components=200))\n",
    "    ],\n",
    "                            initial_scaler=InitialScaler(\n",
    "                                mode='signal integral', individually=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this cell, we read the structures, get a set of all the species in the dataset, and calculate the spherical expansion. \n",
    "\n",
    "``all_species`` is a numpy array with ints, where 1 is H, 2 is He, and so on. \n",
    "\n",
    "``coefficients`` is the dictionary where the keys are central species, 1 and 6 in our case, and entries are numpy arrays shaped in the ``(environment_index, radial/specie index, l, m)`` way. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "all species:  [1 6]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 100/100 [00:00<00:00, 105.94it/s]\n",
      "100%|██████████| 2/2 [00:00<00:00, 29.91it/s]\n",
      "100%|██████████| 50/50 [00:00<00:00, 81.61it/s] \n",
      "100%|██████████| 2/2 [00:00<00:00, 60.53it/s]\n"
     ]
    }
   ],
   "source": [
    "train_structures = ase.io.read('methane.extxyz', index=train_subset)\n",
    "\n",
    "test_structures = ase.io.read('methane.extxyz', index=test_subset)\n",
    "\n",
    "all_species = get_all_species(train_structures + test_structures)\n",
    "print(\"all species: \", all_species)\n",
    "train_coefficients = get_spherical_expansion(train_structures, HYPERS,\n",
    "                                             all_species)\n",
    "\n",
    "test_coefficients = get_spherical_expansion(test_structures, HYPERS,\n",
    "                                            all_species)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to fit two NICE transformers on environments around both the H and C atoms separately.\n",
    "The following cells create them and perform the fitting: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "#individual nice transformers for each atomic specie in the dataset\n",
    "nice = {}\n",
    "for key in train_coefficients.keys():\n",
    "    nice[key] = get_nice()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/pozdn/.local/lib/python3.8/site-packages/nice/blocks/compressors.py:216: UserWarning: Amount of provided data is less than the desired one to fit PCA. Number of components is 200, desired number of environments is 2000, actual number of environments is 1000.\n",
      "  warnings.warn((\"Amount of provided data is less \"\n",
      "/home/pozdn/.local/lib/python3.8/site-packages/nice/blocks/compressors.py:216: UserWarning: Amount of provided data is less than the desired one to fit PCA. Number of components is 200, desired number of environments is 2000, actual number of environments is 1000.\n",
      "  warnings.warn((\"Amount of provided data is less \"\n",
      "/home/pozdn/.local/lib/python3.8/site-packages/nice/blocks/compressors.py:216: UserWarning: Amount of provided data is less than the desired one to fit PCA. Number of components is 200, desired number of environments is 2000, actual number of environments is 1000.\n",
      "  warnings.warn((\"Amount of provided data is less \"\n",
      "/home/pozdn/.local/lib/python3.8/site-packages/nice/blocks/compressors.py:216: UserWarning: Amount of provided data is less than the desired one to fit PCA. Number of components is 200, desired number of environments is 2000, actual number of environments is 1000.\n",
      "  warnings.warn((\"Amount of provided data is less \"\n",
      "/home/pozdn/.local/lib/python3.8/site-packages/nice/blocks/compressors.py:216: UserWarning: Amount of provided data is less than the desired one to fit PCA. Number of components is 200, desired number of environments is 2000, actual number of environments is 1000.\n",
      "  warnings.warn((\"Amount of provided data is less \"\n",
      "/home/pozdn/.local/lib/python3.8/site-packages/nice/blocks/compressors.py:216: UserWarning: Amount of provided data is less than the desired one to fit PCA. Number of components is 200, desired number of environments is 2000, actual number of environments is 1000.\n",
      "  warnings.warn((\"Amount of provided data is less \"\n"
     ]
    }
   ],
   "source": [
    "for key in train_coefficients.keys():\n",
    "    nice[key].fit(train_coefficients[key][:environments_for_fitting])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is not necessary to fit different nice transformers for each central specie, see for example qm9 examples in example folder"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's calculate representations!:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_features = {}\n",
    "for specie in all_species:\n",
    "    train_features[specie] = nice[specie].transform(\n",
    "        train_coefficients[specie], return_only_invariants=True)\n",
    "\n",
    "test_features = {}\n",
    "for specie in all_species:\n",
    "    test_features[specie] = nice[specie].transform(test_coefficients[specie],\n",
    "                                                   return_only_invariants=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is a nested dictionary. The first level keys are central species, and the inner level keys are body orders. Inside are **numpy arrays** with shapes ``(environment_index, invariant_index)``:\n",
    "\n",
    "In this case number of training structures is 10k, and each structure consists of 4 H atoms. Thus, the total number of H centered environments is 40k."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 : (40000, 10)\n",
      "2 : (40000, 200)\n",
      "3 : (40000, 200)\n",
      "4 : (40000, 200)\n"
     ]
    }
   ],
   "source": [
    "for key in train_features[1].keys():\n",
    "    print(\"{} : {}\".format(key, train_features[1][key].shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need to prepare for subsequent linear regression. As was already discussed in the theory section, energy is an extensive property, and thus it is given as a sum of atomic contributions. \n",
    "Each atomic contribution depends on 1) the central specie and 2) the environment. Thus, it is straightforward to see that if each atomic contribution is given by a linear combination of previously calculated NICE features, the structural features should have the following form - for each structure, the set of features is a concatenation of representations for each specie. Representation for each specie is a sum of NICE representations over the atoms with this specie in the structure. \n",
    "\n",
    "In our case, the representation of each environment has a size of 200 + 200 + 200 + 10 = 610. And we have two atomic species - H and C. Thus, the shape of structural features should be ``(number_of_structures) = 610 * 2 = 1220)``:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 10000/10000 [00:01<00:00, 8538.91it/s]\n",
      "100%|██████████| 5000/5000 [00:00<00:00, 9403.23it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(10000, 1220)\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "train_features = make_structural_features(train_features, train_structures,\n",
    "                                          all_species)\n",
    "test_features = make_structural_features(test_features, test_structures,\n",
    "                                         all_species)\n",
    "print(train_features.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Energies are a part of the dataset we previously downloaded:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_energies = [structure.info['energy'] for structure in train_structures]\n",
    "train_energies = np.array(train_energies) * HARTREE_TO_EV\n",
    "\n",
    "test_energies = [structure.info['energy'] for structure in test_structures]\n",
    "test_energies = np.array(test_energies) * HARTREE_TO_EV"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And the last step is to do linear regression and plot learning curve."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_rmse(first, second):\n",
    "    return np.sqrt(np.mean((first - second)**2))\n",
    "\n",
    "\n",
    "def get_standard_deviation(values):\n",
    "    return np.sqrt(np.mean((values - np.mean(values))**2))\n",
    "\n",
    "\n",
    "def get_relative_performance(predictions, values):\n",
    "    return get_rmse(predictions, values) / get_standard_deviation(values)\n",
    "\n",
    "\n",
    "def estimate_performance(regressor, data_train, data_test, targets_train,\n",
    "                         targets_test):\n",
    "    regressor.fit(data_train, targets_train)\n",
    "    return get_relative_performance(regressor.predict(data_test), targets_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 12/12 [00:09<00:00,  1.32it/s]\n"
     ]
    }
   ],
   "source": [
    "errors = []\n",
    "for el in tqdm.tqdm(grid):\n",
    "    errors.append(\n",
    "        estimate_performance(BayesianRidge(), train_features[:el],\n",
    "                             test_features, train_energies[:el],\n",
    "                             test_energies))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this smallest setup best rmse appeared to be about 7%:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.42465654790799834, 0.42789327645414715, 0.3493693406283348, 0.231589490323917, 0.1804705093880474, 0.16273379656134457, 0.13117080606147727, 0.1167863010740854, 0.09928117196727987, 0.08373380778918733, 0.07241337804396386, 0.07011697685671456]\n"
     ]
    }
   ],
   "source": [
    "print(errors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The learning curve looks like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZsAAAEKCAYAAADEovgeAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAAsTAAALEwEAmpwYAAAle0lEQVR4nO3deZyVdfn/8debQVFccSFTZNXcUEHHLRdcUrFcc3fUXJL8uuS+haWmhKWWmv5SLMMSF3ILjELMLU0FRBMQSUlALMMlSQUX5Pr98TmT4+HMcGa55z4z834+HucxM59zL9fR0Ws+9/25r0sRgZmZWZY65R2AmZm1f042ZmaWOScbMzPLnJONmZllzsnGzMwy52RjZmaZ65x3AJVqrbXWit69e+cdhplZm/Lcc8+9HRFrF4872dSjd+/eTJ48Oe8wzMzaFElzSo37MpqZmWXOycbMzDLnZGNmZplzsjEzs8w52VS4UaOgd2/o1Cl9HTUq74jMzBrPyaYFtXRiGDUKhgyBOXMgIn0dMsQJx8zaHrnFQGnV1dXRmKXPtYlh4cLPx7p2hREjoKZm2ft/+mnat+5rr73gzTeX3rZXL5g9u+zQzMxajaTnIqK6eNzP2bSQoUO/mGgg/fyd78CYMV9MIh9+uHRi+fTT8s81d27Lxm5mljUnmxZSXwL48EN48cU0y+naFVZdFdZZB1Za6fOx4lfte6ecAvPnL33Mbt1g8WLo7H97ZtZG+H9XLaRnz3RPpVivXjBjRtOO+dFHS1+a69QJ3n0XttwSfvxj+MY3QGra8c3MWosXCLSQYcPSbKSurl3TeFPV1KR7Pr16pYTSqxfcdhvcc0+67LbffrDbbjBxYvNiNzPLWodKNpL6SvqVpHta+tilEkO5iwOWddzZs2HJkvT16KPh4INh+nS48UZ46SXYbjs44giYNaslPomZWcvLPNlIqpL0vKQHm3GMWyXNlzStxHuDJc2U9KqkCxs6TkT8IyJObGocy1KcGJqbaBqy3HLpns6sWfD978PYsbDJJnDGGfD229md18ysKVpjZnMGUPKuhaTuklYpGtugxKYjgcEl9q8CbgT2ATYFjpS0qaTNJT1Y9Ore3A9SiVZZBX74Q3jlFTjuOLjhBujXD4YPX3p1nJlZXjJNNpJ6AN8AflnPJoOAByR1KWx/EvDz4o0i4gng3RL7bwu8WpixfALcBRwQEVMjYt+iV4l1Xe3Huuumy3ZTp8Kuu8L3vgdf+Qr8+tfw2Wd5R2dmHV3WM5trgfOBJaXejIjfAeOBuyXVACcAhzbi+OsBr9f5eV5hrCRJa0q6CRgo6aJ6ttlP0ogFCxY0IozKsemm8Pvfw+OPw3rrwQknwIABMG5cqkJgZpaHzJKNpH2B+RHxXEPbRcRPgI+AXwD7R8QHWcUUEe9ExMkR0S8ihtezzdiIGLLaaqtlFUar2GUXeOYZGD0aFi1KS6T32APcD87M8pDlzGZHYH9Js0mXt3aXdHvxRpJ2BvoD9wOXNPIcbwDr1/m5R2HMSKviDj00rVi7/vp0iW2bbeCoo+C11/KOzsw6ksySTURcFBE9IqI3cATwSEQcXXcbSQOBEcABwPHAmpKuaMRpJgEbSuojafnCeca0yAdoR5ZfHk4/Pa1cGzoUHngANtoIzjoL3nkn7+jMrCPI+zmbrsBhETErIpYAxwJLPYcv6U7gaWAjSfMknQgQEYuB00j3fWYAoyNieqtF38asuipccUVauXbssWm2069fqkSwaFHe0ZlZe+aqz/VobNXntmjaNLjwQvjDH6BHj5SIjj4aqqryjszM2qr6qj7nPbOxHPXvDw8+CI88koqDHnccbLUV/OlPXrlmZi3LycbYbTd49lm48054/33YZx/Yc0+YMiXvyMysvXCyMSBVkz7iiFSh+tpr4fnnYeut02U1N2ozs+ZysrEv6NIl1VebNSvdz7n33rRy7dxzU2sDM7OmcLKxklZfPdVX+/vf03M5P/1pWrl29dWpz46ZWWM42ViD1l8/1Vd74QXYYQc477w00/ntb1N1azOzcjjZWFm22CLVV3v4YVhrrfScztZbp0ttvXunez69e8OoUXlHamaVyMnGGmWPPWDSpJRU5s1LD4TOmZOWSs+Zk9pYO+GYWTEnG2u0Tp3SfZziNtiQeugMHdr6MZlZZXOysSZ7/fXS43Pntm4cZlb5nGysyXr2bNy4mXVcTjbWZMOGLX0pTYJLL80lHDOrYE421mQ1NakVda9eKcl0754WCsxv1w24zawpnGysWWpqUjmbJUvg3/+GffdN1aPffDPvyMyskjjZWIu65ppUYcAr0sysLicba1Ff+Qp897up6oCrRptZLScba3Hf/36qMnDGGe6LY2aJk421uNVWS/dtnnwSfve7vKMxs0rgZGOZOPFE2HLLVLhz0aK8ozGzvDnZWCaqquC661I1gauvzjsaM8ubk41lZtAgOPhguPJKeOONvKMxszw52VimrroKPvsstSIws46rQyUbSX0l/UrSPXnH0lH06QPnnAO33w7PPJN3NGaWl8ySjaQVJE2U9DdJ0yVd1oxj3SppvqRpJd4bLGmmpFclNfj3c0T8IyJObGoc1jQXXQRf/nJaCu3unmYdU5Yzm4+B3SNiS2AAMFjS9nU3kNRd0ipFYxuUONZIYHDxoKQq4EZgH2BT4EhJm0raXNKDRa/uLfKprNFWXhmGD4eJE91YzayjyizZRPJB4cflCq/iR/wGAQ9I6gIg6STg5yWO9QTwbonTbAu8WpixfALcBRwQEVMjYt+iV1nlISXtJ2nEggULyvqcVp5jjoFttoELLoAPPlj29mbWvmR6z0ZSlaQXgPnAhIh4tu77EfE7YDxwt6Qa4ATg0EacYj2gbguveYWx+uJZU9JNwEBJF5XaJiLGRsSQ1VZbrRFh2LJ06pSWQv/rX2l1mpl1LJkmm4j4LCIGAD2AbSX1L7HNT4CPgF8A+9eZDWURzzsRcXJE9IuI4Vmdx0rbYYfUTvrqq1OlaDPrOFplNVpEvAc8Sun7LjsD/YH7gUsaeeg3gPXr/NyjMGYV6sor0yzn/PPzjsTMWlOWq9HWlrR64fsVgT2Bl4u2GQiMAA4AjgfWlHRFI04zCdhQUh9JywNHAGNaIHzLyPrrp/s2v/sdPP543tGYWWvJcmbzZeBRSS+SksKEiHiwaJuuwGERMSsilgDHAnOKDyTpTuBpYCNJ8ySdCBARi4HTSPd9ZgCjI2J6Zp/IWsR556Wkc+aZ6YFPM2v/FK4BX1J1dXVMnjw57zDarbvugiOPhFtugW9/O+9ozKylSHouIqqLxztUBQGrHIcfDjvumDp6epW5WfvnZGO5kODaa2H+/NT7xszaNycby011NRx/fHr+5pVX8o7GzLLkZGO5+tGPoEsXOPfcvCMxsyw52Viu1lkn3bcZMwYmTMg7GjPLipON5e7MM6FvXzjrLFi8OO9ozCwLTjaWuxVWSCVspk+Hm2/OOxozy4KTjVWEAw+E3XaDH/wA3i1V39vM2jQnG6sItUuh33sPLmtymz0zq1RONlYxttgChgyBG2+El17KOxoza0lONlZRfvjD1Nnz7LPBlZTM2g8nG6soa68Nl1wC48fDuHF5R2NmLcXJxirOqafCV76SZjeffJJ3NGbWEpxsrOIsvzz87Gfw97/DDTfkHY2ZtQQnG6tIX/86DB6c7uG89Vbe0ZhZcznZWMX66U/hgw/g+9/POxIzay4nG6tYm2yS7t/ccgv87W95R2NmzeFkYxXt0kuhW7dUP81Loc3aLicbq2jduqX7No89Bvffn3c0ZtZUTjZW8YYMgR49UitpCXr3hlGj8o7KzBrDycYq3t13pxVpte0H5sxJCcgJx6ztcLKxijd0KHz88RfHFi5M42bWNjSYbCRVSXq5tYIxK2Xu3MaNm1nlaTDZRMRnwExJPVspHrOl9Kznt2/11Vs1DDNrhnIuo3UDpkv6s6Qxta+sAzOrNWwYdO36xbGqKvjPf9LSaC+JNqt8ncvYxs9vW65qatLXoUPTpbOePT9fDn3ZZamz57XXQiffgTSrWMtMNhHxuKQvAdsUhiZGxPxswzL7opqaz5NOrWOOgTXWgGuuSbOcW2+F5ZbLJz4za9gy/xaUdBgwETgUOAx4VtIhWQdmtiwSXHUV/OhHcPvt8M1vwqJFeUdlZqWUcxltKLBN7WxG0trAw8A9WQZmVg4JLrooVRo45ZRUKXrMGFhttbwjM7O6yrnK3anostk7Ze5n1mpOPhnuvBP++lfYbTeY7wu9ZhWlnKTxJ0njJR0n6TjgD4Ab9lrFOfxwGDsWXn4Zdt45VRows8qwrIc6BVwP3AxsUXiNiIgLWiE2s0YbPBgmTEgzm512ghkz8o7IzGDZD3UGMC4i7ouIswsv1961irbjjvD44/Dpp2mGM3ly3hGZWTmX0aZI2mbZm5lVji22gKeeglVXTfdwHn0074jMOrZyks12wNOSZkl6UdJUSS9mHVgWJPWV9CtJXknXAfTrB08+Cb16wT77wAMP5B2RWcdVzj2bIUA/YHdgP2DfwtcGSVpf0qOSXpI0XdIZTQ1S0q2S5kuaVuK9wZJmSnpV0oUNHSci/hERJzY1Dmt71l0XnngCBg6Egw+GkSPzjsisY2rwOZuICEk3RsTmTTj2YuCciJgiaRXgOUkTIuKl2g0kdQcWRcT7dcY2iIhXi441ErgB+E3dQUlVwI3AnsA8YFKhblsVMLzoGCe48kHHtMYaadHAN78Jxx+fqg2cdVbeUZl1LJnds4mIf0XElML37wMzgPWKNhsEPCCpC4Ckk4CflzjWE8C7JU6zLfBqYcbyCXAXcEBETI2IfYteTjQd2Morp2XRhxwCZ58NF1/sAp5mrancezbPNOeejaTewEDg2brjEfE7YDxwt6Qa4ARSWZxyrQe8XufneSyd0OrGsaakm4CBki6qZ5v9JI1YsGBBI8KwtqBLF7jrLvj2t1Ml6VNPhSVL8o7KrGMop1zN3s05gaSVgXuBMyPiv8XvR8RPJN0F/ALoFxEfNOd8DYmId4CTl7HNWGBsdXX1SVnFYfmpqoIRI2DNNeHHP06X1G67DZZfPu/IzNq3Zc5sImIOsD6we+H7heXsByBpOVKiGRUR99Wzzc5Af+B+4JIy4671RiG2Wj0KY2b1kuDKK1OyuesuOPDA1GbazLJTTtXnS4ALgNrLTssBt5exn4BfATMi4qf1bDMQGAEcABwPrCnpivJCB2ASsKGkPpKWB44A3NjNynL++XDLLTB+POy1F7z3Xt4RmbVf5cxQDgL2Bz4EiIh/AquUsd+OwDHA7pJeKLy+XrRNV+CwiJgVEUuAY4GlKlpJuhN4GthI0jxJJxZiWQycRrrvMwMYHRHTy4jNDEj3b+6+GyZOhF13hTffzDsis/apnHs2nxSWQAeApJXKOXBEPAloGds8VfTzp8AtJbY7soFjjMOFQa0ZDjkktSQ46KBUT23CBOjTJ++ozNqXcmY2oyXdDKxeWJr8MCUSgllbtuee8PDDqcX0TjvBdM+PzVpUOQsEriY1SrsX2Aj4QUQs9SyMWVu3/fap2kAE7LILPPvssvcxs/KUtaosIiZExHkRcW5ETMg6KLO89O+fCnh26wZ77JG6gPbuDZ06pa+jRuUdoVnb5I6bZkX69EkFPNdcMy2RnjMnzXbmzIEhQ5xwzJrCycashHXWKV1dYOFCGDq09eMxa+vKfThzRUkbZR2MWSV5o57Hg91u2qzxynmocz/gBeBPhZ8HFCorm7VrPXuWHq8tebN4cevGY9aWlTOzuZRUXfk9gIh4AfBTCNbuDRsGXbt+caxLF+jbF77zHdh889SQzdWjzZatnGTzaUQUl0D2f17W7tXUpBlMr16pnlqvXvCrX8HMmXD//Wmb2gdBn3qq4WOZdXTlJJvpko4CqiRtKOnnwF8zjsusItTUwOzZabHA7NnpZykV75w6NSWj115LCefAA2HGjHzjNatU5SSb04HNgI+BO4AFwJkZxmTWJnTuDCedBK+8AldcAY88kp7TGTIE/vnPvKMzqyzlJJuNI2JoRGxTeF0cER9lHplZG7HSSmk59KxZcNppMHIkbLBB6gb636U6OJl1TOUkm2skzZB0uaT+mUdk1katvTZcdx28/HK6pDZsGPTrB9dfD598knd0ZvkqpzbabsBuwFvAzYW20BdnHplZG9W3L9xxB0yeDFtuCWecARtvDHfeCb/9rcvfWMekaMS6TUmbA+cDh0dEu26kW11dHZMnT847DGvjIuChh+CCC+Bvf0tJpm5lgq5d0yKDmpr8YjRrSZKei4jq4vFyHurcRNKlkqYCtSvRemQQo1m7I8Hee8OUKanWWnEJHJe/sY6inOZptwJ3A3sXunSaWSN16pR65ZQyd27rxmKWh2Umm4jYoTUCMWvvevYsXVetW7d0uU0N9rU1a9vqvYwmaXTh61RJL9Z5TZX0YuuFaNY+lCp/UzvjGTIEPv44n7jMWkNDM5szCl/3bY1AzNq72kUAQ4emS2c9e8Lll6el0j/6EUybBvfeC+uum2+cZlmoN9lExL8K354SERfUfU/Sj4ELlt7LzBpSU1N65dnAgXDccbD11nDffbCDL15bO1POQ517lhjbp6UDMevIDjkEnnkmVSMYNAhuuSXviMxaVkP3bP6vsNx5o6J7Nq8Bvmdj1sL694dJk2D33dM9nJNPduUBaz8aumdzB/BHYDhwYZ3x9yOinkWcZtYc3brBH/6Q6qpdeWWqLH3PPfDlL+cdmVnz1DuziYgFETE7Io6MiDnAIlIfm5Ul1dPD0Myaq6oKhg+Hu++GF16A6up0ic2sLSurLbSkV4DXgMeB2aQZj5ll6LDD4OmnU3fQQYNS4zaztqqcBQJXANsDf4+IPsAegP/OMmsFW2yRCnoOGgTf/jaccorv41jbVG5b6HeATpI6RcSjwFJF1swsG2usAePGwXnnwS9+AXvsAW++mXdUZo1TTrJ5T9LKwBPAKEnXAR9mG5aZ1dW5M/zkJ6lNwXPPpfs4EyfmHZVZ+cpJNgeQFgecBfwJmAXsl2VQZlbaEUfAX/8Kyy0Hu+wCv/513hGZlaec5mkfRsRnEbE4Im6LiOsLl9XMLAcDBqT7ODvtBCecAKefDp9+mndUZg1r6KHO9yX9t87r/bpfWzNIM/uiNdeEP/0JzjkHbrgBvvY1mD8/76jM6tfQczarRMSqdV6r1P3amkGa2dI6d4arr06tpSdOTHXV3FzWKlU592yQtJOk4wvfryWpT7ZhmVm5jjoKnnoqtSvYaSf4zW/yjshsaeU81HkJqcLzRYWh5YHbswzKzBpnq63SrOarX4VvfQvOPNP3cayylDOzOQjYn8Jy50Jr6FWyDMrMGm/tteGhh1Kiue462GsveOutvKMyS8pJNp9ERJDqoiFppWxDMrOm6twZfvazdCntmWfS8zhTpuQdlVl5yWa0pJuB1SWdBDwMuNuGWQU75hh48kmIgB13hP/7P+jdO93X6d07LSowa01Kk5Z63pQE9AA2BvYCBIyPiAmtE15+qqurY7KX9lgbN39+qqv28stfHO/aFUaMKN011Kw5JD0XEUuVNGuonw0REZLGRcTmQLtPMGbtTffusHDh0uMLF8LQoU421nrKuYw2RdI2mUdiZpl4/fXS43PmtG4c1rGVk2y2A56WNKvQFnqqJLeFNmsjetbT6rBTJ/jtb9N9HbOslZNs9gb6AbuTCnDuiwtxmrUZw4alezR1rbAC9O0Lxx4Lu+4K06fnEpp1IOUU4pxT6tUawZlZ89XUpMUAvXqBlL7+8pcwc2b6Om1aKu55/vnwwQd5R2vtVYOr0Toyr0azjuLtt+HCC1Pb6R490gOhBx2UEpNZY9W3Gq2s2mhm1n6ttVaa4Tz1VKomffDB8I1vwKxZeUdm7YmTjZkBqa7a5Mlw7bXpgdDNNoPLLoOPPso7MmsPnGzM7H86d4YzzkgPgR50EFx6KfTvD+PH5x2ZtXVONma2lHXXhTvvhAkToKoKBg+GQw+FefPyjszaKicbM6vX174GL74IV1wBDz4IG28M11yT2heMGuV6a1Y+r0arh1ejmX3Ra6/Bd7+bkk6PHql9wccff/6+660ZeDWamTVTnz4wdiz8/vfwr399MdHA5/XWzEpxsjGzRtl/f/jss9LvzZ3burFY2+FkY2aN1qtX6fEvfal147C2w8nGzBqtVL01Cf79bxgyBN55J5+4rHI52ZhZo5WqtzZiBJx9Ntx6K2y0UapKsGRJ3pFapfBqtHp4NZpZ00ybBqeeCk88AdttB//v/8FWW+UdlbUWr0Yzs1bRvz889ljqlTN7NlRXp+Tzn//kHZnlycnGzFqcBEcfncrenH463HRTurQ2cqQvrXVUTjZmlpnVV08tC557DjbYAI4/HnbZJVUlsI7FycbMMjdgQKokfeutqWnbVlvBmWfCggV5R2atxcnGzFpFp05pZjNzZloeff31qdbaqFHgdUrtn5ONmbWqNdZIK9QmToT110/3dnbbDaZPzzsyy5KTjZnloroann4abr453cMZMADOOw/efz/vyCwLTjZmlpuqqnRJ7e9/h299C66+GjbZBEaP9qW19sbJxsxyt9ZaqeLA009D9+5w+OGw997p/o61Dx0i2UjqK+lXku7JOxYzq9/228OkSXDDDemezuabw/e+Bx9+mHdk1lwVn2wk3SppvqRpReODJc2U9KqkCxs6RkT8IyJOzDZSM2sJVVWp4sDMmXDUUTB8OGy6KTzwANx+u7uDtlWd8w6gDCOBG4Df1A5IqgJuBPYE5gGTJI0BqoDhRfufEBHzWydUM2spX/pSqjhw4okp+Rx0UEoytRUI5sxJ93vA3UHbgoqf2UTEE8C7RcPbAq8WZiyfAHcBB0TE1IjYt+hVdqKRNETSZEmT33rrrRb8FGbWVDvvDFOmQLduS5e6cXfQtqPik0091gNer/PzvMJYSZLWlHQTMFDSRfVtFxEjIqI6IqrXXnvtlovWzJqlc2d4773S782Z06qhWBO11WTTKBHxTkScHBH9IqL4MpuZtQE9e5Yel+DSS+tPRlYZ2mqyeQNYv87PPQpjZtZOleoOusIK6eHQyy5LCwYuu8z11ipVW002k4ANJfWRtDxwBDAm55jMLEOluoP+8pdpifTzz6eSN5dempLO5Zc76VSaik82ku4EngY2kjRP0okRsRg4DRgPzABGR4QrK5m1czU1qSHbkiXpa+0qtAED4P7700KCQYPgBz+APn3giivgv//NMWD7H7eFrofbQpu1XVOmpFnO2LFpFds558B3vwurrJJ3ZO2f20KbWYex1VYwZkyqRrDjjnDxxeny2vDhLvSZFycbM2u3qqvT7GbiRNhhh1T6pk8fuPJK+OCDvKPrWJxszKzd22YbePBBePZZ2HZbuOiilHR+8hPXXWstTjZm1mFsuy2MG5eqS1dXwwUXpKRz1VVOOllzsjGzDmf77eGPf4S//hUGDoTzz4e+feGaa1IJHGt5TjZFJO0nacQCL9I3a/d22AHGj4ennoItt4Rzz00znZ/+1EmnpTnZFImIsRExZLXVVss7FDNrJV/9Kjz0EPzlL6mHzjnnpJnOz34GixblHV374GRjZlaw007w8MPwxBOw2WZw9tkp6Vx3nZNOcznZmJkV2Xln+POf4bHHYOON4cwzoV8/uP56+OijvKNrm5xszMzqMWgQPPpoem24IZxxRko6N9zgpNNYTjZmZsuw665plvPIIynZnH46bLAB3Hhj6ibqVtXL1hbaQpuZ5U5KlaV33TUlnUsugdNOS+O1JSbdqrp+ntmYmTWCBHvskVaude/+eaKp5VbVpTnZmJk1gQRvvVX6vTlzlk5CHZ2TjZlZE9XXqhrS7GfSpNaLpdI52ZiZNVGpVtUrrgjHHgvTpqVabIcdBq+8kk98lcTJxsysiUq1qr7lFrjtNpg1K3UMHTcONt0UTj0V/v3vvCPOjzt11sOdOs2sJbz5Jlx+eUpKXbqkUjjnntt+u4a6U2eZXIjTzFrSOuuk53Feegm+/nX44Q8/fzD0k0/yjq71ONkUcSFOM8vChhvC6NGpgdtmm6UHQzfZBO66C5YsyTu67DnZmJm1om23TQ+FjhsHK60ERx6Zxv7857wjy5aTjZlZK5Ngn33g+efhN79Jz+t87Wuw995prD1ysjEzy0lVFRxzDMycmbqETp4MW20FRx8Nr72Wd3Qty8nGzCxnK6yQeufMmgUXXgj33gsbbZRaG7z9dt7RtQwnGzOzCrH66jB8OLz6KnzrW/Dzn6fmbcOGwYcf5h1d8zjZmJlVmPXWSw+HTpsGu+8OF1+cWhrcfDMsXpx3dE3jZGNmVqE22QQeeACefDLNcE4+OS2bPvPMVK2gLfXQcbIxM6twO+6YEs7vfw8ffADXXQdz56bK0rU9dCo94TjZmJm1ARLsvz90LtHycuHClHAuvxzuuy+tbqu0y23u1Glm1oa8/nrp8YULU+HPWl26wMYbp8tum20G/funr336pMtvxUaNSk3f5s5NrROGDWvZbqNONmZmbUjPnunSWbFevWD6dJgxIy0smD49vZ58Eu644/PtVlwx3QuqTT6bbZZaIHzve7BoUdomi/bWrvpcD1d9NrNKNGpUSgQLF34+1rVrqipdX2L4739TIdDaBFSbjP75z4bP1asXzJ7duPjqq/rsmU0RSfsB+22wwQZ5h2JmtpTahNKYS16rrgrbb59edf3nPynp7Lxz6f3mzm2ZmMEzm3p5ZmNmHUXv3vVfmmupmY1Xo5mZdXCl2lt37ZrGW4qTjZlZB1eqvXVD94CawvdszMyMmpqWTS7FPLMxM7PMOdmYmVnmnGzMzCxzTjZmZpY5JxszM8ucH+qsh6S3gBKPOeVmNWBBOzhnc4/ZlP0bu0852zd3m7WAdtLw17+bzdi/MfuUu23ev5u9ImLtpUYjwq828AJGtIdzNveYTdm/sfuUs31ztwEmt/a/z6xe/t1s+v6N2afcbSv1d9OX0dqOse3knM09ZlP2b+w+5WzfUtu0B/7dbPr+jdmn3G0r8nfTl9HMciBpcpSoH2WWt6x+Nz2zMcvHiLwDMKtHJr+bntmYmVnmPLMxM7PMOdmYmVnmnGzMzCxzTjZmFUDSJpJuknSPpP/LOx6zuiStJGmypH2begwnG7OMSLpV0nxJ04rGB0uaKelVSRcCRMSMiDgZOAzYMY94reNozO9mwQXA6Oac08nGLDsjgcF1ByRVATcC+wCbAkdK2rTw3v7AH4BxrRumdUAjKfN3U9KewEvA/Oac0J06zTISEU9I6l00vC3wakT8A0DSXcABwEsRMQYYI+kPwB2tGqx1KI383VwZWImUgBZJGhcRSxp7Ticbs9a1HvB6nZ/nAdtJ2hX4JtAFz2wsHyV/NyPiNABJxwFvNyXRgJONWUWIiMeAx3IOw6xeETGyOfv7no1Z63oDWL/Ozz0KY2Z5y/R308nGrHVNAjaU1EfS8sARwJicYzKDjH83nWzMMiLpTuBpYCNJ8ySdGBGLgdOA8cAMYHRETM8zTut48vjddCFOMzPLnGc2ZmaWOScbMzPLnJONmZllzsnGzMwy52RjZmaZc7IxM7PMOdmYNZGkxyRVt8J5vitphqRRZWy7uqRTWvDcvSUd1VLHs47LycYsB5IaU5fwFGDPiKgpY9vVC9s395y1egONTjaFcvVm/+NkY+1a4S/zGZJukTRd0kOSViy897+ZiaS1JM0ufH+cpAckTZA0W9Jpks6W9LykZyStUecUx0h6QdI0SdsW9l+p0JxqYmGfA+ocd4ykR4A/l4j17MJxpkk6szB2E9AX+KOks4q236xwjhckvShpQ+BKoF9h7CpJu0r6i6QxwEuFfx7T6hzjXEmXFr7fQNLDkv4maYqkfoXj7Vw43lmFz3BDnf0fLFSsRtIHkq6R9DdgB0lH14nvZklVhdfIwmecWvyZrP1ysrGOYEPgxojYDHgPOLiMffqTSv5vAwwDFkbEQFKJj2PrbNc1IgaQZhO3FsaGAo9ExLbAbsBVklYqvLcVcEhEDKp7MklbA8cD2wHbAydJGljo3vlPYLeI+FlRjCcD1xXOX00qCX8hMCsiBkTEeXXOeUZEfGUZn3kU6Z/TlsBXgX8VjveXwvGKz19sJeDZwv7vAIcDOxbi+wyoAQYA60VE/4jYHPj1Mo5p7YRbDFhH8FpEvFD4/jnSpaFleTQi3gfel7QAGFsYnwpsUWe7O+F/zahWlbQ6sBewv6RzC9usAPQsfD8hIt4tcb6dgPsj4kMASfcBOwPPNxDj08BQST2A+yLiFUmltpsYEa819GElrUJKAvcXPs9HhfGGdiv2GXBv4fs9gK2BSYVjrEjq9DgW6Cvp56SupA815gTWdjnZWEfwcZ3vPyP9jw9gMZ/P7ldoYJ8ldX5ewhf/uykuLhiAgIMjYmbdNyRtB3zYqMgbEBF3SHoW+AYwTtJ3gH+U2LTuOet+Zlj6cy9LQ/t/FBGfFb4XcFtEXFR8AElbAnuTZmaHASc0MgZrg3wZzTqy2aS/vgEOaeIxDgeQtBOwICIWkKrmnq7Cn/SSBpZxnL8AB0rqWrjkdlBhrF6S+gL/iIjrgd+TZlzvA6s0sNu/ge6S1pTUBdgXoDCLmyfpwMKxu0jqWuJ4s4EBkjpJWp/USriUPwOHSOpeON4aknpJWgvoFBH3AheTLvFZB+CZjXVkVwOjJQ0hXdJpio8kPQ8sx+d/oV8OXAu8KKkT8BqF/6nXJyKmSBoJTCwM/TIiGrqEBmlWcIykT4E3gR9FxLuSniosAvgjRZ8rIj6V9MPCed4AXq7z9jHAzYX3PwUOBV4EPivc9B9Z+FyvAS+RytBPqefzvCTpYuChwj+DT4FTgUXArwtjAEvNfKx9cosBMzPLnC+jmZlZ5pxszMwsc042ZmaWOScbMzPLnJONmZllzsnGzMwy52RjZmaZc7IxM7PM/X+DVKahtkQGRQAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "from matplotlib import pyplot as plt\n",
    "plt.plot(grid, errors, 'bo')\n",
    "plt.plot(grid, errors, 'b')\n",
    "plt.xlabel(\"number of structures\")\n",
    "plt.ylabel(\"relative error\")\n",
    "plt.xscale('log')\n",
    "plt.yscale('log')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The pipeline in this tutorial was designed to explain all the intermediate steps, but it has one drawback - at some moment, all atomic representations, along with all intermediate covariants for the whole dataset, are explicitly stored in RAM, which might become a bottleneck for big calculations. Indeed, only structural invariant features are eventually needed, and their size is much smaller than the size of all atomic representations, especially if the dataset consists of large molecules. Thus, it is a good idea to calculate structural features by small blocks and get rid of atomic representations for each block immediately. For this purpose, there is a function ``nice.utilities.transform_sequentially``. It has a parameter ``block_size`` that controls the size of each chunk. The higher this chunk, the more RAM is required for calculations. But, on the other hand, for very small chunks, slow python loops over lambda channels with invoking separate python classes for each lambda channel might become a bottleneck (all the other indices are handled either by numpy vectorization or by cython loops). The other reason for the slowdown is multiprocessing. Thus, transforming time per single environment monotonically decrease with the ``block_size`` and eventually goes to saturation. The default value for ``block_size`` parameter should be fine in most cases.\n",
    "\n",
    "The full example can be found in examples/methane_home_pc or in examples/qm9_home_pc. Other than that and the absence of markdown comments, these notebooks are almost identical to this tutorial (in qm9 single nice transformer is used for all central species). Thus, we recommend picking one of them as the code snippet. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}