Sparse accumulation

[1]:

import torch
import numpy as np
from sparse_accumulation import accumulate

Preparing a dummy data

Let’s prepare some dummy data to play with:

[2]:

X1 = torch.randn(10, 20, 3)
X2 = torch.randn(10, 20, 4)

m1 = torch.LongTensor([0, 1, 1, 2])
m2 = torch.LongTensor([0, 0, 3, 1])
mu = torch.LongTensor([0, 3, 1, 2])

C = torch.FloatTensor([0.17, 0.23, 0.4, -0.9])
output_size = 42

Important sparse accumulation operation requires mu tensor to be sorted to work correctly.

It is very clear that the result of the sparse accumulation operation doesn’t change for the simultaneous permutation of all the tensors m1, m2, mu, and C since the result of the summation doesn’t depend on the order of the terms. Thus, it is always reachable to have mu tensor to be sorted, and one can achieve this as simply as:

[3]:

indices = np.argsort(mu)

m1 = m1[indices]
m2 = m2[indices]
mu = mu[indices]
C = C[indices]

`accumulate` function

The main function which does sparse accumulation operation is called accumulate. It can be invoked like this:

[4]:

output = accumulate(X1, X2, mu, output_size, m1, m2, C)
print(output.shape, output.device)

torch.Size([10, 20, 42]) cpu

Since the input tensors are located on cpu, the pytorch cpu extension was invoked internally.

Now let’s move our dummy data to the gpu:

[5]:

X1_cuda = X1.cuda()
X2_cuda = X2.cuda()
m1_cuda = m1.cuda()
m2_cuda = m2.cuda()
mu_cuda = mu.cuda()
C_cuda = C.cuda()

The call is exactly the same:

[6]:

output = accumulate(X1_cuda, X2_cuda, mu_cuda, 42, m1_cuda, m2_cuda, C_cuda)
print(output.shape, output.device)

torch.Size([10, 20, 42]) cuda:0

This time our cuda kernel was invoked interenally since the input tensors are located on gpu

[optional] Clebsch-Gordan iteration

[7]:

from sparse_accumulation import get_cg_transformation_rule

precomputing Clebsch-Gordan transformation rule

If we want the sparse accumulation operation to do the actual Clebsch-Gordan iteration we need to precompute the corresponding transformation rule and populate the arrays m1, m2, mu and C with the actual Clebsch-Gordan coefficients.

[8]:

l1 = 3
l2 = 4
l_output = 5
m1, m2, mu, C = get_cg_transformation_rule(l1, l2, l_output)
print(m1.shape, m2.shape, mu.shape, C.shape)

torch.Size([93]) torch.Size([93]) torch.Size([93]) torch.Size([93])

The mentioned above sorting operation is not required now since it has been already performed inside get_cg_transformation_rule

Now, given this transformation rule, sparse accumulation operation performs actual CG iteration, producing covariant vectors with l = l_output given covariant vectors with l = l1 and l = l2:

[9]:

X1 = torch.randn(10, 20, 2 * l1 + 1)
X2 = torch.randn(10, 20, 2 * l2 + 1)
output = accumulate(X1, X2, mu, 2 * l_output + 1, m1, m2, C)
print(output.shape)

torch.Size([10, 20, 11])

Clebsch-Gordan Calculator

It makes sense to wrap up the mentioned steps into the class, where the CG transformation rule is computed during initialization, and next is used in the forward method. We provide such a class called CGCalculatorSingle

[10]:

from sparse_accumulation import CGCalculatorSingle

[11]:

calc = CGCalculatorSingle(l1, l2, l_output)
output = calc(X1, X2)
print(output.shape, output.device)

torch.Size([10, 20, 11]) cpu

This class supports convenient reallocation to gpu:

[12]:

calc = calc.cuda()
output = calc(X1.cuda(), X2.cuda())
print(output.shape, output.device)

torch.Size([10, 20, 11]) cuda:0

All the tensors constituting the transformation rule (m1, m2, mu, and C) are stored as buffers, not the parameters, so they will not be optimized.

[todo] add raise Value error to accumulate function if the size of shared memory is insufficient; mention it here. [todo] add fallback to alternative (select which one) implementation to the CGCalculatorSingle if the size of shared memory is insufficient; mention it here.

Outdated

The goal is to compute this for all \(\mu\):

\(\text{Output}[:, :, \mu] = \sum\limits_{m_1, m_2} \text{X_1}[:, :, m_1] * \text{X_2}[:, :, m_2] * C_{m_1, m_2, \mu}\)

This is the subpart of the Clebsch-Gordan iteration for fixed l1, l2, and l. The first two dimensions are the “dense” ones, so the same operation is performed for all the indices in the first two dimensions.

Since Clebsch-Gordan coefficients are very sparse, it is worthwhile to align them into a 1-dimensional tensor containing only non-zero values, but in this case, we need to supply this tensor with supplementary indices tensors telling us what are the corresponding m1, m2, and \(\mu\) indices.

Reference slow python implementation is as simple as this:

[13]:

def sparse_accumulation_loops(X1, X2, idx_output, output_size, idx_1, idx_2, multipliers):
    device = X1.device #all tensors must be on the same device and blah, blah, blah
    dtype = X1.dtype

    output = torch.zeros([X1.shape[0], X2.shape[1], output_size], device = device,dtype=dtype)
    for index in range(idx_output.shape[0]):
        output[:, :, idx_output[index]] += X1[:, :, idx_1[index]] * X2[:, :, idx_2[index]] * multipliers[index]
    return output

Here multipliers are the values of Clebsch-Gordan coefficients, idx_1 is the tensor containing corresponding m1 indices, idx_2 is the tensor containing corresponding m2 indices, and idx_output is the tensor containing \(\mu\) indices. output_size is just a single integer, the desired length of the output (2 * l + 1).

So the loops go over all the terms, for all \(\mu\), m1, and m2 with non-zero clebsch-gordan coefficients, and the current contribution is added to the output array to the proper place defined by \(\mu\) which is stored in the idx_output

The first two dense dimensions are introduced, keeping in mind batch and feature dimensions. If you need just 1, it is possible to introduce a dummy dimension of size 1 ^^.

The transformation itself, i.e., Clebsch-Gordan coefficients, can be precomputed once at the very beginning. This repo among the other things contains the code for this:

[14]:

from sparse_accumulation.clebsch_gordan import ClebschGordan, get_real_clebsch_gordan
L_MAX = 5
clebsch = ClebschGordan(L_MAX).precomputed_
indices = get_real_clebsch_gordan(clebsch[L_MAX, L_MAX, L_MAX], L_MAX, L_MAX, L_MAX)

m1_aligned, m2_aligned = [], []
multipliers, mu_aligned = [], []
for mu in range(0, 2 * L_MAX + 1):
    for el in indices[mu]:
        m1, m2, multiplier = el
        m1_aligned.append(m1)
        m2_aligned.append(m2)
        multipliers.append(multiplier)
        mu_aligned.append(mu)
m1_aligned = torch.LongTensor(m1_aligned)
m2_aligned = torch.LongTensor(m2_aligned)
mu_aligned = torch.LongTensor(mu_aligned)
multipliers = torch.FloatTensor(multipliers)

indices = np.argsort(mu_aligned)

m1_aligned = m1_aligned[indices].cuda()
m2_aligned = m2_aligned[indices].cuda()
mu_aligned = mu_aligned[indices].cuda()
multipliers = multipliers[indices].cuda()

print("L_MAX: ")
print("multipliers shape: ", multipliers.shape)
print("m1_aligned shape: ", m1_aligned.shape)
print("m2_aligned shape: ", m2_aligned.shape)
print("multipliers shape: ", multipliers.shape)

L_MAX:
multipliers shape:  torch.Size([126])
m1_aligned shape:  torch.Size([126])
m2_aligned shape:  torch.Size([126])
multipliers shape:  torch.Size([126])

This is a simple wrapper on sympy package, and the definition of the real clebsch-gordan coefficients is consistent with librascal real spherical harmonics, nice, wigner iterations, and rascaline

Now we can do the Clebsch-Gordan iteration:

[15]:

X1 = torch.randn(100, 17, 2 * L_MAX + 1).cuda()
X2 = torch.randn(100, 17, 2 * L_MAX + 1).cuda()

output_loops = sparse_accumulation_loops(X1, X2, mu_aligned, 2 * L_MAX + 1, m1_aligned, m2_aligned, multipliers)
print(output_loops.shape)

torch.Size([100, 17, 11])

You can take a look at the benchmarks files .py along with their output .out to get an idea 1) how to benchmark this properly with gpu synchronization and 2) the speed of this operation compared to a naive implementation

[ ]: