This repository contains benchmarking results for different ways to extract diagonal entries from a sparse matrix in
PyTorch.
Background
This repository originates from implementing the Random Walk Positional Encoding from
Dwivedi et al., 2022 in PyTorch for
pykeen#918.
This positional encoding is given as
x_i = [R[i, i], R^2[i, i], ..., R^k[i, i]]
where R = AD^{-1}
denotes the column-wise normalized adjacency matrix. Since typically adjacency matrices are
sparse, we want to use torch.sparse
for calculating the matrix
powers.
As of PyTorch 1.11, sparse matrices are in beta status and not all operations support sparse matrices.
Particularly, torch.diagonal
does not support
sparse matrices,
>>> torch.diagonal(A)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: sparse tensors do not have strides
nor can we use advanced indexing via A[torch.arange(n), torch.arange(n)]
```python-console
>>> A[torch.arange(n), torch.arange(n)]
Traceback (most recent call last):
File "", line 1, in
NotImplementedError: Could not run 'aten::index.Tensor' with arguments from the 'SparseCPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::index.Tensor' is only available for these backends: [CPU, QuantizedCPU, BackendSelect, Python, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, AutocastCPU, Autocast, Batched, VmapMode, Functionalize].
CPU: registered at aten/src/ATen/RegisterCPU.cpp:21063 [kernel]
QuantizedCPU: registered at aten/src/ATen/RegisterQuantizedCPU.cpp:1258 [kernel]
BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:47 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at ../aten/src/ATen/ConjugateFallback.cpp:18 [backend fallback]
Negative: registered at ../aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback]
ZeroTensor: registered at ../aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:64 [backend fallback]
AutogradOther: registered at ../torch/csrc/autograd/generated/VariableType_1.cpp:10665 [autograd kernel]
AutogradCPU: registered at ../torch/csrc/autograd/generated/VariableType_1.cpp:10665 [autograd kernel]
AutogradCUDA: registered at ../torch/csrc/autograd/generated/VariableType_1.cpp:10665 [autograd kernel]
AutogradXLA: registered at ../torch/csrc/autograd/generated/VariableType_1.cpp:10665 [autograd kernel]
AutogradLazy: registered at ../torch/csrc/autograd/generated/VariableType_1.cpp:10665 [autograd kernel]
AutogradXPU: registered at ../torch/csrc/autograd/generated/VariableType_1.cpp:10665 [autograd kernel]
AutogradMLC: registered at ../torch/csrc/autograd/generated/VariableType_1.cpp:10665 [autograd kernel]
AutogradHPU: registered at ../torch/csrc/autograd/generated/VariableType_1.cpp:10665 [autograd kernel]
AutogradNestedTensor: registered at ../torch/csrc/autograd/generated/VariableType_1.cpp:10665 [autograd kernel]
AutogradPrivateUse1: registered at ../torch/csrc/autograd/generated/VariableType_1.cpp:10665 [autograd kernel]
AutogradPrivateUse2: registered at ../torch/csrc/autograd/generated/VariableType_1.cpp:10665 [autograd kernel]
AutogradPrivateUse3: registered at ../torch/csrc/autograd/generated/VariableType_1.cpp:10665 [autograd kernel]
Tracer: registered at ../torch/csrc/autograd/generated/TraceType_1.cpp:11342 [kernel]
AutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:461 [backend fallback]
Autocast: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:305 [backend fallback]
Batched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1059 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
Functionalize: registered at ../aten/src/ATen/FunctionalizeFallbackKernel.cpp:52 [backend fallback]
```
</details>
Hence, we study different variants of extracting the diagonal and measure their performance.
## Variants
### Overview
| # | Variant | COO | CSR |
|---|-------------------|:---:|:---:|
| 1 | DenseDiag | ✓ | ✓ |
| 2 | ExplicitPython | ✓ | |
| 3 | ExplicitPythonCSR | | ✓ |
| 4 | Coalesce | ✓ | |
| 5 | ManualCoalesce | ✓ | |
### Details
1. **Torch**: [`torch.diagonal`](https://pytorch.org/docs/stable/generated/torch.diagonal.html) on the dense
version of the matrix, obtained via
[`torch.Tensor.to_dense`](https://pytorch.org/docs/stable/generated/torch.Tensor.to_dense.html).
Due to materializing the dense matrix, this method requires a large amount of memory (O(n^2)).
It is applicable for both, COO and CSR format.
```python
d = torch.diagonal(matrix.to_dense())
```
2. **Python for-loop, and item access**: Due to using a Python loop, this variant is likely to be
inefficient. Moreover, it is only applicable to the COO-format, an fails for CSR adjacency
matrices (cf. details collapsible)
```python
n = matrix.shape[0]
d = torch.zeros(n, device=matrix.device)
for i in range(n):
d[i] = matrix[i, i]
```
```python-traceback
Traceback (most recent call last):
d[i] = matrix[i, i]
NotImplementedError: Could not run 'aten::as_strided' with arguments from the 'SparseCsrCPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::as_strided' is only available for these backends: [CPU, Meta, QuantizedCPU, BackendSelect, Python, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, AutocastCPU, Autocast, Batched, VmapMode, Functionalize].
CPU: registered at aten/src/ATen/RegisterCPU.cpp:21063 [kernel]
Meta: registered at aten/src/ATen/RegisterMeta.cpp:14951 [kernel]
QuantizedCPU: registered at aten/src/ATen/RegisterQuantizedCPU.cpp:1258 [kernel]
BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:47 [backend fallback]
Named: fallthrough registered at ../aten/src/ATen/core/NamedRegistrations.cpp:11 [kernel]
Conjugate: fallthrough registered at ../aten/src/ATen/ConjugateFallback.cpp:22 [kernel]
Negative: fallthrough registered at ../aten/src/ATen/native/NegateFallback.cpp:22 [kernel]
ZeroTensor: registered at aten/src/ATen/RegisterZeroTensor.cpp:167 [kernel]
ADInplaceOrView: registered at ../torch/csrc/autograd/generated/ADInplaceOrViewType_0.cpp:2566 [kernel]
AutogradOther: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9932 [autograd kernel]
AutogradCPU: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9932 [autograd kernel]
AutogradCUDA: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9932 [autograd kernel]
AutogradXLA: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9932 [autograd kernel]
AutogradLazy: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9932 [autograd kernel]
AutogradXPU: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9932 [autograd kernel]
AutogradMLC: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9932 [autograd kernel]
AutogradHPU: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9932 [autograd kernel]
AutogradNestedTensor: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9932 [autograd kernel]
AutogradPrivateUse1: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9932 [autograd kernel]
AutogradPrivateUse2: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9932 [autograd kernel]
AutogradPrivateUse3: registered at ../torch/csrc/autograd/generated/VariableType_0.cpp:9932 [autograd kernel]
Tracer: registered at ../torch/csrc/autograd/generated/TraceType_0.cpp:11618 [kernel]
AutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:461 [backend fallback]
Autocast: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:305 [backend fallback]
Batched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1063 [kernel]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
Functionalize: registered at aten/src/ATen/RegisterFunctionalization_0.cpp:4018 [kernel]
```
3. **Python for-loop (CSR)**: Here, we iterate over the rows, select the corresponding column indices, and
determine, whether one of them corresponds to the diagonal entry. In that case, we select the
corresponding value and copy it into the result.
```python
n = matrix.shape[0]
d = torch.zeros(n, device=matrix.device)
crow = matrix.crow_indices()
col = matrix.col_indices()
values = matrix.values()
for i, (start, stop) in enumerate(zip(crow, crow[1:])):
this_col = col[start:stop]
this_values = values[start:stop]
v = this_values[this_col == i]
if v.numel():
d[i] = v
```
4. **Coalesce**: In this variant, we perform an element-wise multiplication with a sparse identity
matrix, and then use [`torch.Tensor.values`](https://pytorch.org/docs/stable/generated/torch.Tensor.values.html)
and [`torch.Tensor.indices`](https://pytorch.org/docs/stable/generated/torch.Tensor.indices.html)
to obtain the values and indices of non-zero elements. This operation does only support the COO format.
Moreover, it requires a coalesced sparse COO tensor, i.e., a prior call to
[`torch.Tensor.coalesce`](https://pytorch.org/docs/stable/generated/torch.Tensor.coalesce.html).
```python
n = matrix.shape[0]
d = torch.zeros(n, device=matrix.device)
d_sparse = (matrix * eye).coalesce()
indices = d_sparse.indices()
values = d_sparse.values()
d[indices] = values
```
5. **Manual Coalesce**: Since the most expensive part of the previous solution is the coalesce step,
we investigate a variant with directly operates on the raw, uncoalesced `_values` and `_indices`.
Hereby, we avoid having to coalesce non-diagonal entries. Since we operate on the uncoalesced
view, there may be multiple entries for the same index, we need to be aggregated via
[`scatter_add_`](https://pytorch.org/docs/stable/generated/torch.Tensor.scatter_add_.html).
```python
n = matrix.shape[0]
d = torch.zeros(n, device=matrix.device)
indices = matrix._indices()
mask = indices[0] == indices[1]
diagonal_values = matrix._values()[mask]
diagonal_indices = indices[0][mask]
d = d.scatter_add(dim=0, index=diagonal_indices, src=diagonal_values)
```
## Experiments
### Environment
The experiments were conducted on synthetic sparse adjacency matrices of varying sizes
(500 - 512.000 nodes) and densities (0.001% - 1% non-zero entries). The matrices may not fully
be fully representative for natural adjacency matrices, since the non-zero indices were chosen
uniformly random. Moreover, we did not guarantee that there are no repetitions of indices, i.e.,
the coalesced matrices may be sparser than the original ones.
For measuring execution times, we used [PyTorch's built-in benchmarking utilities
](https://pytorch.org/docs/stable/benchmark_utils.html), which ensure proper
device synchronization and warm-up phases for experiments with GPUs.
The experiments were conducted on a Server on Ubuntu 20.04 with a
Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz with 64 cores, one RTX8000 GPU. We used PyTorch 1.11
for our experiments. We did not investigate the scaling effects in terms of the number of CPU threads,
and left these on their default, i.e., CPU numbers are for single-core.
After some initial experiments with smaller adjacency matrices, we decided to run only the
top-performing variants on the larger matrices. Moreover, we also limited the size for experiments
with densification due to the excessive memory requirement. Finally, we decided to exclude experiments
where matrices have more than 10M non-zero entries (i.e., for the largest matrix sizes, we only
considered the sparser variants).
## Results
The following plot shows the median runtime in seconds for the different approaches for various
matrix sizes and densities.
![Comparison](/benchmark-sparse-diagonal/img/comparison.svg)
Overall, we observe, as expected, increasing runtime with matrix size and density.
Approaches relying on Python for-loops exhibit the worst scaling behaviour, likely due to the
overhead of Python loops compared to the other variants, where the looping happens inside PyTorch.
On GPU, the densification approach is competitive in terms of runtime on smaller matrices, but
it faces a quadratic memory complexity. The manual coalescing approach performs best and is up
to multiple orders of magnitude faster than the full coalescing.
While it accesses private methods of sparse tensors (which themselves are considered to be beta
as of PyTorch 1.11), and thus may break with future releases, the effort of exchanging it by any
of the other existing implementations, or even better a variant build into PyTorch, is negligible.
We thus opt for using the manual coalescing approach for
[PyKEEN#918](https://github.com/pykeen/pykeen/pull/918).