Lazylinop intra news

Note

The goal of this section is to present news toward the core team of developers. It should help to communicate about the work currently or lastly in progress. A developer news can later become public provided we make some adjustments (see Lazylinop news).

Summary of news

2024-03-06 Operator diagonal extraction

Below are the results obtained measuring the computation time of several methods used to extract the diagonal of different LazyLinOp-s with the function diag().

Information:

  • The suffix bs50 in figure legend refers to batch size of 50% the size of the operator (see diag()).

  • Not all operators support CSC matrices as operands (e.g. fft(), fwht(), convolve()). This is indicated on the figure.

Conclusion 'toarray' is the fastest method for reasonable dimensions so it has been chosen as the default method. However users are warned in doc that it has the largest memory cost. It can potentially blow up the memory consumption limit for a very large operator while 'canonical_vectors_csc' and 'slicing' should work easily. The extract_batch argument allows to set up trade-off between memory space and execution time.

_images/extract_diag_benchmark_plot.png

Scripts to reproduce (note that needed dependencies are located in benchmark directory of the lazylinop project):

2024-02-01 pmatmat benchmarks on basicops (more dimensions)

The figure below shows the same kind of results as 2024-01-30 pmatmat benchmarks on basicops but for many dimensions (number of columns of A), still on the same hardware configurations as 2024-01-24 pmatmat block_diag benchmarks.

_images/pmatmat_benchmark_plot.png

Except if a computing center is available, in which case MPI can be interesting, the multithread option prevails. Maybe some of the basicops should be parallelized by default if the number of columns of A is large enough (it will probably be discussed later). Parallelization of immediate operators as zeros() must be avoided.

The script to reproduce the figure is pmatmat_benchmark_plot.py. The scripts for performing the benchmark are the same as in 2024-01-30 pmatmat benchmarks on basicops. (note that needed dependencies are located in benchmark directory of the lazylinop project)

2024-01-30 pmatmat benchmarks on basicops

Same type of benchmarks as in 2024-01-24 pmatmat block_diag benchmarks have been written on other basicops of lazylinop to compare (on the same hardware configurations) the performance of the different parallelization methods.

The benchmarks measure the time to compute 30 times L @ A with A a matrix of \(2^{15}\) columns (and L a LazyLinOp to parallelize with lazylinop.wip.parallel.pmatmat()).

Analysis notes on results (table below):

  • For zeros() it is useless to parallelize because obviously the result is immediate. Something may be done to detect when a user tries to do it.

  • For ones(), block_diag(), hstack(), vstack(), sum(), kron(), ‘mpi’ is the best method (provided a large amount of computing resources is available). ‘thread’ is the second best method but in a general case (with no computing center available) it should be considered as the best method (before ‘process’ and of course the sequential method).

  • For eye() and diag() the ‘thread’ method is even better (probably because the throughput is not enough to make ‘mpi’ method to be worth it).

Conclusion:

  • Thread should be the method by default for the tested basicops that are not immediate (contrary to zeros()).

  • A safeguard would be useful to avoid trying a parallelization of an immediate LazyLinOp (e.g. zeros()).

  • The benchmarks should be extended to other dimensions before to confirm this decision.

Operator

Method

Computation time (s)

block_diag

sequential

‘thread’

‘process’

‘mpi’

pylops.BlockDiag

861.24

63.46

450.79

42.11

2947.20

hstack

sequential

‘thread’

‘process’

‘mpi’

pylops.HStack

495.15

23.16

258.62

4.78

2951.53

vstack

sequential

‘thread’

‘process’

‘mpi’

pylops.VStack

678.11

57.50

480.36

42.54

2855.09

diag

sequential

‘thread’

‘process’

‘mpi’

pylops.Diagonal

33.25

20.15

236.79

38.21

142.26

eye

sequential

‘thread’

‘process’

‘mpi’

pylops.Identity

36.15

19.57

168.85

38.10

143.52

zeros

sequential

‘thread’

‘process’

‘mpi’

pylops.Zero

0.001

16.07

150.18

37.95

42.86

ones

sequential

‘thread’

‘process’

‘mpi’

pylops

62.35

38.6

1225.10

39.60

unavailable

sum

sequential

‘thread’

‘process’

‘mpi’

pylops

580.52

26.78

243.42

4.92

unavailable

kron

sequential

‘thread’

‘process’

‘mpi’

pylops.Kronecker

486.76

55.28

178.44

41.26

268.89

Benchmarks scripts:

2024-01-24 pmatmat block_diag benchmarks

lazylinop.wip.parallel.pmatmat() has been recently added as a WiP module.

Benchmarks have been written to compare the performance of the different parallelization methods used on the lazylinop.block_diag() operator L.

L is composed of 20 blocks of size \(512^2\). The benchmarks measure the time to compute 30 times L @ A with A a matrix of \(2^{15}\) columns (and L a LazyLinOp() to parallelize with pmatmat()).

Method

Computation time (s)

sequential

598.92

‘thread’

62.87

‘process’

435.28

‘mpi’

42.01

pylops.BlockDiag

2995.71

Note that sequential, ‘thread’, ‘process’ and pylops.BlockDiag were all executed on the same CBP machine apollo4air3 using 16 workers (threads or processes when parallel, for the sequential case it was left to numpy how to use the resources). appolo4air3 CPU is: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz with 32 cores (and 64 threads for hyperthreading).

It is noteworthy that however ‘mpi’ was tested apart at PSMN computer center, Lake-flix partition. Configuration of nodes are: (CPU) Gold 6242 @ 2.8GHz, 32 cores, 384 GiB of RAM (12 GiB/core), Infiniband 56 GiB/s. A number of 14 nodes with 16 cores per node, that is a total of 224 cores were used to compute L @ A and outperform ‘thread’ method run on appolo4air3.

Threading seems not surprisingly much more efficient but of course providing a large enough number of computing nodes (and cores) it is possible to reach and outperform the threading performance with MPI (and most likely with Dask or any distributed computing environment). The ‘process’ method is still useful for special cases where a GIL issue occurs (see. lazylinop.wip.parallel.pmatmat_multithread()). The reason why the ‘process’ method is very slow (although quite faster than pylops’) is most likely because of the IPC extra cost (it is needed to transfer result slices to the main process while it is made at an almost free cost in ‘thread’ method because of the shared memory between threads).

About pylops.BlockDiag, it is interesting to note that a nproc argument is available in the function in order to parallelize computations using multiprocessing as in our ‘process’ method. However, according to the doc, BlockDiag parallelizes L @ A by assigning the computation of block(L, i) @ A to one or several processes. This strategy is different than ours (which consists in computing the product in parallel by assigning the partial product L @ cols(A, j1, j2) to one process).

Benchmarks scripts:

These benchmarks will be extended to other lazylinop basic operators.

2023-12-22 Merge of padding functions

Padding functions have been merged into lazylinop.ppadder() and evaluated in order to ensure that we keep the best performance available in different configurations by using the proper implementation. The focus is also put on the larger number of features of the merger function.

Reference issues: #45, #5

At start, we had two functions:

1. lazylinop.zpad() based on LazyLinOp lazylinop.hstack() and lazylinop.vstack() operations.

  1. lazylinop.basicops.kron_pad() based on lazylinop.kron function.

Note

The function lazylinop.wip.signal.mpad() has been considered separately and optimized with new function lazylinop.mpad2(). See the related isssue: #57.

Extra features:

  • pad() is capable to pad a LazyLinOp but wip.signal.pad() is not because it can’t pad any 2D array/operator (it needs to flatten it before proceeding but you can’t – currently – flatten a LazyLinOp without a toarray). That is also true for a scipy sparse matrix (no ravel() function).

  • pylops.Pad, is neither capable to pad not only a LazyLinOp but also any LinearOperator (maybe this is a bug) or a scipy sparse matrix. Example:

import numpy as np
from pylops import Pad, aslinearoperator
pad_width = ((2, 3), (4, 5))
X = np.random.rand(10, 10)
Pad(X.shape, pad_width) @ aslinearoperator(X)  # doesn't work
# while the following works
Pad(X.shape, pad_width) @ X  # works
  • Another extra feature of pad(): any constant values can be padded (with possibly different ‘before’ and ‘after’ values on the two dimensions) and not only zeros as other functions (pylops, zpad() and wip.signal.pad()). The reference is here numpy.pad.

The following Pad Benchmark Figure confirms two things:

  1. We keep the best performance with lazylinop.ppadder().

  1. We do better than pylops or as well.

Pad Benchmark Figure

Legend:

  • Aliases:
    • newpad identifies lazylinop.ppadder(),

    • spad identifies :lazylinop.basicops.kron_pad(),

    • pylops identifies pylops.Pad,

    • zpad() is not ambiguous.

  • Benchmarked operations:
    • mat_times: operator @ 2d-array

    • vec_times: operator @ 1d-vec

    • lop_vec_times: operator @ other-LazyLinOp @ 1d-vec

    • lop_mat_times: operator @ other-LazyLinOp @ 2d-array

    • sparse_mat_times: operator @ scipy-sparse_matrix

_images/benchmark_pad_plot.png

The benchmark script is benchmark_pad.py and the plotting script is benchmark_pad_plot.py.

2023-10-19 – Spectral norm of a LazyLinOp

We added spectral_norm() which computes an approximate spectral norm of any LazyLinOp. Compared to scipy (1.10.1) estimate_spectral_norm, it allows to set an accuracy criterion (thres argument) to end the power iteration algorithm which is also used in scipy.

Besides, it seems to be faster on dense arrays as shown by the small test below. That’s partly due to the thres parameter.

In [1]: from numpy.random import rand

In [2]: from scipy.linalg.interpolative import estimate_spectral_norm

In [3]: from lazylinop import aslazylinop

In [4]: from lazylinop.wip.linalg import spectral_norm

In [5]: M = rand(1024, 1024)

In [6]: L = aslazylinop(M)

In [7]: timeit spectral_norm(L)
4.51 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: timeit estimate_spectral_norm(L)
16.1 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

2023-07-04 – MPILop

A new MPI based LazyLinOp has been added in lazylinop package. This operator, named MPILop, allows to parallelize operator-vector or operator-matrix multiplications over an arbitrary number of MPI slots/nodes. This operator can speed up significantly any algorithm based on these multiplications assuming that your operator of interest is large enough to be worth it.

Below, we give a small example, in which the operator has a shape of 81930 x 2040 and is multiplied to a vector both using a MPILop and a numpy equivalent array. In this example, a vector is multiplied ten thousand times using a numpy array and MPILop. The latter shows a speedup of approximately 3 when using 40 cores of an Intel(R) Xeon(R) Platinum 9242 CPU cadenced at 2.30 GHz (this CPU provides a total of 48 cores).

The numpy script is bm_np_mul.py, the MPILop script is bm_mpilop.py.

$ python bm_np_mul.py vec
numpy time: 103.30621838569641
r: [2030.96155502 2033.24868084 2021.15729815 ... 2049.91067706 2025.83520511
 2047.50280854]
$ mpiexec -n 40 python bm_mpilop.py vec
MPI time: 36.17019081115723
r: [2030.96155502 2033.24868084 2021.15729815 ... 2049.91067706 2025.83520511
2047.50280854]

See also the API doc of the MPILop class.