Lazylinop intra news
Note
The goal of this section is to present news toward the core team of developers. It should help to communicate about the work currently or lastly in progress. A developer news can later become public provided we make some adjustments (see Lazylinop news).
Summary of news
2024-03-06 Operator diagonal extraction
Below are the results obtained measuring the computation time of several methods
used to extract the diagonal of different LazyLinOp
-s with the function
diag()
.
Information:
The suffix
bs50
in figure legend refers to batch size of 50% the size of the operator (seediag()
).Not all operators support CSC matrices as operands (e.g.
fft()
,fwht()
,convolve()
). This is indicated on the figure.
Conclusion 'toarray'
is the fastest method for reasonable dimensions so
it has been chosen as the default method. However users are warned in doc that it has
the largest memory cost. It can potentially blow up the memory consumption limit for a
very large operator while 'canonical_vectors_csc'
and 'slicing'
should work easily.
The extract_batch
argument allows to set up trade-off between memory space and
execution time.
Scripts to reproduce (note that needed dependencies are located in benchmark directory of the lazylinop project):
benchmark: extract_diag_benchmark.py
figure: extract_diag_benchmark_plot.py
2024-02-01 pmatmat benchmarks on basicops (more dimensions)
The figure below shows the same kind of results as 2024-01-30 pmatmat benchmarks on basicops but for
many dimensions (number of columns of A
), still on the same hardware
configurations as 2024-01-24 pmatmat block_diag benchmarks.
Except if a computing center is available, in which case MPI can be
interesting, the multithread option prevails.
Maybe some of the basicops should be parallelized by default if the number of
columns of A
is large enough (it will probably be discussed later).
Parallelization of immediate operators as zeros()
must be avoided.
The script to reproduce the figure is pmatmat_benchmark_plot.py. The scripts for performing the benchmark are the same as in 2024-01-30 pmatmat benchmarks on basicops. (note that needed dependencies are located in benchmark directory of the lazylinop project)
2024-01-30 pmatmat benchmarks on basicops
Same type of benchmarks as in 2024-01-24 pmatmat block_diag benchmarks have been written on other basicops of lazylinop to compare (on the same hardware configurations) the performance of the different parallelization methods.
The benchmarks measure the time to compute 30 times L @ A
with A
a
matrix of \(2^{15}\) columns (and L
a LazyLinOp
to
parallelize with lazylinop.wip.parallel.pmatmat()
).
Analysis notes on results (table below):
For
zeros()
it is useless to parallelize because obviously the result is immediate. Something may be done to detect when a user tries to do it.For
ones()
,block_diag()
,hstack()
,vstack()
,sum()
,kron()
, ‘mpi’ is the best method (provided a large amount of computing resources is available). ‘thread’ is the second best method but in a general case (with no computing center available) it should be considered as the best method (before ‘process’ and of course the sequential method).For
eye()
anddiag()
the ‘thread’ method is even better (probably because the throughput is not enough to make ‘mpi’ method to be worth it).
Conclusion:
Thread should be the method by default for the tested basicops that are not immediate (contrary to
zeros()
).A safeguard would be useful to avoid trying a parallelization of an immediate
LazyLinOp
(e.g.zeros()
).The benchmarks should be extended to other dimensions before to confirm this decision.
Operator |
Method |
Computation time (s) |
block_diag |
|
861.24 |
63.46 |
||
450.79 |
||
42.11 |
||
2947.20 |
||
hstack |
|
495.15 |
23.16 |
||
258.62 |
||
4.78 |
||
2951.53 |
||
vstack |
|
678.11 |
57.50 |
||
480.36 |
||
42.54 |
||
2855.09 |
||
diag |
|
33.25 |
20.15 |
||
236.79 |
||
38.21 |
||
142.26 |
||
eye |
|
36.15 |
19.57 |
||
168.85 |
||
38.10 |
||
143.52 |
||
zeros |
|
0.001 |
16.07 |
||
150.18 |
||
37.95 |
||
42.86 |
||
ones |
|
62.35 |
38.6 |
||
1225.10 |
||
39.60 |
||
unavailable |
||
sum |
|
580.52 |
26.78 |
||
243.42 |
||
4.92 |
||
unavailable |
||
kron |
|
486.76 |
55.28 |
||
178.44 |
||
41.26 |
||
268.89 |
Benchmarks scripts:
sequential: pmatmat_baseline_benchmark.py,
‘thread’, ‘process’ and pylops: pmatmat_multithread_multiproc_benchmark.py,
‘mpi’: pmatmat_mpi_benchmark.py with the SLURM script to run the benchmark at PSMN sbatch_pmatmat_mpi_benchmark-lake-flix.sh.
2024-01-24 pmatmat block_diag benchmarks
lazylinop.wip.parallel.pmatmat()
has been recently added as a WiP module.
Benchmarks have been written to compare the performance of the different
parallelization methods used on the lazylinop.block_diag()
operator
L
.
L
is composed of 20 blocks of size \(512^2\). The benchmarks measure the time
to compute 30 times L @ A
with A
a matrix of \(2^{15}\) columns (and L
a
LazyLinOp()
to parallelize with pmatmat()
).
Method |
Computation time (s) |
sequential |
598.92 |
‘thread’ |
62.87 |
‘process’ |
435.28 |
‘mpi’ |
42.01 |
pylops.BlockDiag |
2995.71 |
Note that sequential, ‘thread’, ‘process’ and
pylops.BlockDiag
were all executed on the same CBP machine apollo4air3
using 16 workers (threads or processes when parallel,
for the sequential case it was left to numpy how to use the resources).
appolo4air3
CPU is: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz with 32 cores
(and 64 threads for hyperthreading).
It is noteworthy that however ‘mpi’ was tested apart at PSMN computer center, Lake-flix partition.
Configuration of nodes are: (CPU) Gold 6242 @ 2.8GHz, 32 cores, 384 GiB of RAM (12 GiB/core), Infiniband 56 GiB/s.
A number of 14 nodes with 16 cores per node, that is a total of 224 cores were
used to compute L @ A
and outperform ‘thread’ method run on
appolo4air3
.
Threading seems not surprisingly much more efficient but of course providing a
large enough number of computing nodes (and cores) it is possible to reach and
outperform the threading performance with MPI (and most likely with Dask or any
distributed computing environment).
The ‘process’ method is still useful for special cases where a GIL issue occurs
(see. lazylinop.wip.parallel.pmatmat_multithread()
). The reason why the
‘process’ method is very slow (although quite faster than pylops’) is most likely
because of the IPC extra cost (it is needed to transfer result slices to the
main process while it is made at an almost free cost in ‘thread’ method because of the
shared memory between threads).
About pylops.BlockDiag,
it is interesting to note that a nproc
argument is available in the function
in order to parallelize computations using multiprocessing as in our ‘process’
method. However, according to the doc, BlockDiag
parallelizes L @ A
by
assigning the computation of block(L, i) @ A
to one or several processes.
This strategy is different than ours (which consists in computing the product
in parallel by assigning the partial product L @ cols(A, j1, j2)
to one
process).
Benchmarks scripts:
sequential: blockdiag_pmatmat_baseline_benchmark.py,
‘thread’, ‘process’ and pylops: blockdiag_pmatmat_benchmark.py,
‘mpi’: blockdiag_pmatmat_mpi_benchmark.py with the SLURM script to run the benchmark at PSMN sbatch_blockdiag_pmatmat_mpi_benchmark-lake-flix.sh.
These benchmarks will be extended to other lazylinop basic operators.
2023-12-22 Merge of padding functions
Padding functions have been merged into lazylinop.ppadder()
and evaluated in
order to ensure that we keep the best performance available in different
configurations by using the proper implementation. The focus is also put
on the larger number of features of the merger function.
At start, we had two functions:
1. lazylinop.zpad()
based on
LazyLinOp
lazylinop.hstack()
and lazylinop.vstack()
operations.
lazylinop.basicops.kron_pad()
based onlazylinop.kron
function.
Note
The function lazylinop.wip.signal.mpad()
has been
considered separately and optimized with new function lazylinop.mpad2()
.
See the related isssue: #57.
Extra features:
pad()
is capable to pad aLazyLinOp
butwip.signal.pad()
is not because it can’t pad any 2D array/operator (it needs to flatten it before proceeding but you can’t – currently – flatten aLazyLinOp
without atoarray
). That is also true for a scipy sparse matrix (noravel()
function).pylops.Pad, is neither capable to pad not only a
LazyLinOp
but also anyLinearOperator
(maybe this is a bug) or a scipy sparse matrix. Example:
import numpy as np
from pylops import Pad, aslinearoperator
pad_width = ((2, 3), (4, 5))
X = np.random.rand(10, 10)
Pad(X.shape, pad_width) @ aslinearoperator(X) # doesn't work
# while the following works
Pad(X.shape, pad_width) @ X # works
Another extra feature of
pad()
: any constant values can be padded (with possibly different ‘before’ and ‘after’ values on the two dimensions) and not only zeros as other functions (pylops
,zpad()
andwip.signal.pad()
). The reference is here numpy.pad.
The following Pad Benchmark Figure confirms two things:
We keep the best performance with
lazylinop.ppadder()
.
We do better than pylops or as well.
Pad Benchmark Figure
Legend:
- Aliases:
newpad identifies
lazylinop.ppadder()
,spad identifies :
lazylinop.basicops.kron_pad()
,pylops identifies pylops.Pad,
zpad()
is not ambiguous.
- Benchmarked operations:
mat_times
: operator @ 2d-arrayvec_times
: operator @ 1d-veclop_vec_times
: operator @ other-LazyLinOp @ 1d-veclop_mat_times
: operator @ other-LazyLinOp @ 2d-arraysparse_mat_times
: operator @ scipy-sparse_matrix
The benchmark script is benchmark_pad.py and the plotting script is benchmark_pad_plot.py.
2023-10-19 – Spectral norm of a LazyLinOp
We added spectral_norm()
which computes an approximate
spectral norm of any LazyLinOp
.
Compared to scipy (1.10.1) estimate_spectral_norm, it allows to set an accuracy
criterion (thres
argument) to end the power iteration algorithm which is also
used in scipy.
Besides, it seems to be faster on dense arrays as shown by the small test
below. That’s partly due to the thres
parameter.
In [1]: from numpy.random import rand
In [2]: from scipy.linalg.interpolative import estimate_spectral_norm
In [3]: from lazylinop import aslazylinop
In [4]: from lazylinop.wip.linalg import spectral_norm
In [5]: M = rand(1024, 1024)
In [6]: L = aslazylinop(M)
In [7]: timeit spectral_norm(L)
4.51 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: timeit estimate_spectral_norm(L)
16.1 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
2023-07-04 – MPILop
A new MPI based LazyLinOp
has been added in lazylinop package.
This operator, named MPILop
, allows to parallelize operator-vector or operator-matrix
multiplications over an arbitrary number of MPI slots/nodes.
This operator can speed up significantly any algorithm based on these
multiplications assuming that your operator of interest is large enough to be worth it.
Below, we give a small example, in which the operator has a shape of 81930 x 2040 and is multiplied to a vector both using a MPILop and a numpy equivalent array. In this example, a vector is multiplied ten thousand times using a numpy array and MPILop. The latter shows a speedup of approximately 3 when using 40 cores of an Intel(R) Xeon(R) Platinum 9242 CPU cadenced at 2.30 GHz (this CPU provides a total of 48 cores).
The numpy script is bm_np_mul.py, the MPILop script is bm_mpilop.py.
$ python bm_np_mul.py vec
numpy time: 103.30621838569641
r: [2030.96155502 2033.24868084 2021.15729815 ... 2049.91067706 2025.83520511
2047.50280854]
$ mpiexec -n 40 python bm_mpilop.py vec
MPI time: 36.17019081115723
r: [2030.96155502 2033.24868084 2021.15729815 ... 2049.91067706 2025.83520511
2047.50280854]
See also the API doc of the MPILop
class.