Parallel computing
pmatmat functions
- lazylinop.wip.parallel.pmatmat(L, method='thread', nworkers=None, use_matvec=False, **kwargs)
Frontend function to build a
LazyLinOp
parallelized version ofL
. Precisely,L.matmat
/__matmul__
is parallelized according to the specified method.This function implements the parallelization of the product
L @ A
proceeding alternatively as follows:1. (
use_matvec=False
) Assign evenly blocks of columns ofA
to workers in order to compute the product by blocks in parallel usingL
pre-definedmatmat
.2. (
use_matvec=True
) Define a parallelizedmatmat
usingL
’s pre-definedmatvec
. Each worker makes a series of sequential calls tomatvec
in parallel to other workers. Thematvec
calls are made onA
columns assigned to the worker as in 1.Warning
Using the method 1 or 2 does not go without consequence on the computing performance. In most cases method 1 should be more efficient. This is the default method (
use_matvec=False
).- Args:
- L: (
LazyLinOp
) The operator to parallelize.
- method: (str)
‘thread’: see
pmatmat_multithread()
‘process’: see
pmatmat_multiprocess()
‘mpi’: see
pmatmat_mpi()
- nworkers: (int)
The number of workers used for parallelization. Defaulty, the number of CPUs available on the system. This parameter is ignored for ‘mpi’ method (it is fixed externally by the
mpiexec
/mpirun
command).- use_matvec: (bool)
If
True
thematvec
function ofL
is used for parallelization otherwise onlymatmat
is used.- **kwargs: (unpacked dict)
Specialized arguments corresponding to the method used.
- L: (
Warning
For using
pmatmat
efficiently (i.e. without oversubscribing), the multithreading of underlying libraries should be disabled. Setting for exampleOMP_NUM_THREADS
,MKL_NUM_THREADS
orOPENBLAS_NUM_THREADS
environment variables to ‘1’ (it must notably be done before importing numpy or other underlying library in start of your script). The variables to set depend on theLazyLinOp
matmat
/matvec
being parallelized. In any doubt, all variables can be set to ‘1’. An alternative to environment variables is to use the threadpoolctl library.- Returns:
A specialized
LazyLinOp
that is able to compute the productL @ A
in parallel according to the chosen method.- Example:
>>> # disable OpenMP before all things >>> from os import environ >>> environ['OMP_NUM_THREADS'] = '1' >>> environ['MKL_NUM_THREADS'] = '1' >>> environ['OPENBLAS_NUM_THREADS'] = '1' >>> from lazylinop import aslazylinop >>> from lazylinop.wip.parallel import pmatmat >>> shape = (15, 20) >>> import numpy as np >>> M = np.random.rand(*shape) >>> A = np.random.rand(shape[1], 32) >>> L = aslazylinop(M) >>> pL = pmatmat(L) >>> # pL matmat is parallelized using default thread method >>> LA = L @ A # seqential mul >>> pLA = pL @ A # parallel mul >>> np.allclose(pLA, LA) True >>> np.allclose(pLA, M @ A) True
- lazylinop.wip.parallel.pmatmat_multithread(L, nworkers=None, use_matvec=False)
Implements
pmatmat()
using Python threads (threading package) for parallelization.Warning
Please note that GIL can slow down significantly the computation depending on how
L.matmat
/matvec
is implemented. In case GIL issues cannot be avoided you might use process-based methods ofpmatmat()
.- Returns:
A thread parallelized
LazyLinOp
.
- lazylinop.wip.parallel.pmatmat_multiprocess(L, nworkers=None, use_matvec=False, procs_created_once=False)
Implements
pmatmat()
using Python multiprocessing.- Args:
- L:
see
pmatmat()
- nworkers:
see
pmatmat()
- use_matvec:
see
pmatmat()
- procs_created_once: (bool)
If
True
processes are created beforehand and then consume/compute multiplications on the fly. Otherwise (False
) processes are created each time a multiplication is calculated.procs_created_once=True
hardly outperformsFalse
case because it needs to make IPC (Inter-Process Communication) to send slices ofA
(if we computeL @ A
) to processes created beforehand. In case ofprocs_created_once=False
A
is auto-shared by process forking (without any copy) but the resulting slices of multiplication still imply a need for IPC.
MPILop operator
- class lazylinop.wip.parallel.MPILop(*args, **kwargs)
Implements a LazyLinOp based on MPI to compute the multiplication (matmat and rmatmat).
This operator basically distributes slices of the underlying matrix to MPI processes that compute their part of the multiplication, then the multiplication result slices are aggregated by the rank-0 MPI process.
Note
This operator is not to be confound with
pmatmat_mpi()
See also:
pmatmat()
.MPILop constructor.
- Args:
- shape:
the shape of the operator (must match mat_npz array, it is needed because only 0-rank MPI process will read the file).
- mat_npz:
the file (in npz format) to load the array from.
- mat_dtype:
the dtype corresponding to mat_npz (it is needed because only 0-rank MPI process will read the file).
- bcast_op:
True (default) to broadcast the the multiplication operand of the operator to all MPI processes (from the rank-0 process) and False to load it locally for each process (typically, if only the 0-rank process gets the operand, for example through a prompt that a user fills with any info that defines the operand, you’ll use bcast_op==True. Otherwise if all processes have access to the operand, no need to broadcast it).
Note: This operator should be used without OpenMP enabled or it won’t be efficient to compute the distributed multiplication. Why is that? Well, mixing multithreading (from numpy/BLAS) with multiprocessing is counterproductive.
- Example:
>>> # disable OpenMP before all things >>> from os import environ >>> environ['OMP_NUM_THREADS'] = '1' >>> from lazylinop.wip import MPILop >>> shape = (15, 20) >>> import numpy as np >>> M = np.random.rand(*shape) >>> np.savez('M.npz', M) >>> MPILop(shape, 'M.npz', np.dtype('float64'), bcast_op=False) <15x20 MPILop with unspecified dtype>
See also bm_np_mul.py and bm_mpilop.py for valuable examples and 2023-07-04 – MPILop.
- lazylinop.wip.parallel.MPILop.__init__(self, shape, mat_npz, mat_dtype, bcast_op=True)
MPILop constructor.
- Args:
- shape:
the shape of the operator (must match mat_npz array, it is needed because only 0-rank MPI process will read the file).
- mat_npz:
the file (in npz format) to load the array from.
- mat_dtype:
the dtype corresponding to mat_npz (it is needed because only 0-rank MPI process will read the file).
- bcast_op:
True (default) to broadcast the the multiplication operand of the operator to all MPI processes (from the rank-0 process) and False to load it locally for each process (typically, if only the 0-rank process gets the operand, for example through a prompt that a user fills with any info that defines the operand, you’ll use bcast_op==True. Otherwise if all processes have access to the operand, no need to broadcast it).
Note: This operator should be used without OpenMP enabled or it won’t be efficient to compute the distributed multiplication. Why is that? Well, mixing multithreading (from numpy/BLAS) with multiprocessing is counterproductive.
- Example:
>>> # disable OpenMP before all things >>> from os import environ >>> environ['OMP_NUM_THREADS'] = '1' >>> from lazylinop.wip import MPILop >>> shape = (15, 20) >>> import numpy as np >>> M = np.random.rand(*shape) >>> np.savez('M.npz', M) >>> MPILop(shape, 'M.npz', np.dtype('float64'), bcast_op=False) <15x20 MPILop with unspecified dtype>
See also bm_np_mul.py and bm_mpilop.py for valuable examples and 2023-07-04 – MPILop.