Parallel computing

pmatmat functions

lazylinop.wip.parallel.pmatmat(L, method='thread', nworkers=None, use_matvec=False, **kwargs)

Builds the parallelized LazyLinOp version of L.

L’s matmat/__matmul__ is parallelized according to the selected method. To compute L @ A in parallel, the columns of A are “evenly” assigned to several parallel/concurrent workers. Each worker computes its part of the product.

Note that L.H/L.T multiplication is also parallelized in the same way.

For a better understanding and efficient use please see the pmatmat notebook.

Args:
L: LazyLinOp

The operator to parallelize.

method: str
nworkers: int

The number of workers used for parallelization. Defaultly, this is the number of CPUs/threads available on the system (os.cpu_count()). This parameter is ignored for 'mpi' method (it is fixed externally by the mpiexec/mpirun command).

use_matvec: bool

If True the matvec function of L is used for parallelization otherwise only matmat is used (default).

For use_matvec=True, a parallelized matmat is automatically defined using L’s pre-defined matvec. Each worker makes a series of sequential calls to matvec in parallel (or concurrently) to other workers. The worker calls matvec on columns it has been assigned.

Warning

use_matvec does not go without any consequence on the computing performance. In most cases method False should be more efficient. This is the default method .

**kwargs: unpacked dict

Specialized arguments corresponding to the method used.

Warning

To avoid CPU oversubscribing, it can be useful to disable multithreading of underlying libraries. For example, NumPy multithreading is disabled by setting OMP_NUM_THREADS, MKL_NUM_THREADS or OPENBLAS_NUM_THREADS environment variables to '1' (it must be done before importing NumPy). An alternative to environment variables is to use the threadpoolctl or numthreads libraries.

SciPy matrices

The parallelization of matmat only supports NumPy array operands. A way to handle L @ S with S a SciPy matrix, is to:

Returns:

a LazyLinOp that is able to compute the product L @ A in parallel according to the chosen method.

Warning

The example below is not efficient. The only point is to illustrate how to basically use the function. For a better understanding and efficient use please see the pmatmat notebook.

Example:
>>> from lazylinop import aslazylinop
>>> from lazylinop.wip.parallel import pmatmat
>>> shape = (15, 20)
>>> import numpy as np
>>> M = np.random.rand(*shape)
>>> A = np.random.rand(shape[1], 32)
>>> L = aslazylinop(M)
>>> pL = pmatmat(L)
>>> # pL matmat is parallelized using default thread method
>>> LA = L @ A # seqential mul
>>> pLA = pL @ A # parallel mul
>>> np.allclose(pLA, LA)
True
>>> np.allclose(pLA, M @ A)
True
lazylinop.wip.parallel.pmatmat_multithread(L, nworkers=None, use_matvec=False)

Implements pmatmat() using Python threads (threading package) for parallelization.

Warning

Please note that GIL can slow down significantly the computation depending on how L.matmat/matvec is implemented. In case GIL issues cannot be avoided you might use process-based methods of pmatmat().

Args:
L:

see pmatmat()

nworkers:

see pmatmat()

use_matvec:

see pmatmat()

Returns:

A thread parallelized LazyLinOp.

lazylinop.wip.parallel.pmatmat_multiprocess(L, nworkers=None, use_matvec=False, procs_created_once=False)

Implements pmatmat() using Python multiprocessing.

Args:
L:

see pmatmat()

nworkers:

see pmatmat()

use_matvec:

see pmatmat()

procs_created_once: (bool)

If True, processes are created beforehand and then consume/compute multiplications on the fly. Otherwise (False) processes are created each time a multiplication is calculated. procs_created_once=True hardly outperforms False case because it needs to make IPC (Inter-Process Communication) to send slices of A (if we compute L @ A) to processes created beforehand. In case of procs_created_once=False A is auto-shared by process forking (without any copy) but the resulting slices of multiplication still imply a need for IPC.

lazylinop.wip.parallel.pmatmat_mpi(L, use_matvec=False, scatter_A=False)

Implements pmatmat() using MPI (mpi4py)

Args:
L:

see pmatmat()

nworkers:

see pmatmat()

use_matvec:

see pmatmat()

scatter_A: (bool)

If True, the array A is scattered to MPI slots/cores by slices from 0-rank MPI slot which should be the only one to know the full A. This represents an extra cost, that’s why the option is False by default.

MPILop operator

class lazylinop.wip.parallel.MPILop(*args, **kwargs)

Implements a LazyLinOp based on MPI to compute the multiplication (matmat and rmatmat).

This operator basically distributes slices of the underlying matrix to MPI processes that compute their part of the multiplication, then the multiplication result slices are aggregated by the rank-0 MPI process.

Note

This operator is not to be confound with pmatmat_mpi(). Besides, the function pmatmat_mpi should not be used to parallelize furthermore a MPILop.

See also: pmatmat().

MPILop constructor.

Args:
shape:

the shape of the operator (must match mat_npz array, it is needed because only 0-rank MPI process will read the file).

mat_npz:

the file (in npz format) to load the array from.

mat_dtype:

the dtype corresponding to mat_npz (it is needed because only 0-rank MPI process will read the file).

bcast_op:

True (default) to broadcast the the multiplication operand of the operator to all MPI processes (from the rank-0 process) and False to load it locally for each process (typically, if only the 0-rank process gets the operand, for example through a prompt that a user fills with any info that defines the operand, you’ll use bcast_op==True. Otherwise if all processes have access to the operand, no need to broadcast it).

Note

This operator should be used with OpenMP enabled or it won’t be efficient to compute the distributed multiplication. Mixing multithreading (from NumPy) with multiprocessing is often counterproductive.

Example:
>>> # disable OpenMP before all things
>>> from os import environ
>>> environ['OMP_NUM_THREADS'] = '1'
>>> from lazylinop.wip import MPILop
>>> shape = (15, 20)
>>> import numpy as np
>>> M = np.random.rand(*shape)
>>> np.savez('M.npz', M)
>>> MPILop(shape, 'M.npz', np.dtype('float64'), bcast_op=False)
<15x20 MPILop with unspecified dtype>

See also bm_np_mul.py and bm_mpilop.py for valuable examples and 2023-07-04 – MPILop.