Parallel computing

pmatmat functions

lazylinop.wip.parallel.pmatmat(L, method='thread', nworkers=None, use_matvec=False, **kwargs)

Frontend function to build a LazyLinOp parallelized version of L. Precisely, L.matmat/__matmul__ is parallelized according to the specified method.

This function implements the parallelization of the product L @ A proceeding alternatively as follows:

1. (use_matvec=False) Assign evenly blocks of columns of A to workers in order to compute the product by blocks in parallel using L pre-defined matmat.

2. (use_matvec=True) Define a parallelized matmat using L’s pre-defined matvec. Each worker makes a series of sequential calls to matvec in parallel to other workers. The matvec calls are made on A columns assigned to the worker as in 1.

Warning

Using the method 1 or 2 does not go without consequence on the computing performance. In most cases method 1 should be more efficient. This is the default method (use_matvec=False).

Args:
L: (LazyLinOp)

The operator to parallelize.

method: (str)
nworkers: (int)

The number of workers used for parallelization. Defaulty, the number of CPUs available on the system. This parameter is ignored for ‘mpi’ method (it is fixed externally by the mpiexec/mpirun command).

use_matvec: (bool)

If True the matvec function of L is used for parallelization otherwise only matmat is used.

**kwargs: (unpacked dict)

Specialized arguments corresponding to the method used.

Warning

For using pmatmat efficiently (i.e. without oversubscribing), the multithreading of underlying libraries should be disabled. Setting for example OMP_NUM_THREADS, MKL_NUM_THREADS or OPENBLAS_NUM_THREADS environment variables to ‘1’ (it must notably be done before importing numpy or other underlying library in start of your script). The variables to set depend on the LazyLinOp matmat/matvec being parallelized. In any doubt, all variables can be set to ‘1’. An alternative to environment variables is to use the threadpoolctl library.

Returns:

A specialized LazyLinOp that is able to compute the product L @ A in parallel according to the chosen method.

Example:
>>> # disable OpenMP before all things
>>> from os import environ
>>> environ['OMP_NUM_THREADS'] = '1'
>>> environ['MKL_NUM_THREADS'] = '1'
>>> environ['OPENBLAS_NUM_THREADS'] = '1'
>>> from lazylinop import aslazylinop
>>> from lazylinop.wip.parallel import pmatmat
>>> shape = (15, 20)
>>> import numpy as np
>>> M = np.random.rand(*shape)
>>> A = np.random.rand(shape[1], 32)
>>> L = aslazylinop(M)
>>> pL = pmatmat(L)
>>> # pL matmat is parallelized using default thread method
>>> LA = L @ A # seqential mul
>>> pLA = pL @ A # parallel mul
>>> np.allclose(pLA, LA)
True
>>> np.allclose(pLA, M @ A)
True
lazylinop.wip.parallel.pmatmat_multithread(L, nworkers=None, use_matvec=False)

Implements pmatmat() using Python threads (threading package) for parallelization.

Warning

Please note that GIL can slow down significantly the computation depending on how L.matmat/matvec is implemented. In case GIL issues cannot be avoided you might use process-based methods of pmatmat().

Args:
L:

see pmatmat()

nworkers:

see pmatmat()

use_matvec:

see pmatmat()

Returns:

A thread parallelized LazyLinOp.

lazylinop.wip.parallel.pmatmat_multiprocess(L, nworkers=None, use_matvec=False, procs_created_once=False)

Implements pmatmat() using Python multiprocessing.

Args:
L:

see pmatmat()

nworkers:

see pmatmat()

use_matvec:

see pmatmat()

procs_created_once: (bool)

If True processes are created beforehand and then consume/compute multiplications on the fly. Otherwise (False) processes are created each time a multiplication is calculated. procs_created_once=True hardly outperforms False case because it needs to make IPC (Inter-Process Communication) to send slices of A (if we compute L @ A) to processes created beforehand. In case of procs_created_once=False A is auto-shared by process forking (without any copy) but the resulting slices of multiplication still imply a need for IPC.

lazylinop.wip.parallel.pmatmat_mpi(L, use_matvec=False, scatter_A=False)

Implements pmatmat() using MPI (mpi4py)

Note

Do not make confusion with MPILop which parallelizes by blocks of L rows (which must hence hide a full array). Besides, the function pmatmat_mpi should not be used to parallelize furthermore a MPILop.

Args:
L:

see pmatmat()

nworkers:

see pmatmat()

use_matvec:

see pmatmat()

scatter_A: (bool)

If True the array A is scattered to MPI slots/cores by slices from 0-rank MPI slot which should be the only one to know the full A. This represents an extra cost, that’s why the option is False by default.

MPILop operator

class lazylinop.wip.parallel.MPILop(*args, **kwargs)

Implements a LazyLinOp based on MPI to compute the multiplication (matmat and rmatmat).

This operator basically distributes slices of the underlying matrix to MPI processes that compute their part of the multiplication, then the multiplication result slices are aggregated by the rank-0 MPI process.

Note

This operator is not to be confound with pmatmat_mpi()

See also: pmatmat().

MPILop constructor.

Args:
shape:

the shape of the operator (must match mat_npz array, it is needed because only 0-rank MPI process will read the file).

mat_npz:

the file (in npz format) to load the array from.

mat_dtype:

the dtype corresponding to mat_npz (it is needed because only 0-rank MPI process will read the file).

bcast_op:

True (default) to broadcast the the multiplication operand of the operator to all MPI processes (from the rank-0 process) and False to load it locally for each process (typically, if only the 0-rank process gets the operand, for example through a prompt that a user fills with any info that defines the operand, you’ll use bcast_op==True. Otherwise if all processes have access to the operand, no need to broadcast it).

Note: This operator should be used without OpenMP enabled or it won’t be efficient to compute the distributed multiplication. Why is that? Well, mixing multithreading (from numpy/BLAS) with multiprocessing is counterproductive.

Example:
>>> # disable OpenMP before all things
>>> from os import environ
>>> environ['OMP_NUM_THREADS'] = '1'
>>> from lazylinop.wip import MPILop
>>> shape = (15, 20)
>>> import numpy as np
>>> M = np.random.rand(*shape)
>>> np.savez('M.npz', M)
>>> MPILop(shape, 'M.npz', np.dtype('float64'), bcast_op=False)
<15x20 MPILop with unspecified dtype>

See also bm_np_mul.py and bm_mpilop.py for valuable examples and 2023-07-04 – MPILop.

lazylinop.wip.parallel.MPILop.__init__(self, shape, mat_npz, mat_dtype, bcast_op=True)

MPILop constructor.

Args:
shape:

the shape of the operator (must match mat_npz array, it is needed because only 0-rank MPI process will read the file).

mat_npz:

the file (in npz format) to load the array from.

mat_dtype:

the dtype corresponding to mat_npz (it is needed because only 0-rank MPI process will read the file).

bcast_op:

True (default) to broadcast the the multiplication operand of the operator to all MPI processes (from the rank-0 process) and False to load it locally for each process (typically, if only the 0-rank process gets the operand, for example through a prompt that a user fills with any info that defines the operand, you’ll use bcast_op==True. Otherwise if all processes have access to the operand, no need to broadcast it).

Note: This operator should be used without OpenMP enabled or it won’t be efficient to compute the distributed multiplication. Why is that? Well, mixing multithreading (from numpy/BLAS) with multiprocessing is counterproductive.

Example:
>>> # disable OpenMP before all things
>>> from os import environ
>>> environ['OMP_NUM_THREADS'] = '1'
>>> from lazylinop.wip import MPILop
>>> shape = (15, 20)
>>> import numpy as np
>>> M = np.random.rand(*shape)
>>> np.savez('M.npz', M)
>>> MPILop(shape, 'M.npz', np.dtype('float64'), bcast_op=False)
<15x20 MPILop with unspecified dtype>

See also bm_np_mul.py and bm_mpilop.py for valuable examples and 2023-07-04 – MPILop.