Using The GPU FAµST API¶

In this notebook we'll see quickly how to leverage the GPU computing power with pyfaust.
Since pyfaust 2.9.0 the API has been modified to make the GPU available directly from the python wrapper. Indeed, an independent GPU module (aka gpu_mod) has been developed for this purpose.

The first question you might ask is: does it work on my computer? Here is the answer: the loading of this module is quite transparent, if an NVIDIA GPU is available and CUDA is properly installed on your system, you have normally nothing to do except installing pyfaust to get the GPU implementations at your fingertips. We'll see in the end of this notebook how to load manually the module and how to get further information in case of an error.

It is worthy to note two drawbacks about the pyfaust GPU support:

Mac OS X is not supported because NVIDIA has stopped to support this OS.
On Windows and Linux, the pyfaust GPU support is currently limited to CUDA 9.2 and 11.x versions.

In addition to these drawbacks, please notice that the GPU module support is still considered in beta status as the code is relatively young and still evolving. However the API shouldn't evolve that much in a near future.

In [1]:

from pyfaust import Faust
from numpy.random import rand
M, N = rand(10,10), rand(10,15)
gpuF = Faust([M, N], dev='gpu')
gpuF

Out[1]:

- GPU FACTOR 0 (double) DENSE size 10 x 10, addr: 0xdcbf4e0, density 1.000000, nnz 100
- GPU FACTOR 1 (double) DENSE size 10 x 15, addr: 0xdd8ca60, density 1.000000, nnz 150

In [2]:

gpuF.device

Out[2]:

'gpu'

In [3]:

Faust([M, N], dev='cpu').device

Out[3]:

'cpu'

In [4]:

from pyfaust import Faust
from scipy.sparse import random, csr_matrix
S, T = csr_matrix(random(10, 15, density=0.25)), csr_matrix(random(15, 10, density=0.05))
sparse_gpuF = Faust([S, T], dev='gpu')
sparse_gpuF

Out[4]:

- GPU FACTOR 0 (double) SPARSE size 10 x 15, addr: 0xd4f7cd0, density 0.253333, nnz 38
- GPU FACTOR 1 (double) SPARSE size 15 x 10, addr: 0x14fc6ff0, density 0.053333, nnz 8

You can also create a GPU Faust by explicitly copying a CPU Faust to the GPU memory. Actually, at anytime you can copy a CPU Faust to GPU and conversely. The clone() member function is here precisely for this purpose. Below we copy gpuF to CPU and back again to GPU in the new Faust gpuF2.

In [5]:

cpuF = gpuF.clone('cpu')
gpuF2 = cpuF.clone('gpu')
gpuF2

Out[5]:

- GPU FACTOR 0 (double) DENSE size 10 x 10, addr: 0x147bd9a0, density 1.000000, nnz 100
- GPU FACTOR 1 (double) DENSE size 10 x 15, addr: 0x14c675e0, density 1.000000, nnz 150

Generating a GPU Faust¶

Many of the functions for generating a Faust object on CPU are available on GPU too. It is always the same, you precise the dev argument by assigning the 'gpu' value and you'll get a GPU Faust instead of a CPU Faust.

For example, the code below will successively create a random GPU Faust, a Hadamard transform GPU Faust, a identity GPU Faust and finally a DFT GPU Faust.

In [6]:

from pyfaust import rand  as frand, eye as feye, wht, dft
print("Random GPU Faust:", frand(10,10, num_factors=11, dev='gpu'))
print("Hadamard GPU Faust:", wht(32, dev='gpu'))
print("Identity GPU Faust:", feye(16, dev='gpu'))
print("DFT GPU Faust:", dft(32, dev='gpu'))

Random GPU Faust: - GPU FACTOR 0 (double) SPARSE size 10 x 10, addr: 0xd6c6120, density 0.500000, nnz 50
- GPU FACTOR 1 (double) SPARSE size 10 x 10, addr: 0xdd946d0, density 0.500000, nnz 50
- GPU FACTOR 2 (double) SPARSE size 10 x 10, addr: 0x14c6c890, density 0.500000, nnz 50
- GPU FACTOR 3 (double) SPARSE size 10 x 10, addr: 0x14c6d740, density 0.500000, nnz 50
- GPU FACTOR 4 (double) SPARSE size 10 x 10, addr: 0x14c6e610, density 0.500000, nnz 50
- GPU FACTOR 5 (double) SPARSE size 10 x 10, addr: 0x14c6f530, density 0.500000, nnz 50
- GPU FACTOR 6 (double) SPARSE size 10 x 10, addr: 0x14c70400, density 0.500000, nnz 50
- GPU FACTOR 7 (double) SPARSE size 10 x 10, addr: 0x14c712d0, density 0.500000, nnz 50
- GPU FACTOR 8 (double) SPARSE size 10 x 10, addr: 0x14c721a0, density 0.500000, nnz 50
- GPU FACTOR 9 (double) SPARSE size 10 x 10, addr: 0x14c73070, density 0.500000, nnz 50
- GPU FACTOR 10 (double) SPARSE size 10 x 10, addr: 0x14c73f60, density 0.500000, nnz 50

Hadamard GPU Faust: - GPU FACTOR 0 (double) SPARSE size 32 x 32, addr: 0x14c6f530, density 0.062500, nnz 64
- GPU FACTOR 1 (double) SPARSE size 32 x 32, addr: 0x14c6e610, density 0.062500, nnz 64
- GPU FACTOR 2 (double) SPARSE size 32 x 32, addr: 0x14c6d740, density 0.062500, nnz 64
- GPU FACTOR 3 (double) SPARSE size 32 x 32, addr: 0x14c6c890, density 0.062500, nnz 64
- GPU FACTOR 4 (double) SPARSE size 32 x 32, addr: 0xdd946d0, density 0.062500, nnz 64

Identity GPU Faust: - GPU FACTOR 0 (double) SPARSE size 16 x 16, addr: 0xdd946d0, density 0.062500, nnz 16

DFT GPU Faust: - GPU FACTOR 0 (complex) SPARSE size 32 x 32, addr: 0x14c6a3c0, density 0.062500, nnz 64
- GPU FACTOR 1 (complex) SPARSE size 32 x 32, addr: 0x14c6c890, density 0.062500, nnz 64
- GPU FACTOR 2 (complex) SPARSE size 32 x 32, addr: 0x14c6d740, density 0.062500, nnz 64
- GPU FACTOR 3 (complex) SPARSE size 32 x 32, addr: 0x14c6e610, density 0.062500, nnz 64
- GPU FACTOR 4 (complex) SPARSE size 32 x 32, addr: 0x14c6f530, density 0.062500, nnz 64
- GPU FACTOR 5 (complex) SPARSE size 32 x 32, addr: 0xd6c6120, density 0.031250, nnz 32

Manipulating GPU Fausts and CPU interoperability¶

Once you've created GPU Faust objects, you can perform operations on them staying in GPU world (that is, with no array transfer to CPU memory). That's of course not always possible. For example, let's consider Faust-scalar multiplication and Faust-matrix product. In the first case the scalar is copied to the GPU memory and likewise in the second case the matrix is copied from CPU to GPU in order to proceed to the computation. However in both cases the Faust factors stay into GPU memory and don't move during the computation.

In [7]:

# Faust-scalar multiplication
2*gpuF

Out[7]:

- GPU FACTOR 0 (double) DENSE size 10 x 10, addr: 0x14c6e670, density 1.000000, nnz 100
- GPU FACTOR 1 (double) DENSE size 10 x 15, addr: 0xdd8ca60, density 1.000000, nnz 150

As you see the first factor's address has changed in the result compared to what it was in gpuF. Indeed, when you make a scalar multiplication only one factor is multiplied, the others don't change, they are shared between the Faust being multiplied and the resulting Faust. This is an optimization and to go further in this direction the factor chosen to be multiplied is the smallest in memory (not necessarily the first one).

In [8]:

# Faust-matrix product (the matrix is copied to GPU 
# then the multiplication is performed on GPU)
gpuF@rand(gpuF.shape[1],15)

Out[8]:

array([[12.03377659, 16.38928566, 10.99943303, 14.64638585, 15.10180097,
        15.31577758, 14.78077452, 20.52957295, 15.92204739, 18.06766027,
        15.7963145 , 14.21081761, 13.15456605, 15.96658355, 15.35098426],
       [14.05170417, 19.56342933, 13.04819932, 16.70816225, 17.46326246,
        18.56864722, 17.34764863, 23.83110741, 18.35370574, 20.54300363,
        18.66927476, 16.93012863, 15.70669665, 18.80183248, 17.5896307 ],
       [18.29575425, 24.92036498, 16.84187022, 22.38577625, 22.45371319,
        23.43575176, 22.44748548, 31.0921777 , 24.14895513, 27.08199129,
        24.30461575, 21.70208805, 20.03751162, 23.90379561, 23.01133409],
       [12.0659893 , 16.8673387 , 11.28410304, 14.48887525, 14.7691039 ,
        16.22373177, 14.8497166 , 20.57766858, 15.70442625, 17.30160988,
        16.34628828, 14.45728898, 13.4733362 , 15.57131619, 14.81371431],
       [14.53799204, 19.6400743 , 12.87698767, 17.95153605, 16.88523763,
        18.10321043, 18.19935167, 24.34954868, 18.67210392, 21.52473004,
        19.35171877, 16.569914  , 15.59375974, 19.08834618, 17.94800186],
       [21.28334326, 28.34910167, 19.26842249, 26.21405791, 24.95494806,
        27.03503927, 26.00247181, 35.28832206, 27.7446279 , 30.70565365,
        28.26693527, 25.12514283, 23.0166663 , 27.97430027, 26.64179637],
       [13.73410592, 17.88512478, 11.68289297, 16.68226848, 15.26903287,
        16.77936704, 16.82946142, 22.42035338, 17.49186224, 19.52779557,
        17.97049292, 15.15780818, 14.38081105, 17.92879243, 16.71139883],
       [13.53226238, 18.36404472, 12.10727412, 16.8842778 , 15.74382448,
        17.18578195, 17.11428967, 22.43308374, 17.59617255, 19.68465526,
        17.94934552, 15.96907774, 14.82568442, 18.20769237, 17.14133334],
       [17.01211071, 23.58650292, 15.4334701 , 20.91107463, 20.45996867,
        21.41103798, 21.47499384, 28.72080482, 22.29891498, 25.31573514,
        22.52557223, 20.07959372, 18.82644436, 23.09548567, 21.56291862],
       [13.6850099 , 19.11797231, 12.77059877, 16.86645235, 16.74130503,
        17.79811476, 17.13957774, 23.33261701, 17.80037305, 20.41449683,
        18.44258085, 16.40666374, 15.22653354, 18.16563817, 17.17479868]])

In [9]:

from pyfaust import rand as frand
gpuF2 = frand(gpuF.shape[1],18, dev='gpu')
gpuF3 = gpuF@gpuF2
gpuF3

Out[9]:

- GPU FACTOR 0 (double) DENSE size 10 x 10, addr: 0xdcbf4e0, density 1.000000, nnz 100
- GPU FACTOR 1 (double) DENSE size 10 x 15, addr: 0xdd8ca60, density 1.000000, nnz 150
- GPU FACTOR 2 (double) SPARSE size 15 x 17, addr: 0xd6c6120, density 0.294118, nnz 75
- GPU FACTOR 3 (double) SPARSE size 17 x 15, addr: 0x14c6f530, density 0.333333, nnz 85
- GPU FACTOR 4 (double) SPARSE size 15 x 16, addr: 0x14c6e610, density 0.312500, nnz 75
- GPU FACTOR 5 (double) SPARSE size 16 x 15, addr: 0x14c6d740, density 0.333333, nnz 80
- GPU FACTOR 6 (double) SPARSE size 15 x 18, addr: 0x14c6c890, density 0.277778, nnz 75

Besides, it's important to note that gpuF3 factors are not duplicated in memory because they already exist for gpuF and gpuF2, that's an extra optimization: gpuF3 is just a memory view of the factors of gpuF and gpuF2 (the same GPU arrays are shared between Faust objects). That works pretty well the same for CPU Faust objects.

Finally, please notice that CPU Faust objects are not directly interoperable with GPU Fausts objects. You can try, it'll end up with an error.

In [10]:

cpuF = frand(5,5,5, dev='cpu')
gpuF = frand(5,5,6, dev='gpu')
try:
    print("A first try to multiply a CPU Faust with a GPU one...")
    cpuF@gpuF
except:
    print("it doesn't work, you must either convert cpuF to a GPU Faust or gpuF to a CPU Faust before multiplying.")
print("A second try using conversion as needed...")
print(cpuF.clone('gpu')@gpuF) # this is what you should do
print("Now it works!")

A first try to multiply a CPU Faust with a GPU one...
it doesn't work, you must either convert cpuF to a GPU Faust or gpuF to a CPU Faust before multiplying.
A second try using conversion as needed...
- GPU FACTOR 0 (double) SPARSE size 5 x 5, addr: 0x14c7c040, density 1.000000, nnz 25
- GPU FACTOR 1 (double) SPARSE size 5 x 5, addr: 0x14c7ced0, density 1.000000, nnz 25
- GPU FACTOR 2 (double) SPARSE size 5 x 5, addr: 0x14c7dd60, density 1.000000, nnz 25
- GPU FACTOR 3 (double) SPARSE size 5 x 5, addr: 0x14c7ebf0, density 1.000000, nnz 25
- GPU FACTOR 4 (double) SPARSE size 5 x 5, addr: 0x14c7fa80, density 1.000000, nnz 25
- GPU FACTOR 5 (double) SPARSE size 5 x 5, addr: 0x14c6a3c0, density 1.000000, nnz 25
- GPU FACTOR 6 (double) SPARSE size 5 x 5, addr: 0xcd695a0, density 1.000000, nnz 25
- GPU FACTOR 7 (double) SPARSE size 5 x 5, addr: 0x14c6cf30, density 1.000000, nnz 25
- GPU FACTOR 8 (double) SPARSE size 5 x 5, addr: 0x14c793b0, density 1.000000, nnz 25
- GPU FACTOR 9 (double) SPARSE size 5 x 5, addr: 0x14c7a280, density 1.000000, nnz 25
- GPU FACTOR 10 (double) SPARSE size 5 x 5, addr: 0x14c7b170, density 1.000000, nnz 25

Now it works!

In [11]:

from pyfaust import rand as frand
cpuF = frand(1024, 1024, num_factors=10, fac_type='dense')
%timeit cpuF.norm(2)
%timeit cpuF.toarray()

239 ms ± 21.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
714 ms ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]:

gpuF = cpuF.clone(dev='gpu')
%timeit gpuF.norm(2)
%timeit gpuF.toarray()

18.1 ms ± 52.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
137 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Of course not all GPUs are equal, below are the results I got using a Tesla V100:

6.85 ms ± 9.06 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.82 ms ± 90.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Likewise let's compare the performance obtained for a sparse Faust:

In [13]:

from pyfaust import rand as frand
cpuF2 = frand(1024, 1024, num_factors=10, fac_type='sparse', density=.2)
gpuF2 = cpuF2.clone(dev='gpu')
print("CPU times:")
%timeit cpuF2.norm(2)
%timeit cpuF2.toarray()
print("GPU times:")
%timeit gpuF2.norm(2)
%timeit gpuF2.toarray()

CPU times:
105 ms ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.07 s ± 236 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
GPU times:
123 ms ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
132 ms ± 24.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Running some FAµST algorithms on GPU¶

Some of the FAµST algorithms implemented in the C++ core are now also available in pure GPU mode. For example, let's compare the factorization times taken by the hierarchical factorization when launched on CPU and GPU. When running on GPU, the matrix to factorize is copied in GPU memory and almost all operations executed during the algorithm don't imply the CPU in any manner (the only exception at this stage of development is the proximal operators that only run on CPU).

Warning: THE COMPUTATION CAN LAST THIRTY MINUTES OR SO ON CPU

In [14]:

from scipy.io import loadmat
from pyfaust.demo import get_data_dirpath
d = loadmat(get_data_dirpath()+'/matrix_MEG.mat')
def factorize_MEG(dev='cpu'):
    from pyfaust.fact import hierarchical
    from pyfaust.factparams import ParamsHierarchicalRectMat
    from time import time
    from numpy.linalg import norm
    MEG = d['matrix'].T
    num_facts = 9
    k = 10
    s = 8
    t_start = time()
    p = ParamsHierarchicalRectMat.createParams(MEG, ['rectmat', num_facts, k, s])
    p.factor_format = 'dense' 
    MEG16 = hierarchical(MEG, p, backend=2020, on_gpu=dev=='gpu')
    total_time = time()-t_start
    err = norm(MEG16.toarray()-MEG)/norm(MEG)
    return MEG16, total_time, err

It seems  FAuST data is already available locally. To renew the download please empty the directory: /home/hhadjdji/pyfaust_data

In [15]:

gpuMEG16, gpu_time, gpu_err = factorize_MEG(dev='gpu')
print("GPU time, error:", gpu_time, gpu_err)

Faust::hierarchical: 1/8
Faust::hierarchical: 2/8
Faust::hierarchical: 3/8
Faust::hierarchical: 4/8
Faust::hierarchical: 5/8
Faust::hierarchical: 6/8
Faust::hierarchical: 7/8
Faust::hierarchical: 8/8
GPU time, error: 147.81562995910645 0.13022356291246556

In [16]:

cpuMEG16, cpu_time, cpu_err = factorize_MEG(dev='cpu')
print("CPU time, error:", cpu_time, cpu_err)

Faust::hierarchical: 1/8
Faust::hierarchical: 2/8
Faust::hierarchical: 3/8
Faust::hierarchical: 4/8
Faust::hierarchical: 5/8
Faust::hierarchical: 6/8
changed mul. CPU time, error: 616.1654939651489 0.13008384395275446

Implementation	Hardware	Time (s)	Error Faust vs MEG matrix
CPU	Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz	616.16	.130
GPU	NVIDIA GTX980	147.81	.130
GPU	RTX2080	73.88	.130

In [17]:

import pyfaust
pyfaust.enable_gpu_mod(silent=False, fatal=True)

Warning: gm_lib is already loaded (can't reload it).

Afterward you can call pyfaust.is_gpu_mod_enabled() to verify if it works in your script.

Below I copy outputs that show what it should look like when it doesn't work:

1) If you asked a fatal error using enable_gpu_mod(silent=False, fatal=True) an exception will be raised and your code won't be able to continue after this call:

python -c "import pyfaust; pyfaust.enable_gpu_mod(silent=False, fatal=True)"
WARNING: you must call enable_gpu_mod() before using GPUModHandler singleton.
loading libgm
libcublas.so.9.2: cannot open shared object file: No such file or directory
[...]
Exception: Can't load gpu_mod library, maybe the path (/home/test/venv_pyfaust-2.10.14/lib/python3.7/site-packages/pyfaust/lib/libgm.so) is not correct or the backend (cuda) is not installed or configured properly so the libraries are not found.

2) If you just want a warning, you must use enable_gpu_mod(silent=False), the code will continue after with no gpu_mod enabled but you'll get some information about what is going wrong (here the CUDA toolkit version 9.2 is not installed) :

python -c "import pyfaust; pyfaust.enable_gpu_mod(silent=False)"
WARNING: you must call enable_gpu_mod() before using GPUModHandler singleton.
loading libgm
libcublas.so.9.2: cannot open shared object file: No such file or directory

Using The GPU FAµST API¶

Creating a GPU Faust object¶

Generating a GPU Faust¶

Manipulating GPU Fausts and CPU interoperability¶

Benchmarking your GPU with pyfaust!¶

Running some FAµST algorithms on GPU¶

Manually loading the pyfaust GPU module¶