KEMBAR78
Experiments with GPU CUDA acceleration...sort of · Issue #220 · ggml-org/whisper.cpp · GitHub
Skip to content

Experiments with GPU CUDA acceleration...sort of #220

@Topping1

Description

@Topping1

CUDA toolkit documentation link states that NVBLAS is a drop-in BLAS replacement.
Also states: "The NVBLAS Library is a GPU-accelerated Library that implements BLAS (Basic Linear Algebra Subprograms). It can accelerate most BLAS Level-3 routines by dynamically routing BLAS calls to one or more NVIDIA GPUs present in the system, when the charateristics of the call make it speed up on a GPU." One of those Level-3 routines is sgemm (matrix multiplication), that is used extensively by ggml.c.
In theory, IF CORRECTLY CONFIGURED, NVBLAS can intercept the calls to the OpenBLAS function cblas_sgemm and accelerate it using a CUDA compatible graphics card installed in the system.
There is not much information about the specific steps to enable it, but I could piece together this step-by-step:

1-Install CUDA toolkit from the official link link

2-create the file /etc/nvblas.conf with the following contents:

NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so
NVBLAS_GPU_LIST ALL

/usr/lib/x86_64-linux-gnu/libopenblas.so is the location of libopenblas.so on my system, You have to point it to the correct location (should not be that different).

3-create an environment variable pointing to nvblas.conf
export NVBLAS_CONFIG_FILE=/etc/nvblas.conf

4-create an environment variable pointing to the location of libnvblas.so
export LD_PRELOAD=/usr/local/cuda/lib64/libnvblas.so.11
here is not clear which .so file is needed. For example on my system I can find the following
/usr/local/cuda/lib64/libnvblas.so
/usr/local/cuda/lib64/libnvblas.so.11
/usr/local/cuda/lib64/libnvblas.so.11.11.3.6
/usr/local/cuda-11.8/lib64/libnvblas.so
/usr/local/cuda-11.8/lib64/libnvblas.so.11
/usr/local/cuda-11.8/lib64/libnvblas.so.11.11.3.6
/usr/local/cuda-11.8/lib64/libnvblas.so
/usr/local/cuda-11.8/lib64/libnvblas.so.11
/usr/local/cuda-11.8/lib64/libnvblas.so.11.11.3.6

5-Download source code of whisper.cpp with
git clone https://github.com/ggerganov/whisper.cpp

6-Inside the whisper.cpp folder, execute
cmake -DWHISPER_SUPPORT_OPENBLAS=ON .

7-Inside the whisper.cpp folder, execute
make
you should have now a compiled main executable with BLAS support turned on.

8-now, at least in my case, when I run a test transcription, the program confirms that is using BLAS (BLAS = 1), but NVBLAS does not seem to be intercepting the calls. NVTOP does not show GPU usage and no nvblas.log is created.

If someone can figure out how to make this work, it has the potential to accelerate substantially the transcription speed on x64.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceCPU and memory usage - results and comparisons

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions