-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Loading models directly into VRAM, norm calculation on GPUs, broadcasting for ggml_mul #1483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The CUDA code looks fine, but I would suggest some changes to avoid an explosion of
@ggerganov may disagree about this, wait for his opinion. |
There is a CPU implementation for broadcasting of tensors in ggml_mul. It's just that I did not implement a version that utilizes hardware acceleration on CPUs so I thought I would disable it by default. |
Not sure what you mean by "hardware acceleration on CPUs" - do you mean it is not SIMD-fied? If yes, then it is not a problem. I think the way you've implemented it is fine and should be enabled by default on the CPU as @slaren recommended Make sure this does not break the backward computation - run
Yes, this is better |
This is the output of baby-llama:
I think this is what it's supposed to look like? |
I've quickly tried moving the calls for CUDA functions to a new function |
af005ce
to
bcbf634
Compare
With the new broadcasting for ggml_mul building on macOS seems to fail. I don't have any Apple devices on which I could debug this; should I just disable it by default again? |
you can still check the CI build logs. |
I know but that won't tell me anything about whether or not my implementation produces correct results. |
bcbf634
to
a272e71
Compare
I implemented loading models directly into VRAM and pushed it to this PR; sorry for the feature creep. I'm getting 1.7 t/s for 65b q4_0 with a 3090 and 32 GB RAM @ 3200 MHz. This implementation works with EXT4 and NTFS but not BTRFS. Models are loaded via cuFile. This allows the GPU to directly access the disk when loading weights. Originally I wanted to have the implementation entirely in Question: when are there multiple input files? For simplicity I assumed there to be only one. |
33b models with 16 GB RAM and 8 GB VRAM still do not seem to be viable. On my headless Linux server I'm ~1 GB short. |
When #1508 is ready, it might be just enough to fit it
I'll fix this over the weekend |
Thanks, I'll leave it to you. |
On my Linux server with an i5-4570s, 1600 MHz RAM, and a GTX 1070 when loading 7b q4_0 from a SATA SSD with an empty file system cache: On the second run: |
Maintaining these changes in llama.cpp or supporting a different backend like opencl is going to be a nightmare. IMO this should be abstracted to the absolute minimum number of dependencies with CUDA and ggml-cuda, and if a dependency cannot be avoided reasonably, it should be clearly isolated from the rest of the code. |
afaik, cufile is also linux only? |
Thank you for your input. I understand my PRs to be points of discussion and am willing to implement changes (since that's relatively easy once you have a working version). To make sure there is no misunderstanding: cuFile is included in the CUDA toolkit and would not require installing additional software.
|
cmake support for cuFile requires at least 3.25 |
I collected some data with my main machine (RTX 3090, Ryzen 3700X, 32 GB RAM @ 3200 MHz). The test command I used is
The difference seems to be ~1.5 s for 7b, ~3 s for 13b, and ~6 s for 33b. |
9acc42f
to
71b053c
Compare
I force-pushed a version based on the feedback. This version keeps any cuFile-related code entirely in I designed the new version to make it easy to either remove cuFile-related code entirely or to turn it into a compilation option. Opinions regarding this decisions would be appreciated. The original version can be found here. |
I vote for this. Has anyone a direct storage system to test this? Either with "Magnum IO GPUDirect Storage" or maybe windows direct storage or something. would be interesting to see. |
nice! Is the system with 32G RAM and 11G VRAM expected to run the 65b model? Or 24G VRAM? |
On a headless Linux server it should be possible on Winbloats maybe not so much. |
* ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0 * llama : bump LLAMA_FILE_VERSION to 3 * cuda : update Q4 and Q8 dequantize kernels * ggml : fix AVX dot products * readme : update performance table + hot topics
* Fix name shadowing and C4146 * Fix if macros not using defined when required * Update llama-util.h Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update llama-util.h Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Code style Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* feature: add blis support * feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927 * fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake * Fix typo in INTEGER Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…l-org#1502)" This reverts commit 07e9ace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've update the llama
inference to use broadcasted ggml_mul()
always.
Cannot do it in baby-llama
yet because the backward pass is not implemented (added asserts to prevent from anyone using it for now).
I think we have to drop cufile
- it requires CMake 3.25 and I think currently breaks Windows build for some reason.
Does not seem worth it given the latest benchmarks.
P.S. Hope I didn't mess up something during the rebase that I just did
I removed cuFile and cleaned up the rebase as best as I could. Give me a little more time and I'll clean up the code a little now that I no longer need to consider cuFile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
llama.cpp
Outdated
// "output" tensor | ||
{ | ||
ggml_backend backend_output; | ||
if (n_gpu_layers > int(n_layer)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: if with identical then and else branches [bugprone-branch-clone]
if (n_gpu_layers > int(n_layer)) {
^
Additional context
llama.cpp:1024: else branch starts here
} else {
^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
llama.cpp
Outdated
// "output" tensor | ||
{ | ||
ggml_backend backend_output; | ||
if (n_gpu_layers > int(n_layer)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: if with identical then and else branches [bugprone-branch-clone]
if (n_gpu_layers > int(n_layer)) {
^
Additional context
llama.cpp:1022: else branch starts here
} else {
^
I moved the loop for loading weights from |
Alright, thanks for the help everyone. |
Currently my implementation of GPU acceleration has inefficient memory management: weights are loaded into RAM and then copied to VRAM. My current goal is to fix this. This PR is a step towards that goal.
Ideally you would directly load parameters from disk into VRAM. However, currently not all parameters can be loaded into VRAM. Between the weight matrices there are two norms per layer. Managing this would make the implementation more complicated. But if all weights in a layer are in VRAM the implementation will be simpler. So this PR implements GPU acceleration for norms (or rather ggml_mul) even though this in and of itself is of rather low priority; for full GPU acceleration we would have needed this eventually though.
On master the norms are first repeated and then multiplied with another tensor. To keep the CUDA code simpler I have extended ggml_mul (for both CPU and CUDA) to allow the broadcasting of values: if
src1
is smaller thansrc0
in some dimensions its values are being repeated during the multiplication. That way a repeat CUDA kernel is not needed. The ggml graph is also a little smaller which in theory reduces overhead.There don't seem to be significant performance differences for generating tokens: