KEMBAR78
Support for different data types (float16, float32) by kris-jusiak · Pull Request #93 · karpathy/llama2.c · GitHub
Skip to content

Conversation

kris-jusiak
Copy link

Problem:

  • Only float32 is currently supported which requires 2x the memory and it slower to execute as inference is memory bound.

Solution:

  • Add ability to export model to float16.
  • Add support to inference with float16.

Note:

  • It's quite hard to do in generically in pure C (without templates) so to avoid adding too much complexity compilation option has been chosen. Ideally that would be a run-time pick based on the value stored in the config but that requires additional complexity which I wanted to avoid but that can be still explored with a proper solution.

Problem:
- Only float32 is currently supported which requires 2x the memory and
  it slower to execute as inference is memory bound.

Solution:
- Add ability to export model to float16.
- Add support to inference with float16.

Note:
- It's quite hard to do in generically in pure C (without templates) so
  to avoid adding too much complexity compilation option has been
  chosen. Ideally that would be a run-time pick based on the value
  stored in the config but that requires additional complexity which
  I wanted to avoid but that can be still explored with a proper solution.
@vgoklani
Copy link

@krzysztof-jusiak Hey there, what about approach would be used for BF16?

thanks!

@axrwl
Copy link

axrwl commented Jul 26, 2023

For clang, -DDTYPE=_Float16 will work.

@xefoci7612
Copy link

@krzysztof-jusiak: Hi, do you think it makes sense to just move from float32 to float16? It seems float16 is always faster than float32.

Using a single data type would be simpler both in code and in numbers of modelxxx.bin (already 3 + llama 7B and counting...)

```

The export will take ~10 minutes or so and generate a 26GB file (the weights of the 7B model in float32) called `llama2_7b.bin` in the current directory. It has been [reported](https://github.com/karpathy/llama2.c/pull/85) that despite efforts, the 13B export currently doesn't work for unknown reaons (accepting PRs for fix). We can run the model as normal:
The export will take ~1 minute or so and generate a 26GB file (the weights of the 7B model in float32) called `llama2_7b.bin` in the current directory. It has been [reported](https://github.com/karpathy/llama2.c/pull/85) that despite efforts, the 13B export currently doesn't work for unknown reaons (accepting PRs for fix). We can run the model as normal:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reaons -> reasons

@kris-jusiak
Copy link
Author

@xefoci7612 I think there is a need for different data type as they provide different accuracy and differ in sizes, etc. Ideally we would operate on max precision floating point only but that's not possible due to performance/size limitations. However, smaller models or better hardware allow to use higher precision versions. Since inference is memory bound for faster speeds usually quantization is also applied, which compress the model even more to for example 4 bits but that comes with accuracy trades off, which I don't think can be made by default for everyone.

@kroggen
Copy link
Contributor

kroggen commented Jul 26, 2023

@xefoci7612

No, float16 is not always faster than float32. If the processor does not natively support FP16 arithmetic, then it will be emulated in software

https://stackoverflow.com/questions/56697332/float16-is-much-slower-than-float32-and-float64-in-numpy

@karpathy
Copy link
Owner

will def take a look here. one thing to be careful with is that if you want to inference in fp16 you must train in fp16 (with gradient scalers), and not in fp32 or bf16. Otherwise the range of activations can overflow.

@karpathy
Copy link
Owner

@krzysztof-jusiak what are the benefits of fp16?

  • clearly checkpoints get half as small
  • how much faster is fp16? e.g. on your computer, or on a MacBook Air M1 or so

also my understanding is this would invalidate all our previous checkpoints because they don't contain dtype in the config. Which is on me for having chosen a dumb serialization :D

@kris-jusiak
Copy link
Author

Thanks for the hint about the training, it's a really valid point.

Regarding the benefits of fp16, it's mainly performance as the weights size is smaller and since the computations are memory bound the speedup should be noticeable.

Saying that, I've not noticed a huge difference between fp16 and fp32 on my machine
fp16: 2.3tok/s
fp32: 2.0tok/s

which doesn't add up with my previous experiments with other llms, I'd expect a bit bigger difference.
I'm verifying whether the code is at memory bound stage or not yet. Maybe it's still cpu bound which would explain why the speedup isn't bigger.

@axrwl
Copy link

axrwl commented Jul 27, 2023

also my understanding is this would invalidate all our previous checkpoints because they don't contain dtype in the config

Backwards compatibility can be maintained by assuming that a missing dtype implies fp32.

#include <fcntl.h>
#include <sys/mman.h>

#ifndef DTYPE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to typedef dtype rather than spreading DTYPE macro all over.

Suggested change
#ifndef DTYPE
#ifdef DTYPE
typedef DTYPE dtype;
else
typedef float dtype;
#else

@karpathy
Copy link
Owner

karpathy commented Aug 5, 2023

Ok I took a look but I think the required changes are a little bit too yucky, and we're not seeing strong evidence of much better results. I do like that the files would have been smaller by ~half... Anyway I'm leaning to not include this atm but ty. If someone can demonstrate a solid higher throughput I think it's worth revisiting.

@karpathy karpathy closed this Aug 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants