-
Notifications
You must be signed in to change notification settings - Fork 30
[Opt] Using filter and kernel level pipeline to optimize lookup kernels #136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Documentation previewhttps://nvidia-merlin.github.io/HierarchicalKV/review/pr-136 |
README.md
Outdated
| ## Benchmark & Performance(W.I.P) | ||
|
|
||
| * GPU: 1 x NVIDIA A100 80GB PCIe: 8.0 | ||
| * GPU: 1 x NVIDIA A100-SXM4-80GB: 8.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to keep PCIE as our benchmark baseline.
include/merlin_hashtable.cuh
Outdated
| // Only bucket_size = 128 | ||
| // On A100, the maximum dim which Pipeline support is 224 floats | ||
| if (options_.max_bucket_size == 128 && | ||
| value_size <= (224 * sizeof(float))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd better avoid the magic number, and make it a private member of HashTable or better form. If the 224 depends on the GPU hardware setting, we need to calculate it at initialize phrase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will make arch infomation as tempate type to select the kernel related config at compile time.
include/merlin/core_kernels.cuh
Outdated
| for (size_t i = 0; i < bucket_max_size; i++) | ||
| new (buckets[start + tid].keys(i)) | ||
| AtomicKey<K>{static_cast<K>(EMPTY_KEY)}; | ||
| K hashed_key = Murmur3HashDevice(static_cast<K>(EMPTY_KEY)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const K
include/merlin/core_kernels.cuh
Outdated
| new (buckets[start + tid].keys(i)) | ||
| AtomicKey<K>{static_cast<K>(EMPTY_KEY)}; | ||
| K hashed_key = Murmur3HashDevice(static_cast<K>(EMPTY_KEY)); | ||
| uint8_t digest = static_cast<uint8_t>(hashed_key >> 32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const uint8_t
include/merlin/core_kernels.cuh
Outdated
| new (buckets[start + tid].keys(i)) | ||
| AtomicKey<K>{static_cast<K>(EMPTY_KEY)}; | ||
| K hashed_key = Murmur3HashDevice(static_cast<K>(EMPTY_KEY)); | ||
| uint8_t digest = static_cast<uint8_t>(hashed_key >> 32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you confirm which header we should use for uint8_t?
sys/types.h
or <stdint.h>
or https://nvidia.github.io/cutlass/structcutlass_1_1TypeTraits_3_01uint8__t_01_4.html
I mean if we need to add a special header explicitly for uint8_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its <cstdint> or <stdint.h> which is already included in our code in header types.cuh
include/merlin/core_kernels.cuh
Outdated
| local_size = buckets_size[new_bkt_idx]; | ||
| if (rank == src_lane) { | ||
| K hashed_key = Murmur3HashDevice(key); | ||
| uint8_t target_digest = static_cast<uint8_t>(hashed_key >> 32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, and I will find all of this and add const, thanks for your review.
include/merlin/core_kernels.cuh
Outdated
| if (rank == 0) { | ||
| K hashed_key = Murmur3HashDevice(static_cast<K>(EMPTY_KEY)); | ||
| uint8_t target_digest = static_cast<uint8_t>(hashed_key >> 32); | ||
| bucket->digests[key_idx] = target_digest; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For EMPTY_KEY, we'd better define a separate macro for its relative digest.
include/merlin/core_kernels.cuh
Outdated
| if (g.thread_rank() == src_lane) { | ||
| const int key_pos = | ||
| (start_idx + tile_offset + src_lane) & (bucket_max_size - 1); | ||
| K hashed_key = Murmur3HashDevice(static_cast<K>(EMPTY_KEY)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EMPTY_DIGEST
| #include "../utils.cuh" | ||
|
|
||
| // if i % 2 == 0, select buffer 0, else buffer 1 | ||
| #define SAME_BUF(i) (((i)&0x01) ^ 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is used to select buffer in pipeline kernel. For example:
V* v_src = sm_vector[SAME_BUF(i)][groupID]; in kernel lookup_kernel_with_io_pipeline_v2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't expand the lookup_kernels.cuh.
So here is a potential issue, the macro naming is too common that may dirty the end-users name scope, if no performance loss, can we change them to a __forced_ inline__ __device__ func(..) ?
Or at least, #undef them after the last reference in this file.
Or special prefix like 'MERLIN_xxx'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I get
|
|
||
| __forceinline__ __device__ static S lgs(S* src) { return src[0]; } | ||
|
|
||
| __forceinline__ __device__ static void stg(S* dst, S score_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stg(const S* dst, const S score_)
| __pipeline_memcpy_async(dst, src, sizeof(S)); | ||
| } | ||
|
|
||
| __forceinline__ __device__ static S lgs(S* src) { return src[0]; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using const maybe help the compiler optimize code.
include/merlin/core_kernels.cuh
Outdated
|
|
||
| using namespace cooperative_groups; | ||
| namespace cg = cooperative_groups; | ||
| #include "core_kernels/kernel_utils.cuh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please modify for Bazel build file at the same time, its location is ./include/merlin/BUILD, and please try to build with Bazel after done(no CI cases for it currently).
| int idx_block = groupID * GROUP_SIZE + rank; | ||
| K target_key = keys[key_idx_base + rank]; | ||
| sm_target_keys[idx_block] = target_key; | ||
| K hashed_key = Murmur3HashDevice(target_key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to use const as possible.
| template <typename K = uint64_t, typename V = float, typename S = uint64_t, | ||
| typename CopyScore = CopyScoreEmpty<S, K, 128>, | ||
| typename CopyValue = CopyValueTwoGroup<float, float4, 32>> | ||
| __global__ void lookup_kernel_with_io_pipeline_v1( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
/blossom-ci |
| int find_number = __popc(find_result); | ||
| int group_base = 0; | ||
| if (find_number > 0) { | ||
| group_base = atomicAdd(sm_counts + key_idx_block, find_number); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the atomicAdd_block is enough here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with you. __atomicAdd_block is more proper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, when I use __atomicAdd_block, I got the error:identifier "__atomicAdd_block" is undefined.
I think its related to CMakeLists.txt : set_target_properties(xxx PROPERTIES CUDA_ARCHITECTURES OFF).
And, accorrding to the CUDA Doc, atomicAdd support shared memory, so I think use atomicAdd is the cheapest way at present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
atomicAdd works for both global and shared memory
include/merlin/core_kernels.cuh
Outdated
| CUDA_CHECK(cudaMalloc(&((*table)->buckets[i].keys_), bucket_memory_size)); | ||
| (*table)->buckets[i].scores_ = reinterpret_cast<AtomicScore<S>*>( | ||
| (*table)->buckets[i].keys_ + bucket_max_size); | ||
| (*table)->buckets[i].digests = reinterpret_cast<uint8_t*>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to bring the digests to be ahead of keys_, the find should always read the digests first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, its reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think we can just store the key and value's address, as the address of scores and digests can be infered by keys_ and bucket_max_size
| S score_ = CopyScore::lgs(sm_target_scores + key_idx_block); | ||
| CopyValue::lds_stg(rank, v_dst, v_src, dim); | ||
| founds[key_idx_grid] = true; | ||
| CopyScore::stg(scores + key_idx_grid, score_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the result of score and found information is stored in shared memory, to write them to the global memory in the end of the kernel, for coalesing memory access, targeting reduce memory traffic.
include/merlin/types.cuh
Outdated
| /// TODO: compute the pointer of scores and digests using bucket_max_size | ||
| AtomicScore<S>* scores_; | ||
| /// @brief not visible to users | ||
| uint8_t* digests; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inspired by a discussion with you, potentially, we can reduce the memory consumption of Bucket struct by canceling the separate pointers for keys_, scores_, and digests_, because we just need only 1 start pointer for these three.
So could you switch the digests to a function like this?
__forceinline__ __device__ uint8_t* digests(int index) const {
return digests_ + index;
}This will benefit the future refactoring in the future I said.
| constexpr int GROUP_SIZE = 32; | ||
| constexpr int RESERVE = 16; | ||
| constexpr int DIM_BUF = 224; | ||
| constexpr int BLOCK_SIZE = 128; | ||
| constexpr int BUCKET_SIZE = 128; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiashuy Are they configurable? How do you decide their values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At present:
BUCKET_SIZEis fixed to 128. This is commonly used by users which can be confirmed by @rhdong .- I think
BLOCK_SIZEis as small as possible to reduce uneven workload. But too small will cause the grid size too large. So I choose 128. GROUP_SIZEis set accorrding the profiler. When dim is small, use 16 threads to deal with one key cooperatively is more effective(if use 8, will consume more registers); and when dim is large, use 32 threads to deal with one key, so that we can put larger value to shared memory(group num is smaller, means using less shared memory for double buffer).
And the only difference between kernel v1 and v2 is theGROUP_SIZE.DIM_BUFis configurable, according to the shared memory size of SM(different on arch). I've already finished this, and will commit today.RESERVEis the reserved size for possible keys(digest = target digest).
From the statistics of continues keys, 16 is enough forRESERVE, but I use 8 inlookup_kernel_with_io_pipeline_v2,for reduce shared memory usage. Resolving correctness by swaping space with time(latency).
The frequency of the reserve size that is really needed is a power-law distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a summry, BUCKET_SIZE,BLOCK_SIZE、GROUP_SIZE、RESERVE are fixed for specific kernel.
And BLOCK_SIZE is set accorrding the subjective experience;
RESERVE and GROUP_SIZE are set by summary from profiler and performance.
DIM_BUF is configurable and have been implemented.
| __pipeline_commit(); // padding | ||
| __pipeline_commit(); // padding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiashuy Why do you need these paddings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the __pipeline_wait_prior(x) waits for the (x+1)th __pipeline_commit() in front.
You can observe the __pipeline_wait_prior(3) at line 109.
So, in the first loop, we need to wait sm_probing_digests to be writen back at the stage of pipeline loading.
So I pad __pipeline_commit() to avoid to check in the loop again and again.
2b7e893 to
9055d9a
Compare
On pure HBM mode 1. Using digests(some bits of hashed keys) as a filter to reduce memory traffic. 2. Using kernel level pipeline to overlap memory accesses to hide latency. 3. Unit test of the look kernels using filter and pipeline. 4. Make dim which lookup kernel with pipeline support Configurable. 5. Put common kernels into the core_kernels folder, and modify the BUILD file used for bazel build. 6. Change the way addressing digests 7. When init hash table, check the bucket_max_size to make keys and scores meet cache line size.
|
/blossom-ci |
|
/blossom-ci |
On pure HBM mode