[Opt] Using filter and kernel level pipeline to optimize lookup kernels #136

jiashuy · 2023-06-04T22:35:40Z

On pure HBM mode

Using digests(some bits of hashed keys) as a filter to reduce memory traffic.
Using kernel level pipeline to overlap memory accesses to hide latency.
Unit test of the look kernels using filter and pipeline.
Make dim which lookup kernel with pipeline support Configurable.
Put common kernels into the core_kernels folder, and modify the BUILD file used for bazel build.
Change the way addressing digests
When init hash table, check the bucket_max_size to make keys and scores meet cache line size.

github-actions · 2023-06-04T22:37:47Z

Documentation preview

https://nvidia-merlin.github.io/HierarchicalKV/review/pr-136

rhdong · 2023-06-05T03:02:21Z

README.md

 ## Benchmark & Performance(W.I.P)

-* GPU: 1 x NVIDIA A100 80GB PCIe: 8.0
+* GPU: 1 x NVIDIA A100-SXM4-80GB: 8.0


Better to keep PCIE as our benchmark baseline.

rhdong · 2023-06-05T04:27:27Z

include/merlin_hashtable.cuh

+      // Only bucket_size = 128
+      // On A100, the maximum dim which Pipeline support is 224 floats
+      if (options_.max_bucket_size == 128 &&
+          value_size <= (224 * sizeof(float))) {


We'd better avoid the magic number, and make it a private member of HashTable or better form. If the 224 depends on the GPU hardware setting, we need to calculate it at initialize phrase.

OK, I will make arch infomation as tempate type to select the kernel related config at compile time.

rhdong · 2023-06-05T07:03:15Z

include/merlin/core_kernels.cuh

    for (size_t i = 0; i < bucket_max_size; i++)
      new (buckets[start + tid].keys(i))
          AtomicKey<K>{static_cast<K>(EMPTY_KEY)};
+    K hashed_key = Murmur3HashDevice(static_cast<K>(EMPTY_KEY));


rhdong · 2023-06-05T07:04:09Z

include/merlin/core_kernels.cuh

      new (buckets[start + tid].keys(i))
          AtomicKey<K>{static_cast<K>(EMPTY_KEY)};
+    K hashed_key = Murmur3HashDevice(static_cast<K>(EMPTY_KEY));
+    uint8_t digest = static_cast<uint8_t>(hashed_key >> 32);


const uint8_t

rhdong · 2023-06-05T07:12:59Z

include/merlin/core_kernels.cuh

      new (buckets[start + tid].keys(i))
          AtomicKey<K>{static_cast<K>(EMPTY_KEY)};
+    K hashed_key = Murmur3HashDevice(static_cast<K>(EMPTY_KEY));
+    uint8_t digest = static_cast<uint8_t>(hashed_key >> 32);


Could you confirm which header we should use for uint8_t?
sys/types.h
or <stdint.h>
or https://nvidia.github.io/cutlass/structcutlass_1_1TypeTraits_3_01uint8__t_01_4.html

I mean if we need to add a special header explicitly for uint8_t

I think its <cstdint> or <stdint.h> which is already included in our code in header types.cuh

rhdong · 2023-06-05T07:18:02Z

include/merlin/core_kernels.cuh

      local_size = buckets_size[new_bkt_idx];
      if (rank == src_lane) {
+        K hashed_key = Murmur3HashDevice(key);
+        uint8_t target_digest = static_cast<uint8_t>(hashed_key >> 32);


OK, and I will find all of this and add const, thanks for your review.

rhdong · 2023-06-05T07:21:01Z

include/merlin/core_kernels.cuh

          if (rank == 0) {
+            K hashed_key = Murmur3HashDevice(static_cast<K>(EMPTY_KEY));
+            uint8_t target_digest = static_cast<uint8_t>(hashed_key >> 32);
+            bucket->digests[key_idx] = target_digest;


For EMPTY_KEY, we'd better define a separate macro for its relative digest.

rhdong · 2023-06-05T07:21:35Z

include/merlin/core_kernels.cuh

      if (g.thread_rank() == src_lane) {
        const int key_pos =
            (start_idx + tile_offset + src_lane) & (bucket_max_size - 1);
+        K hashed_key = Murmur3HashDevice(static_cast<K>(EMPTY_KEY));


EMPTY_DIGEST

rhdong · 2023-06-05T07:25:22Z

include/merlin/core_kernels/kernel_utils.cuh

+#include "../utils.cuh"
+
+// if i % 2 == 0, select buffer 0, else buffer 1
+#define SAME_BUF(i) (((i)&0x01) ^ 0)


No, it is used to select buffer in pipeline kernel. For example:
V* v_src = sm_vector[SAME_BUF(i)][groupID]; in kernel lookup_kernel_with_io_pipeline_v2.

Sorry, I didn't expand the lookup_kernels.cuh.
So here is a potential issue, the macro naming is too common that may dirty the end-users name scope, if no performance loss, can we change them to a __forced_ inline__ __device__ func(..) ?
Or at least, #undef them after the last reference in this file.
Or special prefix like 'MERLIN_xxx'

rhdong · 2023-06-05T07:29:38Z

include/merlin/core_kernels/kernel_utils.cuh

+
+  __forceinline__ __device__ static S lgs(S* src) { return src[0]; }
+
+  __forceinline__ __device__ static void stg(S* dst, S score_) {


stg(const S* dst, const S score_)

rhdong · 2023-06-05T07:30:27Z

include/merlin/core_kernels/kernel_utils.cuh

+    __pipeline_memcpy_async(dst, src, sizeof(S));
+  }
+
+  __forceinline__ __device__ static S lgs(S* src) { return src[0]; }


Using const maybe help the compiler optimize code.

rhdong · 2023-06-05T07:35:22Z

include/merlin/core_kernels.cuh

-
-using namespace cooperative_groups;
-namespace cg = cooperative_groups;
+#include "core_kernels/kernel_utils.cuh"


Please modify for Bazel build file at the same time, its location is ./include/merlin/BUILD, and please try to build with Bazel after done(no CI cases for it currently).

rhdong · 2023-06-05T07:48:18Z

include/merlin/core_kernels/lookup_kernels.cuh

+    int idx_block = groupID * GROUP_SIZE + rank;
+    K target_key = keys[key_idx_base + rank];
+    sm_target_keys[idx_block] = target_key;
+    K hashed_key = Murmur3HashDevice(target_key);


Try to use const as possible.

rhdong · 2023-06-05T07:54:08Z

include/merlin/core_kernels/lookup_kernels.cuh

+template <typename K = uint64_t, typename V = float, typename S = uint64_t,
+          typename CopyScore = CopyScoreEmpty<S, K, 128>,
+          typename CopyValue = CopyValueTwoGroup<float, float4, 32>>
+__global__ void lookup_kernel_with_io_pipeline_v1(


Hi @minseokl, could you help review here for @jiashuy? It's too complex for me to hold it. :(

rhdong · 2023-06-05T08:07:41Z

/blossom-ci

rhdong · 2023-06-05T08:14:42Z

include/merlin/core_kernels/lookup_kernels.cuh

+    int find_number = __popc(find_result);
+    int group_base = 0;
+    if (find_number > 0) {
+      group_base = atomicAdd(sm_counts + key_idx_block, find_number);


It looks like the atomicAdd_block is enough here.

Yes, I agree with you. __atomicAdd_block is more proper.

However, when I use __atomicAdd_block， I got the error:identifier "__atomicAdd_block" is undefined.
I think its related to CMakeLists.txt : set_target_properties(xxx PROPERTIES CUDA_ARCHITECTURES OFF).
And, accorrding to the CUDA Doc, atomicAdd support shared memory, so I think use atomicAdd is the cheapest way at present.

atomicAdd works for both global and shared memory

rhdong · 2023-06-05T08:26:47Z

include/merlin/core_kernels.cuh

+    CUDA_CHECK(cudaMalloc(&((*table)->buckets[i].keys_), bucket_memory_size));
+    (*table)->buckets[i].scores_ = reinterpret_cast<AtomicScore<S>*>(
+        (*table)->buckets[i].keys_ + bucket_max_size);
+    (*table)->buckets[i].digests = reinterpret_cast<uint8_t*>(


If we need to bring the digests to be ahead of keys_, the find should always read the digests first.

OK, its reasonable

And I think we can just store the key and value's address, as the address of scores and digests can be infered by keys_ and bucket_max_size

jiashuy · 2023-06-05T15:10:15Z

include/merlin/core_kernels/lookup_kernels.cuh

+        S score_ = CopyScore::lgs(sm_target_scores + key_idx_block);
+        CopyValue::lds_stg(rank, v_dst, v_src, dim);
+        founds[key_idx_grid] = true;
+        CopyScore::stg(scores + key_idx_grid, score_);


Because the result of score and found information is stored in shared memory, to write them to the global memory in the end of the kernel, for coalesing memory access, targeting reduce memory traffic.

rhdong · 2023-06-06T08:04:11Z

include/merlin/types.cuh

+  /// TODO: compute the pointer of scores and digests using bucket_max_size
  AtomicScore<S>* scores_;
+  /// @brief not visible to users
+  uint8_t* digests;


Inspired by a discussion with you, potentially, we can reduce the memory consumption of Bucket struct by canceling the separate pointers for keys_, scores_, and digests_, because we just need only 1 start pointer for these three.
So could you switch the digests to a function like this?

__forceinline__ __device__ uint8_t* digests(int index) const { return digests_ + index; }

This will benefit the future refactoring in the future I said.

minseokl · 2023-06-07T08:01:50Z

include/merlin/core_kernels/lookup_kernels.cuh

+  constexpr int GROUP_SIZE = 32;
+  constexpr int RESERVE = 16;
+  constexpr int DIM_BUF = 224;
+  constexpr int BLOCK_SIZE = 128;
+  constexpr int BUCKET_SIZE = 128;


@jiashuy Are they configurable? How do you decide their values?

At present:

BUCKET_SIZE is fixed to 128. This is commonly used by users which can be confirmed by @rhdong .

I think BLOCK_SIZE is as small as possible to reduce uneven workload. But too small will cause the grid size too large. So I choose 128.

GROUP_SIZE is set accorrding the profiler. When dim is small, use 16 threads to deal with one key cooperatively is more effective(if use 8, will consume more registers); and when dim is large, use 32 threads to deal with one key, so that we can put larger value to shared memory(group num is smaller, means using less shared memory for double buffer).
And the only difference between kernel v1 and v2 is the GROUP_SIZE.

DIM_BUF is configurable, according to the shared memory size of SM(different on arch). I've already finished this, and will commit today.

RESERVE is the reserved size for possible keys(digest = target digest).
From the statistics of continues keys, 16 is enough for RESERVE, but I use 8 in lookup_kernel_with_io_pipeline_v2，for reduce shared memory usage. Resolving correctness by swaping space with time(latency).
The frequency of the reserve size that is really needed is a power-law distribution.

In a summry, BUCKET_SIZE，BLOCK_SIZE、GROUP_SIZE、RESERVE are fixed for specific kernel.
And BLOCK_SIZE is set accorrding the subjective experience；
RESERVE and GROUP_SIZE are set by summary from profiler and performance.
DIM_BUF is configurable and have been implemented.

minseokl · 2023-06-07T08:08:37Z

include/merlin/core_kernels/lookup_kernels.cuh

+  __pipeline_commit();  // padding
+  __pipeline_commit();  // padding


@jiashuy Why do you need these paddings?

I think the __pipeline_wait_prior(x) waits for the (x+1)th __pipeline_commit() in front.
You can observe the __pipeline_wait_prior(3) at line 109.
So, in the first loop, we need to wait sm_probing_digests to be writen back at the stage of pipeline loading.
So I pad __pipeline_commit() to avoid to check in the loop again and again.

On pure HBM mode 1. Using digests(some bits of hashed keys) as a filter to reduce memory traffic. 2. Using kernel level pipeline to overlap memory accesses to hide latency. 3. Unit test of the look kernels using filter and pipeline. 4. Make dim which lookup kernel with pipeline support Configurable. 5. Put common kernels into the core_kernels folder, and modify the BUILD file used for bazel build. 6. Change the way addressing digests 7. When init hash table, check the bucket_max_size to make keys and scores meet cache line size.

rhdong · 2023-06-11T09:26:28Z

/blossom-ci

rhdong · 2023-06-11T09:39:06Z

/blossom-ci

jiashuy requested a review from rhdong June 4, 2023 22:35

rhdong reviewed Jun 5, 2023

View reviewed changes

rhdong requested review from minseokl and zehuanw June 5, 2023 07:52

rhdong reviewed Jun 5, 2023

View reviewed changes

jiashuy commented Jun 5, 2023

View reviewed changes

rhdong reviewed Jun 6, 2023

View reviewed changes

minseokl reviewed Jun 7, 2023

View reviewed changes

jiashuy force-pushed the master branch 2 times, most recently from 2b7e893 to 9055d9a Compare June 8, 2023 12:47

jiashuy force-pushed the master branch from 9055d9a to bf0cb0e Compare June 9, 2023 07:47

[Benchmark] Speed up the benchmark process.

8670348

rhdong merged commit 921e9b8 into NVIDIA-Merlin:master Jun 11, 2023


		__forceinline__ __device__ static S lgs(S* src) { return src[0]; }

		__forceinline__ __device__ static void stg(S* dst, S score_) {

		__pipeline_commit(); // padding
		__pipeline_commit(); // padding

[Opt] Using filter and kernel level pipeline to optimize lookup kernels #136

[Opt] Using filter and kernel level pipeline to optimize lookup kernels #136

Uh oh!

Conversation

jiashuy commented Jun 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 4, 2023

Documentation preview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhdong Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiashuy Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhdong Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhdong Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhdong commented Jun 5, 2023

Uh oh!

rhdong Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiashuy Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiashuy Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiashuy commented Jun 4, 2023 •

edited

Loading

rhdong Jun 5, 2023 •

edited

Loading

jiashuy Jun 5, 2023 •

edited

Loading

rhdong Jun 5, 2023 •

edited

Loading

rhdong Jun 5, 2023 •

edited

Loading

rhdong Jun 5, 2023 •

edited

Loading

jiashuy Jun 5, 2023 •

edited

Loading

jiashuy Jun 5, 2023 •

edited

Loading

jiashuy Jun 7, 2023 •

edited

Loading