KEMBAR78
[Feat]: find API with missing parameters by dakabang · Pull Request #183 · NVIDIA-Merlin/HierarchicalKV · GitHub
Skip to content

Conversation

dakabang
Copy link
Contributor

@dakabang dakabang commented Jan 17, 2024

In our inference service, we found that the find API will have better performance by directly writing the missed keys when the hit rate is high, which it usually is.

@github-actions
Copy link

@rhdong rhdong changed the title [Feat]: find API with add missing parameters [Feat]: find API with missing parameters Jan 17, 2024
: founds(founds_) {}

__forceinline__ __device__ void operator()(const int idx, const K /*key*/,
const bool found) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please format the code by referring the guidance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@rhdong rhdong requested a review from jiashuy January 17, 2024 20:40
const K* __restrict keys, V** __restrict values, S* __restrict scores,
K* __restrict missed_keys, int* __restrict missed_indices,
int* __restrict missed_size, int* __restrict dst_offset, size_t N) {
FoundFunctorV2<K> found_functor(missed_keys, missed_indices, missed_size);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomicadd in FoundFunctorV2 runs on all warps when the key is not found, which can cause performance degradation when there are lots of missed key. Isn't it better to split the missed_keys size used by found_functor from n to the bucket size, and then merge according to missed_size at last kernel finishing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll work on refining this code.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll work on refining this code.

What was the result? Does this optimization show up in performance tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize for the late reply, but I have verified your suggestion. Due to the additional calculations brought by the bucket separation, the benchmark results did not show performance improvement.

@rhdong
Copy link
Member

rhdong commented Feb 13, 2024

/blossom-ci

@rhdong rhdong merged commit 2da863f into NVIDIA-Merlin:master Feb 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants