- 
                Notifications
    You must be signed in to change notification settings 
- Fork 6.1k
8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) #14227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… float and double arrays)
| 👋 Welcome back vamsi-parasa! A progress list of the required criteria for merging this PR into  | 
| /label hotspot-compiler | 
| @vamsi-parasa  | 
| What happens to really short arrays? Your patch should include macro benchmarks for e.g. 50 and 10. | 
| 
 Thanks for the suggestion. Please see the performance for small array sizes below: 
 | 
| I notice that This is not quite right,  | 
| Hi @vamsi-parasa ! | 
Co-authored-by: Andrew Haley <aph-open@littlepinkcloud.com>
| 
 Hi @merykitty | 
| 
 Hi @franz1981, thank you for the suggestions! The algorithm was tested to sort only a part of the array with non-zero offsets and length. I will upstream those benchmarks/tests as well. | 
| @vamsi-parasa this pull request can not be integrated into  git checkout avx512sort
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push | 
| @sviswa7 @vamsi-parasa Pushed as commit a4e9168. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. | 
| Hi @vamsi-parasa, May be too late but there is one question. We have 2 new methods Methods  | 
| When the numpy news came out I analyzed the intel implementation here https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md. I think it's relevant here as well. I assume the new version of x86-simd-sort contains improvements, I hope they have addressed the easy to hit quadratic runtime. | 
| Hi @vamsi-parasa, If DualPivotQuicksort.java is updated, can you improve the class a little bit? Please find my suggested changes: 
 I guess new PR should be opened for all mentioned changes. Could you please go ahead? | 
| Mailing list message from Florian Weimer on hotspot-runtime-dev: I believe this has introduced a build failure with GCC 12.2 on Debian 12.1: Building target 'jdk' in configuration '/home/fw/build/jdk' | 
| @vamsi-parasa | 
| A discussion on Reddit raised that this had the potential to regress sort performance on AMD Zen 4. The poster didn't have access to Zen 4 hardware, so I took a moment to run the benchmark, comparing against a baseline with  This is explained in the linked comments/commits and this Twitter thread: | 
| @DanielThomas thank you, i was not aware of the limitation on AMD Zen 4. We need to address this and it appears the fix is simple and localized. Thankfully we are early into the JDK 22 development cycle so we have time to shake this out. | 
| 
 I believe it due to an issue in GCC 12 that has yet to be resolved: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105593 We (Oracle) currently build with GCC 11.2.0 (https://wiki.openjdk.org/display/Build/Supported+Build+Platforms) so i expect we will encounter the same problem when upgrading to the 12 toolchain. | 
| 
 Hello Florian (@fweimer), As Paul mentioned, this is an issue in GCC 12. Thanks, | 
| 
 Hello Daniel (@DanielThomas ), Thank you for sharing the data. Will issue a new PR to restrict the AVX512 sort to Intel CPUs only for the time being. Thanks, | 
| 
 Hello Paul (@PaulSandoz) and Florian (@fweimer), Just did a successful OpenJDK build (without any of the errors reported) using the latest GCC 12 (12.3.1) . Also, reproduced the reported build failure using GCC 12.2.0. Thanks, | 
| @IntrinsicCandidate | ||
| @ForceInline | ||
| private static <A> void sort(Class<?> elemType, A array, long offset, int low, int high, SortOperation<A> so) { | ||
| so.sort(array, low, high); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm late to the party, but how does the fallback implementation work (or is intended to) for off-heap case (null + absolute base address)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, for on-heap case the fallback implementation is equivalent to intrinsified case only when offset points at the 0th element of the array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iwanowww Yes, you are late to the party :). The fallback implementation could be similar to the vectorizedMismatch regarding base/offset for non heap case. Please see java.base/share/classes/jdk/internal/foreign/AbstractMemorySegmentImpl.java.
Yes, the fallback implementation for non intrinsic case is kept to be equivalent to what was before the SIMD sort PR. This is done to not affect the fallback performance on other platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with fallback implementation as it is now is that it doesn't take into account the offset. It's completely broken for off-heap case and works for on-heap case only because the implementation always passes primitive array base offset.
| 
 hello, I tested this feature in openjdk 22.19, but it didn't seem to work. public class JDKSort {
    public static void main(String[] args) {
        new JDKSort().sort();
    }
    public void sort() {
        // 100 million pieces of data
        int size = 100000000;
        int[] arr = new int[size];
        java.util.Random rand = new java.util.Random();
        for (int i = 0; i < size; ++i) {
            arr[i] = rand.nextInt();
        }
        long startTime = System.currentTimeMillis();
        java.util.Arrays.sort(arr);
        long elapseTime = System.currentTimeMillis() - startTime;
        System.out.println("elapse time -> " + elapseTime + " ms");
    }
}test result: 
 this feature looks like it's merged into 22.19: | 
| @himichael Please refer to this question for how to correctly benchmark Java code. | 
| 
 thanks for your reply, but I think this has nothing to do with benchmark. in open jdk 22.19 , run command: my question is that this feature should improve performance several times, but it doesn't look like there's much difference between open jdk 22.19 and jdk 8. | 
| @himichael Please try java -Xbatch -XX:-TieredCompilation JDKSort. The method DualPivotQuickSort.sort() needs to be C2 compiled for you to see the benefit. | 
| 
 Hello @himichael, Using your code snippet, please see the output below using the latest JDK and JDK 20 (which does not have AVX512 sort): JDK 20 (without AVX512 sort): elapse time -> 7501 ms JDK 22 (with AVX512 sort) It shows 4.66x speedup. | 
| 
 Hello, @vamsi-parasa /data/soft/jdk1.8.0_371/bin/javac  JDKSort.java
/data/soft/jdk1.8.0_371/bin/java  JDKSortelapse time -> 15309 ms use OpenJDK 22.19(with AVX512 sort) /data/soft/jdk-22/bin/javac JDKSort.java
/data/soft/jdk-22/bin/java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort
CompileCommand: CompileThresholdScaling java/util/DualPivotQuicksort.sort double CompileThresholdScaling = 0.000100elapse time -> 11687 ms Not much seems to have changed. My JDK info: /data/soft/jdk-22/bin/java -version
openjdk version "22-ea" 2024-03-19
OpenJDK Runtime Environment (build 22-ea+19-1460)
OpenJDK 64-Bit Server VM (build 22-ea+19-1460, mixed mode, sharing)JDK 8: /data/soft/jdk1.8.0_371/bin/java -version
java version "1.8.0_371"
Java(TM) SE Runtime Environment (build 1.8.0_371-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.371-b11, mixed mode)I tested Intel's x86-simd-sort, my code as follow: #include <iostream>
#include <vector>
#include <algorithm>
#include <chrono>
#include "src/avx512-32bit-qsort.hpp"
int main() {
    // 100 million records
    const int size = 100000000;
    std::vector<int> random_array(size);
    for (int i = 0; i < size; ++i) {
        random_array[i] = rand();
    }
    auto start_time = std::chrono::steady_clock::now();
    avx512_qsort(random_array.data(), size);
    auto end_time = std::chrono::steady_clock::now();
    auto elapse_time = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count();
    std::cout << "elapse time -> " << elapse_time << " ms" << std::endl;
    return 0;
}compile commands: elapse time -> 1151 ms Here is my cpu information: Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             8
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel Xeon Processor (Skylake, IBRS)
Stepping:              4
CPU MHz:               2394.374
BogoMIPS:              4788.74
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 md_clear spec_ctrl
 
 | 
| @himichael What do you mean by this having nothing to do with benchmark. You are trying to execute some code to measure its execution time, which is benchmarking. And you are doing that on only 1 simple function, which makes your benchmark micro. To be more specific, this is a C2-specific optimisation, so only C2-compiled code is benefitted from it. As a result, you need to have the function compiled BEFORE starting the clock. Typically, this is ensured by executing the function repeatedly for several iterations (the current default value is 20000), starting the clock, executing the function several more times, stopping the clock and calculating the average throughput. As this is quite complex and contains non-trivial caveats, it is recommended to use JMH for microbenchmarks. | 
| @himichael , could you check if the libsimdsort.so is being loaded by running the command below? Here is my output: | 
| 
 Hello, @vamsi-parasa, I see why, I run command  [0.071s][info][library] Loaded library libjsvml.so, handle 0x00007f3a5c0223e0
openjdk version "22-ea" 2024-03-19
OpenJDK Runtime Environment (build 22-ea+19-1460)
OpenJDK 64-Bit Server VM (build 22-ea+19-1460, mixed mode, sharing)There is no information about  run  [0.044s][info][library] Loaded library libjsvml.so, handle 0x00007f00b40223e0
[0.067s][info][library] Failed to find JNI_OnLoad_nio in library with handle 0x00007f016a4f6150
[0.068s][info][library] Loaded library /data/soft/jdk-22/lib/libnio.so, handle 0x00007f0160196da0
[0.068s][info][library] Found JNI_OnLoad in library with handle 0x00007f0160196da0
[0.069s][info][library] Found Java_sun_nio_fs_UnixNativeDispatcher_init in library with handle 0x00007f0160196da0
[0.069s][info][library] Found Java_sun_nio_fs_UnixNativeDispatcher_getcwd in library with handle 0x00007f0160196da0
[0.069s][info][library] Failed to find JNI_OnLoad_jimage in library with handle 0x00007f016a4f6150
[0.069s][info][library] Loaded library /data/soft/jdk-22/lib/libjimage.so, handle 0x00007f0160005450
[0.069s][info][library] Failed to find JNI_OnLoad in library with handle 0x00007f0160005450
[0.069s][info][library] Failed to find Java_jdk_internal_jimage_NativeImageBuffer_getNativeMap in library with handle 0x00007f0160196da0
[0.069s][info][library] Found Java_jdk_internal_jimage_NativeImageBuffer_getNativeMap in library with handle 0x00007f0160005450It shows some bad failures, but it doesn't seem to have anything to do with libsimdsort. | 
| can I use this function for sorting excel file data with multiple columns in it? | 
| 
 @himichael Did you ever resolve your issue? I am using JDK22 from SDKMan and also do not have  However, I checked my  | 
| The answer for slow performance of AVX512 version of x86-simd-sort on Zen 4 is most probably explained in AMD manuals which could be found at: https://www.amd.com/en/search/documentation/hub.html#q=software%20optimization%20guide%20for%20the%20amd%20microarchitecture&f-amd_document_type=Software%20Optimization%20Guides Software Optimization Guide for the AMD Zen4 Microarchitecture has following remark in "2.11.2 Code recommendations" chapter: 
 Software Optimization Guide for the AMD Zen5 Microarchitecture doesn't have any remark about COMPRESS instructions. Could you add some code that disables the AVX512 version on Zen4, but keeps it enabled on Zen5 and future Zen architectures? | 
| 
 Or as you suggest here revert to AVX2. I updated JDK-8317976 with that suggestion, which is simpler to maintain. The HotSpot C++ class  | 
The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX512 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays.
This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the performance data below.
Arrays.sort performance data using JMH benchmarks for arrays with random data
This is the first in the series of PRs related to vectorized sorting. Future planned steps after the integration of this PR:
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14227/head:pull/14227$ git checkout pull/14227Update a local copy of the PR:
$ git checkout pull/14227$ git pull https://git.openjdk.org/jdk.git pull/14227/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 14227View PR using the GUI difftool:
$ git pr show -t 14227Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14227.diff
Webrev
Link to Webrev Comment