ARM C Language Extensions - Al Grant
ARM C Language Extensions - Al Grant
Release 2.0
Abstract
This document specifies the ARM C Language Extensions to enable C/C++ programmers to exploit the ARM
architecture with minimal restrictions on source code portability.
Keywords
ACLE, ABI, C, C++, compiler, armcc, gcc, intrinsic, macro, attribute, NEON, SIMD, atomic
Confidentiality status
This document is Non-Confidential.
Proprietary Notice
This document is protected by copyright and other related rights and the practice or implementation of the
information contained in this document may be protected by one or more patents or pending patent applications.
No part of this document may be reproduced in any form by any means without the express prior written
permission of ARM. No license, express or implied, by estoppel or otherwise to any intellectual property
rights is granted by this document unless specifically stated.
Your access to the information in this document is conditional upon your acceptance that you will not use or permit
others to use the information for the purposes of determining whether implementations infringe any third party
patents.
THIS DOCUMENT IS PROVIDED “AS IS”. ARM PROVIDES NO REPRESENTATIONS AND NO WARRANTIES,
EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR
PURPOSE WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, ARM makes no representation
with respect to, and has undertaken no analysis to identify or understand the scope and content of, third party
patents, copyrights, trade secrets, or other rights.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 1 of 81
Non-Confidential
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR ANY DAMAGES,
INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR
CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY,
ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE POSSIBILITY
OF SUCH DAMAGES.
This document consists solely of commercial items. You shall be responsible for ensuring that any use, duplication
or disclosure of this document complies fully with any relevant export laws and regulations to assure that this
document or any portion thereof is not exported, directly or indirectly, in violation of such export laws. Use of the
word “partner” in reference to ARM‟s customers is not intended to create or refer to any partnership relationship
with any other company. ARM may make changes to this document at any time and without notice.
If any of the provisions contained in these terms conflict with any of the provisions of any click through or signed
written agreement covering this document with ARM, then the click through or signed written agreement prevails
over and supersedes the conflicting provisions of these terms. This document may be translated into other
languages for convenience, and you agree that if there is any conflict between the English version of this
document and any translation, the terms of the English version of the Agreement shall prevail.
Words and logos marked with ® or ™ are registered trademarks or trademarks of ARM Limited or its affiliates in
the EU and/or elsewhere. All rights reserved. Other brands and names mentioned in this document may be the
trademarks of their respective owners. Please follow ARM‟s trademark usage guidelines at
http://www.arm.com/about/trademark-usage-guidelines.php.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 2 of 81
Non-Confidential
Contents
1.2 References 8
2 SCOPE 10
3 INTRODUCTION 11
4 C LANGUAGE EXTENSIONS 13
4.3 Intrinsics 14
4.3.1 Constant arguments to intrinsics 14
4.5 Attributes 15
5.1 Introduction 16
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 3 of 81
Non-Confidential
6 FEATURE TEST MACROS 19
6.1 Introduction 19
6.3 Endianness 19
7.5 Alignment 30
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 4 of 81
Non-Confidential
7.5.1 Alignment attribute 30
7.5.2 Alignment of static objects 30
7.5.3 Alignment of stack objects 31
7.5.4 Procedure calls 31
7.5.5 Alignment of C heap storage 31
7.5.6 Alignment of C++ heap allocation 31
8.1 Introduction 33
8.4 Hints 35
8.5 Swap 36
8.7 NOP 38
9 DATA-PROCESSING INTRINSICS 39
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 5 of 81
Non-Confidential
9.5.5 Packing and unpacking 45
9.5.6 Parallel selection 45
9.5.7 Parallel 8-bit addition and subtraction 45
9.5.8 Sum of 8-bit absolute differences 46
9.5.9 Parallel 16-bit addition and subtraction 46
9.5.10 Parallel 16-bit multiplication 48
9.5.11 Examples 49
11 INSTRUCTION GENERATION 54
12 NEON INTRINSICS 57
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 6 of 81
Non-Confidential
12.3.7 NEON loads and stores 64
12.3.7.1 Examples 65
12.3.7.2 Alignment assertions 66
12.3.8 NEON lane-by-lane operations 67
12.3.10 NEON Vector Additions to AArch32 in ARMv8 75
12.3.11 NEON vector reductions 76
12.3.12 NEON vector rearrangements 77
12.3.13 NEON vector table lookup 78
12.3.14 Crypto Intrinsics 78
13 FUTURE DIRECTIONS 80
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 7 of 81
Non-Confidential
1 ABOUT THIS DOCUMENT
1.2 References
This document refers to the following documents.
Ref Doc No Author(s) Title
ARMARM ARM DDI 0406C ARM ARM Architecture Reference Manual (7-A / 7-R)
ARMARMv8 ARM DDI0487A.B ARM ARMv8-A Reference Manual (Issue A.b)
ARMv7M ARM DDI 0403C ARM ARM Architecture Reference Manual (7-M)
AAPCS ARM IHI 0042D ARM Procedure Call Standard
AAPCS64 ARM IHI0055C-BETA ARM Procedure Call Standard (AArch64)
BA ARM IHI 0045C ARM EABI Addenda and Errata – Build Attributes
C++11 ISO/IEC 14882:2011 ISO Standard C++ (based on draft N3337)
C11 ISO/IEC 9899:2011 ISO Standard C (based on draft N1570)
C99 ISO 9899:1999 ISO Standard C (“C99”)
cxxabi http://mentorembedded.github. Code- Itanium C++ ABI (rev. 1.86)
com/cxx-abi/abi.html Sourcery
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 8 of 81
Non-Confidential
G.191 T-REC-G.191-200508-I ITU-T Software Tool Library 2005 User‟s Manual
GNUC http://gcc.gnu.org/onlinedocs GNU/FSF GNU C Compiler Collection
IA-64 245370-003 Intel Intel Itanium Processor-Specific ABI
IEEE-FP IEEE 754-2008 IEEE IEEE floating-point
POSIX IEEE 1003.1 IEEE / TOG The Open Group base specifications
Warren ISBN 0-201-91465-4 H. Warren “Hacker‟s Delight”, pub. Addison-Wesley 2003
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 9 of 81
Non-Confidential
2 SCOPE
The ARM C Language Extensions (ACLE) specification specifies source language extensions and implementation
choices that C/C++ compilers can implement in order to allow programmers to better exploit the ARM architecture.
The extensions include:
Predefined macros that provide information about the functionality of the target architecture (for example,
whether it has hardware floating-point)
Intrinsic functions
Attributes that can be applied to functions, data and other entities
This specification does not standardize command-line options, diagnostics or other external behavior of compilers.
The intended users of this specification are:
Application programmers wishing to adapt or hand-optimize applications and libraries for ARM targets
System programmers needing low-level access to ARM targets beyond what C/C++ provides for
Compiler implementors, who will implement this specification
Implementors of IDEs, static analysis tools etc. who wish to deal with the C/C++ source language
extensions when encountered in source code
Some of the material – specifically, the architecture/CPU namings, and the feature test macros – may also be
applicable to assemblers and other tools.
ACLE is not a hardware abstraction layer (HAL), and does not specify a library component – but it may make it
easier to write a HAL or other low-level library in C rather than assembler.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 10 of 81
Non-Confidential
3 INTRODUCTION
Modern computer architectures (such as ARM) include architectural features that go beyond the set of operations
available in C/C++. These features may include SIMD and saturating instructions. Exploiting these features to
improve program efficiency has in the past caused “lock-in” to compilers, or to individual CPUs.
The intention of the ARM C Language Extensions (ACLE) is to allow the writing of applications and middleware
code that is portable across compilers, and across ARM architecture variants, while exploiting the unique features
of the ARM architecture family.
The design principles for ACLE can be summarized as:
Defined new NEON Intrinsics for ARMv8 AArch32 and AArch64 NEON[ACLE-3]
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 11 of 81
Non-Confidential
3.3 Portable Binary Objects
In AArch32, the ABI for the ARM Architecture defines a set of build attributes [BA]. These attributes are intended
to facilitate generating cross-platform portable binary object files by providing a mechanism to determine the
compatibility of object files. In AArch64, the ABI does not define a standard set of build attributes and takes the
approach that binaries are, in general, not portable across platforms. References to build attributes in this
document should be interpreted as applying only to AArch32.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 12 of 81
Non-Confidential
4 C LANGUAGE EXTENSIONS
When pointers are 32 bits, the „long‟ type is 32 bits (ILP32 model).
When pointers are 64 bits, the „long‟ type may be either 64 bits (LP64 model) or 32 bits (LLP64 model).
Whether the underlying type of an enumeration is minimal or at least 32-bit, is implementation-defined. The
predefined macro __ARM_SIZEOF_MINIMAL_ENUM should be defined as 1 or 4 according to the size of a minimal
enumeration type such as enum { X=0 }. An implementation that conforms to the ARM ABI must reflect its
choice in the Tag_ABI_enum_size build attribute.
wchar_t may be 2 or 4 bytes. The predefined macro __ARM_SIZEOF_WCHAR_T should be defined as the same
number. An implementation that conforms to the ARM ABI must reflect its choice in the Tag_ABI_PCS_wchar_t
build attribute.
16-bit floating point is a storage and interchange format only. Values of __fp16 type promote to (at least) float
when used in arithmetic operations, in the same way that values of char or short types promote to int. There
is no arithmetic directly on 16-bit values.
Conversion from 64-bit to 16-bit, i.e. from double to __fp16, must round only once. (With round-to-nearest,
converting first to 32-bit and then to 16-bit could give an incorrectly rounded result.) Because in current ARM
hardware floating-point architectures this is not a primitive operation, it may be faster to convert first to single-
precision and then to half-precision:
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 13 of 81
Non-Confidential
double xd;
__fp16 xs = (float)xd;
rather than:
double xd;
__fp16 xs = xd;
In some older implementations, __fp16 cannot be used as an argument or result type, though it can be used as a
field in a structure passed as an argument or result, or passed via a pointer. The predefined macro
__ARM_FP16_ARGS should be defined if __fp16 can be used as an argument and result. C++ name mangling is
“Dh” as defined in [cxxabi], and is the same for both the IEEE and alternative formats.
4.3 Intrinsics
ACLE standardizes intrinsics to access the NEON (Advanced SIMD) extension. These intrinsics are intended to
be compatible with existing implementations. Before using the NEON intrinsics or data types, the <arm_neon.h>
header must be included. The NEON intrinsics are defined in section 12. Note that the NEON intrinsics and data
types are in the user namespace.
ACLE also standardizes other intrinsics to access ARM instructions which do not map directly to C operators –
generally either for optimal implementation of algorithms, or for accessing specialist system-level features.
Intrinsics are defined further in various following sections.
Before using the non-NEON intrinsics, the <arm_acle.h> header should be included.
Whether intrinsics are macros, functions or built-in operators is unspecified. For example:
it is unspecified whether applying #undef to an intrinsic removes the name from visibility
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 14 of 81
Non-Confidential
range, or sometimes a string literal. An implementation should produce a diagnostic if the argument does not meet
the requirements.
<arm_neon.h> is provided to define the NEON intrinsics. As these intrinsics are in the user namespace, an
implementation would not normally define them until the header is included. The __ARM_NEON macro should be
tested before including the header:
#ifdef __ARM_NEON
#include <arm_neon.h>
#endif /* __ARM_NEON */
These headers behave as standard library headers; repeated inclusion has no effect beyond the first include.
It is unspecified whether the ACLE headers include the standard headers <assert.h>, <stdint.h> or
<inttypes.h>. However, the ACLE headers will not define the standard type names (uint32_t etc.) except by
inclusion of the standard headers. Programmers are recommended to include the standard headers explicitly if the
associated types and macros are needed.
In C++, the following source code fragments are expected to work correctly:
#include <stdint.h>
// UINT64_C not defined here since we did not set __STDC_FORMAT_MACROS
...
#include <arm_neon.h>
and
#include <arm_neon.h>
...
#define __STDC_FORMAT_MACROS
#include <stdint.h>
// ... UINT64_C is now defined
4.5 Attributes
GCC-style attributes are provided to annotate types, objects and functions with extra information, such as
alignment. These attributes are defined in section 7.
Alternatively, <arm_acle.h> could define the ACLE intrinsics in terms of already supported features of the
implementation, e.g. compiler intrinsics with other names, or inline functions using inline assembler.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 15 of 81
Non-Confidential
5 ARCHITECTURE AND CPU NAMES
5.1 Introduction
The intention of this section is to standardize architecture names, e.g. for use in compiler command lines.
Toolchains should accept these names case-insensitively where possible, or use all lowercase where not
possible. Tools may apply local conventions such as using hyphens instead of underscores.
(Note: processor names, including from the ARM Cortex™ family, are used as illustrative examples. This
specification is applicable to any processors implementing the ARM architecture.)
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 16 of 81
Non-Confidential
Name Features ARM Thumb Example processor
VFPv3_D16_FP16 VFPv3 with 16 D-registers and FP16 Cortex-A9 (without NEON), Cortex-R7
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 17 of 81
Non-Confidential
Name Features Example processor
VFPv4_D16 VFPv4 (including FMA and FP16) with 16 D-registers Cortex-A5 (VFP option)
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 18 of 81
Non-Confidential
6 FEATURE TEST MACROS
6.1 Introduction
The feature test macros allow programmers to determine the availability of ACLE or subsets of it, or of target
architectural features. This may indicate the availability of some source language extensions (e.g. intrinsics) or the
likely level of performance of some standard C features, such as integer division and floating-point.
Several macros are defined as numeric values to indicate the level of support for particular features. These
macros are undefined if the feature is not present. (Aside: in Standard C/C++, references to undefined macros
expand to 0 in preprocessor expressions, so a comparison such as
#if __ARM_ARCH >= 7
will have the expected effect of evaluating to false if the macro is not defined.)
All ACLE macros begin with the prefix __ARM_. All ACLE macros expand to integral constant expressions suitable
for use in an #if directive, unless otherwise specified. Syntactically, they must be primary-expressions –
generally this means an implementation should enclose them in parentheses if they are not simple constants.
6.3 Endianness
__ARM_BIG_ENDIAN is defined as 1 if data is stored by default in big-endian format. If the macro is not set, data is
stored in little-endian format. (Aside: the “mixed-endian” format for double-precision numbers, used on some very
old ARM FPU implementations, is not supported by ACLE or the ARM ABI.)
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 19 of 81
Non-Confidential
6.4.1 ARM/Thumb instruction set architecture
__ARM_ARCH is defined as an integer value indicating the current ARM instruction set architecture (e.g. 7 for the
ARM v7-A architecture implemented by Cortex-A8 or the ARM v7-M architecture implemented by Cortex-M3 or 8
for the ARM v8-A architecture implemented by Cortex-A57). Since ACLE only supports the ARM architecture, this
macro would always be defined in an ACLE implementation.
Note that the __ARM_ARCH macro is defined even for cores which only support the Thumb instruction set.
__ARM_ARCH_ISA_ARM is defined to 1 if the core supports the ARM instruction set. It is not defined for M-profile
cores.
__ARM_ARCH_ISA_THUMB is defined to 1 if the core supports the original Thumb instruction set (including the v6-M
architecture) and 2 if it supports the Thumb-2 instruction set as found in the v6T2 architecture and all v7
architectures.
__ARM_ARCH_ISA_A64 is defined to 1 if the core supports AArch64‟s A64 instruction set.
This macro corresponds to the Tag_CPU_arch_profile object build attribute. It may be useful to writers of system
code. It is expected in most cases programmers will use more feature-specific tests.
Values „R‟, „M‟ and „S‟ are unsupported for architectural targets with __ARM_ARCH > 7.
The macro is undefined for architectural targets which predate the use of architectural profiles.
6.4.4 LDREX/STREX
This feature is deprecated in ACLE 2.0. It is strongly recommended that C11/C++11 atomics be used instead.
__ARM_FEATURE_LDREX is defined if the load/store-exclusive instructions (LDREX/STREX) are supported. Its value
is a set of bits indicating available widths of the access, as powers of 2. The following bits are used:
Bit Value Access width Instruction
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 20 of 81
Non-Confidential
Bit Value Access width Instruction
6.4.5 CLZ
__ARM_FEATURE_CLZ is defined to 1 if the CLZ (count leading zeroes) instruction is supported in hardware. Note
that ACLE provides the __clz() family of intrinsics (see 9.2) even when __ARM_FEATURE_CLZ is not defined.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 21 of 81
Non-Confidential
6.4.9 32-bit SIMD instructions
__ARM_FEATURE_SIMD32 is defined to 1 if the 32-bit SIMD instructions are supported and the intrinsics defined in
9.5 are available. This also implies support for the GE global flags which indicate byte-by-byte comparison results.
__ARM_FEATURE_SIMD32 is deprecated in ACLE 2.0 for A-profile. Users are encouraged to use NEON Intrinscs as
an equivalent for the 32-bit SIMD intrinsics functionality. However they are fully supported for M and R-profiles.
This is defined for AArch32 only.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 22 of 81
Non-Confidential
Support for 16-bit floating-point language extensions (see 6.5.2) is only required to be available if supported in
hardware. Hardware support for 16-bit floating-point is limited to conversions. Values are promoted to 32-bit
(single-precision) type for arithmetic.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 23 of 81
Non-Confidential
6.5.7 Crypto Extension
__ARM_FEATURE_CRYPTO is defined to 1 if the Crypto instructions are supported and the intrinsics defined in
12.3.14 are available. These instructions include AES{E, D}, SHA1{C, P, M} etc. This is only available when
__ARM_ARCH >= 8.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 24 of 81
Non-Confidential
6.7 Procedure call standard
__ARM_PCS is defined to 1 if the default procedure calling standard for the translation unit conforms to the “base
PCS” defined in [AAPCS]. This is supported on AArch32 only.
__ARM_PCS_VFP is defined to 1 if the default is to pass floating-point parameters in hardware floating-point
registers using the “VFP variant PCS” defined in [AAPCS]. This is supported on AArch32 only.
__ARM_PCS_AAPCS64 is defined to 1 if the default procedure calling standard for the translation unit conforms to
the [AAPCS64].
Note that this should reflect the implementation default for the translation unit. Implementations which allow the
PCS to be set for a function, class or namespace are not expected to redefine the macro within that scope.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 25 of 81
Non-Confidential
6.8 Mapping of object build attributes to predefines
This section is provided for guidance. Details of build attributes can be found in [BA].
Tag no. Tag Predefined macro
6 Tag_CPU_arch __ARM_ARCH,
__ARM_FEATURE_DSP
7 Tag_CPU_arch_profile __ARM_PROFILE
8 Tag_ARM_ISA_use __ARM_ISA_ARM
9 Tag_THUMB_ISA_use __ARM_ISA_THUMB
11 Tag_WMMX_arch __ARM_WMMX
18 Tag_ABI_PCS_wchar_t __ARM_SIZEOF_WCHAR_T
20 Tag_ABI_FP_denormal
21 Tag_ABI_FP_exceptions
22 Tag_ABI_FP_user_exceptions
23 Tag_ABI_FP_number_model
26 Tag_ABI_enum_size __ARM_SIZEOF_MINIMAL_ENUM
34 Tag_CPU_unaligned_access __ARM_FEATURE_UNALIGNED
36 Tag_FP_HP_extension __ARM_FP16_FORMAT_IEEE,
__ARM_FP16_FORMAT_ALTERNATIVE
38 Tag_ABI_FP_16bit_format __ARM_FP16_FORMAT_IEEE,
__ARM_FP16_FORMAT_ALTERNATIVE
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 26 of 81
Non-Confidential
6.9 Summary of predefined macros
Macro name Meaning Example See section
__ARM_32BIT_STATE code is for AArch32 state 1 6.4.1
__ARM_64BIT_STATE code is for AArch64 state 1 6.4.1
__ARM_ACLE indicates ACLE implemented 101 6.2
__ARM_ALIGN_MAX_PWR log of maximum alignment of static object 20 7.5.2
__ARM_ALIGN_MAX_STACK_PWR log of maximum alignment of stack object 3 7.5.3
__ARM_ARCH ARM architecture level 7 6.4.1
__ARM_ARCH_ISA_A64 AArch64 ISA present 1 6.4.1
__ARM_ARCH_ISA_ARM ARM instruction set present 1 6.4.1
__ARM_ARCH_ISA_THUMB Thumb instruction set present 2 6.4.1
__ARM_ARCH_PROFILE architecture profile „A‟ 6.4.2
__ARM_BIG_ENDIAN memory is big-endian 1 6.3
__ARM_FEATURE_CLZ CLZ instruction 1 6.4.5, 9.2
__ARM_FEATURE_CRC32 CRC32 extension 1 6.5.8
__ARM_FEATURE_CRYPTO Crypto extension 1 6.5.7
__ARM_FEATURE_DIRECTED_ROUNDING Directed Rounding 1 12.3.10
__ARM_FEATURE_DSP DSP instructions (ARM v5E) (32-bit-only) 1 6.4.6, 9.4
__ARM_FEATURE_FMA floating-point fused multiply-accumulate 1 6.5.3, 9.6
__ARM_FEATURE_IDIV Hardware Integer Divide 1 6.4.10
__ARM_FEATURE_LDREX(Deprecated) load/store exclusive instructions 0x0F 6.4.4, 8
__ARM_FEATURE_NUMERIC_MAXMIN Numeric Maximum and Minimum 1 12.3.10
__ARM_FEATURE_QBIT Q (saturation) flag (32-bit-only) 1 6.4.6, 9.1.1
__ARM_FEATURE_SAT width-specified saturation instructions (32- 1 6.4.8, 9.4.1
bit-only)
__ARM_FEATURE_SIMD32 32-bit SIMD instructions (ARM v6) (32-bit- 1 6.4.8, 9.5
only)
__ARM_FEATURE_UNALIGNED hardware support for unaligned access 1 6.4.3
__ARM_FP hardware floating-point 0x0C 6.5.1
__ARM_FP16_ARGS __fp16 argument and result 1 6.5.11
__ARM_FP16_FORMAT_ALTERNATIVE 16-bit floating-point, alternative format 1 6.5.2
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 27 of 81
Non-Confidential
__ARM_FP16_FORMAT_IEEE 16-bit floating-point, IEEE format 1 6.5.2
__ARM_FP_FAST accuracy-losing optimizations 1 6.6
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 28 of 81
Non-Confidential
7 ATTRIBUTES AND PRAGMAS
attribute A applies to both x and y; B and C apply to x only, and D and E apply to y only. Programmers are
recommended to keep declarations simple if attributes are used.
Unless otherwise stated, all attribute arguments must be compile-time constants.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 29 of 81
Non-Confidential
7.4 Weak linkage
__attribute__((weak)) can be attached to declarations and definitions to indicate that they have weak static
linkage (STB_WEAK in ELF objects). As definitions, they can be overridden by other definitions of the same
symbol. As references, they do not need to be satisfied and will be resolved to zero if a definition is not present.
7.5 Alignment
The new standards for C [C11 6.7.5] and C++ [C++11 7.6.2] add syntax for aligning objects and types. ACLE
provides an alternative syntax described in this section.
the type of &x is “char *” and the type of &y is “int *”. The following declarations are equivalent:
struct S x __attribute__((aligned(16))); /* ACLE */
struct S _Alignas(16) x; /* C11 */
#include <stdalign.h> /* C11 (alternative) */
struct S alignas(16) x;
struct S alignas(16) x; /* C++11 */
Since an alignment request on an object does not change its type or size, x in this example would have type int
and size 4.
There is in principle no limit on the alignment of static objects, within the constraints of available memory. In the
ARM ABI an object with a requested alignment would go into an ELF section with at least as strict an alignment
requirement. However, an implementation supporting position-independent dynamic objects or overlays may need
to place restrictions on their alignment demands.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 30 of 81
Non-Confidential
7.5.3 Alignment of stack objects
It must be possible to align any local object up to the stack alignment as specified in the AAPCS for AArch32 (i.e.
8 bytes) or as specified in AAPCS64 for AArch64 (i.e. 16 bytes) this being also the maximal alignment of any
native type.
An implementation may, but is not required to, permit the allocation of local objects with greater alignment, e.g. 16
or 32 bytes for AArch32. (This would involve some runtime adjustment such that the object address was not a
fixed offset from the stack pointer on entry.)
If a program requests alignment greater than the implementation supports, it is recommended that the compiler
warn but not fault this. Programmers should expect over-alignment of local objects to be treated as a hint.
The macro __ARM_ALIGN_MAX_STACK_PWR indicates (as the exponent of a power of 2) the maximum available
stack alignment. For example, a value of 3 indicates 8-byte alignment.
which means that in AArch32 AAPCS the second parameter is in R2/R3 rather than R1/R2.
as defined in [POSIX], or
void *aligned_alloc(size_t alignment, size_t size);
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 31 of 81
Non-Confidential
that the cookie is not necessarily at the address seen by the allocation and deallocation functions.
Implementations will need to make some adjustments before and after calls to the ABI-defined C++ runtime, or
may provide additional non-standard runtime helper functions.) Example:
struct float4 {
void *operator new[](size_t s) {
void *p;
posix_memalign(&p, 16, s);
return p;
}
float data[4];
} __attribute__((aligned(16)));
If the user has not provided their own allocation function, the behavior is implementation-defined.
The generic itanium C++ ABI, which we use in AArch64, already handles arrays with arbitrarily aligned elements
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 32 of 81
Non-Confidential
8 SYNCHRONIZATION, BARRIER AND HINT INTRINSICS
8.1 Introduction
This section provides intrinsics for managing data that may be accessed concurrently between processors, or
between a processor and a device. Some intrinsics atomically update data, while others place barriers around
accesses to data to ensure that accesses are visible in the correct order.
Memory prefetch intrinsics are also described in this section.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 33 of 81
Non-Confidential
Argument Mnemonic Domain Ordered Accesses (before-after)
Generates a DMB (data memory barrier) instruction or equivalent CP15 instruction. DMB ensures the observed
ordering of memory accesses. Memory accesses of the specified type issued before the DMB are guaranteed to
be observed (in the specified scope) before memory accesses issued after the DMB. For example, DMB should
be used between storing data, and updating a flag variable that makes that data available to another core.
The __dmb() intrinsic also acts as a compiler memory barrier of the appropriate type.
void __dsb(/*constant*/ unsigned int);
Generates a DSB (data synchronization barrier) instruction or equivalent CP15 instruction. DSB ensures the
completion of memory accesses. A DSB behaves as the equivalent DMB and has additional properties. After a
DSB instruction completes, all memory accesses of the specified type issued before the DSB are guaranteed to
have completed.
The __dsb() intrinsic also acts as a compiler memory barrier of the appropriate type.
void __isb(/*constant*/ unsigned int);
Generates an ISB (instruction synchronization barrier) instruction or equivalent CP15 instruction. This instruction
flushes the processor pipeline fetch buffers, so that following instructions are fetched from cache or memory. An
ISB is needed after some system maintenance operations.
An ISB is also needed before transferring control to code that has been loaded or modified in memory, for
example by an overlay mechanism or just-in-time code generator. (Note that if instruction and data caches are
separate, privileged cache maintenance operations would be needed in order to unify the caches.)
The only supported argument for the __isb() intrinsic is 15, corresponding to the SY (full system) scope of the
ISB instruction.
8.3.1 Examples
In this example, process P1 makes some data available to process P2 and sets a flag to indicate this.
P1:
value = x;
/* issue full-system memory barrier for previous store:
setting of flag is guaranteed not to be observed before
write to value */
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 34 of 81
Non-Confidential
__dmb(14);
flag = true;
P2:
/* busy-wait until the data is available */
while (!flag) {}
/* issue full-system memory barrier: read of value is guaranteed
not to be observed by memory system before read of flag */
__dmb(15);
use value;
P2:
/* busy-wait until work item appears */
while (!(work = queue_head)) {}
/* no barrier needed: load of payload is data-dependent */
use work->payload
8.4 Hints
The intrinsics in this section are available for all targets. They may be no-ops (i.e. generate no code, but possibly
act as a code motion barrier in compilers) on targets where the relevant instructions do not exist. On targets where
the relevant instructions exist but are implemented as no-ops, these intrinsics generate the instructions.
void __wfi(void);
Generates a WFI (wait for interrupt) hint instruction, or nothing. The WFI instruction allows (but does not require)
the processor to enter a low-power state until one of a number of asynchronous events occurs.
void __wfe(void);
Generates a WFE (wait for event) hint instruction, or nothing. The WFE instruction allows (but does not require)
the processor to enter a low-power state until some event occurs such as a SEV being issued by another
processor.
void __sev(void);
Generates a SEV (send a global event) hint instruction. This causes an event to be signaled to all processors in a
multiprocessor system. It is a NOP on a uniprocessor system.
void __sevl(void);
Generates a “send a local event” hint instruction. This causes an event to be signaled to only the processor
executing this instruction. In a multiprocessor system, it is not required to affect the other processors.
void __yield(void);
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 35 of 81
Non-Confidential
Generates a YIELD hint instruction. This enables multithreading software to indicate to the hardware that it is
performing a task, for example a spin-lock, that could be swapped out to improve overall system performance.
void __dbg(/*constant*/ unsigned int);
Generates a DBG instruction. This provides a hint to debugging and related systems. The argument must be a
constant integer from 0 to 15 inclusive. See implementation documentation for the effect (if any) of this instruction
and the meaning of the argument. This is available only when compliling for AArch32.
8.5 Swap
__swp is available for all targets. This intrinsic expands to a sequence equivalent to the deprecated (and possibly
unavailable) SWP instruction.
uint32_t __swp(uint32_t, volatile void *);
unconditionally stores a new value at the given address, and returns the old value.
As with the IA-64/GCC primitives described in 0, the __swp intrinsic is polymorphic. The second argument must
provide the address of a byte-sized object or an aligned word-sized object and it must be possible to determine
the size of this object from the argument expression.
This intrinsic is implemented by LDREX/STREX (or LDREXB/STREXB) where available, as if by
uint32_t __swp(uint32_t x, volatile uint32_t *p) {
uint32_t v;
/* use LDREX/STREX intrinsics not specified by ACLE */
do v = __ldrex(p); while (__strex(x, p));
return v;
}
or alternatively,
uint32_t __swp(uint32_t x, uint32_t *p) {
uint32_t v;
/* use IA-64/GCC atomic builtins */
do v = *p; while (!__sync_bool_compare_and_swap(p, v, x));
return v;
}
Only if load-store exclusive instructions are not available will the intrinsic use the SWP/SWPB instructions.
It is strongly recommended to use standard and flexible atomic primitives such as those available in the C++
<atomic> header. __swp is provided solely to allow straightforward (and possibly automated) replacement of
explicit use of SWP in inline assembler. SWP is obsolete in the ARM architecture, and in recent versions of the
architecture, may be configured to be unavailable in user-mode. (Aside: unconditional atomic swap is also less
powerful as a synchronization primitive than load-exclusive/store-conditional.)
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 36 of 81
Non-Confidential
8.6.1 Data prefetch
void __pld(void const volatile *addr);
Generates a data prefetch instruction, if available. The argument should be any expression that may designate a
data address. The data is prefetched to the innermost level of cache, for reading.
void __pldx(/*constant*/ unsigned int /*access_kind*/,
/*constant*/ unsigned int /*cache_level*/,
/*constant*/ unsigned int /*retention_policy*/,
void const volatile *addr);
Generates a data prefetch instruction. This intrinsic allows the specification of the expected access kind (read or
write), the cache level to load the data, the data retention policy (temporal or streaming), The relevant arguments
can only be one of the following values.
Access Kind Value Summary
KEEP 0 Temporal fetch of the addressed location (i.e. allocate in cache normally)
STRM 1 Streaming fetch of the addressed location (i.e. memory used only once)
__pldx and __plix arguments „cache level‟ and „retention policy‟ are ignored on unsupported targets.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 37 of 81
Non-Confidential
8.7 NOP
void __nop(void);
generates an unspecified no-op instruction. Note that not all architectures provide a distinguished NOP instruction.
On those that do, it is unspecified whether this intrinsic generates it or another instruction. It is not guaranteed that
inserting this instruction will increase execution time.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 38 of 81
Non-Confidential
9 DATA-PROCESSING INTRINSICS
The intrinsics in this section are provided for algorithm optimization.
The <arm_acle.h> header should be included before using these intrinsics.
Implementations are not required to introduce precisely the instructions whose names match the intrinsics.
However, implementations should aim to ensure that a computation expressed compactly with intrinsics will
generate a similarly compact sequence of machine code. In general, C‟s “as-if rule” [C99 5.1.2.3] applies,
meaning that the compiled code must behave as if the instruction had been generated.
In general, these intrinsics are aimed at DSP algorithm optimization on M-profile and R-profile. Use on A-profile is
deprecated. However, the miscellaneous intrinsics and CRC32 intrinsics described in 9.2 and 9.7 respectively are
suitable for all profiles.
Sets or resets the Q flag according to the LSB of the value. __set_saturation_occurred(0) might be used
before performing a sequence of operations after which the Q flag is tested. (In general, the Q flag cannot be
assumed to be unset at the start of a function.)
void __ignore_saturation(void);
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 39 of 81
Non-Confidential
This intrinsic is a hint and may be ignored. It indicates to the compiler that the value of the Q flag is not live
(needed) at or subsequent to the program point at which the intrinsic occurs. It may allow the compiler to remove
preceding instructions, or to change the instruction sequence in such a way as to result in a different value of the
Q flag. (A specific example is that it may recognize clipping idioms in C code and implement them with an
instruction such as SSAT that may set the Q flag.)
ACLE does not support changing the DN, FZ or AHP bits at runtime.
VFP “short vector” mode (enabled by setting the Stride and Len bits) is deprecated, and is unavailable on later
VFP implementations. ACLE provides no support for this mode.
The 64-bit versions of these intrinsics („ll‟ suffix) are new in ACLE 1.1. For completeness and to aid portability
between LP64 and LLP64 models, ACLE 1.1 also defines intrinsics with „l‟ suffix.
uint32_t __ror(uint32_t x, uint32_t y);
unsigned long __rorl(unsigned long x, uint32_t y);
uint64_t __rorll(uint64_t x, uint32_t y);
rotates the argument x right by y bits. y can take any value. These intrinsics are available on all targets.
unsigned int __clz(uint32_t x);
unsigned int __clzl(unsigned long x);
unsigned int __clzll(uint64_t x);
returns the number of leading zero bits in x. When x is zero it returns the argument width, i.e. 32 or 64. These
intrinsics are available on all targets. On targets without the CLZ instruction it should be implemented as an
instruction sequence or a call to such a sequence. A suitable sequence can be found in [Warren] (fig. 5-7).
Hardware support for these intrinsics is indicated by __ARM_FEATURE_CLZ.
unsigned int __cls(uint32_t x);
unsigned int __clsl(unsigned long x);
unsigned int __clsll(uint64_t x);
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 40 of 81
Non-Confidential
returns the number of leading sign bits in x. When x is zero it returns the argument width, i.e. 32 or 64. These
intrinsics are available on all targets. On targets without the CLZ instruction it should be implemented as an
instruction sequence or a call to such a sequence. Fast hardware implementation (using a CLS instruction or a
short code sequence involving the CLZ instruction) is indicated by __ARM_FEATURE_CLZ. New in ACLE 1.1.
uint32_t __rev(uint32_t);
unsigned long __revl(unsigned long);
uint64_t __revll(uint64_t);
reverses the byte order within a word or doubleword. These intrinsics are available on all targets and should be
expanded to an efficient straight-line code sequence on targets without byte reversal instructions.
uint32_t __rev16(uint32_t);
unsigned long __rev16l(unsigned long);
uint64_t __rev16ll(uint64_t);
reverses the byte order within each halfword of a word. For example, 0x12345678 becomes 0x34127856. These
intrinsics are available on all targets and should be expanded to an efficient straight-line code sequence on targets
without byte reversal instructions.
int16_t __revsh(int16_t);
reverses the byte order in a 16-bit value and returns the (sign-extended) result. For example, 0x00000080
becomes 0xFFFF8000. This intrinsic is available on all targets and should be expanded to an efficient straight-line
code sequence on targets without byte reversal instructions.
uint32_t __rbit(uint32_t x);
unsigned long __rbitl(unsigned long x);
uint64_t __rbitll(uint64_t x);
reverses the bits in x. These intrinsics are only available on targets with the RBIT instruction.
9.2.1 Examples
#ifdef __ARM_BIG_ENDIAN
#define htonl(x) (uint32_t)(x)
#define htons(x) (uint16_t)(x)
#else /* little-endian */
#define htonl(x) __rev(x)
#define htons(x) (uint16_t)__revsh(x)
#endif /* endianness */
#define ntohl(x) htonl(x)
#define ntohs(x) htons(x)
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 41 of 81
Non-Confidential
9.3 16-bit multiplications
The intrinsics in this section provide direct access to the 16x16 and 16x32 bit multiplies introduced in ARM v5E.
Compilers are also encouraged to exploit these instructions from C code. These intrinsics are available when
__ARM_FEATURE_DSP is defined, and are not available on non-5E targets. These multiplies cannot overflow.
int32_t __smulbb(int32_t, int32_t);
Multiplies two 16-bit signed integers, i.e. the low halfwords of the operands.
int32_t __smulbt(int32_t, int32_t);
Multiplies the low halfword of the first operand and the high halfword of the second operand.
int32_t __smultb(int32_t, int32_t);
Multiplies the high halfword of the first operand and the low halfword of the second operand.
int32_t __smultt(int32_t, int32_t);
Multiplies the 32-bit signed first operand with the low halfword (as a 16-bit signed integer) of the second operand.
Return the top 32 bits of the 48-bit product.
int32_t __smulwt(int32_t, int32_t);
Multiplies the 32-bit signed first operand with the high halfword (as a 16-bit signed integer) of the second operand.
Return the top 32 bits of the 48-bit product.
Saturates a signed integer to the given bit width in the range 1 to 32. For example, the result of saturation to 8-bit
width will be in the range -128 to 127. The Q flag is set if the operation saturates.
uint32_t __usat(int32_t, /*constant*/ unsigned int);
Saturates a signed integer to an unsigned (non-negative) integer of a bit width in the range 0 to 31. For example,
the result of saturation to 8-bit width is in the range 0 to 255, with all negative inputs going to zero. The Q flag is
set if the operation saturates.
Adds two 32-bit signed integers, with saturation. Sets the Q flag if the addition saturates.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 42 of 81
Non-Confidential
int32_t __qsub(int32_t, int32_t);
Subtracts two 32-bit signed integers, with saturation. Sets the Q flag if the subtraction saturates.
int32_t __qdbl(int32_t);
Doubles a signed 32-bit number, with saturation. __qdbl(x) is equal to __qadd(x,x) except that the argument
x is evaluated only once. Sets the Q flag if the addition saturates.
Multiplies two 16-bit signed integers, the low halfwords of the first two operands, and adds to the third operand.
Sets the Q flag if the addition overflows. (Note that the addition is the usual 32-bit modulo addition which wraps on
overflow, not a saturating addition. The multiplication cannot overflow.)
int32_t __smlabt(int32_t, int32_t, int32_t);
Multiplies the low halfword of the first operand and the high halfword of the second operand, and adds to the third
operand, as for __smlabb.
int32_t __smlatb(int32_t, int32_t, int32_t);
Multiplies the high halfword of the first operand and the low halfword of the second operand, and adds to the third
operand, as for __smlabb.
int32_t __smlatt(int32_t, int32_t, int32_t);
Multiplies the high halfwords of the first two operands and adds to the third operand, as for __smlabb.
int32_t __smlawb(int32_t, int32_t, int32_t);
Multiplies the 32-bit signed first operand with the low halfword (as a 16-bit signed integer) of the second operand.
Adds the top 32 bits of the 48-bit product to the third operand. Sets the Q flag if the addition overflows. (See note
for __smlabb.)
int32_t __smlawt(int32_t, int32_t, int32_t);
Multiplies the 32-bit signed first operand with the high halfword (as a 16-bit signed integer) of the second operand
and adds the top 32 bits of the 48-bit result to the third operand as for __smlawb.
9.4.4 Examples
The ACLE DSP intrinsics can be used to define ETSI/ITU-T basic operations [G.191]:
#include <arm_acle.h>
inline int32_t L_add(int32_t x, int32_t y) { return __qadd(x, y); }
inline int32_t L_negate(int32_t x) { return __qsub(0, x); }
inline int32_t L_mult(int16_t x, int16_t y) { return __qdbl(x*y); }
inline int16_t add(int16_t x, int16_t y) { return (int16_t)(__qadd(x<<16, y<<16) >> 16); }
inline int16_t norm_l(int32_t x) { return __clz(x ^ (x<<1)) & 31; }
...
This example assumes the implementation preserves the Q flag on return from an inline function.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 43 of 81
Non-Confidential
9.5 32-bit SIMD intrinsics
9.5.1 Availability
ARM v6 introduced instructions to perform 32-bit SIMD operations (i.e. two 16-bit operations or four 8-bit
operations) on the ARM general-purpose registers. These instructions are not related to the much more versatile
Advanced SIMD (NEON) extension, whose support is described in section 12.
The 32-bit SIMD intrinsics are available on targets featuring ARM v6 and upwards, including the A and R profiles.
In the M profile they are available in the v7E-M architecture. Availability of the 32-bit SIMD intrinsics implies
availability of the saturating intrinsics.
Availability of the SIMD intrinsics is indicated by the __ARM_FEATURE_SIMD32 predefine.
The SIMD intrinsics generally operate on and return 32-bit words consisting of two 16-bit or four 8-bit values.
These are represented as int16x2_t and int8x4_t below for illustration. Some intrinsics also feature scalar
accumulator operands and/or results.
When defining the intrinsics, implementations can define SIMD operands using a 32-bit integral type (such as
„unsigned int‟).
The header <arm_acle.h> defines typedefs int16x2_t, uint16x2_t, int8x4_t and uint8x4_t. These should be
defined as 32-bit integral types of the appropriate sign. There are no intrinsics provided to pack or unpack values
of these types. This can be done with shifting and masking operations.
The explicit saturation operations __ssat and __usat set the Q flag if saturation occurs. Similarly, __ssat16
and __usat16 set the Q flag if saturation occurs in either lane.
Some instructions, such as __smlad, set the Q flag if overflow occurs on an accumulation, even though the
accumulation is not a saturating operation (i.e. does not clip its result to the limits of the type).
In the following descriptions of intrinsics, if the description does not mention whether the intrinsic affects the Q
flag, the intrinsic does not affect it.
Saturates two 16-bit signed values to a width in the range 1 to 16. The Q flag is set if either operation saturates.
int16x2_t __usat16(int16x2_t, /*constant */ unsigned int);
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 44 of 81
Non-Confidential
Saturates two 16-bit signed values to a bit width in the range 0 to 15. The input values are signed and the output
values are non-negative, with all negative inputs going to zero. The Q flag is set if either operation saturates.
Two values (at bit positions 0..7 and 16..23) are extracted from the second operand, sign-extended to 16 bits, and
added to the first operand.
int16x2_t __sxtb16(int8x4_t);
Two values (at bit positions 0..7 and 16..23) are extracted from the first operand, sign-extended to 16 bits, and
returned as the result.
uint16x2_t __uxtab16(uint16x2_t, uint8x4_t);
Two values (at bit positions 0..7 and 16..23) are extracted from the second operand, zero-extended to 16 bits, and
added to the first operand.
uint16x2_t __uxtb16(uint8x4_t);
Two values (at bit positions 0..7 and 16..23) are extracted from the first operand, zero-extended to 16 bits, and
returned as the result.
Selects each byte of the result from either the first operand or the second operand, according to the values of the
GE bits. For each result byte, if the corresponding GE bit is set then the byte from the first operand is used,
otherwise the byte from the second operand is used. Because of the way that int16x2_t operations set two
(duplicate) GE bits per value, the __sel intrinsic works equally well on (u)int16x2_t and (u)int8x4_t data.
4x8-bit signed addition. The GE bits are set according to the results.
int8x4_t __shadd8(int8x4_t, int8x4_t);
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 45 of 81
Non-Confidential
4x8-bit signed subtraction, halving the results.
int8x4_t __ssub8(int8x4_t, int8x4_t);
4x8-bit signed subtraction. The GE bits are set according to the results.
uint8x4_t __uadd8(uint8x4_t, uint8x4_t);
4x8-bit unsigned addition. The GE bits are set according to the results.
uint8x4_t __uhadd8(uint8x4_t, uint8x4_t);
4x8-bit unsigned subtraction. The GE bits are set according to the results.
Performs 4x8-bit unsigned subtraction, and adds the absolute values of the differences together, returning the
result as a single unsigned integer.
uint32_t __usada8(uint8x4_t, uint8x4_t, uint32_t);
Performs 4x8-bit unsigned subtraction, adds the absolute values of the differences together, and adds the result to
the third operand.
Exchanges halfwords of second operand, adds high halfwords and subtracts low halfwords, saturating in each
case.
int16x2_t __qsax(int16x2_t, int16x2_t);
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 46 of 81
Non-Confidential
Exchanges halfwords of second operand, subtracts high halfwords and adds low halfwords, saturating in each
case.
int16x2_t __qsub16(int16x2_t, int16x2_t);
2x16-bit signed addition. The GE bits are set according to the results.
int16x2_t __sasx(int16x2_t, int16x2_t);
Exchanges halfwords of the second operand, adds high halfwords and subtracts low halfwords. The GE bits are
set according to the results.
int16x2_t __shadd16(int16x2_t, int16x2_t);
Exchanges halfwords of the second operand, adds high halfwords and subtract low halfwords, halving the results.
int16x2_t __shsax(int16x2_t, int16x2_t);
Exchanges halfwords of the second operand, subtracts high halfwords and add low halfwords, halving the results.
int16x2_t __shsub16(int16x2_t, int16x2_t);
Exchanges halfwords of the second operand, subtracts high halfwords and adds low halfwords. The GE bits are
set according to the results.
int16x2_t __ssub16(int16x2_t, int16x2_t);
2x16-bit signed subtraction. The GE bits are set according to the results.
uint16x2_t __uadd16(uint16x2_t, uint16x2_t);
2x16-bit unsigned addition. The GE bits are set according to the results.
uint16x2_t __uasx(uint16x2_t, uint16x2_t);
Exchanges halfwords of the second operand, adds high halfwords and subtracts low halfwords. The GE bits are
set according to the results of unsigned addition.
uint16x2_t __uhadd16(uint16x2_t, uint16x2_t);
Exchanges halfwords of the second operand, adds high halfwords and subtracts low halfwords, halving the
results.
uint16x2_t __uhsax(uint16x2_t, uint16x2_t);
Exchanges halfwords of the second operand, subtracts high halfwords and adds low halfwords, halving the
results.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 47 of 81
Non-Confidential
uint16x2_t __uhsub16(uint16x2_t, uint16x2_t);
Exchanges halfwords of the second operand, and performs saturating unsigned addition on the high halfwords
and saturating unsigned subtraction on the low halfwords.
uint16x2_t __uqsax(uint16x2_t, uint16x2_t);
Exchanges halfwords of the second operand, and performs saturating unsigned subtraction on the high halfwords
and saturating unsigned addition on the low halfwords.
uint16x2_t __uqsub16(uint16x2_t, uint16x2_t);
Exchanges the halfwords of the second operand, subtracts the high halfwords and adds the low halfwords. Sets
the GE bits according to the results of unsigned addition.
uint16x2_t __usub16(uint16x2_t, uint16x2_t);
2x16-bit unsigned subtraction. The GE bits are set according to the results.
Performs 2x16-bit multiplication and adds both results to the third operand. Sets the Q flag if the addition
overflows. (Overflow cannot occur during the multiplications.)
int32_t __smladx(int16x2_t, int16x2_t, int32_t);
Exchanges the halfwords of the second operand, performs 2x16-bit multiplication, and adds both results to the
third operand. Sets the Q flag if the addition overflows. (Overflow cannot occur during the multiplications.)
int64_t __smlald(int16x2_t, int16x2_t, int64_t);
Performs 2x16-bit multiplication and adds both results to the 64-bit third operand. Overflow in the addition is not
detected.
int64_t __smlaldx(int16x2_t, int16x2_t, int64_t);
Exchanges the halfwords of the second operand, performs 2x16-bit multiplication and adds both results to the 64-
bit third operand. Overflow in the addition is not detected.
int32_t __smlsd(int16x2_t, int16x2_t, int32_t);
Performs two 16-bit signed multiplications. Takes the difference of the products, subtracting the high-halfword
product from the low-halfword product, and adds the difference to the third operand. Sets the Q flag if the addition
overflows. (Overflow cannot occur during the multiplications or the subtraction.)
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 48 of 81
Non-Confidential
int32_t __smlsdx(int16x2_t, int16x2_t, int32_t);
Performs two 16-bit signed multiplications. The product of the high halfword of the first operand and the low
halfword of the second operand is subtracted from the product of the low halfword of the first operand and the high
halfword of the second operand, and the difference is added to the third operand. Sets the Q flag if the addition
overflows. (Overflow cannot occur during the multiplications or the subtraction.)
int64_t __smlsld(int16x2_t, int16x2_t, int64_t);
Perform two 16-bit signed multiplications. Take the difference of the products, subtracting the high-halfword
product from the low-halfword product, and add the difference to the third operand. Overflow in the 64-bit addition
is not detected. (Overflow cannot occur during the multiplications or the subtraction.)
int64_t __smlsldx(int16x2_t, int16x2_t, int64_t);
Perform two 16-bit signed multiplications. The product of the high halfword of the first operand and the low
halfword of the second operand is subtracted from the product of the low halfword of the first operand and the high
halfword of the second operand, and the difference is added to the third operand. Overflow in the 64-bit addition is
not detected. (Overflow cannot occur during the multiplications or the subtraction.)
int32_t __smuad(int16x2_t, int16x2_t);
Perform 2x16-bit signed multiplications, adding the products together. Set the Q flag if the addition overflows.
int32_t __smuadx(int16x2_t, int16x2_t);
Exchange the halfwords of the second operand (or equivalently, the first operand), perform 2x16-bit signed
multiplications, and add the products together. Set the Q flag if the addition overflows.
int32_t __smusd(int16x2_t, int16x2_t);
Perform two 16-bit signed multiplications. Take the difference of the products, subtracting the high-halfword
product from the low-halfword product.
int32_t __smusdx(int16x2_t, int16x2_t);
Perform two 16-bit signed multiplications. The product of the high halfword of the first operand and the low
halfword of the second operand is subtracted from the product of the low halfword of the first operand and the high
halfword of the second operand.
9.5.11 Examples
Taking the elementwise maximum of two SIMD values each of which consists of four 8-bit signed numbers:
int8x4_t max8x4(int8x4_t x, int8x4_t y) { __ssub8(x, y); return __sel(x, y); }
As described in section 9.5.6, where SIMD values consist of two 16-bit unsigned numbers:
int16x2_t max16x2(int16x2_t x, int16x2_t y) { __usub16(x, y); return __sel(x, y); }
Note that even though the result of the subtraction is not used, the compiler must still generate the instruction,
because of its side-effect on the GE bits which are tested by the __sel() intrinsic.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 49 of 81
Non-Confidential
The __sqrt intrinsics compute the square root of their operand. They have no effect on errno. Negative values
will produce a default NaN result and possible floating-point exception as described in [ARM ARM A2.7.7].
double __fma(double x, double y, double z);
float __fmaf(float x, float y, float z);
The __fma intrinsics compute (x*y)+z, without intermediate rounding. These intrinsics are available only if
__ARM_FEATURE_FMA is defined. On a Standard C implementation it should not normally be necessary to use
these intrinsics, as the fma functions defined in [C99 7.12.13] should expand directly to the instructions if
available.
float __rintnf (float);
double __rintn (double);
The __rintn intrinsics perform a floating point round to integral, to nearest with ties to even. The __rintn intrinsic
is available when __ARM_FEATURE_DIRECTED_ROUNDING is defined to 1. For other rounding modes like „to nearest
with ties to away‟ it is strongly recommended that C99 standard functions be used. To achieve a floating point
convert to integer, rounding to „nearest with ties to even‟ operation, use these rounding functions with a type-cast
to integral values, eg.
(int) __rintnf (a);
Will map to a floating point convert to signed integer, rounding to nearest with ties to even operation.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 50 of 81
Non-Confidential
10 SYSTEM REGISTER ACCESS
The values of the register specifiers will be as described in [ARM ARM] or the Technical Reference Manual (TRM)
for the specific processor.
So to read MIDR:
unsigned int midr = __arm_rsr("cp15:0:c0:c0:0");
ACLE does not specify predefined strings for the system coprocessor register names documented in the ARM
ARM (e.g. “MIDR”).
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 51 of 81
Non-Confidential
10.2.2 AArch32 32-bit system register
When specifying a 32-bit system register to __arm_rsr, __arm_rsrp, __arm_wsr, or __arm_wsrp, one of:
the values accepted in the spec_reg field of the MRS instruction [ARMARM-AR B6.1.5], e.g. “CPSR”
the values accepted in the spec_reg field of the MSR (immediate) instruction [ARMARM B6.1.6]
the values accepted in the spec_reg field of the VMRS instruction [ARMARM B6.1.14], e.g. “FPSID”
the values accepted in the spec_reg field of the VMSR instruction [ARMARM B6.1.15], e.g. “FPSCR”
the values accepted in the spec_reg field of the MSR and MRS instructions with virtualization extensions
[ARM ARM B1.7], e.g. “ELR_Hyp”
the values specified in „Special register encodings used in ARMv7-M system instructions.‟ [ARMv7M
B5.1.1], e.g. “PRIMASK”
"o0:op1:CRn:CRm:op2"
where:
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 52 of 81
Non-Confidential
when writing to a read-only register, or a register that is undefined on the architecture being compiled for
when reading or writing to a register which the implementation models by some other means (this covers
– but is not limited to – reading/writing cp10 and cp11 registers when VFP is enabled, and reading/writing
the CPSR)
when reading or writing a register using one of these intrinsics with an inappropriate type for the value
being read or written to
when writing to a co-processor register that carries out a "System operation"
when using a register specifier which doesn't apply to the targetted architecture.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 53 of 81
Non-Confidential
11 INSTRUCTION GENERATION
Architecture „8‟ means ARMv8-A AArch32 and AArch64, „8-32‟ means ARMv8-AArch32 only.
in the sequence of ARM architectures { 5, 5TE, 6, 6T2, 7 } each architecture includes its predecessor
instruction set
in the sequence of Thumb-only architectures { 6-M, 7-M, 7E-M } each architecture includes its
predecessor instruction set
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 54 of 81
Non-Confidential
Instruction Flgs Arch. Intrinsic or C code SBFX 8,6T2, 7M C
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 55 of 81
Non-Confidential
SMMUL 6, 7EM C UHADD16 6, 7EM __uhadd16
UDIV 7M+ C
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 56 of 81
Non-Confidential
12 NEON INTRINSICS
Not all types can be used in all operations. Generally, the operations available on a type correspond to the
operations available on the corresponding scalar type.
ACLE does not define whether int64x1_t is the same type as int64_t, or whether uint64x1_t is the same
type as uint64_t, or whether poly64x1_t is the same as poly64_t e.g. for C++ overloading purposes.
float16 types are only available when the __fp16 type is defined, i.e. when supported by the hardware. As with
scalar (VFP) operations, 16-bit floating-point types cannot be used in arithmetic operations. They can be used in
conversions to and from 32-bit floating-point types, in loads and stores, and in reinterpret operations.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 57 of 81
Non-Confidential
<type>_t the corresponding array type is <type>x<length>_t. Concretely, an array type is a structure containing
a single array element called val.
Note that this array of two 64-bit vector types is distinct from the 128-bit vector type int16x8_t.
poly8_t and poly16_t are defined as unsigned integer types. It is unspecified whether these are the same type
as uint8_t and uint16_t for overloading and mangling purposes.
is not portable. Use the vreinterpret intrinsics to convert from one vector type to another without changing
representation, and use the vcvt intrinsics to convert between integer and floating types; for example:
int32x4_t x;
uint32x4_t y = vreinterpretq_u32_s32(x);
float32x4_t z = vcvt_f32_s32(x);
is not portable. Use the vcreate or vdup intrinsics to construct values from scalars.
In C++, ACLE does not define whether NEON data types are POD types or whether they can be inherited from.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 58 of 81
Non-Confidential
include “arm_neon.h”
...
const int temp[2] = {0, 1};
uint32x2_t x = vld1_s32 (temp);
uint32_t y = vget_lane_s32 (x, 0);
reductions
T is a vector type such as int16x4_t for a vector of four lanes of signed 16-bit integers
Q is the string “q” if T is a 128-bit vector type, the empty string otherwise. This is used in forming the
names of intrinsics
C is the string „b‟, „h‟, „s‟ or „d‟ if T is an Advanced SIMD scalar type of width 8-bit, 16-bit, 32-
bit or 64-bit.ST is the short-form name of the lane type of a vector type T, such as s16 for a signed 16-bit
integer
UST is the unsigned short-form name of the lane type of a vector type T, such as u16 for an unsigned 16-
bit integer.
DT, for a 64-bit vector type T, is the 128-bit vector type with lanes twice as wide as T (where this exists).
Where T is 128-bit vector type(„_high_‟ widening intrinsics), DT is a 128-bit vector type where the lane is
twice as wide as lane type in T and half the number of elements in T. It basically represents the widened
top half of T.
HNT, for 128-bit vector type T, is the 128-bit vector type with lanes half as wide but twice in number. This
is used in narrowing operations. UHNT is the same as HNT, but unsigned type.
HT, for a 128-bit vector type T, is the 64-bit vector type with lanes half as wide as T (where this exists).
This is used in narrowing operations. There are no types with 4-bit lanes. UT is the vector type of the
same size and lane size as T but whose lane type is an unsigned integer. This is used as the result of
comparison operations and signed-to-unsigned saturation operations, and as an operand in selection
operations
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 59 of 81
Non-Confidential
IT is the vector type of the same size and lane size as T but whose lane type is a signed integer.
UHT, for a 128-bit vector type T, is the 64-bit vector type with lanes half as wide as T and of unsigned
type. This is used as the result of signed-to-unsigned narrowing saturation operations
T64 is the 64-bit vector type with the same lane type as T
T2, for a 64-bit vector type T, is the 128-bit vector type with the same lane type as T
FT, for a vector type T of 32-bit integral lane type, is the type with lanes of 32-bit floating type
TxN for N from 2 to 4 is an array of T, so where T is an int8x8_t, Tx3 is int8x8x3_t; this is used in
intrinsics which return multiple results, or where input operands consist of multiple vectors. Where N is 1,
the array type is simply T.
directly supplied as a scalar operand. These intrinsics are identified with the string “_n” in their name.
Depending on the intrinsic, this operand may be a compile-time integral constant (e.g. a shift count), or it
may be a general expression (usually of the same type as the vector lanes).
from one lane of an input vector. These intrinsics are identified with the string “_lane” in their name. The
lane number is the last argument and must be a compile-time constant and within range. The input vector
from which the scalar operand is taken is the preceding operand and is always a 64-bit vector.
where
q indicates a saturating operation(with the exception of vqtb[l][x] in AArch64 operations where the q
indicates 128-bit index and result operands)
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 60 of 81
Non-Confidential
name is the descriptive name of the basic operation
x indicates a Advanced SIMD scalar operation in AArch64. It can be one of „b‟, „h‟, „s‟, „d‟.
In AArch64, „_high‟ is used for widening and narrowing operations involving 128-bit operands. For
widening 128-bit operands, „high‟ refers to the top 64-bits of the source operand(s) and for narrowing,
it refers to the top 64-bits of the destination operand.
_laneq indicates a scalar operand taken from the lane of an input vector of 128-bit width.
sint64 1 int64
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 61 of 81
Non-Confidential
class count Types
int/64/poly 10 int8, int16, int32, int64, uint8, uint16, uint32, uint64, poly8, poly16
arith/64 9 int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32
f16 1 float16
f32 1 float32
f32,u32 1 float32,uint32
f64 1 float64
any 12* int8, int16, int32, int64, uint8, uint16, uint32, uint64, poly8, poly16, poly64, float32,
(float16)
any/f64 13* int8, int16, int32, int64, uint8, uint16, uint32, uint64, poly8, poly16, poly64, float32, float64,
(float16)
* Note: float16 is only available if supported in target hardware.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 62 of 81
Non-Confidential
T vcreate_ST(uint64_t a);
creates a vector by reinterpreting a 64-bit value. T can be any 64-bit vector type. ARMv8 adds 2 more for
poly64_t and float64_t.
T vdupQ_n_ST(ET value);
T vmovQ_n_ST(ET value);
creates a vector by duplicating a scalar value across all lanes. T can be any vector type for which ET exists. There
are 22 intrinsics. ARMv8 adds 4 more intrinsics for 64-bit and 128-bit vectors of float64_t and poly64_t.
T vdupQ_lane_ST(T64 vec, const int lane);
creates a vector by duplicating one lane of a source vector. T can be any vector type. T64 is the 64-bit vector type
corresponding to T. The scalar value is obtained from a designated lane of the input vector. There are 22
intrinsics. ARMv8 adds 4 more intrinsics for 64 and 128-bit vectors with float64_t and poly64_t elements.
AArch64 supports 128-bit lane-vectors. If the target supports float16_t, this adds 2 more intrinsics.
T vdupQ_laneq_ST(T2 vec, const int lane);
creates a vector by duplicating one lane of a source vector. T can be any vector type. The scalar value is
obtained from a designated lane of the input vector. The lane type of vector type T can be 8-bit, 16-bit, 32-bit, 64-
bit integers, 8-bit, 16-bit, 64-bit polynomial, float32_t and float64_t. There are 26 intrinsics. If the target
supports float16_t, this adds 2 more intrinsics. These are only available for AArch64.
T2 vcombine_ST(T low, T high);
creates a 128-bit vector by combining two 64-bit vectors. T can be any 64-bit vector type. There are 12 intrinsics.
ARMv8 adds 2 more for poly64_t and float64_t.
T vget_high_ST(T2 a);
T vget_low_ST(T2 a);
gets the high, or low, half of a 128-bit vector. There are 24 intrinsics. ARMv8 adds 4 more intrinsics for 128-bit
vectors with float64_t and poly64_t lane type.
T vsetQ_lane_ST(ET value, T vec, const int lane);
sets the specified lane of an input vector to be a new value. There are 24 intrinsics. ARMv8 adds 4 intrinsics for
64-bit and 128-bit vectors for float64_t and poly64_t lane type.
ET vgetQ_lane_ST(T vec, const int lane);
gets the value from the specified lane of an input vector. There are 24 intrinsics. ARMv8 adds 4 intrinsics for 64-bit
and 128-bit vectors for float64_t and poly64_t lane type.
These intrinsics are part of the AdvSIMD scalar intrinsics and are available only on AArch64.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 63 of 81
Non-Confidential
T‟ vreinterpretQ_ST‟_ST(T a);
reinterprets a vector of one type T as a vector of another type T‟, without any operation taking place. The lane
sizes may be the same or different. For example, vreinterpretq_s8_f32() reinterprets a vector of four 32-bit
floating-point elements as a vector of sixteen 8-bit signed integer elements.
ARMv8 adds new intrinsics to reinterpret 64-bit and 128-bit vectors of float64_t, poly64_t and poly128_t
values to other types and other types to float64_t, poly64_t and poly128_t. This adds 132 new
intrinsics. Reinterprets for poly128_t are of the form
poly128_t vreinterpretq_p128_ST (...);
T vreinterpretq_ST_p128 (poly128_t);
12.3.6.1 Examples
The following “no-op” expressions demonstrate some relationships between these intrinsics:
vcombine_ST(vget_low_ST(a), vget_high_ST(a))
vset_lane_ST(vget_lane_ST(a, N), a, N)
vreinterpret_ST_u8(vreinterpret_u8_ST(a))
for N from 2 to 4, loads N vectors from an array, with de-interleaving. The array consists of a sequence of sets of
N values. The first element of the array is placed in the first lane of the first vector, the second element in the first
lane of the second vector, and so on. For example, vld3_s32 will load the six 32-bit elements { A, B, C, D, E, F }
into the three 64-bit vectors { DA, EB, FC }. Not available for 64-bit lanes when T is a 128-bit vector type in
AArch32. AArch64 adds support for 64-bit lanes when T is a 128-bit vector type.
TxN vldNQ_dup_ST(ET const *ptr);
for N from 2 to 4, loads a single N-element structure to all lanes of N vectors. N values are loaded, then duplicated
across all lanes. For example, vld3_dup_s16 will load the three consecutive 16-bit elements { A, B, C } and
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 64 of 81
Non-Confidential
produce the three 64-bit vectors { AAAA, BBBB, CCCC }, while the 128-bit vector form vld3q_dup_s16 will
produce the three 128-bit vectors { AAAAAAAA, BBBBBBBB, CCCCCCCC }. Not available for 64-bit lanes when T
is a 128-bit vector type in AArch32. AArch64 adds support for 64-bit lanes when T is a 128-bit vector type.
void vstNQ_ST(ET *ptr, TxN val);
for N from 2 to 4, stores N vectors to an array, with interleaving. Every element of each vector is stored. Not
available for 64-bit lanes when T is a 128-bit vector type in AArch32. AArch64 adds support for 64-bit lanes when
T is a 128-bit vector type.
TxN vldNQ_lane_ST(ET const *ptr, TxN src, const int lane);
for N from 2 to 4, loads a single N-element structure to the designated lane of N vectors. Not available for 64-bit
lanes; not available for 8-bit lanes and 128-bit vectors in AArch32. AArch64 adds support for 8-bit and 64-bit lanes
when T is a 128-bit vector.
void vstNQ_lane_ST(ET *ptr, TxN val, const int lane);
for N from 2 to 4, stores a single N-element structure from the designated lane of N vectors. Not available for 64-
bit lanes; not available for 8-bit lanes and 128-bit vectors in AArch32. AArch64 adds support for 8-bit and 64-bit
lanes when T is a 128-bit vector.
TxN vld1Q_ST_xN(ET const *ptr);
for N from 2 to 4, loads N vectors from an array without de-interleaving. The first element (at the lowest address)
of the array is placed in the first lane of the first vector, the second element in the second lane of the first vector
and so on. For example, vld1_s32_x4 will load the eight 32-bit array elements {A, B, C, D, E, F, G, H} into the
four 64-bit vectors {BA, DC, FE, HG},
for N from 2 to 4, stores N vectors from a register to an array without de-interleaving. The first element (at LSB) of
the register is placed in the lowest address of the array, the second lane of the first vector in the second element
of the array and so on. For example, vst1_s32_x4 will store four 64-bit vectors {BA, DC, FE, HG} into the eight 32-
bit array elements {A, B, C, D, E, F, G, H}.
stores a poly128_t value. This is available only on ARMv8 AArch32 and AArch64.
12.3.7.1 Examples
This is an example of iterating through an array, with fixup code for any elements left over:
void scale_values(float *a, int n, float scale) {
int i;
for (i = 0; i < (n & ~3); i+=4) {
vst1q_f32(&a[i], vmulq_n_f32(vld1q_f32(&a[i]), scale));
}
if (i & 2) {
vst1_f32(&a[i], vmul_n_f32(vld1_f32(&a[i]), scale));
i += 2;
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 65 of 81
Non-Confidential
}
if (i & 1) {
a[i] *= scale;
}
}
If the array is known to contain an integral number of whole vectors, fixup code is not necessary.
The fixup code could also be written using Advanced SIMD scalar intrinsics. For example,
void add_values(int64_t *a, int64_t *b, int n) {
int i;
for (i = 0; i < (n & ~3); i+=4) {
vst1q_s64(&a[i], vaddq_s64(vld1q_s64(&a[i]), vld1q_s64(&b[i])));
}
if (i & 2) {
vst1q_s64(&a[i], vaddq_s64(vld1q_s64(&a[i]), vld1q_s64(&b[i])));
i += 2;
}
if (i & 1) {
a[i] = vaddd_s64(a[i], b[i]);
}
}
void qadd_values(int32_t *a, int32_t *b, int n) {
int i;
for (i = 0; i < (n & ~3); i+=4) {
vst1q_s32(&a[i], vqaddq_s32(vld1q_s32(&a[i]), vld1q_s32(&b[i])));
}
if (i & 2) {
vst1_s32(&a[i], vqadd_s32(vld1_s32(&a[i]), vld1_s32(&b[i])));
i += 2;
}
if (i & 1) {
a[i] = vqadds_s32(a[i], b[i]);
}
}
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 66 of 81
Non-Confidential
Alignment hints are not supported by AArch64.
Saturation clips the result of an operation to the output range when it is narrowed to a smaller result type or shifted
left. Signed-to-unsigned saturation clips a signed result to an unsigned range, so that negative results go to zero.
Rounding is used when a value is shifted right, or when the high part of a result is taken. It effectively adds a value
equivalent to 0.5 bits to the value before truncating it, so values are rounded towards positive infinity.
Variable shift operations are bidirectional, i.e. a shift count is encoded as a signed integer. A shift operation may
be both saturating (when the value is shifted left, or narrowed) and rounding (when the value is shifted right).
Template count Types Additional Additional Instruction
supported in Types Types (AArch64
AArch32 supported in supported format)
(vector) AArch64 in
(vector) AArch64
(AdvSIMD
scalar)
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 67 of 81
Non-Confidential
Template count Types Additional Additional Instruction
supported in Types Types (AArch64
AArch32 supported in supported format)
(vector) AArch64 in
(vector) AArch64
(AdvSIMD
scalar)
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 68 of 81
Non-Confidential
Template count Types Additional Additional Instruction
supported in Types Types (AArch64
AArch32 supported in supported format)
(vector) AArch64 in
(vector) AArch64
(AdvSIMD
scalar)
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 69 of 81
Non-Confidential
Template count Types Additional Additional Instruction
supported in Types Types (AArch64
AArch32 supported in supported format)
(vector) AArch64 in
(vector) AArch64
(AdvSIMD
scalar)
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 70 of 81
Non-Confidential
Template count Types Additional Additional Instruction
supported in Types Types (AArch64
AArch32 supported in supported format)
(vector) AArch64 in
(vector) AArch64
(AdvSIMD
scalar)
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 71 of 81
Non-Confidential
Template count Types Additional Additional Instruction
supported in Types Types (AArch64
AArch32 supported in supported format)
(vector) AArch64 in
(vector) AArch64
(AdvSIMD
scalar)
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 72 of 81
Non-Confidential
Template count Types Additional Additional Instruction
supported in Types Types (AArch64
AArch32 supported in supported format)
(vector) AArch64 in
(vector) AArch64
(AdvSIMD
scalar)
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 73 of 81
Non-Confidential
Template count Types Additional Additional Instruction
supported in Types Types (AArch64
AArch32 supported in supported format)
(vector) AArch64 in
(vector) AArch64
(AdvSIMD
scalar)
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 74 of 81
Non-Confidential
Template count Types Additional Additional Instruction
supported in Types Types (AArch64
AArch32 supported in supported format)
(vector) AArch64 in
(vector) AArch64
(AdvSIMD
scalar)
Are available for float32_t on AArch32 and both float32_t and float64_t on AArch64, ARMv8 onwards.
These intrinsics are available when __ARM_FEATURE_NUMERIC_MAXMIN is defined.
T vcvtRQC_ST_f32(FT a)
T vcvtRQC_ST_f64(FT a)
Are available for int32_t and uint32_t on AArch32 and int32_t , uint32_t, int64_t, uint64_t on
AArch64
T vrndRXQ_ST(T a)
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 75 of 81
Non-Confidential
Are available for float32_t on AArch32 and float32_t and float64_t on AArch64.
float32_t vrndns_f32(float32_t a)
is available on AArch32 and AArch64 for round with „Exact Ties to Even‟.
These intrinsics are available when __ARM_FEATURE_DIRECTED_ROUNDING is defined.
poly64_t intrinsics for AArch64 also apply to AArch32 in ARMv8 except where explicitly mentioned otherwise.
performs a pairwise add operation. For example, given the two input vectors ABCD and EFGH, the result is
{A+B,C+D,E+F,G+H}. The lane type of T can be any 8-bit, 16-bit,32-bit or 64-bit integer type or float32_t,
float64_t. AArch32 supports only 64-bit vectors while AArch64 supports both 64-bit and 128-bit vectors. There
is no support for 64-bit lane size when vector is 64-bit long.
RT vpaddlQ_ST(T a);
RT vpadalQ_ST(RT a, T b);
adds elements pairwise in the input vector, with a long result. The input elements can be 8-bit, 16-bit or 32-bit
integers. The result vector type RT is the same size as the input vector, with half as many lanes, each of twice the
size. For example, given an int16x4_t input vector {A,B,C,D}, the output vector is the int32x2_t vector
{A+B,C+D}. The vpadal() form accumulates the result with another vector.
T vpmaxQ_ST(T a, T b);
T vpminQ_ST(T a, T b);
performs pairwise maximum or minimum on a pair of input vectors. AArch64 supports input vectors of both 64-bit
and 128-bit, while AArch32 supports only 64-bit input vectors. In AArch32, the input elements can be 8-bit, 16-bit
or 32-bit integers, or float32_t. AArch64 adds support for float64_t. Given inputs {A,B,C,D} and {E,F,G,H},
the output vector (for vpmax) is {max(A,B),max(C,D),max(E,F),max(G,H)}. AArch64 also adds support for two
more variants – vpmaxnm and vpminnm that only support 64 and 128-bit vectors of float32_t and
float64_t. There is no support for 64-bit lane size when vector is 64-bit long.
AArch64 offers new vector reduce intrinsics that operate on vectors and return a scalar quantity.
ET vaddvQ_ST (T v);
Performs addition across lanes of the input vector „v‟ and returns a scalar value. 64-bit and 128-bit vectors are
supported. The lane type of T can be any of 8-bit, 16-bit, 32-bit, 64-bit integers, float32_t or float64_t . 64-
integers and float64_t are supported for 128-bit vectors only.
EDT vaddlvQ_ST (T v);
Performs widened addition across lanes of the input vector „v‟ and returns a scalar value that is twice as wide as
the lane type of T. Both 64-bit and 128-bit vectors are supported. The input elements can be 8-bit, 16-bit and 32-
bit integers.
ET vmaxvQ_ST (T v);
ET vminvQ_ST (T v);
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 76 of 81
Non-Confidential
Perform maximum and minimum across the lanes of input vector „v‟. Both 64-bit and 128-bit vectors are
supported. The lanes of T can be any of 8-bit, 16-bit or 32-bit integers. They also support 64-bit and 128-bit
vectors of float32_t and 128-bit vectors of float64_t. 64-bit vectors of float64_t are not supported.
ET vmaxnmvQ_ST (T v);
ET vminnmvQ_ST (T v);
Perform numeric maximum and minimum across the lanes of input vector „v‟. They support 64-bit and 128-bit
vectors of float32_t and 128-bit vectors of float64_t. 64-bit vectors of float64_t are not supported.
extracts one vector from a pair of concatenated input vectors, starting at a given lane position. Given inputs ABCD
and EFGH, and a lane position of 3, the concatenation EFGHABCD is formed and a vector is extracted at lane 3
to produce a result of FGHA. ARMv8 adds support for float64_t and poly64_t.
T vrevBQ_ST(T vec);
reverses the order of lanes within B-bit sets. For example, vrev32_s8 reverses the order of 8-bit lanes within 32-
bit groups of four lanes in an int8x8_t vector, so that the input ABCDEFGH would result in DCBAHGFE. (At the
machine level, this can also be understood as a SIMD operation on 32-bit elements, reversing the byte order in
each, but to use the vrev intrinsic with int32x2_t vectors it would be necessary to reinterpret the input and
output vector types.) B must be greater than the lane size: i.e. for 8-bit lanes B must be 16, 32 or 64; for 16-bit
lanes B must be 32 or 64; and for 32-bit lanes B must be 64.
Tx2 vzipQ_ST(T a, T b);
interleaves elements pairwise from two vectors, returning a pair (i.e. a 2-element array) of vectors.The inputs
ABCD and EFGH result in AEBF and CGDH. Not available for 64-bit lanes.
Tx2 vuzpQ_ST(T a, T b);
de-interleaves elements from two vectors. The inputs ABCD and EFGH result in ACEG and BDFH. Not available
for 64-bit lanes.
Tx2 vtrnQ_ST(T a, T b);
transposes elements from two vectors, treating them as 2x2 matrices. Not available for 64-bit lanes.
In AArch64, ZIP, UZP and TRN are split into two instructions. They are available for ARMv7 and ARMv8.
The following additional intrinsics are provided to support these operations which are available only on AArch64.
T vzip1Q_ST(T a, T b);
interleaves the elements from lower half of a and b into the result and returns the result which is of the same size
as the input vectors.
T vzip2Q_ST(T a, T b);
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 77 of 81
Non-Confidential
de-interleaves the elements from lower half of a and b into the result and returns the result which is of the same
size as the input vectors.
T vuzp2Q_ST(T a, T b);
Transposes the elements of lower half of a and b and stores into the result vector.
T vtrn2Q_ST(T a, T b);Is similar to vtrn1, but for the upper half of a and b. Available for 64-bit and 128-
bit vectors of lanes 8, 16, 32 and 64-bits except for 64-bit vectors when lane size is 64 bits.
performs 8-bit table lookup. T must be a 64-bit vector type with 8-bit lanes, i.e. int8x8_t, uint8x8_t or
poly8x8_t. The table is supplied in a as an array of 1 to 4 vectors, treated as one large vector consisting of
(respectively) 8 to 32 table entries. The output is formed by using the vector b as a vector of indexes into the table,
and mapping each index by its table entry, or zero if the index is out of range. This operation can be thought of as
either
a lane-by-lane table-lookup operation on b, where the index value in each lane is replaced by the
corresponding table value
performs an extended table lookup operation. In contrast to vtbl, for vtbx, if the index is out of range, the
resulting lane value is taken from the corresponding lane in the vector a, rather than zero.
In AArch64, the table operations are similar in operation to AArch32, but the table size is always 128-bit and the
index vector can either be 64-bit or 128-bit.
T vqtblNQ_ST (T2xN t, UT idx);
Is similar in operation to ARMv7‟s vtbl, but T2 is always 128-bit. T can be 64-bit or 128-bit i.e. int8x8_t,
uint8x8_t, poly8x8_t or int8x16_t, uint8x16_t or poly8x16_t.
T vqtbxNQ_ST (T a, T2xN t, UT idx);
Is similar in operation to ARMv7‟s vtbx, but T2 is always 128-bit. T can be 64-bit or 128-bit i.e. int8x8_t,
uint8x8_t, poly8x8_t or int8x16_t, uint8x16_t or poly8x16_t.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 78 of 81
Non-Confidential
uint8x16_t vaesdq_u8 (uint8x16_t data, uint8x16_t key);
Performs widening polynomial multiplication on double-words low part. Available on ARMv8 AArch32 and
AArch64.
poly128_t vmull_high_p64 (poly64x2_t, poly64x2_t);
Performs widening polynomial multiplication on double-words high part. Available on ARMv8 AArch32 and
AArch64.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 79 of 81
Non-Confidential
13 FUTURE DIRECTIONS
to define small (inline) functions defined in terms of expressions involving intrinsics, which provide
abstractions or emulate other intrinsic families; it is desirable for such functions to have the same well-
defined effects on the Q/GE bits as the corresponding intrinsics
This would indicate that the result registers should be used as if the type had been passed as the first argument.
The implementation should not complain if the attribute is applied inappropriately (i.e. where insufficient registers
are available) – it might be a template instance.
using additional argument registers, e.g. passing an argument in R5, R7, R12 etc.
using additional result registers, e.g. R0 and R1 for a combined divide-and-remainder routine (note that
some implementations may be able to support this by means of a “value in registers” structure return)
When calling the function, arguments and results would be marshalled according to the AAPCS, the only
difference being that the call would be invoked as a trap instruction rather than a branch-and-link.
One issue is that some calls may have non-standard calling conventions. (For example, ARM Linux system calls
expect the code number to be passed in R7.)
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 80 of 81
Non-Confidential
Another issue is that the code may vary between ARM and Thumb state. This issue could be addressed by
allowing two numeric parameters in the attribute.
IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 81 of 81
Non-Confidential