KEMBAR78
ARM C Language Extensions - Al Grant | PDF | C (Programming Language) | 64 Bit Computing
0% found this document useful (0 votes)
118 views81 pages

ARM C Language Extensions - Al Grant

The ARM C Language Extensions document (Release 2.0) outlines specifications for C/C++ programmers to utilize the ARM architecture while maintaining source code portability. It includes details on data types, intrinsic functions, and various features of the ARM architecture. The document is non-confidential and provides guidelines for reporting defects and accessing the latest releases.

Uploaded by

armen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views81 pages

ARM C Language Extensions - Al Grant

The ARM C Language Extensions document (Release 2.0) outlines specifications for C/C++ programmers to utilize the ARM architecture while maintaining source code portability. It includes details on data types, intrinsic functions, and various features of the ARM architecture. The document is non-confidential and provides guidelines for reporting defects and accessing the latest releases.

Uploaded by

armen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

ARM® C Language Extensions

Release 2.0

Document number: IHI 0053C


Date of Issue: 09/05/2014

Abstract
This document specifies the ARM C Language Extensions to enable C/C++ programmers to exploit the ARM
architecture with minimal restrictions on source code portability.

Keywords
ACLE, ABI, C, C++, compiler, armcc, gcc, intrinsic, macro, attribute, NEON, SIMD, atomic

How to find the latest release of this specification or report a defect in it


Please check the ARM Information Center (http://infocenter.arm.com/) for a later release if your copy is more than
one year old. This document may be found under “Developer Guides and Articles”, “Software Development”.
Please report defects in this specification to arm dot acle at arm dot com.

Confidentiality status
This document is Non-Confidential.

Proprietary Notice
This document is protected by copyright and other related rights and the practice or implementation of the
information contained in this document may be protected by one or more patents or pending patent applications.
No part of this document may be reproduced in any form by any means without the express prior written
permission of ARM. No license, express or implied, by estoppel or otherwise to any intellectual property
rights is granted by this document unless specifically stated.

Your access to the information in this document is conditional upon your acceptance that you will not use or permit
others to use the information for the purposes of determining whether implementations infringe any third party
patents.

THIS DOCUMENT IS PROVIDED “AS IS”. ARM PROVIDES NO REPRESENTATIONS AND NO WARRANTIES,
EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR
PURPOSE WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, ARM makes no representation
with respect to, and has undertaken no analysis to identify or understand the scope and content of, third party
patents, copyrights, trade secrets, or other rights.

This document may include technical inaccuracies or typographical errors.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 1 of 81
Non-Confidential
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR ANY DAMAGES,
INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR
CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY,
ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE POSSIBILITY
OF SUCH DAMAGES.

This document consists solely of commercial items. You shall be responsible for ensuring that any use, duplication
or disclosure of this document complies fully with any relevant export laws and regulations to assure that this
document or any portion thereof is not exported, directly or indirectly, in violation of such export laws. Use of the
word “partner” in reference to ARM‟s customers is not intended to create or refer to any partnership relationship
with any other company. ARM may make changes to this document at any time and without notice.

If any of the provisions contained in these terms conflict with any of the provisions of any click through or signed
written agreement covering this document with ARM, then the click through or signed written agreement prevails
over and supersedes the conflicting provisions of these terms. This document may be translated into other
languages for convenience, and you agree that if there is any conflict between the English version of this
document and any translation, the terms of the English version of the Agreement shall prevail.

Words and logos marked with ® or ™ are registered trademarks or trademarks of ARM Limited or its affiliates in
the EU and/or elsewhere. All rights reserved. Other brands and names mentioned in this document may be the
trademarks of their respective owners. Please follow ARM‟s trademark usage guidelines at
http://www.arm.com/about/trademark-usage-guidelines.php.

Copyright © 2014. ARM Limited or its affiliates. All rights reserved.

ARM Limited. Company 02557590 registered in England.


110 Fulbourn Road, Cambridge, England CB1 9NJ.
LES-PRE-20349

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 2 of 81
Non-Confidential
Contents

1 ABOUT THIS DOCUMENT 8

1.1 Change control 8


1.1.1 Current status and anticipated changes 8
1.1.2 Change history 8

1.2 References 8

1.3 Terms and abbreviations 9

2 SCOPE 10

3 INTRODUCTION 11

3.1 ACLE and 64-bit architectures 11

3.2 Change history 11


3.2.1 Changes between ACLE 1.1 and ACLE 2.0 11
3.2.2 General changes 11

3.3 Portable Binary Objects 12

4 C LANGUAGE EXTENSIONS 13

4.1 Fundamental data types 13


4.1.1 Implementation-defined type properties 13
4.1.2 Half-precision floating-point 13

4.2 Predefined macros 14

4.3 Intrinsics 14
4.3.1 Constant arguments to intrinsics 14

4.4 Header files 15

4.5 Attributes 15

4.6 Implementation strategies 15

5 ARCHITECTURE AND CPU NAMES 16

5.1 Introduction 16

5.2 Architecture names 16


5.2.1 CPU architecture 16
5.2.2 FPU architecture 17

5.3 CPU names 18

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 3 of 81
Non-Confidential
6 FEATURE TEST MACROS 19

6.1 Introduction 19

6.2 Testing for ARM C Language Extensions 19

6.3 Endianness 19

6.4 ARM and Thumb instruction set architecture and features 19


6.4.1 ARM/Thumb instruction set architecture 20
6.4.2 Architectural profile (A, R, M or pre-Cortex) 20
6.4.3 Unaligned access supported in hardware 20
6.4.4 LDREX/STREX 20
6.4.5 CLZ 21
6.4.6 Q (saturation) flag 21
6.4.7 DSP instructions 21
6.4.8 Saturation instructions 21
6.4.9 32-bit SIMD instructions 22
6.4.10 Hardware Integer Divide 22

6.5 Floating-point and Advanced SIMD (NEON) hardware 22


6.5.1 Hardware floating point 22
6.5.2 Half-precision (16-bit) floating-point format 23
6.5.3 Fused multiply-accumulate (FMA) 23
6.5.4 Advanced SIMD architecture extension (NEON) 23
6.5.5 NEON floating-point 23
6.5.6 Wireless MMX 23
6.5.7 Crypto Extension 24
6.5.8 CRC32 Extension 24
6.5.9 Directed Rounding 24
6.5.10 Numeric Maximum and Minimum 24
6.5.11 Half-precision argument and result 24

6.6 Floating-point model 24

6.7 Procedure call standard 25

6.8 Mapping of object build attributes to predefines 26

6.9 Summary of predefined macros 27

7 ATTRIBUTES AND PRAGMAS 29

7.1 Attribute syntax 29

7.2 Hardware/software floating-point calling convention 29

7.3 Target selection 29

7.4 Weak linkage 30


7.4.1 Patchable constants 30

7.5 Alignment 30

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 4 of 81
Non-Confidential
7.5.1 Alignment attribute 30
7.5.2 Alignment of static objects 30
7.5.3 Alignment of stack objects 31
7.5.4 Procedure calls 31
7.5.5 Alignment of C heap storage 31
7.5.6 Alignment of C++ heap allocation 31

7.6 Other attributes 32

8 SYNCHRONIZATION, BARRIER AND HINT INTRINSICS 33

8.1 Introduction 33

8.2 Atomic update primitives 33


8.2.1 C/C++ standard atomic primitives 33
8.2.2 IA-64/GCC atomic update primitives 33

8.3 Memory barriers 33


8.3.1 Examples 34

8.4 Hints 35

8.5 Swap 36

8.6 Memory prefetch intrinsics 36


8.6.1 Data prefetch 37
8.6.2 Instruction prefetch 37

8.7 NOP 38

9 DATA-PROCESSING INTRINSICS 39

9.1 Programmer’s model of global state 39


9.1.1 The Q (saturation) flag 39
9.1.2 The GE flags 40
9.1.3 Floating-point environment 40

9.2 Miscellaneous data-processing intrinsics 40


9.2.1 Examples 41

9.3 16-bit multiplications 42

9.4 Saturating intrinsics 42


9.4.1 Width-specified saturation intrinsics 42
9.4.2 Saturating addition and subtraction intrinsics 42
9.4.3 Accumulating multiplications 43
9.4.4 Examples 43

9.5 32-bit SIMD intrinsics 44


9.5.1 Availability 44
9.5.2 Data types for 32-bit SIMD intrinsics 44
9.5.3 Use of the Q flag by 32-bit SIMD intrinsics 44
9.5.4 Parallel 16-bit saturation 44

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 5 of 81
Non-Confidential
9.5.5 Packing and unpacking 45
9.5.6 Parallel selection 45
9.5.7 Parallel 8-bit addition and subtraction 45
9.5.8 Sum of 8-bit absolute differences 46
9.5.9 Parallel 16-bit addition and subtraction 46
9.5.10 Parallel 16-bit multiplication 48
9.5.11 Examples 49

9.6 Floating-point data-processing intrinsics 49

9.7 CRC32 intrinsics 50

10 SYSTEM REGISTER ACCESS 51

10.1 Special register intrinsics 51

10.2 Special register designations 51


10.2.1 AArch32 32-bit coprocessor register 51
10.2.2 AArch32 32-bit system register 52
10.2.3 AArch32 64-bit coprocessor register 52
10.2.4 AArch64 system register 52
10.2.5 AArch64 processor state field 52

10.3 Unspecified behavior 52

11 INSTRUCTION GENERATION 54

11.1 Instruction generation, arranged by instruction 54

12 NEON INTRINSICS 57

12.1 Availability of NEON intrinsics 57


12.1.1 16-bit floating-point availability 57
12.1.2 Fused multiply-accumulate availability 57

12.2 NEON data types 57


12.2.1 Vector data types 57
12.2.2 Advanced SIMD Scalar data types 57
12.2.3 Vector array data types 57
12.2.4 Scalar data types 58
12.2.5 Operations on data types 58
12.2.6 Compatibility with other vector programming models 58

12.3 Specification of NEON intrinsics 59


12.3.1 Introduction 59
12.3.2 Explanation of NEON intrinsics templates 59
12.3.2.1 Examples of template type parameters 60
12.3.3 Intrinsics with scalar operands 60
12.3.4 Summary of intrinsic naming conventions 60
12.3.5 Lane type classes 61
12.3.6 Constructing and deconstructing NEON vectors 62
12.3.6.1 Examples 64

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 6 of 81
Non-Confidential
12.3.7 NEON loads and stores 64
12.3.7.1 Examples 65
12.3.7.2 Alignment assertions 66
12.3.8 NEON lane-by-lane operations 67
12.3.10 NEON Vector Additions to AArch32 in ARMv8 75
12.3.11 NEON vector reductions 76
12.3.12 NEON vector rearrangements 77
12.3.13 NEON vector table lookup 78
12.3.14 Crypto Intrinsics 78

13 FUTURE DIRECTIONS 80

13.1 Extensions under consideration 80


13.1.1 Procedure calls and the Q / GE bits 80
13.1.2 Returning a value in registers 80
13.1.3 Custom calling conventions 80
13.1.4 Traps: system calls, breakpoints etc. 80
13.1.5 Mixed-endian data 81
13.1.6 Memory access with non-temporal hints. 81

13.2 Features not considered for support 81


13.2.1 VFP vector mode 81
13.2.2 Bit-banded memory access 81

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 7 of 81
Non-Confidential
1 ABOUT THIS DOCUMENT

1.1 Change control


1.1.1 Current status and anticipated changes
This document is release 2.0 of the ARM C Language Extensions (ACLE).
Anticipated changes to this document include:
Update to include extensions for ARMv8 AArch32 and AArch64 execution states.
Typographical corrections.
Clarifications.
Compatible extensions.

1.1.2 Change history


Issue Date By Change
A 11/11/11 AG First release
B 13/11/13 AG Version 1.1. Editorial changes. Corrections and
completions to intrinsics as detailed in 3.3. Updated for
C11/C++11.

C 09/05/14 TB Version 2.0. Updated for ARMv8 AArch32 and AArch64.

1.2 References
This document refers to the following documents.
Ref Doc No Author(s) Title
ARMARM ARM DDI 0406C ARM ARM Architecture Reference Manual (7-A / 7-R)
ARMARMv8 ARM DDI0487A.B ARM ARMv8-A Reference Manual (Issue A.b)
ARMv7M ARM DDI 0403C ARM ARM Architecture Reference Manual (7-M)
AAPCS ARM IHI 0042D ARM Procedure Call Standard
AAPCS64 ARM IHI0055C-BETA ARM Procedure Call Standard (AArch64)
BA ARM IHI 0045C ARM EABI Addenda and Errata – Build Attributes
C++11 ISO/IEC 14882:2011 ISO Standard C++ (based on draft N3337)
C11 ISO/IEC 9899:2011 ISO Standard C (based on draft N1570)
C99 ISO 9899:1999 ISO Standard C (“C99”)
cxxabi http://mentorembedded.github. Code- Itanium C++ ABI (rev. 1.86)
com/cxx-abi/abi.html Sourcery

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 8 of 81
Non-Confidential
G.191 T-REC-G.191-200508-I ITU-T Software Tool Library 2005 User‟s Manual
GNUC http://gcc.gnu.org/onlinedocs GNU/FSF GNU C Compiler Collection
IA-64 245370-003 Intel Intel Itanium Processor-Specific ABI
IEEE-FP IEEE 754-2008 IEEE IEEE floating-point
POSIX IEEE 1003.1 IEEE / TOG The Open Group base specifications
Warren ISBN 0-201-91465-4 H. Warren “Hacker‟s Delight”, pub. Addison-Wesley 2003

1.3 Terms and abbreviations


This document uses the following terms and abbreviations.
Term Meaning
AAPCS ARM Procedure Call Standard, part of the ABI, defined in [AAPCS]
ABI ARM Application Binary Interface
ACLE ARM C Language Extensions, as defined in this document
Advanced SIMD a 64-bit/128-bit SIMD instruction set defined as part of the ARM architecture
build attributes object build attributes indicating configuration, as defined in [BA]
ILP32 a 32-bit address mode where „long‟ is a 32-bit type

LLP64 a 64-bit address mode where „long‟ is a 32-bit type

LP64 a 64-bit address mode where „long‟ is a 64-bit type

NEON™ an implementation of the ARM Advanced SIMD extensions


SIMD any instruction set that operates simultaneously on multiple elements of a vector
data type
®
Thumb the Thumb instruction set extension to ARM
VFP the original ARM non-SIMD floating-point instruction set
word a 32-bit quantity, in memory or a register

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 9 of 81
Non-Confidential
2 SCOPE
The ARM C Language Extensions (ACLE) specification specifies source language extensions and implementation
choices that C/C++ compilers can implement in order to allow programmers to better exploit the ARM architecture.
The extensions include:
Predefined macros that provide information about the functionality of the target architecture (for example,
whether it has hardware floating-point)
Intrinsic functions
Attributes that can be applied to functions, data and other entities
This specification does not standardize command-line options, diagnostics or other external behavior of compilers.
The intended users of this specification are:
Application programmers wishing to adapt or hand-optimize applications and libraries for ARM targets
System programmers needing low-level access to ARM targets beyond what C/C++ provides for
Compiler implementors, who will implement this specification
Implementors of IDEs, static analysis tools etc. who wish to deal with the C/C++ source language
extensions when encountered in source code
Some of the material – specifically, the architecture/CPU namings, and the feature test macros – may also be
applicable to assemblers and other tools.
ACLE is not a hardware abstraction layer (HAL), and does not specify a library component – but it may make it
easier to write a HAL or other low-level library in C rather than assembler.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 10 of 81
Non-Confidential
3 INTRODUCTION
Modern computer architectures (such as ARM) include architectural features that go beyond the set of operations
available in C/C++. These features may include SIMD and saturating instructions. Exploiting these features to
improve program efficiency has in the past caused “lock-in” to compilers, or to individual CPUs.
The intention of the ARM C Language Extensions (ACLE) is to allow the writing of applications and middleware
code that is portable across compilers, and across ARM architecture variants, while exploiting the unique features
of the ARM architecture family.
The design principles for ACLE can be summarized as:

be implementable in (or as an addition to) current C/C++ implementations

build on and standardize existing practice where possible


Notably, ACLE standardizes the NEON (Advanced SIMD) intrinsics.
ACLE incorporates some language extensions introduced in the GCC C compiler. Current GCC documentation
[GCC] can be found at http://gcc.gnu.org/onlinedocs/gcc. Formally it should be assumed that ACLE refers to the
documentation for GCC 4.5.1: http://gcc.gnu.org/onlinedocs/gcc-4.5.1/gcc/.
Some of the ACLE extensions are not specific to the ARM architecture but have proven to be of particular benefit
in low-level and systems programming; examples include features for controlling the alignment and packing of
data, and some common operations such as word rotation and reversal. As and when features become available
in international standards (and implementations), it is recommended to use these in preference to ACLE. When
implementations are widely available, any ACLE-specific features can be expected to be deprecated.

3.1 ACLE and 64-bit architectures


This revision of ACLE is intended to be efficiently implementable on 64-bit architectures and to aid in writing code
that runs efficiently on both 32-bit and 64-bit architectures.

3.2 Change history


The following sections highlight changes which implementors should be aware of. For tracking purposes the
internal defect references (e.g. “[ACLE-123]”) are given.

3.2.1 Changes between ACLE 1.1 and ACLE 2.0


Most changes in ACLE 2.0 are updates to support features introduced in ARMv8 AArch32 and AArch64.

3.2.2 General changes


Introduced new predefines to support ARMv8 32-bit and 64-bit execution states[ACLE-4]

Deprecated DSP-style intrinsics for A-profile cores[ACLE-1]

Defined new NEON Intrinsics for ARMv8 AArch32 and AArch64 NEON[ACLE-3]

Defined new Crypto intrinsics for ARMv8 AArch32 and AArch64[ACLE-40]

Defined new CRC32 intrinsics for ARMv8 AArch32 and AArch64[ACLE-58]

Introduced poly128_t type[ACLE-68]

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 11 of 81
Non-Confidential
3.3 Portable Binary Objects
In AArch32, the ABI for the ARM Architecture defines a set of build attributes [BA]. These attributes are intended
to facilitate generating cross-platform portable binary object files by providing a mechanism to determine the
compatibility of object files. In AArch64, the ABI does not define a standard set of build attributes and takes the
approach that binaries are, in general, not portable across platforms. References to build attributes in this
document should be interpreted as applying only to AArch32.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 12 of 81
Non-Confidential
4 C LANGUAGE EXTENSIONS

4.1 Fundamental data types


This section overlaps with the specification of the ARM Procedure Call Standard, particularly [AAPCS 4.1]. ACLE
extends C by providing some types not present in Standard C and defining how they are dealt with by the AAPCS.
It also extends some of the guarantees of C, allowing assumptions to be made in source code beyond those
permitted by Standard C.
Plain „char‟ is unsigned, as specified in the ABI [AAPCS and AAPCS64 7.1.1].

When pointers are 32 bits, the „long‟ type is 32 bits (ILP32 model).

When pointers are 64 bits, the „long‟ type may be either 64 bits (LP64 model) or 32 bits (LLP64 model).

4.1.1 Implementation-defined type properties


ACLE and the ARM ABI allow implementations some freedom in order to conform to long-standing conventions in
various environments. It is suggested that implementations set suitable defaults for their environment but allow the
default to be overridden.
The signedness of a plain „int‟ bit-field is implementation-defined.

Whether the underlying type of an enumeration is minimal or at least 32-bit, is implementation-defined. The
predefined macro __ARM_SIZEOF_MINIMAL_ENUM should be defined as 1 or 4 according to the size of a minimal
enumeration type such as enum { X=0 }. An implementation that conforms to the ARM ABI must reflect its
choice in the Tag_ABI_enum_size build attribute.

wchar_t may be 2 or 4 bytes. The predefined macro __ARM_SIZEOF_WCHAR_T should be defined as the same
number. An implementation that conforms to the ARM ABI must reflect its choice in the Tag_ABI_PCS_wchar_t
build attribute.

4.1.2 Half-precision floating-point


The __fp16 type denotes half-precision (16-bit) floating-point. It is not required to be provided when not
implemented in hardware. The recommended way to test for this hardware support is to test bit 1 in __ARM_FP.
Implementations which support 16-bit floating-point support two formats: the “binary16” format defined in [IEEE-
FP], and an alternative format, defined by ARM, which extends the range by removing support for infinities and
NaNs. Both formats are described in [ARM ARM A2.7.4][ARM ARMv8 A1.4.2]. Toolchains are not required to
support the alternative format. The format in use can be selected at runtime but ACLE assumes it is fixed for the
life of a program. If 16-bit floating-point is available, one of __ARM_FP16_FORMAT_IEEE and
__ARM_FP16_FORMAT_ALTERNATIVE will be defined to indicate the format in use. An implementation conforming to
the ARM ABI will set the Tag_ABI_FP_16bit_format build attribute.

16-bit floating point is a storage and interchange format only. Values of __fp16 type promote to (at least) float
when used in arithmetic operations, in the same way that values of char or short types promote to int. There
is no arithmetic directly on 16-bit values.
Conversion from 64-bit to 16-bit, i.e. from double to __fp16, must round only once. (With round-to-nearest,
converting first to 32-bit and then to 16-bit could give an incorrectly rounded result.) Because in current ARM
hardware floating-point architectures this is not a primitive operation, it may be faster to convert first to single-
precision and then to half-precision:

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 13 of 81
Non-Confidential
double xd;
__fp16 xs = (float)xd;

rather than:
double xd;
__fp16 xs = xd;

In some older implementations, __fp16 cannot be used as an argument or result type, though it can be used as a
field in a structure passed as an argument or result, or passed via a pointer. The predefined macro
__ARM_FP16_ARGS should be defined if __fp16 can be used as an argument and result. C++ name mangling is
“Dh” as defined in [cxxabi], and is the same for both the IEEE and alternative formats.

In this example, the floating-point addition is done in single (32-bit) precision:


void add(__fp16 *z, __fp16 const *x, __fp16 const *y, int n) {
int i;
for (i = 0; i < n; ++i) z[i] = x[i] + y[i];
}

4.2 Predefined macros


Several predefined macros are defined. Generally these define features of the ARM architecture being targeted,
or how the C/C++ implementation uses the architecture. These macros are detailed in section 6. All ACLE
predefined macros start with the prefix __ARM.

4.3 Intrinsics
ACLE standardizes intrinsics to access the NEON (Advanced SIMD) extension. These intrinsics are intended to
be compatible with existing implementations. Before using the NEON intrinsics or data types, the <arm_neon.h>
header must be included. The NEON intrinsics are defined in section 12. Note that the NEON intrinsics and data
types are in the user namespace.
ACLE also standardizes other intrinsics to access ARM instructions which do not map directly to C operators –
generally either for optimal implementation of algorithms, or for accessing specialist system-level features.
Intrinsics are defined further in various following sections.
Before using the non-NEON intrinsics, the <arm_acle.h> header should be included.

Whether intrinsics are macros, functions or built-in operators is unspecified. For example:

it is unspecified whether applying #undef to an intrinsic removes the name from visibility

it is unspecified whether it is possible to take the address of an intrinsic


However, each argument must be evaluated at most once. So this definition is acceptable:
#define __rev(x) __builtin_bswap32(x)

but this is not:


#define __rev(x) ((((x) & 0xff) << 24) | (((x) & 0xff00) << 8) | \
(((x) & 0xff0000) >> 8) | ((x) >> 24))

4.3.1 Constant arguments to intrinsics


Some intrinsics may require arguments that are constant at compile-time, to supply data that is encoded into the
immediate fields of an instruction. Typically, these intrinsics require an integral-constant-expression in a specified

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 14 of 81
Non-Confidential
range, or sometimes a string literal. An implementation should produce a diagnostic if the argument does not meet
the requirements.

4.4 Header files


<arm_acle.h> is provided to make the non-NEON intrinsics available. These intrinsics are in the C
implementation namespace and begin with double underscores. It is unspecified whether they are available
without the header being included. The __ARM_ACLE macro should be tested before including the header:
#ifdef __ARM_ACLE
#include <arm_acle.h>
#endif /* __ARM_ACLE */

<arm_neon.h> is provided to define the NEON intrinsics. As these intrinsics are in the user namespace, an
implementation would not normally define them until the header is included. The __ARM_NEON macro should be
tested before including the header:
#ifdef __ARM_NEON
#include <arm_neon.h>
#endif /* __ARM_NEON */

These headers behave as standard library headers; repeated inclusion has no effect beyond the first include.
It is unspecified whether the ACLE headers include the standard headers <assert.h>, <stdint.h> or
<inttypes.h>. However, the ACLE headers will not define the standard type names (uint32_t etc.) except by
inclusion of the standard headers. Programmers are recommended to include the standard headers explicitly if the
associated types and macros are needed.
In C++, the following source code fragments are expected to work correctly:
#include <stdint.h>
// UINT64_C not defined here since we did not set __STDC_FORMAT_MACROS
...
#include <arm_neon.h>

and
#include <arm_neon.h>
...
#define __STDC_FORMAT_MACROS
#include <stdint.h>
// ... UINT64_C is now defined

4.5 Attributes
GCC-style attributes are provided to annotate types, objects and functions with extra information, such as
alignment. These attributes are defined in section 7.

4.6 Implementation strategies


An implementation may choose to define all the ACLE non-NEON intrinsics as true compiler intrinsics, i.e. built-in
functions. The <arm_acle.h> header would then have no effect.

Alternatively, <arm_acle.h> could define the ACLE intrinsics in terms of already supported features of the
implementation, e.g. compiler intrinsics with other names, or inline functions using inline assembler.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 15 of 81
Non-Confidential
5 ARCHITECTURE AND CPU NAMES

5.1 Introduction
The intention of this section is to standardize architecture names, e.g. for use in compiler command lines.
Toolchains should accept these names case-insensitively where possible, or use all lowercase where not
possible. Tools may apply local conventions such as using hyphens instead of underscores.
(Note: processor names, including from the ARM Cortex™ family, are used as illustrative examples. This
specification is applicable to any processors implementing the ARM architecture.)

5.2 Architecture names


5.2.1 CPU architecture
The recommended CPU architecture names are as specified under Tag_CPU_arch in [BA]. For details of how to
use predefined macros to test architecture in source code, see 6.4.1.
®
The following table lists the architectures and the ARM and Thumb instruction set versions.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 16 of 81
Non-Confidential
Name Features ARM Thumb Example processor

ARMv4 ARM v4 4 DEC/Intel StrongARM

ARMv4T ARM v4 with Thumb instruction set 4 2 ARM7TDMI

ARMv5T ARM v5 with Thumb instruction set 5 2 ARM10TDMI

ARMv5TE ARM v5T with DSP extensions 5 2 ARM9E, Intel XScale


®
ARMv5TEJ ARM v5TE with Jazelle extensions 5 2 ARM926EJ

ARMv6 ARM v6 (includes TEJ) 6 2 ARM1136J r0

ARMv6K ARM v6 with kernel extensions 6 2 ARM1136J r1

ARMv6T2 ARM v6 with Thumb-2 architecture 6 3 ARM1156T2


®
ARMv6Z ARM v6K with TrustZone extensions (includes K) 6 2 ARM1176JZ-S

ARMv6-M Thumb-1 only (M-profile) 2 Cortex-M0, Cortex-M1

ARMv7-A ARM v7 application profile 7 4 Cortex-A8, Cortex-A9

ARMv7-R ARM v7 realtime profile 7 4 Cortex-R4

ARMv7-M ARM v7 microcontroller profile: Thumb-2 4 Cortex-M3


instructions only

ARMv7E-M ARM v7-M with DSP extensions 4 Cortex-M4

ARMv8-A ARM v8 application profile 8 4 Cortex-A57, Cortex-


AArch32 A53

ARMv8-A ARM v8 application profile 8 Cortex-A57, Cortex-


AArch64 A53
Note that there is some architectural variation that is not visible through ACLE; either because it is only relevant at
the system level (e.g. the large physical address extension) or because it would be handled by the compiler (e.g.
hardware divide might or might not be present in the ARM v7-A architecture).

5.2.2 FPU architecture


For details of how to test FPU features in source code, see 6.5. In particular, for testing which precisions are
supported in hardware, see 6.5.1.
Name Features Example processor

VFPv2 VFPv2 ARM1136JF-S

VFPv3 VFPv3 Cortex-A8

VFPv3_FP16 VFPv3 with FP16 Cortex-A9 (with NEON)

VFPv3_D16 VFPv3 with 16 D-registers Cortex-R4F

VFPv3_D16_FP16 VFPv3 with 16 D-registers and FP16 Cortex-A9 (without NEON), Cortex-R7

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 17 of 81
Non-Confidential
Name Features Example processor

VFPv3_SP_D16 VFPv3 with 16 D-registers, single-precision only Cortex-R5 with SP-only

VFPv4 VFPv4 (including FMA and FP16) Cortex-A15

VFPv4_D16 VFPv4 (including FMA and FP16) with 16 D-registers Cortex-A5 (VFP option)

FPv4_SP FPv4 with single-precision only Cortex-M4.fp

5.3 CPU names


ACLE does not standardize CPU names for use in command-line options and similar contexts. Standard vendor
product names should be used.
Object producers should place the CPU name in the Tag_CPU_name build attribute.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 18 of 81
Non-Confidential
6 FEATURE TEST MACROS

6.1 Introduction
The feature test macros allow programmers to determine the availability of ACLE or subsets of it, or of target
architectural features. This may indicate the availability of some source language extensions (e.g. intrinsics) or the
likely level of performance of some standard C features, such as integer division and floating-point.
Several macros are defined as numeric values to indicate the level of support for particular features. These
macros are undefined if the feature is not present. (Aside: in Standard C/C++, references to undefined macros
expand to 0 in preprocessor expressions, so a comparison such as
#if __ARM_ARCH >= 7

will have the expected effect of evaluating to false if the macro is not defined.)
All ACLE macros begin with the prefix __ARM_. All ACLE macros expand to integral constant expressions suitable
for use in an #if directive, unless otherwise specified. Syntactically, they must be primary-expressions –
generally this means an implementation should enclose them in parentheses if they are not simple constants.

6.2 Testing for ARM C Language Extensions


__ARM_ACLE is defined to the version of this specification implemented, as 100*major version + minor_version. An
implementation implementing version 2.0 (this version) of the ACLE specification will define __ARM_ACLE as 200.

6.3 Endianness
__ARM_BIG_ENDIAN is defined as 1 if data is stored by default in big-endian format. If the macro is not set, data is
stored in little-endian format. (Aside: the “mixed-endian” format for double-precision numbers, used on some very
old ARM FPU implementations, is not supported by ACLE or the ARM ABI.)

6.4 ARM and Thumb instruction set architecture and features


References to “the target architecture” refer to the target as configured in the tools, for example by appropriate
command-line options. This may be a subset or intersection of actual targets, in order to produce a binary that
runs on more than one real architecture. For example, use of specific features may be disabled.
In some cases, hardware features may be accessible from only one or other of ARM or Thumb instruction state.
For example, in the v5TE and v6 architectures, “DSP” instructions and (where available) VFP instructions, are
only accessible in ARM state, while in the v7-R architecture, hardware divide is only accessible from Thumb state.
Where both states are available, the implementation should set feature test macros indicating that the hardware
feature is accessible. To provide access to the hardware feature, an implementation might override the
programmer‟s preference for target instruction set, or generate an interworking call to a helper function. This
mechanism is outside the scope of ACLE. In cases where the implementation is given a hard requirement to use
only one state (e.g. to support validation, or post-processing) then it should set feature test macros only for the
hardware features available in that state – as if compiling for a core where the other instruction set was not
present.
An implementation that allows a user to indicate which functions go into which state (either as a hard requirement
or a preference) is not required to change the settings of architectural feature test macros.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 19 of 81
Non-Confidential
6.4.1 ARM/Thumb instruction set architecture
__ARM_ARCH is defined as an integer value indicating the current ARM instruction set architecture (e.g. 7 for the
ARM v7-A architecture implemented by Cortex-A8 or the ARM v7-M architecture implemented by Cortex-M3 or 8
for the ARM v8-A architecture implemented by Cortex-A57). Since ACLE only supports the ARM architecture, this
macro would always be defined in an ACLE implementation.
Note that the __ARM_ARCH macro is defined even for cores which only support the Thumb instruction set.

__ARM_ARCH_ISA_ARM is defined to 1 if the core supports the ARM instruction set. It is not defined for M-profile
cores.
__ARM_ARCH_ISA_THUMB is defined to 1 if the core supports the original Thumb instruction set (including the v6-M
architecture) and 2 if it supports the Thumb-2 instruction set as found in the v6T2 architecture and all v7
architectures.
__ARM_ARCH_ISA_A64 is defined to 1 if the core supports AArch64‟s A64 instruction set.

__ARM_32BIT_STATE is defined to 1 if code is being generated for AArch32.

__ARM_64BIT_STATE is defined to 1 if code is being generated for AArch64.

6.4.2 Architectural profile (A, R, M or pre-Cortex)


__ARM_ARCH_PROFILE is defined as „A‟, „R‟, „M‟ or „S‟, or unset, according to the architectural profile of the target.
„S‟ indicates the common subset of „A‟ and „R‟. The common subset of „A‟, „R‟ and „M‟ is indicated by
__ARM_ARCH == 7 && !defined (__ARM_ARCH_PROFILE)

This macro corresponds to the Tag_CPU_arch_profile object build attribute. It may be useful to writers of system
code. It is expected in most cases programmers will use more feature-specific tests.
Values „R‟, „M‟ and „S‟ are unsupported for architectural targets with __ARM_ARCH > 7.

The macro is undefined for architectural targets which predate the use of architectural profiles.

6.4.3 Unaligned access supported in hardware


__ARM_FEATURE_UNALIGNED is defined if the target supports unaligned access in hardware, at least to the extent
of being able to load or store an integer word at any alignment with a single instruction. (There may be restrictions
on load-multiple and floating-point accesses.) Note that whether a code generation target permits unaligned
access will in general depend on the settings of system register bits, so an implementation should define this
macro to match the user‟s expectations and intentions. For example, a command-line option might be provided to
disable the use of unaligned access, in which case this macro would not be defined.

6.4.4 LDREX/STREX
This feature is deprecated in ACLE 2.0. It is strongly recommended that C11/C++11 atomics be used instead.
__ARM_FEATURE_LDREX is defined if the load/store-exclusive instructions (LDREX/STREX) are supported. Its value
is a set of bits indicating available widths of the access, as powers of 2. The following bits are used:
Bit Value Access width Instruction

0 0x01 byte LDREXB/STREXB

1 0x02 halfword LDREXH/STREXH

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 20 of 81
Non-Confidential
Bit Value Access width Instruction

2 0x04 word LDREX/STREX

3 0x08 doubleword LDREXD/STREXD


Other bits are reserved.
The following values of __ARM_FEATURE_LDREX may occur:
Macro value Access widths Example architecture

(undefined) none ARM v5, ARM v6-M

0x04 word ARM v6

0x07 word, halfword, byte ARM v7-M

0x0F doubleword, word, halfword, byte ARM v6K, ARM v7-A/R


Other values are reserved.
The LDREX/STREX instructions are introduced in recent versions of the ARM architecture and supersede the
SWP instruction. Where both are available, ARM strongly recommends programmers to use LDREX/STREX
rather than SWP. Note that platforms may choose to make SWP unavailable in user mode and emulate it through
a trap to a platform routine, or fault it.

6.4.5 CLZ
__ARM_FEATURE_CLZ is defined to 1 if the CLZ (count leading zeroes) instruction is supported in hardware. Note
that ACLE provides the __clz() family of intrinsics (see 9.2) even when __ARM_FEATURE_CLZ is not defined.

6.4.6 Q (saturation) flag


__ARM_FEATURE_QBIT is defined to 1 if the Q (saturation) global flag exists and the intrinsics defined in 9.1.1 are
available. This flag is used with the DSP saturating-arithmetic instructions (such as QADD) and the width-specified
saturating instructions (SSAT and USAT). Note that either of these classes of instructions may exist without the
other: for example, v5E has only QADD while v7-M has only SSAT.
Intrinsics associated with the Q-bit and their feature macro __ARM_FEATURE_QBIT are deprecated in ACLE 2.0
for A-profile. They are fully supported for M-profile and R-profile. This macro is defined for AArch32 only.

6.4.7 DSP instructions


__ARM_FEATURE_DSP is defined to 1 if the DSP (v5E) instructions are supported and the intrinsics defined in 9.4
are available. These instructions include QADD, SMULBB etc. This feature also implies support for the Q flag.
__ARM_FEATURE_DSP and its associated intrinsics are deprecated in ACLE 2.0 for A-profile. They are fully
supported for M and R-profiles. This macro is defined for AArch32 only.

6.4.8 Saturation instructions


__ARM_FEATURE_SAT is defined to 1 if the SSAT and USAT instructions are supported and the intrinsics defined in
9.4.1 are available. This feature also implies support for the Q flag.
__ARM_FEATURE_SAT and its associated intrinsics are deprecated in ACLE 2.0 for A-profile. They are fully
supported for M and R-profiles. This macro is defined for AArch32 only.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 21 of 81
Non-Confidential
6.4.9 32-bit SIMD instructions
__ARM_FEATURE_SIMD32 is defined to 1 if the 32-bit SIMD instructions are supported and the intrinsics defined in
9.5 are available. This also implies support for the GE global flags which indicate byte-by-byte comparison results.
__ARM_FEATURE_SIMD32 is deprecated in ACLE 2.0 for A-profile. Users are encouraged to use NEON Intrinscs as
an equivalent for the 32-bit SIMD intrinsics functionality. However they are fully supported for M and R-profiles.
This is defined for AArch32 only.

6.4.10 Hardware Integer Divide


__ARM_FEATURE_IDIV is defined to 1 if the target has hardware support for 32-bit integer division in all available
instruction sets. Signed and unsigned versions are both assumed to be available. The intention is to allow
programmers to choose alternative algorithm implementations depending on the likely speed of integer division.
Some older R-profile targets have hardware divide available in the Thumb instruction set only. This can be tested
for using the following test:
#if __ARM_FEATURE_IDIV || (__ARM_ARCH_PROFILE == ‟R‟)

6.5 Floating-point and Advanced SIMD (NEON) hardware


6.5.1 Hardware floating point
__ARM_FP is set if hardware floating-point is available. The value is a set of bits indicating the floating-point
precisions supported. The following bits are used:
Bit Value Precision

1 0x02 half (16-bit) – data type only

2 0x04 single (32-bit)

3 0x08 double (64-bit)


Bits 0 and 4..31 are reserved
Currently, the following values of __ARM_FP may occur (assuming the processor configuration option for
hardware floating-point support is selected where available):
Value Precisions Example processor

(undefined) none any processor without hardware floating-point support

0x04 single Cortex-R5 when configured with SP only

0x06 single, half Cortex-M4.fp

0x0C double, single ARM9, ARM11, Cortex-A8, Cortex-R4

0x0E double, single, half Cortex-A9, Cortex-A15, Cortex-R7


Other values are reserved.
Standard C implementations support single and double precision floating-point irrespective of whether floating-
point hardware is available. However, an implementation might choose to offer a mode to diagnose or fault use of
floating-point arithmetic at a precision not supported in hardware.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 22 of 81
Non-Confidential
Support for 16-bit floating-point language extensions (see 6.5.2) is only required to be available if supported in
hardware. Hardware support for 16-bit floating-point is limited to conversions. Values are promoted to 32-bit
(single-precision) type for arithmetic.

6.5.2 Half-precision (16-bit) floating-point format


__ARM_FP16_FORMAT_IEEE is defined to 1 if the IEEE 754-2008 [IEEE-FP] 16-bit floating-point format is used.

__ARM_FP16_FORMAT_ALTERNATIVE is defined to 1 if the ARM alternative [ARMARM] 16-bit floating-point format is


used. This format removes support for infinities and NaNs in order to provide an extra exponent bit.
At most one of these macros will be defined. See 4.1.2 for details of half-precision floating-point types.

6.5.3 Fused multiply-accumulate (FMA)


__ARM_FEATURE_FMA is defined to 1 if the hardware floating-point architecture supports fused floating-point
multiply-accumulate, i.e. without intermediate rounding. Note that C implementations are encouraged [C99 7.12]
to ensure that <math.h> defines FP_FAST_FMAF or FP_FAST_FMA, which can be tested by portable C code. A C
implementation on ARM might define these macros by testing __ARM_FEATURE_FMA and __ARM_FP.

6.5.4 Advanced SIMD architecture extension (NEON)


__ARM_NEON is defined to a value indicating the Advanced SIMD (NEON) architecture supported. The only current
value is 1.
In principle, for AArch32, the NEON architecture can exist in an integer-only version. To test for the presence of
NEON floating-point vector instructions, test __ARM_NEON_FP. When NEON does occur in an integer-only version,
the VFP scalar instruction set is also not present. See [ARMARM table A2-4] for architecturally permitted
combinations.
__ARM_NEON is always set to 1 for AArch64.

6.5.5 NEON floating-point


__ARM_NEON_FP is defined as a bitmap to indicate floating-point support in the NEON architecture. The meaning of
the values is the same as for __ARM_FP. This macro is undefined when the NEON extension is not present or
does not support floating-point.
Current AArch32 NEON implementations do not support double-precision floating-point even when it is present in
VFP. 16-bit floating-point format is supported in NEON if and only if it is supported in VFP. Consequently, the
definition of __ARM_NEON_FP is the same as __ARM_FP except that the bit to indicate double-precision is not set
for AArch32. Double-precision is always set for AArch64.
If __ARM_FEATURE_FMA and __ARM_NEON_FP are both defined, fused-multiply instructions are available in
NEON also.

6.5.6 Wireless MMX


If Wireless MMX operations are available on the target, __ARM_WMMX is defined to a value that indicates the level
of support, corresponding to the Tag_WMMX_arch build attribute.
This specification does not further define source-language features to support Wireless MMX.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 23 of 81
Non-Confidential
6.5.7 Crypto Extension
__ARM_FEATURE_CRYPTO is defined to 1 if the Crypto instructions are supported and the intrinsics defined in
12.3.14 are available. These instructions include AES{E, D}, SHA1{C, P, M} etc. This is only available when
__ARM_ARCH >= 8.

6.5.8 CRC32 Extension


__ARM_FEATURE_CRC32 is defined to 1 if the CRC32 instructions are supported and the intrinsics defined in 9.7 are
available. These instructions include CRC32B, CRC32H etc. This is only available when __ARM_ARCH >= 8.

6.5.9 Directed Rounding


__ARM_FEATURE_DIRECTED_ROUNDING is defined to 1 if the directed rounding and conversion vector instructions
are supported and rounding and conversion intrinsics defined in 12.3.10 are available. This is only available when
__ARM_ARCH >= 8.

6.5.10 Numeric Maximum and Minimum


__ARM_FEATURE_NUMERIC_MAXMIN is defined to 1 if the IEEE 754-2008 compliant floating point maximum and
minimum vector instructions are supported and intrinsics defined in 12.3.10 are available. This is only available
when __ARM_ARCH >= 8.

6.5.11 Half-precision argument and result


__ARM_FP16_ARGS is defined to 1 if __fp16 can be used as an argument and result.

6.6 Floating-point model


These macros test the floating-point model implemented by the compiler and libraries. The model determines the
guarantees on arithmetic and exceptions.
__ARM_FP_FAST is defined to 1 if floating-point optimizations may occur such that the computed results are
different from those prescribed by the order of operations according to the C standard. Examples of such
optimizations would be reassociation of expressions to reduce depth, and replacement of a division by constant
with multiplication by its reciprocal.
__ARM_FP_FENV_ROUNDING is defined to 1 if the implementation allows the rounding to be configured at runtime
using the standard C fesetround() function and will apply this rounding to future floating-point operations. The
rounding mode applies to both scalar floating-point and NEON.
The floating-point implementation might or might not support denormal values. If denormal values are not
supported then they are flushed to zero.
Implementations may also define the following macros in appropriate floating-point modes:
__STDC_IEC_559__ is defined if the implementation conforms to IEC 559. This implies support for floating-point
exception status flags, including the inexact exception. This macro is specified by [C99 6.10.8].
__SUPPORT_SNAN__ is defined if the implementation supports signalling NaNs. This macro is specified by the C
standards proposal WG14 N965 “Optional support for Signaling NaNs”. (Note: this was not adopted into C11.)

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 24 of 81
Non-Confidential
6.7 Procedure call standard
__ARM_PCS is defined to 1 if the default procedure calling standard for the translation unit conforms to the “base
PCS” defined in [AAPCS]. This is supported on AArch32 only.
__ARM_PCS_VFP is defined to 1 if the default is to pass floating-point parameters in hardware floating-point
registers using the “VFP variant PCS” defined in [AAPCS]. This is supported on AArch32 only.
__ARM_PCS_AAPCS64 is defined to 1 if the default procedure calling standard for the translation unit conforms to
the [AAPCS64].
Note that this should reflect the implementation default for the translation unit. Implementations which allow the
PCS to be set for a function, class or namespace are not expected to redefine the macro within that scope.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 25 of 81
Non-Confidential
6.8 Mapping of object build attributes to predefines
This section is provided for guidance. Details of build attributes can be found in [BA].
Tag no. Tag Predefined macro

6 Tag_CPU_arch __ARM_ARCH,
__ARM_FEATURE_DSP

7 Tag_CPU_arch_profile __ARM_PROFILE

8 Tag_ARM_ISA_use __ARM_ISA_ARM

9 Tag_THUMB_ISA_use __ARM_ISA_THUMB

11 Tag_WMMX_arch __ARM_WMMX

18 Tag_ABI_PCS_wchar_t __ARM_SIZEOF_WCHAR_T

20 Tag_ABI_FP_denormal

21 Tag_ABI_FP_exceptions

22 Tag_ABI_FP_user_exceptions

23 Tag_ABI_FP_number_model

26 Tag_ABI_enum_size __ARM_SIZEOF_MINIMAL_ENUM

34 Tag_CPU_unaligned_access __ARM_FEATURE_UNALIGNED

36 Tag_FP_HP_extension __ARM_FP16_FORMAT_IEEE,
__ARM_FP16_FORMAT_ALTERNATIVE

38 Tag_ABI_FP_16bit_format __ARM_FP16_FORMAT_IEEE,
__ARM_FP16_FORMAT_ALTERNATIVE

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 26 of 81
Non-Confidential
6.9 Summary of predefined macros
Macro name Meaning Example See section
__ARM_32BIT_STATE code is for AArch32 state 1 6.4.1
__ARM_64BIT_STATE code is for AArch64 state 1 6.4.1
__ARM_ACLE indicates ACLE implemented 101 6.2
__ARM_ALIGN_MAX_PWR log of maximum alignment of static object 20 7.5.2
__ARM_ALIGN_MAX_STACK_PWR log of maximum alignment of stack object 3 7.5.3
__ARM_ARCH ARM architecture level 7 6.4.1
__ARM_ARCH_ISA_A64 AArch64 ISA present 1 6.4.1
__ARM_ARCH_ISA_ARM ARM instruction set present 1 6.4.1
__ARM_ARCH_ISA_THUMB Thumb instruction set present 2 6.4.1
__ARM_ARCH_PROFILE architecture profile „A‟ 6.4.2
__ARM_BIG_ENDIAN memory is big-endian 1 6.3
__ARM_FEATURE_CLZ CLZ instruction 1 6.4.5, 9.2
__ARM_FEATURE_CRC32 CRC32 extension 1 6.5.8
__ARM_FEATURE_CRYPTO Crypto extension 1 6.5.7
__ARM_FEATURE_DIRECTED_ROUNDING Directed Rounding 1 12.3.10
__ARM_FEATURE_DSP DSP instructions (ARM v5E) (32-bit-only) 1 6.4.6, 9.4
__ARM_FEATURE_FMA floating-point fused multiply-accumulate 1 6.5.3, 9.6
__ARM_FEATURE_IDIV Hardware Integer Divide 1 6.4.10
__ARM_FEATURE_LDREX(Deprecated) load/store exclusive instructions 0x0F 6.4.4, 8
__ARM_FEATURE_NUMERIC_MAXMIN Numeric Maximum and Minimum 1 12.3.10
__ARM_FEATURE_QBIT Q (saturation) flag (32-bit-only) 1 6.4.6, 9.1.1
__ARM_FEATURE_SAT width-specified saturation instructions (32- 1 6.4.8, 9.4.1
bit-only)
__ARM_FEATURE_SIMD32 32-bit SIMD instructions (ARM v6) (32-bit- 1 6.4.8, 9.5
only)
__ARM_FEATURE_UNALIGNED hardware support for unaligned access 1 6.4.3
__ARM_FP hardware floating-point 0x0C 6.5.1
__ARM_FP16_ARGS __fp16 argument and result 1 6.5.11
__ARM_FP16_FORMAT_ALTERNATIVE 16-bit floating-point, alternative format 1 6.5.2

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 27 of 81
Non-Confidential
__ARM_FP16_FORMAT_IEEE 16-bit floating-point, IEEE format 1 6.5.2
__ARM_FP_FAST accuracy-losing optimizations 1 6.6

__ARM_FP_FENV_ROUNDING rounding is configurable at runtime 1 6.6

__ARM_NEON Advanced SIMD (NEON) extension 1 6.5.4


__ARM_NEON_FP Advanced SIMD (NEON) floating-point 0x04 6.5.5
__ARM_PCS ARM procedure call standard (32-bit-only) 1 6.7
__ARM_PCS_AAPCS64 ARM PCS for AArch64. 1 6.7
__ARM_PCS_VFP ARM PCS hardware FP variant in use 1 6.7
(32-bit-only)
__ARM_SIZEOF_MINIMAL_ENUM size of minimal enumeration type: 1 or 4 1 4.1.1
__ARM_SIZEOF_WCHAR_T size of wchar_t: 2 or 4 2 4.1.1

__ARM_WMMX Wireless MMX extension (32-bit-only) 1 6.5.6

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 28 of 81
Non-Confidential
7 ATTRIBUTES AND PRAGMAS

7.1 Attribute syntax


The general rules for attribute syntax are described in the GCC documentation
http://gcc.gnu.org/onlinedocs/gcc/Attribute-Syntax.html. Briefly, for this declaration:
A int B x C, D y E;

attribute A applies to both x and y; B and C apply to x only, and D and E apply to y only. Programmers are
recommended to keep declarations simple if attributes are used.
Unless otherwise stated, all attribute arguments must be compile-time constants.

7.2 Hardware/software floating-point calling convention


The AArch32 PCS defines a base standard, as well as several variants.
On targets with hardware FP the AAPCS provides for procedure calls to use either integer or floating-point
argument and result registers. ACLE allows this to be selectable per function.
__attribute__((pcs("aapcs")))

applied to a function, selects software (integer) FP calling convention.


__attribute__((pcs("aapcs-vfp")))

applied to a function, selects hardware FP calling convention.


The AArch64 PCS standard variants do not change how parameters are passed, so no PCS attributes are
supported.
The pcs attribute applies to functions and function types. Implementations are allowed to treat the procedure call
specification as part of the type, i.e. as a “language linkage” in the sense of [C++ 7.5#1].

7.3 Target selection


The following target selection attributes are supported:
__attribute__((target("arm")))

when applied to a function, forces ARM state code generation.


__attribute__((target("thumb")))

when applied to a function, forces Thumb state code generation.


The implementation must generate code in the required state unless it is impossible to do so. For example, on an
ARM v5 or v6 target with VFP (and without the Thumb2 instruction set), if a function is forced to Thumb state, any
floating-point operations or intrinsics that are only available in ARM state must be generated as calls to library
functions or compiler-generated functions.
This attribute does not apply to AArch64.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 29 of 81
Non-Confidential
7.4 Weak linkage
__attribute__((weak)) can be attached to declarations and definitions to indicate that they have weak static
linkage (STB_WEAK in ELF objects). As definitions, they can be overridden by other definitions of the same
symbol. As references, they do not need to be satisfied and will be resolved to zero if a definition is not present.

7.4.1 Patchable constants


In addition, this specification requires that weakly defined initialized constants are not used for constant
propagation, allowing the value to be safely changed by patching after the object is produced.

7.5 Alignment
The new standards for C [C11 6.7.5] and C++ [C++11 7.6.2] add syntax for aligning objects and types. ACLE
provides an alternative syntax described in this section.

7.5.1 Alignment attribute


__attribute__((aligned(N))) can be associated with data, functions, types and fields. N must be an integral
constant expression and must be a power of 2, e.g. 1, 2, 4, 8. The maximum alignment depends on the storage
class of the object being aligned. The size of a data type is always a multiple of its alignment. This is a
consequence of the rule in C that the spacing between array elements is equal to the element size.
The aligned attribute does not act as a type qualifier. For example, given
char x __attribute__((aligned(8)));
int y __attribute__((aligned(1)));

the type of &x is “char *” and the type of &y is “int *”. The following declarations are equivalent:
struct S x __attribute__((aligned(16))); /* ACLE */
struct S _Alignas(16) x; /* C11 */
#include <stdalign.h> /* C11 (alternative) */
struct S alignas(16) x;
struct S alignas(16) x; /* C++11 */

7.5.2 Alignment of static objects


The macro __ARM_ALIGN_MAX_PWR indicates (as the exponent of a power of 2) the maximum available alignment
of static data – for example 4 for 16-byte alignment. So the following is always valid:
int x __attribute__((aligned(1 << __ARM_ALIGN_MAX_PWR)));

or, using the C11/C++11 syntax:


alignas(1 << __ARM_ALIGN_MAX_PWR) int x;

Since an alignment request on an object does not change its type or size, x in this example would have type int
and size 4.
There is in principle no limit on the alignment of static objects, within the constraints of available memory. In the
ARM ABI an object with a requested alignment would go into an ELF section with at least as strict an alignment
requirement. However, an implementation supporting position-independent dynamic objects or overlays may need
to place restrictions on their alignment demands.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 30 of 81
Non-Confidential
7.5.3 Alignment of stack objects
It must be possible to align any local object up to the stack alignment as specified in the AAPCS for AArch32 (i.e.
8 bytes) or as specified in AAPCS64 for AArch64 (i.e. 16 bytes) this being also the maximal alignment of any
native type.
An implementation may, but is not required to, permit the allocation of local objects with greater alignment, e.g. 16
or 32 bytes for AArch32. (This would involve some runtime adjustment such that the object address was not a
fixed offset from the stack pointer on entry.)
If a program requests alignment greater than the implementation supports, it is recommended that the compiler
warn but not fault this. Programmers should expect over-alignment of local objects to be treated as a hint.
The macro __ARM_ALIGN_MAX_STACK_PWR indicates (as the exponent of a power of 2) the maximum available
stack alignment. For example, a value of 3 indicates 8-byte alignment.

7.5.4 Procedure calls


For procedure calls, where a parameter has aligned type, data should be passed as if it was a basic type of the
given type and alignment. For example, given the aligned type
struct S { int a[2]; } __attribute__((aligned(8)));

the second argument of


f(int, struct S);

should be passed as if it were


f(int, long long);

which means that in AArch32 AAPCS the second parameter is in R2/R3 rather than R1/R2.

7.5.5 Alignment of C heap storage


The standard C allocation functions [C99 7.20.3], such as malloc(), return storage aligned to the normal
maximal alignment, i.e. the largest alignment of any (standard) type.
Implementations may, but are not required to, provide a function to return heap storage of greater alignment.
Suitable functions are
int posix_memalign(void **memptr, size_t alignment, size_t size);

as defined in [POSIX], or
void *aligned_alloc(size_t alignment, size_t size);

as defined in [C11 7.22.3.1].

7.5.6 Alignment of C++ heap allocation


In C++, an allocation (with „new‟) knows the object‟s type. If the type is aligned, the allocation should also be
aligned. There are two cases to consider depending on whether the user has provided an allocation function.
If the user has provided an allocation function for an object or array of over-aligned type, it is that function‟s
responsibility to return suitably aligned storage. The size requested by the runtime library will be a multiple of the
alignment (trivially so, for the non-array case).
(The AArch32 C++ ABI does not explicitly deal with the runtime behavior when dealing with arrays of alignment
greater than 8. In this situation, any „cookie‟ will be 8 bytes as usual, immediately preceding the array; this means

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 31 of 81
Non-Confidential
that the cookie is not necessarily at the address seen by the allocation and deallocation functions.
Implementations will need to make some adjustments before and after calls to the ABI-defined C++ runtime, or
may provide additional non-standard runtime helper functions.) Example:
struct float4 {
void *operator new[](size_t s) {
void *p;
posix_memalign(&p, 16, s);
return p;
}
float data[4];
} __attribute__((aligned(16)));

If the user has not provided their own allocation function, the behavior is implementation-defined.
The generic itanium C++ ABI, which we use in AArch64, already handles arrays with arbitrarily aligned elements

7.6 Other attributes


The following attributes should be supported and their definitions follow [GCC]. These attributes are not specific to
ARM or the ARM ABI.
alias, common, nocommon, noinline, packed, section, visibility, weak

Some specific requirements on the weak attribute are detailed in 7.4.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 32 of 81
Non-Confidential
8 SYNCHRONIZATION, BARRIER AND HINT INTRINSICS

8.1 Introduction
This section provides intrinsics for managing data that may be accessed concurrently between processors, or
between a processor and a device. Some intrinsics atomically update data, while others place barriers around
accesses to data to ensure that accesses are visible in the correct order.
Memory prefetch intrinsics are also described in this section.

8.2 Atomic update primitives


8.2.1 C/C++ standard atomic primitives
The new C and C++ standards [C11 7.17, C++11 clause 29] provide a comprehensive library of atomic operations
and barriers, including operations to read and write data with particular ordering requirements. Programmers are
recommended to use this where available.

8.2.2 IA-64/GCC atomic update primitives


The __sync family of intrinsics (introduced in [IA-64 section 7.4], and as documented in the GCC documentation)
may be provided, especially if the C/C++ atomics are not available, and are recommended as being portable and
widely understood. These may be expanded inline, or call library functions. Note that, unusually, these intrinsics
are polymorphic – they will specialize to instructions suitable for the size of their arguments.

8.3 Memory barriers


Memory barriers ensure specific ordering properties between memory accesses. For more details on memory
barriers, see ARM ARM [v7 section A3.8.3]. The intrinsics in this section are available for all targets. They may be
no-ops (i.e. generate no code, but possibly act as a code motion barrier in compilers) on targets where the
relevant instructions do not exist, but only if the property they guarantee would have held anyway. On targets
where the relevant instructions exist but are implemented as no-ops, these intrinsics generate the instructions.
The memory barrier intrinsics take a numeric argument indicating the scope and access type of the barrier, as
shown in the following table. (The assembler mnemonics for these numbers, as shown in the table, are not
available in the intrinsics.) The argument should be an integral constant expression within the required range –
see section 4.3.1.
Argument Mnemonic Domain Ordered Accesses (before-after)

15 SY Full system Any-Any

14 ST Full system Store-Store

13 LD Full system Load-Load, Load-Store

11 ISH Inner shareable Any-Any

10 ISHST Inner shareable Store-Store

9 ISHLD Inner shareable Load-Load, Load-Store

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 33 of 81
Non-Confidential
Argument Mnemonic Domain Ordered Accesses (before-after)

7 NSH or UN Non-shareable Any-Any

6 NSHST Non-shareable Store-Store

5 NSHLD Non-shareable Load-Load, Load-Store

3 OSH Outer shareable Any-Any

2 OSHST Outer shareable Store-Store

1 OSHLD Outer shareable Load-Load, Load-Store

The following memory barrier intrinsics are available:


void __dmb(/*constant*/ unsigned int);

Generates a DMB (data memory barrier) instruction or equivalent CP15 instruction. DMB ensures the observed
ordering of memory accesses. Memory accesses of the specified type issued before the DMB are guaranteed to
be observed (in the specified scope) before memory accesses issued after the DMB. For example, DMB should
be used between storing data, and updating a flag variable that makes that data available to another core.
The __dmb() intrinsic also acts as a compiler memory barrier of the appropriate type.
void __dsb(/*constant*/ unsigned int);

Generates a DSB (data synchronization barrier) instruction or equivalent CP15 instruction. DSB ensures the
completion of memory accesses. A DSB behaves as the equivalent DMB and has additional properties. After a
DSB instruction completes, all memory accesses of the specified type issued before the DSB are guaranteed to
have completed.
The __dsb() intrinsic also acts as a compiler memory barrier of the appropriate type.
void __isb(/*constant*/ unsigned int);

Generates an ISB (instruction synchronization barrier) instruction or equivalent CP15 instruction. This instruction
flushes the processor pipeline fetch buffers, so that following instructions are fetched from cache or memory. An
ISB is needed after some system maintenance operations.
An ISB is also needed before transferring control to code that has been loaded or modified in memory, for
example by an overlay mechanism or just-in-time code generator. (Note that if instruction and data caches are
separate, privileged cache maintenance operations would be needed in order to unify the caches.)
The only supported argument for the __isb() intrinsic is 15, corresponding to the SY (full system) scope of the
ISB instruction.

8.3.1 Examples
In this example, process P1 makes some data available to process P2 and sets a flag to indicate this.
P1:
value = x;
/* issue full-system memory barrier for previous store:
setting of flag is guaranteed not to be observed before
write to value */

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 34 of 81
Non-Confidential
__dmb(14);
flag = true;

P2:
/* busy-wait until the data is available */
while (!flag) {}
/* issue full-system memory barrier: read of value is guaranteed
not to be observed by memory system before read of flag */
__dmb(15);
use value;

In this example, process P1 makes data available to P2 by putting it on a queue.


P1:
work = new WorkItem;
work->payload = x;
/* issue full-system memory barrier for previous store:
consumer cannot observe work item on queue before write to
work item‟s payload
__dmb(14);
queue_head = work;

P2:
/* busy-wait until work item appears */
while (!(work = queue_head)) {}
/* no barrier needed: load of payload is data-dependent */
use work->payload

8.4 Hints
The intrinsics in this section are available for all targets. They may be no-ops (i.e. generate no code, but possibly
act as a code motion barrier in compilers) on targets where the relevant instructions do not exist. On targets where
the relevant instructions exist but are implemented as no-ops, these intrinsics generate the instructions.
void __wfi(void);
Generates a WFI (wait for interrupt) hint instruction, or nothing. The WFI instruction allows (but does not require)
the processor to enter a low-power state until one of a number of asynchronous events occurs.
void __wfe(void);
Generates a WFE (wait for event) hint instruction, or nothing. The WFE instruction allows (but does not require)
the processor to enter a low-power state until some event occurs such as a SEV being issued by another
processor.
void __sev(void);
Generates a SEV (send a global event) hint instruction. This causes an event to be signaled to all processors in a
multiprocessor system. It is a NOP on a uniprocessor system.
void __sevl(void);
Generates a “send a local event” hint instruction. This causes an event to be signaled to only the processor
executing this instruction. In a multiprocessor system, it is not required to affect the other processors.
void __yield(void);

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 35 of 81
Non-Confidential
Generates a YIELD hint instruction. This enables multithreading software to indicate to the hardware that it is
performing a task, for example a spin-lock, that could be swapped out to improve overall system performance.
void __dbg(/*constant*/ unsigned int);
Generates a DBG instruction. This provides a hint to debugging and related systems. The argument must be a
constant integer from 0 to 15 inclusive. See implementation documentation for the effect (if any) of this instruction
and the meaning of the argument. This is available only when compliling for AArch32.

8.5 Swap
__swp is available for all targets. This intrinsic expands to a sequence equivalent to the deprecated (and possibly
unavailable) SWP instruction.
uint32_t __swp(uint32_t, volatile void *);

unconditionally stores a new value at the given address, and returns the old value.
As with the IA-64/GCC primitives described in 0, the __swp intrinsic is polymorphic. The second argument must
provide the address of a byte-sized object or an aligned word-sized object and it must be possible to determine
the size of this object from the argument expression.
This intrinsic is implemented by LDREX/STREX (or LDREXB/STREXB) where available, as if by
uint32_t __swp(uint32_t x, volatile uint32_t *p) {
uint32_t v;
/* use LDREX/STREX intrinsics not specified by ACLE */
do v = __ldrex(p); while (__strex(x, p));
return v;
}
or alternatively,
uint32_t __swp(uint32_t x, uint32_t *p) {
uint32_t v;
/* use IA-64/GCC atomic builtins */
do v = *p; while (!__sync_bool_compare_and_swap(p, v, x));
return v;
}

It is recommended that compilers should produce a downgradeable/upgradeable warning on encountering the


__swp intrinsic.

Only if load-store exclusive instructions are not available will the intrinsic use the SWP/SWPB instructions.
It is strongly recommended to use standard and flexible atomic primitives such as those available in the C++
<atomic> header. __swp is provided solely to allow straightforward (and possibly automated) replacement of
explicit use of SWP in inline assembler. SWP is obsolete in the ARM architecture, and in recent versions of the
architecture, may be configured to be unavailable in user-mode. (Aside: unconditional atomic swap is also less
powerful as a synchronization primitive than load-exclusive/store-conditional.)

8.6 Memory prefetch intrinsics


Intrinsics are provided to prefetch data or instructions. The size of the data or function is ignored. Note that the
intrinsics may be implemented as no-ops (i.e. not generate a prefetch instruction, if none is available). Also, even
where the architecture does provide a prefetch instruction, a particular implementation may implement the
instruction as a no-op (i.e. the instruction has no effect).

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 36 of 81
Non-Confidential
8.6.1 Data prefetch
void __pld(void const volatile *addr);
Generates a data prefetch instruction, if available. The argument should be any expression that may designate a
data address. The data is prefetched to the innermost level of cache, for reading.
void __pldx(/*constant*/ unsigned int /*access_kind*/,
/*constant*/ unsigned int /*cache_level*/,
/*constant*/ unsigned int /*retention_policy*/,
void const volatile *addr);

Generates a data prefetch instruction. This intrinsic allows the specification of the expected access kind (read or
write), the cache level to load the data, the data retention policy (temporal or streaming), The relevant arguments
can only be one of the following values.
Access Kind Value Summary

PLD 0 Fetch the addressed location for reading

PST 1 Fetch the addressed location for writing

Cache Level Value Summary

L1 0 Fetch the addressed location to L1 cache

L2 1 Fetch the addressed location to L2 cache

L3 2 Fetch the addressed location to L3 cache

Retention Policy Value Summary

KEEP 0 Temporal fetch of the addressed location (i.e. allocate in cache normally)

STRM 1 Streaming fetch of the addressed location (i.e. memory used only once)

8.6.2 Instruction prefetch


void __pli(T addr);
Generates a code prefetch instruction, if available. If a specific code prefetch instruction is not available, this
intrinsic may generate a data-prefetch instruction to fetch the addressed code to the innermost level of unified
cache. It will not fetch code to data-cache in a split cache level.
void __plix(/*constant*/ unsigned int /*cache_level*/,
/*constant*/ unsigned int /*retention_policy*/,
T addr);
Generates a code prefetch instruction. This intrinsic allows the specification of the cache level to load the code,
the retention policy (temporal or streaming). The relevant arguments can have the same values as in __pldx.

__pldx and __plix arguments „cache level‟ and „retention policy‟ are ignored on unsupported targets.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 37 of 81
Non-Confidential
8.7 NOP
void __nop(void);

generates an unspecified no-op instruction. Note that not all architectures provide a distinguished NOP instruction.
On those that do, it is unspecified whether this intrinsic generates it or another instruction. It is not guaranteed that
inserting this instruction will increase execution time.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 38 of 81
Non-Confidential
9 DATA-PROCESSING INTRINSICS
The intrinsics in this section are provided for algorithm optimization.
The <arm_acle.h> header should be included before using these intrinsics.
Implementations are not required to introduce precisely the instructions whose names match the intrinsics.
However, implementations should aim to ensure that a computation expressed compactly with intrinsics will
generate a similarly compact sequence of machine code. In general, C‟s “as-if rule” [C99 5.1.2.3] applies,
meaning that the compiled code must behave as if the instruction had been generated.
In general, these intrinsics are aimed at DSP algorithm optimization on M-profile and R-profile. Use on A-profile is
deprecated. However, the miscellaneous intrinsics and CRC32 intrinsics described in 9.2 and 9.7 respectively are
suitable for all profiles.

9.1 Programmer’s model of global state


9.1.1 The Q (saturation) flag
The Q flag is a cumulative („sticky‟) saturation bit in the APSR (Application Program Status Register) indicating
that an operation saturated, or in some cases, overflowed. It is set on saturation by most intrinsics in the DSP and
SIMD intrinsic sets, though some SIMD intrinsics feature saturating operations which do not set the Q flag.
[AAPCS 5.1.1] states:
The N, Z, C, V and Q flags (bits 27-31) and the GE[3:0] bits (bits 16-19) are undefined on entry to or
return from a public interface.
Note that this does not state that these bits (in particular the Q flag) are undefined across any C/C++ function call
boundary – only across a “public interface”. The Q and GE bits could be manipulated in well-defined ways by local
functions, for example when constructing functions to be used in DSP algorithms.
Implementations must avoid introducing instructions (such as SSAT/USAT, or SMLABB) which affect the Q flag, if
the programmer is testing whether the Q flag was set by explicit use of intrinsics and if the implementation‟s
introduction of an instruction may affect the value seen. The implementation might choose to model the definition
and use (liveness) of the Q flag in the way that it models the liveness of any visible variable, or it might suppress
introduction of Q-affecting instructions in any routine in which the Q flag is tested.
ACLE does not define how or whether the Q flag is preserved across function call boundaries. (This is seen as an
area for future specification.)
In general, the Q flag should appear to C/C++ code in a similar way to the standard floating-point cumulative
exception flags, as global (or thread-local) state that can be tested, set or reset through an API.
The following intrinsics are available when __ARM_FEATURE_QBIT is defined:
int __saturation_occurred(void);

Returns 1 if the Q flag is set, 0 if not.


void __set_saturation_occurred(int);

Sets or resets the Q flag according to the LSB of the value. __set_saturation_occurred(0) might be used
before performing a sequence of operations after which the Q flag is tested. (In general, the Q flag cannot be
assumed to be unset at the start of a function.)
void __ignore_saturation(void);

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 39 of 81
Non-Confidential
This intrinsic is a hint and may be ignored. It indicates to the compiler that the value of the Q flag is not live
(needed) at or subsequent to the program point at which the intrinsic occurs. It may allow the compiler to remove
preceding instructions, or to change the instruction sequence in such a way as to result in a different value of the
Q flag. (A specific example is that it may recognize clipping idioms in C code and implement them with an
instruction such as SSAT that may set the Q flag.)

9.1.2 The GE flags


The GE (Greater than or Equal to) flags are four bits in the APSR. They are used with the 32-bit SIMD intrinsics
described in section 9.5.
There are four GE flags, one for each 8-bit lane of a 32-bit SIMD operation. Certain non-saturating 32-bit SIMD
intrinsics set the GE bits to indicate overflow of addition or subtraction. For 4x8-bit operations the GE bits are set
one for each byte. For 2x16-bit operations the GE bits are paired together, one for the high halfword and the other
pair for the low halfword. The only supported way to read or use the GE bits (in this specification) is by using the
__sel intrinsic.

9.1.3 Floating-point environment


An implementation should implement the features of <fenv.h> for accessing the floating-point runtime
environment. Programmers should use this rather than accessing the VFP FPSCR directly. For example, on a
target supporting VFP the cumulative exception flags (IXC, OFC etc.) can be read from the FPSCR by using the
fetestexcept() function, and the rounding mode (RMode) bits can be read using the fegetround() function.

ACLE does not support changing the DN, FZ or AHP bits at runtime.
VFP “short vector” mode (enabled by setting the Stride and Len bits) is deprecated, and is unavailable on later
VFP implementations. ACLE provides no support for this mode.

9.2 Miscellaneous data-processing intrinsics


The following intrinsics perform general data-processing operations. They have no effect on global state.
[Note: documentation of the __nop intrinsic has moved to 8.7.]

The 64-bit versions of these intrinsics („ll‟ suffix) are new in ACLE 1.1. For completeness and to aid portability
between LP64 and LLP64 models, ACLE 1.1 also defines intrinsics with „l‟ suffix.
uint32_t __ror(uint32_t x, uint32_t y);
unsigned long __rorl(unsigned long x, uint32_t y);
uint64_t __rorll(uint64_t x, uint32_t y);

rotates the argument x right by y bits. y can take any value. These intrinsics are available on all targets.
unsigned int __clz(uint32_t x);
unsigned int __clzl(unsigned long x);
unsigned int __clzll(uint64_t x);

returns the number of leading zero bits in x. When x is zero it returns the argument width, i.e. 32 or 64. These
intrinsics are available on all targets. On targets without the CLZ instruction it should be implemented as an
instruction sequence or a call to such a sequence. A suitable sequence can be found in [Warren] (fig. 5-7).
Hardware support for these intrinsics is indicated by __ARM_FEATURE_CLZ.
unsigned int __cls(uint32_t x);
unsigned int __clsl(unsigned long x);
unsigned int __clsll(uint64_t x);

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 40 of 81
Non-Confidential
returns the number of leading sign bits in x. When x is zero it returns the argument width, i.e. 32 or 64. These
intrinsics are available on all targets. On targets without the CLZ instruction it should be implemented as an
instruction sequence or a call to such a sequence. Fast hardware implementation (using a CLS instruction or a
short code sequence involving the CLZ instruction) is indicated by __ARM_FEATURE_CLZ. New in ACLE 1.1.
uint32_t __rev(uint32_t);
unsigned long __revl(unsigned long);
uint64_t __revll(uint64_t);

reverses the byte order within a word or doubleword. These intrinsics are available on all targets and should be
expanded to an efficient straight-line code sequence on targets without byte reversal instructions.
uint32_t __rev16(uint32_t);
unsigned long __rev16l(unsigned long);
uint64_t __rev16ll(uint64_t);

reverses the byte order within each halfword of a word. For example, 0x12345678 becomes 0x34127856. These
intrinsics are available on all targets and should be expanded to an efficient straight-line code sequence on targets
without byte reversal instructions.
int16_t __revsh(int16_t);

reverses the byte order in a 16-bit value and returns the (sign-extended) result. For example, 0x00000080
becomes 0xFFFF8000. This intrinsic is available on all targets and should be expanded to an efficient straight-line
code sequence on targets without byte reversal instructions.
uint32_t __rbit(uint32_t x);
unsigned long __rbitl(unsigned long x);
uint64_t __rbitll(uint64_t x);

reverses the bits in x. These intrinsics are only available on targets with the RBIT instruction.

9.2.1 Examples
#ifdef __ARM_BIG_ENDIAN
#define htonl(x) (uint32_t)(x)
#define htons(x) (uint16_t)(x)
#else /* little-endian */
#define htonl(x) __rev(x)
#define htons(x) (uint16_t)__revsh(x)
#endif /* endianness */
#define ntohl(x) htonl(x)
#define ntohs(x) htons(x)

/* Count leading sign bits */


inline unsigned int count_sign(int32_t x) { return __clz(x ^ (x << 1)); }

/* Count trailing zeroes */


inline unsigned int count_trail(uint32_t x) {
#if (__ARM_ARCH >= 6 && __ARM_ISA_THUMB >= 2) || __ARM_ARCH >= 7
/* RBIT is available */
return __clz(__rbit(x));
#else
unsigned int n = __clz(x & -x); /* get the position of the last bit */
return n == 32 ? n : (31-n);
#endif
}

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 41 of 81
Non-Confidential
9.3 16-bit multiplications
The intrinsics in this section provide direct access to the 16x16 and 16x32 bit multiplies introduced in ARM v5E.
Compilers are also encouraged to exploit these instructions from C code. These intrinsics are available when
__ARM_FEATURE_DSP is defined, and are not available on non-5E targets. These multiplies cannot overflow.
int32_t __smulbb(int32_t, int32_t);

Multiplies two 16-bit signed integers, i.e. the low halfwords of the operands.
int32_t __smulbt(int32_t, int32_t);

Multiplies the low halfword of the first operand and the high halfword of the second operand.
int32_t __smultb(int32_t, int32_t);

Multiplies the high halfword of the first operand and the low halfword of the second operand.
int32_t __smultt(int32_t, int32_t);

Multiplies the high halfwords of the operands.


int32_t __smulwb(int32_t, int32_t);

Multiplies the 32-bit signed first operand with the low halfword (as a 16-bit signed integer) of the second operand.
Return the top 32 bits of the 48-bit product.
int32_t __smulwt(int32_t, int32_t);

Multiplies the 32-bit signed first operand with the high halfword (as a 16-bit signed integer) of the second operand.
Return the top 32 bits of the 48-bit product.

9.4 Saturating intrinsics


9.4.1 Width-specified saturation intrinsics
These intrinsics are available when __ARM_FEATURE_SAT is defined. They saturate a 32-bit value at a given bit
position. The saturation width must be an integral constant expression – see section 4.3.1.
int32_t __ssat(int32_t, /*constant*/ unsigned int);

Saturates a signed integer to the given bit width in the range 1 to 32. For example, the result of saturation to 8-bit
width will be in the range -128 to 127. The Q flag is set if the operation saturates.
uint32_t __usat(int32_t, /*constant*/ unsigned int);

Saturates a signed integer to an unsigned (non-negative) integer of a bit width in the range 0 to 31. For example,
the result of saturation to 8-bit width is in the range 0 to 255, with all negative inputs going to zero. The Q flag is
set if the operation saturates.

9.4.2 Saturating addition and subtraction intrinsics


These intrinsics are available when __ARM_FEATURE_DSP is defined.
The saturating intrinsics operate on 32-bit signed integer data. There are no special „saturated‟ or „fixed point‟
types.
int32_t __qadd(int32_t, int32_t);

Adds two 32-bit signed integers, with saturation. Sets the Q flag if the addition saturates.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 42 of 81
Non-Confidential
int32_t __qsub(int32_t, int32_t);

Subtracts two 32-bit signed integers, with saturation. Sets the Q flag if the subtraction saturates.
int32_t __qdbl(int32_t);

Doubles a signed 32-bit number, with saturation. __qdbl(x) is equal to __qadd(x,x) except that the argument
x is evaluated only once. Sets the Q flag if the addition saturates.

9.4.3 Accumulating multiplications


These intrinsics are available when __ARM_FEATURE_DSP is defined.
int32_t __smlabb(int32_t, int32_t, int32_t);

Multiplies two 16-bit signed integers, the low halfwords of the first two operands, and adds to the third operand.
Sets the Q flag if the addition overflows. (Note that the addition is the usual 32-bit modulo addition which wraps on
overflow, not a saturating addition. The multiplication cannot overflow.)
int32_t __smlabt(int32_t, int32_t, int32_t);

Multiplies the low halfword of the first operand and the high halfword of the second operand, and adds to the third
operand, as for __smlabb.
int32_t __smlatb(int32_t, int32_t, int32_t);

Multiplies the high halfword of the first operand and the low halfword of the second operand, and adds to the third
operand, as for __smlabb.
int32_t __smlatt(int32_t, int32_t, int32_t);

Multiplies the high halfwords of the first two operands and adds to the third operand, as for __smlabb.
int32_t __smlawb(int32_t, int32_t, int32_t);

Multiplies the 32-bit signed first operand with the low halfword (as a 16-bit signed integer) of the second operand.
Adds the top 32 bits of the 48-bit product to the third operand. Sets the Q flag if the addition overflows. (See note
for __smlabb.)
int32_t __smlawt(int32_t, int32_t, int32_t);

Multiplies the 32-bit signed first operand with the high halfword (as a 16-bit signed integer) of the second operand
and adds the top 32 bits of the 48-bit result to the third operand as for __smlawb.

9.4.4 Examples
The ACLE DSP intrinsics can be used to define ETSI/ITU-T basic operations [G.191]:
#include <arm_acle.h>
inline int32_t L_add(int32_t x, int32_t y) { return __qadd(x, y); }
inline int32_t L_negate(int32_t x) { return __qsub(0, x); }
inline int32_t L_mult(int16_t x, int16_t y) { return __qdbl(x*y); }
inline int16_t add(int16_t x, int16_t y) { return (int16_t)(__qadd(x<<16, y<<16) >> 16); }
inline int16_t norm_l(int32_t x) { return __clz(x ^ (x<<1)) & 31; }
...
This example assumes the implementation preserves the Q flag on return from an inline function.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 43 of 81
Non-Confidential
9.5 32-bit SIMD intrinsics
9.5.1 Availability
ARM v6 introduced instructions to perform 32-bit SIMD operations (i.e. two 16-bit operations or four 8-bit
operations) on the ARM general-purpose registers. These instructions are not related to the much more versatile
Advanced SIMD (NEON) extension, whose support is described in section 12.
The 32-bit SIMD intrinsics are available on targets featuring ARM v6 and upwards, including the A and R profiles.
In the M profile they are available in the v7E-M architecture. Availability of the 32-bit SIMD intrinsics implies
availability of the saturating intrinsics.
Availability of the SIMD intrinsics is indicated by the __ARM_FEATURE_SIMD32 predefine.

To access the intrinsics, the <arm_acle.h> header should be included.

9.5.2 Data types for 32-bit SIMD intrinsics


The header <arm_acle.h> should be included before using these intrinsics.

The SIMD intrinsics generally operate on and return 32-bit words consisting of two 16-bit or four 8-bit values.
These are represented as int16x2_t and int8x4_t below for illustration. Some intrinsics also feature scalar
accumulator operands and/or results.
When defining the intrinsics, implementations can define SIMD operands using a 32-bit integral type (such as
„unsigned int‟).

The header <arm_acle.h> defines typedefs int16x2_t, uint16x2_t, int8x4_t and uint8x4_t. These should be
defined as 32-bit integral types of the appropriate sign. There are no intrinsics provided to pack or unpack values
of these types. This can be done with shifting and masking operations.

9.5.3 Use of the Q flag by 32-bit SIMD intrinsics


Some 32-bit SIMD instructions may set the Q flag described in section 9.1.1. The behavior of the intrinsics
matches that of the instructions.
Generally, instructions that perform lane-by-lane saturating operations do not set the Q flag. For example,
__qadd16 does not set the Q flag, even if saturation occurs in one or more lanes.

The explicit saturation operations __ssat and __usat set the Q flag if saturation occurs. Similarly, __ssat16
and __usat16 set the Q flag if saturation occurs in either lane.

Some instructions, such as __smlad, set the Q flag if overflow occurs on an accumulation, even though the
accumulation is not a saturating operation (i.e. does not clip its result to the limits of the type).
In the following descriptions of intrinsics, if the description does not mention whether the intrinsic affects the Q
flag, the intrinsic does not affect it.

9.5.4 Parallel 16-bit saturation


These intrinsics are available when __ARM_FEATURE_SIMD32 is defined. They saturate two 16-bit values to a given
bit width as for the __ssat and __usat intrinsics defined in 9.4.1.
int16x2_t __ssat16(int16x2_t, /*constant*/ unsigned int);

Saturates two 16-bit signed values to a width in the range 1 to 16. The Q flag is set if either operation saturates.
int16x2_t __usat16(int16x2_t, /*constant */ unsigned int);

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 44 of 81
Non-Confidential
Saturates two 16-bit signed values to a bit width in the range 0 to 15. The input values are signed and the output
values are non-negative, with all negative inputs going to zero. The Q flag is set if either operation saturates.

9.5.5 Packing and unpacking


These intrinsics are available when __ARM_FEATURE_SIMD32 is defined.
int16x2_t __sxtab16(int16x2_t, int8x4_t);

Two values (at bit positions 0..7 and 16..23) are extracted from the second operand, sign-extended to 16 bits, and
added to the first operand.
int16x2_t __sxtb16(int8x4_t);

Two values (at bit positions 0..7 and 16..23) are extracted from the first operand, sign-extended to 16 bits, and
returned as the result.
uint16x2_t __uxtab16(uint16x2_t, uint8x4_t);

Two values (at bit positions 0..7 and 16..23) are extracted from the second operand, zero-extended to 16 bits, and
added to the first operand.
uint16x2_t __uxtb16(uint8x4_t);

Two values (at bit positions 0..7 and 16..23) are extracted from the first operand, zero-extended to 16 bits, and
returned as the result.

9.5.6 Parallel selection


This intrinsic is available when __ARM_FEATURE_SIMD32 is defined.
uint8x4_t __sel(uint8x4_t, uint8x4_t);

Selects each byte of the result from either the first operand or the second operand, according to the values of the
GE bits. For each result byte, if the corresponding GE bit is set then the byte from the first operand is used,
otherwise the byte from the second operand is used. Because of the way that int16x2_t operations set two
(duplicate) GE bits per value, the __sel intrinsic works equally well on (u)int16x2_t and (u)int8x4_t data.

9.5.7 Parallel 8-bit addition and subtraction


These intrinsics are available when __ARM_FEATURE_SIMD32 is defined. Each intrinsic performs 8-bit parallel
addition or subtraction. In some cases the result may be halved or saturated.
int8x4_t __qadd8(int8x4_t, int8x4_t);

4x8-bit addition, saturated to the range -2**7 to 2**7-1.


int8x4_t __qsub8(int8x4_t, int8x4_t);

4x8-bit subtraction, with saturation.


int8x4_t __sadd8(int8x4_t, int8x4_t);

4x8-bit signed addition. The GE bits are set according to the results.
int8x4_t __shadd8(int8x4_t, int8x4_t);

4x8-bit signed addition, halving the results.


int8x4_t __shsub8(int8x4_t, int8x4_t);

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 45 of 81
Non-Confidential
4x8-bit signed subtraction, halving the results.
int8x4_t __ssub8(int8x4_t, int8x4_t);

4x8-bit signed subtraction. The GE bits are set according to the results.
uint8x4_t __uadd8(uint8x4_t, uint8x4_t);

4x8-bit unsigned addition. The GE bits are set according to the results.
uint8x4_t __uhadd8(uint8x4_t, uint8x4_t);

4x8-bit unsigned addition, halving the results.


uint8x4_t __uhsub8(uint8x4_t, uint8x4_t);

4x8-bit unsigned subtraction, halving the results.


uint8x4_t __uqadd8(uint8x4_t, uint8x4_t);

4x8-bit unsigned addition, saturating to the range 0 to 2**8-1.


uint8x4_t __uqsub8(uint8x4_t, uint8x4_t);

4x8-bit unsigned subtraction, saturating to the range 0 to 2**8-1.


uint8x4_t __usub8(uint8x4_t, uint8x4_t);

4x8-bit unsigned subtraction. The GE bits are set according to the results.

9.5.8 Sum of 8-bit absolute differences


These intrinsics are available when __ARM_FEATURE_SIMD32 is defined. They perform an 8-bit sum-of-absolute
differences operation, typically used in motion estimation.
uint32_t __usad8(uint8x4_t, uint8x4_t);

Performs 4x8-bit unsigned subtraction, and adds the absolute values of the differences together, returning the
result as a single unsigned integer.
uint32_t __usada8(uint8x4_t, uint8x4_t, uint32_t);

Performs 4x8-bit unsigned subtraction, adds the absolute values of the differences together, and adds the result to
the third operand.

9.5.9 Parallel 16-bit addition and subtraction


These intrinsics are available when __ARM_FEATURE_SIMD32 is defined. Each intrinsic performs 16-bit parallel
addition and/or subtraction. In some cases the result may be halved or saturated.
int16x2_t __qadd16(int16x2_t, int16x2_t);

2x16-bit addition, saturated to the range -2**15 to 2**15-1.


int16x2_t __qasx(int16x2_t, int16x2_t);

Exchanges halfwords of second operand, adds high halfwords and subtracts low halfwords, saturating in each
case.
int16x2_t __qsax(int16x2_t, int16x2_t);

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 46 of 81
Non-Confidential
Exchanges halfwords of second operand, subtracts high halfwords and adds low halfwords, saturating in each
case.
int16x2_t __qsub16(int16x2_t, int16x2_t);

2x16-bit subtraction, with saturation.


int16x2_t __sadd16(int16x2_t, int16x2_t);

2x16-bit signed addition. The GE bits are set according to the results.
int16x2_t __sasx(int16x2_t, int16x2_t);

Exchanges halfwords of the second operand, adds high halfwords and subtracts low halfwords. The GE bits are
set according to the results.
int16x2_t __shadd16(int16x2_t, int16x2_t);

2x16-bit signed addition, halving the results.


int16x2_t __shasx(int16x2_t, int16x2_t);

Exchanges halfwords of the second operand, adds high halfwords and subtract low halfwords, halving the results.
int16x2_t __shsax(int16x2_t, int16x2_t);

Exchanges halfwords of the second operand, subtracts high halfwords and add low halfwords, halving the results.
int16x2_t __shsub16(int16x2_t, int16x2_t);

2x16-bit signed subtraction, halving the results.


int16x2_t __ssax(int16x2_t, int16x2_t);

Exchanges halfwords of the second operand, subtracts high halfwords and adds low halfwords. The GE bits are
set according to the results.
int16x2_t __ssub16(int16x2_t, int16x2_t);

2x16-bit signed subtraction. The GE bits are set according to the results.
uint16x2_t __uadd16(uint16x2_t, uint16x2_t);

2x16-bit unsigned addition. The GE bits are set according to the results.
uint16x2_t __uasx(uint16x2_t, uint16x2_t);

Exchanges halfwords of the second operand, adds high halfwords and subtracts low halfwords. The GE bits are
set according to the results of unsigned addition.
uint16x2_t __uhadd16(uint16x2_t, uint16x2_t);

2x16-bit unsigned addition, halving the results.


uint16x2_t __uhasx(uint16x2_t, uint16x2_t);

Exchanges halfwords of the second operand, adds high halfwords and subtracts low halfwords, halving the
results.
uint16x2_t __uhsax(uint16x2_t, uint16x2_t);

Exchanges halfwords of the second operand, subtracts high halfwords and adds low halfwords, halving the
results.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 47 of 81
Non-Confidential
uint16x2_t __uhsub16(uint16x2_t, uint16x2_t);

2x16-bit unsigned subtraction, halving the results.


uint16x2_t __uqadd16(uint16x2_t, uint16x2_t);

2x16-bit unsigned addition, saturating to the range 0 to 2**16-1.


uint16x2_t __uqasx(uint16x2_t, uint16x2_t);

Exchanges halfwords of the second operand, and performs saturating unsigned addition on the high halfwords
and saturating unsigned subtraction on the low halfwords.
uint16x2_t __uqsax(uint16x2_t, uint16x2_t);

Exchanges halfwords of the second operand, and performs saturating unsigned subtraction on the high halfwords
and saturating unsigned addition on the low halfwords.
uint16x2_t __uqsub16(uint16x2_t, uint16x2_t);

2x16-bit unsigned subtraction, saturating to the range 0 to 2**16-1.


uint16x2_t __usax(uint16x2_t, uint16x2_t);

Exchanges the halfwords of the second operand, subtracts the high halfwords and adds the low halfwords. Sets
the GE bits according to the results of unsigned addition.
uint16x2_t __usub16(uint16x2_t, uint16x2_t);

2x16-bit unsigned subtraction. The GE bits are set according to the results.

9.5.10 Parallel 16-bit multiplication


These intrinsics are available when __ARM_FEATURE_SIMD32 is defined. Each intrinsic performs two 16-bit
multiplications.
int32_t __smlad(int16x2_t, int16x2_t, int32_t);

Performs 2x16-bit multiplication and adds both results to the third operand. Sets the Q flag if the addition
overflows. (Overflow cannot occur during the multiplications.)
int32_t __smladx(int16x2_t, int16x2_t, int32_t);

Exchanges the halfwords of the second operand, performs 2x16-bit multiplication, and adds both results to the
third operand. Sets the Q flag if the addition overflows. (Overflow cannot occur during the multiplications.)
int64_t __smlald(int16x2_t, int16x2_t, int64_t);

Performs 2x16-bit multiplication and adds both results to the 64-bit third operand. Overflow in the addition is not
detected.
int64_t __smlaldx(int16x2_t, int16x2_t, int64_t);

Exchanges the halfwords of the second operand, performs 2x16-bit multiplication and adds both results to the 64-
bit third operand. Overflow in the addition is not detected.
int32_t __smlsd(int16x2_t, int16x2_t, int32_t);

Performs two 16-bit signed multiplications. Takes the difference of the products, subtracting the high-halfword
product from the low-halfword product, and adds the difference to the third operand. Sets the Q flag if the addition
overflows. (Overflow cannot occur during the multiplications or the subtraction.)

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 48 of 81
Non-Confidential
int32_t __smlsdx(int16x2_t, int16x2_t, int32_t);

Performs two 16-bit signed multiplications. The product of the high halfword of the first operand and the low
halfword of the second operand is subtracted from the product of the low halfword of the first operand and the high
halfword of the second operand, and the difference is added to the third operand. Sets the Q flag if the addition
overflows. (Overflow cannot occur during the multiplications or the subtraction.)
int64_t __smlsld(int16x2_t, int16x2_t, int64_t);

Perform two 16-bit signed multiplications. Take the difference of the products, subtracting the high-halfword
product from the low-halfword product, and add the difference to the third operand. Overflow in the 64-bit addition
is not detected. (Overflow cannot occur during the multiplications or the subtraction.)
int64_t __smlsldx(int16x2_t, int16x2_t, int64_t);

Perform two 16-bit signed multiplications. The product of the high halfword of the first operand and the low
halfword of the second operand is subtracted from the product of the low halfword of the first operand and the high
halfword of the second operand, and the difference is added to the third operand. Overflow in the 64-bit addition is
not detected. (Overflow cannot occur during the multiplications or the subtraction.)
int32_t __smuad(int16x2_t, int16x2_t);

Perform 2x16-bit signed multiplications, adding the products together. Set the Q flag if the addition overflows.
int32_t __smuadx(int16x2_t, int16x2_t);

Exchange the halfwords of the second operand (or equivalently, the first operand), perform 2x16-bit signed
multiplications, and add the products together. Set the Q flag if the addition overflows.
int32_t __smusd(int16x2_t, int16x2_t);

Perform two 16-bit signed multiplications. Take the difference of the products, subtracting the high-halfword
product from the low-halfword product.
int32_t __smusdx(int16x2_t, int16x2_t);

Perform two 16-bit signed multiplications. The product of the high halfword of the first operand and the low
halfword of the second operand is subtracted from the product of the low halfword of the first operand and the high
halfword of the second operand.

9.5.11 Examples
Taking the elementwise maximum of two SIMD values each of which consists of four 8-bit signed numbers:
int8x4_t max8x4(int8x4_t x, int8x4_t y) { __ssub8(x, y); return __sel(x, y); }
As described in section 9.5.6, where SIMD values consist of two 16-bit unsigned numbers:
int16x2_t max16x2(int16x2_t x, int16x2_t y) { __usub16(x, y); return __sel(x, y); }
Note that even though the result of the subtraction is not used, the compiler must still generate the instruction,
because of its side-effect on the GE bits which are tested by the __sel() intrinsic.

9.6 Floating-point data-processing intrinsics


The intrinsics in this section provide direct access to selected floating-point instructions. They are defined only if
the appropriate precision is available in hardware, as indicated by __ARM_FP (6.5.1).
double __sqrt(double x);
float __sqrtf(float x);

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 49 of 81
Non-Confidential
The __sqrt intrinsics compute the square root of their operand. They have no effect on errno. Negative values
will produce a default NaN result and possible floating-point exception as described in [ARM ARM A2.7.7].
double __fma(double x, double y, double z);
float __fmaf(float x, float y, float z);

The __fma intrinsics compute (x*y)+z, without intermediate rounding. These intrinsics are available only if
__ARM_FEATURE_FMA is defined. On a Standard C implementation it should not normally be necessary to use
these intrinsics, as the fma functions defined in [C99 7.12.13] should expand directly to the instructions if
available.
float __rintnf (float);
double __rintn (double);
The __rintn intrinsics perform a floating point round to integral, to nearest with ties to even. The __rintn intrinsic
is available when __ARM_FEATURE_DIRECTED_ROUNDING is defined to 1. For other rounding modes like „to nearest
with ties to away‟ it is strongly recommended that C99 standard functions be used. To achieve a floating point
convert to integer, rounding to „nearest with ties to even‟ operation, use these rounding functions with a type-cast
to integral values, eg.
(int) __rintnf (a);
Will map to a floating point convert to signed integer, rounding to nearest with ties to even operation.

9.7 CRC32 intrinsics


CRC32 intrinsics provide direct access to CRC32 instructions CRC32{C}{B, H, W, X} in both ARMv8 AArch32
and AArch64 execution states. These intrinsics are available when__ARM_FEATURE_CRC32 is defined.
uint32_t __crc32b (uint32_t a, uint8_t b);

Performs CRC-32 checksum from bytes.


uint32_t __crc32h (uint32_t a, uint16_t b);

Performs CRC-32 checksum from half-words.


uint32_t __crc32w (uint32_t a, uint32_t b);

Performs CRC-32 checksum from words.


uint32_t __crc32d (uint32_t a, uint64_t b);

Performs CRC-32 checksum from double words.


uint32_t __crc32cb (uint32_t a, uint8_t b);

Performs CRC-32C checksum from bytes.


uint32_t __crc32ch (uint32_t a, uint16_t b);

Performs CRC-32C checksum from half-words.


uint32_t __crc32cw (uint32_t a, uint32_t b);

Performs CRC-32C checksum from words.


uint32_t __crc32cd (uint32_t a, uint64_t b);

Performs CRC-32C checksum from double words.

To access these intrinsics, <arm_acle.h> should be included.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 50 of 81
Non-Confidential
10 SYSTEM REGISTER ACCESS

10.1 Special register intrinsics


Intrinsics are provided to read and write system and coprocessor registers, collectively referred to as “special
registers”.
uint32_t __arm_rsr(const char *special_register);
reads a 32-bit system register.
uint64_t __arm_rsr64(const char *special_register);
reads a 64-bit system register.
void* __arm_rsrp(const char *special_register);
reads a system register containing an address.
void __arm_wsr(const char *special_register, uint32_t value);
writes a 32-bit system register.
void __arm_wsr64(const char *special_register, uint64_t value);
writes a 64-bit system register.
void __arm_wsrp(const char *special_register, const void *value);
writes a system register containing an address.

10.2 Special register designations


The special_register parameter must be a compile time string literal. This means that the implementation
can determine the register being accessed at compile-time and produce the correct instruction without having to
resort to self-modifying code. All register specifiers are case-insensitive (so "apsr" is equivalent to "APSR"). The
string literal should have one of the forms described below.

10.2.1 AArch32 32-bit coprocessor register


When specifying a 32-bit coprocessor register to __arm_rsr, __arm_rsrp, __arm_wsr, or __arm_wsrp:
cp<coprocessor>:<opc1>:c<CRn>:c<CRm>:<opc2>
or (equivalently)
p<coprocessor>:<opc1>:c<CRn>:c<CRm>:<opc2>
where:
<coprocessor> is a decimal integer in the range [0, 15]
<opc1>, <opc2> are decimal integers in the range [0, 7]
<CRn>, <CRm> are decimal integers in the range [0, 15].

The values of the register specifiers will be as described in [ARM ARM] or the Technical Reference Manual (TRM)
for the specific processor.
So to read MIDR:
unsigned int midr = __arm_rsr("cp15:0:c0:c0:0");

ACLE does not specify predefined strings for the system coprocessor register names documented in the ARM
ARM (e.g. “MIDR”).

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 51 of 81
Non-Confidential
10.2.2 AArch32 32-bit system register
When specifying a 32-bit system register to __arm_rsr, __arm_rsrp, __arm_wsr, or __arm_wsrp, one of:

the values accepted in the spec_reg field of the MRS instruction [ARMARM-AR B6.1.5], e.g. “CPSR”

the values accepted in the spec_reg field of the MSR (immediate) instruction [ARMARM B6.1.6]

the values accepted in the spec_reg field of the VMRS instruction [ARMARM B6.1.14], e.g. “FPSID”

the values accepted in the spec_reg field of the VMSR instruction [ARMARM B6.1.15], e.g. “FPSCR”

the values accepted in the spec_reg field of the MSR and MRS instructions with virtualization extensions
[ARM ARM B1.7], e.g. “ELR_Hyp”

the values specified in „Special register encodings used in ARMv7-M system instructions.‟ [ARMv7M
B5.1.1], e.g. “PRIMASK”

10.2.3 AArch32 64-bit coprocessor register


When specifying a 64-bit coprocessor register to __arm_rsr64 or __arm_wsr64:
cp<coprocessor>:<opc1>:c<CRm>
or (equivalently)
p<coprocessor>:<opc1>:c<Rm>
where:
<coprocessor> is a decimal integer in the range [0, 15]
<opc1> is a decimal integer in the range [0, 7]
<CRm> is a decimal integer in the range [0, 15]

10.2.4 AArch64 system register


When specifying a system register to __arm_rsr, __arm_rsr64, __arm_rsrp, __arm_wsr, __arm_wsr64 or
__arm_wsrp:

"o0:op1:CRn:CRm:op2"

where:

<o0> is a decimal integer in the range [0, 1]


<op1>, <op2> are decimal integers in the range [0, 7]
<CRm>, <CRn> are decimal integers in the range [0, 15]

10.2.5 AArch64 processor state field


When specifying a processor state field to __arm_rsr, __arm_rsp, __arm_wsr, or __arm_wsrp, one of the
values accepted in the pstatefield of the MSR (immediate) instruction [ARMARM-v8 C5.6.130].

10.3 Unspecified behavior


ACLE does not specify how the implementation should behave in the following cases:

when merging multiple reads/writes of the same register

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 52 of 81
Non-Confidential
when writing to a read-only register, or a register that is undefined on the architecture being compiled for

when reading or writing to a register which the implementation models by some other means (this covers
– but is not limited to – reading/writing cp10 and cp11 registers when VFP is enabled, and reading/writing
the CPSR)

when reading or writing a register using one of these intrinsics with an inappropriate type for the value
being read or written to
when writing to a co-processor register that carries out a "System operation"
when using a register specifier which doesn't apply to the targetted architecture.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 53 of 81
Non-Confidential
11 INSTRUCTION GENERATION

11.1 Instruction generation, arranged by instruction


The following table indicates how instructions may be generated by intrinsics, and/or C code. The table includes
integer data processing and certain system instructions.
Compilers are encouraged to use opportunities to combine instructions, or to use shifted/rotated operands where
available. In general, intrinsics are not provided for accumulating variants of instructions in cases where the
accumulation is a simple addition (or subtraction) following the instruction.
The table indicates which architectures the instruction is supported on, as follows:

Architecture „8‟ means ARMv8-A AArch32 and AArch64, „8-32‟ means ARMv8-AArch32 only.

architecture „7‟ means ARM v7-A and ARM v7-R

in the sequence of ARM architectures { 5, 5TE, 6, 6T2, 7 } each architecture includes its predecessor
instruction set

in the sequence of Thumb-only architectures { 6-M, 7-M, 7E-M } each architecture includes its
predecessor instruction set

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 54 of 81
Non-Confidential
Instruction Flgs Arch. Intrinsic or C code SBFX 8,6T2, 7M C

BKPT 5 none SDIV 7M+ C

BFC 6T2, 7M C SEL (GE) 6, 7EM __sel

BFI 6T2, 7M C SETEND 6 n/a

CLZ 5 __clz, __builtin_clz SEV 8,6K __sev

DBG 7, 7M __dbg SHADD16 6, 7EM __shadd16

DMB 8,7, 6M __dmb SHADD8 6, 7EM __shadd8

DSB 8, 7, 6M __dsb SHASX 6, 7EM __shasx

ISB 8, 7, 6M __isb SHSAX 6, 7EM __shsax

LDREX 6, 7M __sync_xxx SHSUB16 6, 7EM __shsub16

LDRT all none SHSUB8 6, 7EM __shsub8

MCR/MRC all see 10 SMC 8,6Z, T2 none

MSR/MRS 6M see 10 SMI 6Z, T2 none

PKHBT 6 C SMLABB Q 5E, 7EM __smlabb

PKHTB 6 C SMLABT Q 5E, 7EM __smlabt

PLD 8-32,5TE, __pld SMLAD Q 6, 7EM __smlad


7M
PLDW 7MP __pldx SMLADX Q 6, 7EM __smladx

PLI 8-32,7 __pli SMLAL all, 7M C

QADD Q 5E, 7EM __qadd SMLALBB 5E, 7EM __smulbb and C

QADD16 6, 7EM __qadd16 SMLALBT 5E, 7EM __smulbt and C

QADD8 6, 7EM __qadd8 SMLALTB 5E, 7EM __smultb and C

QASX 6, 7EM __qasx SMLALTT 5E, 7EM __smultt and C

QDADD Q 5E, 7EM __qadd(__qdbl) SMLALD 6, 7EM __smlald

QDSUB Q 5E, 7EM __qsub(__qdbl) SMLALDX 6, 7EM __smlaldx

QSAX 6, 7EM __qsax SMLATB Q 5E, 7EM __smlatb

QSUB Q 5E, 7EM __qsub SMLATT Q 5E, 7EM __smlatt

QSUB16 6, 7EM __qsub16 SMLAWB Q 5E, 7EM __smlawb

QSUB8 6, 7EM __qsub8 SMLAWT Q 5E, 7EM __smlawt

RBIT 8,6T2, 7M __rbit, __builtin_rbit SMLSD Q 6, 7EM __smlsd

REV 8,6, 6M __rev, __builtin_bswap32 SMLSDX Q 6, 7EM __smlsdx

REV16 8,6, 6M __rev16 SMLSLD 6, 7EM __smlsld

REVSH 6, 6M __revsh SMLSLDX 6, 7EM __smlsldx

ROR all __ror SMMLA 6, 7EM C

SADD16 GE 6, 7EM __sadd16 SMMLAR 6, 7EM C

SADD8 GE 6, 7EM __sadd8 SMMLS 6, 7EM C

SASX GE 6, 7EM __sasx SMMLSR 6, 7EM C

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 55 of 81
Non-Confidential
SMMUL 6, 7EM C UHADD16 6, 7EM __uhadd16

SMMULR 6, 7EM C UHADD8 6, 7EM __uhadd8

SMUAD Q 6, 7EM __smuad UHASX 6, 7EM __uhasx

SMUADX Q 6, 7EM __smuadx UHSAX 6, 7EM __uhsax

SMULBB 5E, 7EM __smulbb; C UHSUB16 6, 7EM __uhsub16

SMULBT 5E, 7EM __smulbt ; C UHSUB8 6, 7EM __uhsub8

SMULTB 5E, 7EM __smultb; C UMAAL 6, 7EM C

SMULTT 5E, 7EM __smultt; C UMLAL all, 7M acc += (uint64_t)x * y

SMULL all, 7M C UMULL all, 7M C

SMULWB 5E, 7EM __smulwb; C UQADD16 6, 7EM __uqadd16

SMULWT 5E, 7EM __smulwt; C UQADD8 6, 7EM __uqadd8

SMUSD 6, 7EM __smusd UQASX 6, 7EM __uqasx

SMUSDX 6, 7EM __smusd UQSAX 6, 7EM __uqsax

SSAT Q 6, 7M __ssat UQSUB16 6, 7EM __uqsub16

SSAT16 Q 6, 7EM __ssat16 UQSUB8 6, 7EM __uqsub8

SSAX GE 6, 7EM __ssax USAD8 6, 7EM __usad8

SSUB16 GE 6, 7EM __ssub16 USADA8 6, 7EM __usad8 + acc

SSUB8 GE 6, 7EM __ssub8 USAT Q 6, 7M __usat

STREX 6, 7M __sync_xxx USAT16 Q 6, 7EM __usat16

STRT all none USAX 6, 7EM __usax

SVC all none USUB16 6, 7EM __usub16

SWP ARM only __swp [deprecated; see USUB8 6, 7EM __usub8


8.5]
SXTAB 6, 7EM (int8_t)x + a UXTAB 6, 7EM (uint8_t)x + i

SXTAB16 6, 7EM __sxtab16 UXTAB16 6, 7EM __uxtab16

SXTAH 6, 7EM (int16_t)x + a UXTAH 6, 7EM (uint16_t)x + i

SXTB 8,6, 6M (int8_t)x UXTB16 6, 7EM __uxtb16

SXTB16 6, 7EM __sxtb16 UXTH 8,6, 6M (uint16_t)x

SXTH 8,6, 6M (int16_t)x VFMA VFPv4 fma, __fma

UADD16 GE 6, 7EM __uadd16 VSQRT VFP sqrt, __sqrt

UADD8 GE 6, 7EM __uadd8 WFE 8,6K, 6M __wfe

UASX GE 6, 7EM __uasx WFI 8,6K, 6M __wfi

UBFX 8,6T2, 7M C YIELD 8,6K, 6M __yield

UDIV 7M+ C

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 56 of 81
Non-Confidential
12 NEON INTRINSICS

12.1 Availability of NEON intrinsics


The NEON intrinsics correspond to operations on the ARM NEON extension. This architectural extension provides
for arithmetic, logical and saturated arithmetic operations on 8-bit, 16-bit and 32-bit integers (and sometimes on
64-bit integers) and on 32-bit and 64-bit floating-point data, arranged in 64-bit and 128-bit vectors.
NEON intrinsics are available if the __ARM_NEON macro is predefined (see 6.5.4), but in order to access NEON
intrinsics it is necessary to include the <arm_neon.h> header.

12.1.1 16-bit floating-point availability


When the 16-bit floating-point data type is available in the scalar VFP instruction set, it is also available in NEON.
This will be indicated by the setting of bit 1 in __ARM_NEON_FP (see 6.5.5).

12.1.2 Fused multiply-accumulate availability


Fused multiply-accumulate is available in the NEON extension when available in the scalar instruction set, as
indicated by __ARM_FEATURE_FMA. When fused multiply-accumulate is available, extra NEON intrinsics are
defined to access it.

12.2 NEON data types


12.2.1 Vector data types
Vector data types are named as a lane type and a multiple. Lane type names are based on the types defined in
<stdint.h>. For example,. int16x4_t is a vector of four int16_t values. The base types are int8_t,
uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, float16_t, float32_t,
poly8_t, poly16_t. The multiples are such that the resulting vector types are 64-bit and 128-bit. In AArch64,
float64_t, poly64_t and poly128_t are also base types.

Not all types can be used in all operations. Generally, the operations available on a type correspond to the
operations available on the corresponding scalar type.
ACLE does not define whether int64x1_t is the same type as int64_t, or whether uint64x1_t is the same
type as uint64_t, or whether poly64x1_t is the same as poly64_t e.g. for C++ overloading purposes.

float16 types are only available when the __fp16 type is defined, i.e. when supported by the hardware. As with
scalar (VFP) operations, 16-bit floating-point types cannot be used in arithmetic operations. They can be used in
conversions to and from 32-bit floating-point types, in loads and stores, and in reinterpret operations.

12.2.2 Advanced SIMD Scalar data types


AArch64 supports Advanced SIMD scalar operations that work on standard scalar data types viz. int8_t,
uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, float32_t, float64_t.

12.2.3 Vector array data types


Array types are defined for multiples of 2, 3 or 4 of all the vector types, for use in load and store operations, in
table-lookup operations, and as the result type of operations that return a pair of vectors. For a vector type

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 57 of 81
Non-Confidential
<type>_t the corresponding array type is <type>x<length>_t. Concretely, an array type is a structure containing
a single array element called val.

For example an array of two int16x4_t types is int16x4x2_t, and is represented as


struct int16x4x2_t { int16x4_t val[2]; };

Note that this array of two 64-bit vector types is distinct from the 128-bit vector type int16x8_t.

12.2.4 Scalar data types


For consistency, <arm_neon.h> defines some additional scalar data types to match the vector types.

float32_t is defined as an alias for float.

If the __fp16 type is defined, float16_t is defined as an alias for it.

poly8_t and poly16_t are defined as unsigned integer types. It is unspecified whether these are the same type
as uint8_t and uint16_t for overloading and mangling purposes.

12.2.5 Operations on data types


ACLE does not define implicit conversion between different data types. E.g.
int32x4_t x;
uint32x4_t y = x; // No representation change
float32x4_t z = x; // Conversion of integer to floating type

is not portable. Use the vreinterpret intrinsics to convert from one vector type to another without changing
representation, and use the vcvt intrinsics to convert between integer and floating types; for example:
int32x4_t x;
uint32x4_t y = vreinterpretq_u32_s32(x);
float32x4_t z = vcvt_f32_s32(x);

ACLE does not define static construction of vector types. E.g.


int32x4_t x = { 1, 2, 3, 4 };

is not portable. Use the vcreate or vdup intrinsics to construct values from scalars.

In C++, ACLE does not define whether NEON data types are POD types or whether they can be inherited from.

12.2.6 Compatibility with other vector programming models


Programmers should take particular care when combining the Neon Intrinsics API with alternative vector
programming models; ACLE does not specify how the NEON Intrinsics API interoperates with them.
For instance, the GCC vector extension permits
include “arm_neon.h”
...
uint32x2_t x = {0, 1}; // GCC extension.
uint32_t y = vget_lane_s32 (x, 0); // ACLE NEON Intrinsic.
but with this code the value stored in „y‟ will depend on both the target architecture (AArch32 or AArch64) and
whether the program is running in big- or little-endian mode.
It is recommended that NEON Intrinsics be used consistently:

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 58 of 81
Non-Confidential
include “arm_neon.h”
...
const int temp[2] = {0, 1};
uint32x2_t x = vld1_s32 (temp);
uint32_t y = vget_lane_s32 (x, 0);

12.3 Specification of NEON intrinsics


12.3.1 Introduction
The intrinsics are presented in the following order:

construction and deconstruction of vectors

loading and storing vectors

lane-by-lane (SIMD) operations

reductions

rearrangements and table lookups

12.3.2 Explanation of NEON intrinsics templates


In order to present the NEON intrinsics in a compact form, they are specified here in a generic way, as templates.
In this specification, all capital letters in intrinsic names and types are template parameters. Both names and types
are specified with placeholders to be filled in with a vector type parameter T or some type or string derived from T,
as follows:

T is a vector type such as int16x4_t for a vector of four lanes of signed 16-bit integers

Q is the string “q” if T is a 128-bit vector type, the empty string otherwise. This is used in forming the
names of intrinsics

C is the string „b‟, „h‟, „s‟ or „d‟ if T is an Advanced SIMD scalar type of width 8-bit, 16-bit, 32-
bit or 64-bit.ST is the short-form name of the lane type of a vector type T, such as s16 for a signed 16-bit
integer

UST is the unsigned short-form name of the lane type of a vector type T, such as u16 for an unsigned 16-
bit integer.

ET is the element type of the vector type T

EDT is twice the width of the element type of vector type T.

DT, for a 64-bit vector type T, is the 128-bit vector type with lanes twice as wide as T (where this exists).
Where T is 128-bit vector type(„_high_‟ widening intrinsics), DT is a 128-bit vector type where the lane is
twice as wide as lane type in T and half the number of elements in T. It basically represents the widened
top half of T.

HNT, for 128-bit vector type T, is the 128-bit vector type with lanes half as wide but twice in number. This
is used in narrowing operations. UHNT is the same as HNT, but unsigned type.

HT, for a 128-bit vector type T, is the 64-bit vector type with lanes half as wide as T (where this exists).
This is used in narrowing operations. There are no types with 4-bit lanes. UT is the vector type of the
same size and lane size as T but whose lane type is an unsigned integer. This is used as the result of
comparison operations and signed-to-unsigned saturation operations, and as an operand in selection
operations

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 59 of 81
Non-Confidential
IT is the vector type of the same size and lane size as T but whose lane type is a signed integer.

UHT, for a 128-bit vector type T, is the 64-bit vector type with lanes half as wide as T and of unsigned
type. This is used as the result of signed-to-unsigned narrowing saturation operations

T64 is the 64-bit vector type with the same lane type as T

T2, for a 64-bit vector type T, is the 128-bit vector type with the same lane type as T

FT, for a vector type T of 32-bit integral lane type, is the type with lanes of 32-bit floating type

TxN for N from 2 to 4 is an array of T, so where T is an int8x8_t, Tx3 is int8x8x3_t; this is used in
intrinsics which return multiple results, or where input operands consist of multiple vectors. Where N is 1,
the array type is simply T.

B, for a vector type T, is the number of bits in one element of T.

N, for a vector type T, is the number of elements in T.

12.3.2.1 Examples of template type parameters


T Q lanes ET ST DT HT UT T64 T2

int8x8_t 0..7 int8_t s8 int16x8_t n/a uint8x8_t int8x8_t int8x16_t

uint16x8_t q 0..7 uint16_t u16 n/a uint8x8_t uint16x8_t uint16x4_t

float32x4_t q 0..3 float32_t f32 n/a n/a uint32x4_t float32x2_t

12.3.3 Intrinsics with scalar operands


Some NEON vector operations use a scalar (non-vector) value. Depending on the intrinsic, scalar values may be
obtained in one of two ways:

directly supplied as a scalar operand. These intrinsics are identified with the string “_n” in their name.
Depending on the intrinsic, this operand may be a compile-time integral constant (e.g. a shift count), or it
may be a general expression (usually of the same type as the vector lanes).

from one lane of an input vector. These intrinsics are identified with the string “_lane” in their name. The
lane number is the last argument and must be a compile-time constant and within range. The input vector
from which the scalar operand is taken is the preceding operand and is always a 64-bit vector.

12.3.4 Summary of intrinsic naming conventions


All capital letters in intrinsic descriptions in this specification are template parameters. Names are modelled after
the NEON instruction set and generally follow the following scheme:
v[p][q][r]name[u][n][q][x][_high][_lane | laneq][_n][_result]_type

where

q indicates a saturating operation(with the exception of vqtb[l][x] in AArch64 operations where the q
indicates 128-bit index and result operands)

p indicates a pairwise operation.

r indicates a rounding operation

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 60 of 81
Non-Confidential
name is the descriptive name of the basic operation

u indicates signed-to-unsigned saturation

n indicates a narrowing operation

q postfixing the name indicates an operation on 128-bit vectors

x indicates a Advanced SIMD scalar operation in AArch64. It can be one of „b‟, „h‟, „s‟, „d‟.

In AArch64, „_high‟ is used for widening and narrowing operations involving 128-bit operands. For
widening 128-bit operands, „high‟ refers to the top 64-bits of the source operand(s) and for narrowing,
it refers to the top 64-bits of the destination operand.

_n indicates a scalar operand supplied as an argument

_lane indicates a scalar operand taken from the lane of a vector

_laneq indicates a scalar operand taken from the lane of an input vector of 128-bit width.

result is the result type in short form

type is the primary operand type in short form

12.3.5 Lane type classes


class count Types

int 6 int8, int16, int32, uint8, uint16, uint32

int/64 8 int8, int16, int32, int64, uint8, uint16, uint32, uint64

nint 6 int16, int32, int64, uint16, uint32, uint64

snint 3 int16, int32, int64

unint 3 uint16, uint32, uint64

sint 3 int8, int16, int32

sint/64 4 int8, int16, int32, int64

uint/64 4 uint8, uint16, uint32, uint64

sint/f32 4 int8, int16, int32, float32

int16/32 4 int16, uint16, int32, uint32

int32/64 4 int32, uint32, int64, uint64

sint16/32 2 int16, int32

int32 2 int32, uint32

sint64 1 int64

int64 2 int64, uint64

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 61 of 81
Non-Confidential
class count Types

intpoly64 2 int64, uint64, poly64

int64/float 2 int64, uint64, float32, float64

8bit 3 int8, uint8, poly8

int/poly8 7 int8, int16, int32, uint8, uint16, uint32, poly8

int/poly 8 int8, int16, int32, uint8, uint16, uint32, poly8, poly16

int/64/poly 10 int8, int16, int32, int64, uint8, uint16, uint32, uint64, poly8, poly16

arith 7 int8, int16, int32, uint8, uint16, uint32, float32

arith16/32 5 int16, int32, uint16, uint32, float32

arith/64 9 int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32

arith/poly8 8 int8, int16, int32, uint8, uint16, uint32, poly8, float32

f16 1 float16

f32 1 float32

f32,u32 1 float32,uint32

sintfloat64 1 int64, float64

f64 1 float64

float 2 float32, float64

sint/64/float 6 int8, int16, int32, int64,float32, float64

intfloat64 3 int64, uint64, float64

intfloatpoly64 4 int64, uint64, float64, poly64

sint64/float 3 int64, float32, float64

any 12* int8, int16, int32, int64, uint8, uint16, uint32, uint64, poly8, poly16, poly64, float32,
(float16)

any/f64 13* int8, int16, int32, int64, uint8, uint16, uint32, uint64, poly8, poly16, poly64, float32, float64,
(float16)
* Note: float16 is only available if supported in target hardware.

12.3.6 Constructing and deconstructing NEON vectors


The intrinsics in this section construct and deconstruct NEON vectors. In some cases, they may be “free”
operations, in the sense that a compiler can achieve the effect (e.g. of combining two 64-bit vectors into a 128-bit
vector) by register allocation. Also, in many cases the most natural way to construct a NEON vector is to load it
from an array.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 62 of 81
Non-Confidential
T vcreate_ST(uint64_t a);

creates a vector by reinterpreting a 64-bit value. T can be any 64-bit vector type. ARMv8 adds 2 more for
poly64_t and float64_t.
T vdupQ_n_ST(ET value);
T vmovQ_n_ST(ET value);

creates a vector by duplicating a scalar value across all lanes. T can be any vector type for which ET exists. There
are 22 intrinsics. ARMv8 adds 4 more intrinsics for 64-bit and 128-bit vectors of float64_t and poly64_t.
T vdupQ_lane_ST(T64 vec, const int lane);

creates a vector by duplicating one lane of a source vector. T can be any vector type. T64 is the 64-bit vector type
corresponding to T. The scalar value is obtained from a designated lane of the input vector. There are 22
intrinsics. ARMv8 adds 4 more intrinsics for 64 and 128-bit vectors with float64_t and poly64_t elements.
AArch64 supports 128-bit lane-vectors. If the target supports float16_t, this adds 2 more intrinsics.
T vdupQ_laneq_ST(T2 vec, const int lane);

creates a vector by duplicating one lane of a source vector. T can be any vector type. The scalar value is
obtained from a designated lane of the input vector. The lane type of vector type T can be 8-bit, 16-bit, 32-bit, 64-
bit integers, 8-bit, 16-bit, 64-bit polynomial, float32_t and float64_t. There are 26 intrinsics. If the target
supports float16_t, this adds 2 more intrinsics. These are only available for AArch64.
T2 vcombine_ST(T low, T high);

creates a 128-bit vector by combining two 64-bit vectors. T can be any 64-bit vector type. There are 12 intrinsics.
ARMv8 adds 2 more for poly64_t and float64_t.
T vget_high_ST(T2 a);
T vget_low_ST(T2 a);

gets the high, or low, half of a 128-bit vector. There are 24 intrinsics. ARMv8 adds 4 more intrinsics for 128-bit
vectors with float64_t and poly64_t lane type.
T vsetQ_lane_ST(ET value, T vec, const int lane);

sets the specified lane of an input vector to be a new value. There are 24 intrinsics. ARMv8 adds 4 intrinsics for
64-bit and 128-bit vectors for float64_t and poly64_t lane type.
ET vgetQ_lane_ST(T vec, const int lane);

gets the value from the specified lane of an input vector. There are 24 intrinsics. ARMv8 adds 4 intrinsics for 64-bit
and 128-bit vectors for float64_t and poly64_t lane type.

ET vdupC_lane_ST(T vec, const int lane);


and
ET vdupC_laneq_ST(T vec, const int lane);

where T is defined for Advanced SIMD scalar „int/64‟ are aliases of


ET vgetQ_lane_ST(T vec, const int lane);

These intrinsics are part of the AdvSIMD scalar intrinsics and are available only on AArch64.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 63 of 81
Non-Confidential
T‟ vreinterpretQ_ST‟_ST(T a);

reinterprets a vector of one type T as a vector of another type T‟, without any operation taking place. The lane
sizes may be the same or different. For example, vreinterpretq_s8_f32() reinterprets a vector of four 32-bit
floating-point elements as a vector of sixteen 8-bit signed integer elements.
ARMv8 adds new intrinsics to reinterpret 64-bit and 128-bit vectors of float64_t, poly64_t and poly128_t
values to other types and other types to float64_t, poly64_t and poly128_t. This adds 132 new
intrinsics. Reinterprets for poly128_t are of the form
poly128_t vreinterpretq_p128_ST (...);
T vreinterpretq_ST_p128 (poly128_t);

12.3.6.1 Examples
The following “no-op” expressions demonstrate some relationships between these intrinsics:
vcombine_ST(vget_low_ST(a), vget_high_ST(a))
vset_lane_ST(vget_lane_ST(a, N), a, N)
vreinterpret_ST_u8(vreinterpret_u8_ST(a))

12.3.7 NEON loads and stores


There are separate load and store intrinsics for each lane type, but they are implemented as common instructions
determined by lane size and vector size.
T vld1Q_ST(ET const *ptr);

loads a vector elementwise from an array.


T vld1Q_lane_ST(ET const *ptr, T vec, int lane);

loads one lane of a vector.


T vld1Q_dup_ST(ET const *ptr);

loads a single element of type ET and duplicates it toall lanes of a vector.


void vst1Q_ST(ET *ptr, T val);

stores a vector elementwise into an array.


void vst1Q_lane_ST(ET *ptr, T val, const int lane);

stores one element of a vector.


TxN vldNQ_ST(ET const *ptr);

for N from 2 to 4, loads N vectors from an array, with de-interleaving. The array consists of a sequence of sets of
N values. The first element of the array is placed in the first lane of the first vector, the second element in the first
lane of the second vector, and so on. For example, vld3_s32 will load the six 32-bit elements { A, B, C, D, E, F }
into the three 64-bit vectors { DA, EB, FC }. Not available for 64-bit lanes when T is a 128-bit vector type in
AArch32. AArch64 adds support for 64-bit lanes when T is a 128-bit vector type.
TxN vldNQ_dup_ST(ET const *ptr);

for N from 2 to 4, loads a single N-element structure to all lanes of N vectors. N values are loaded, then duplicated
across all lanes. For example, vld3_dup_s16 will load the three consecutive 16-bit elements { A, B, C } and

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 64 of 81
Non-Confidential
produce the three 64-bit vectors { AAAA, BBBB, CCCC }, while the 128-bit vector form vld3q_dup_s16 will
produce the three 128-bit vectors { AAAAAAAA, BBBBBBBB, CCCCCCCC }. Not available for 64-bit lanes when T
is a 128-bit vector type in AArch32. AArch64 adds support for 64-bit lanes when T is a 128-bit vector type.
void vstNQ_ST(ET *ptr, TxN val);

for N from 2 to 4, stores N vectors to an array, with interleaving. Every element of each vector is stored. Not
available for 64-bit lanes when T is a 128-bit vector type in AArch32. AArch64 adds support for 64-bit lanes when
T is a 128-bit vector type.
TxN vldNQ_lane_ST(ET const *ptr, TxN src, const int lane);

for N from 2 to 4, loads a single N-element structure to the designated lane of N vectors. Not available for 64-bit
lanes; not available for 8-bit lanes and 128-bit vectors in AArch32. AArch64 adds support for 8-bit and 64-bit lanes
when T is a 128-bit vector.
void vstNQ_lane_ST(ET *ptr, TxN val, const int lane);

for N from 2 to 4, stores a single N-element structure from the designated lane of N vectors. Not available for 64-
bit lanes; not available for 8-bit lanes and 128-bit vectors in AArch32. AArch64 adds support for 8-bit and 64-bit
lanes when T is a 128-bit vector.
TxN vld1Q_ST_xN(ET const *ptr);

for N from 2 to 4, loads N vectors from an array without de-interleaving. The first element (at the lowest address)
of the array is placed in the first lane of the first vector, the second element in the second lane of the first vector
and so on. For example, vld1_s32_x4 will load the eight 32-bit array elements {A, B, C, D, E, F, G, H} into the
four 64-bit vectors {BA, DC, FE, HG},

void vst1Q_ST_xN(ET *ptr, TxN vec);

for N from 2 to 4, stores N vectors from a register to an array without de-interleaving. The first element (at LSB) of
the register is placed in the lowest address of the array, the second lane of the first vector in the second element
of the array and so on. For example, vst1_s32_x4 will store four 64-bit vectors {BA, DC, FE, HG} into the eight 32-
bit array elements {A, B, C, D, E, F, G, H}.

poly128_t vldrq_p128(const poly128_t *ptr);

Loads a poly128_t value.

poly128_t vstrq_p128(const poly128_t *ptr);

stores a poly128_t value. This is available only on ARMv8 AArch32 and AArch64.

12.3.7.1 Examples
This is an example of iterating through an array, with fixup code for any elements left over:
void scale_values(float *a, int n, float scale) {
int i;
for (i = 0; i < (n & ~3); i+=4) {
vst1q_f32(&a[i], vmulq_n_f32(vld1q_f32(&a[i]), scale));
}
if (i & 2) {
vst1_f32(&a[i], vmul_n_f32(vld1_f32(&a[i]), scale));
i += 2;

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 65 of 81
Non-Confidential
}
if (i & 1) {
a[i] *= scale;
}
}

If the array is known to contain an integral number of whole vectors, fixup code is not necessary.
The fixup code could also be written using Advanced SIMD scalar intrinsics. For example,
void add_values(int64_t *a, int64_t *b, int n) {
int i;
for (i = 0; i < (n & ~3); i+=4) {
vst1q_s64(&a[i], vaddq_s64(vld1q_s64(&a[i]), vld1q_s64(&b[i])));
}
if (i & 2) {
vst1q_s64(&a[i], vaddq_s64(vld1q_s64(&a[i]), vld1q_s64(&b[i])));
i += 2;
}
if (i & 1) {
a[i] = vaddd_s64(a[i], b[i]);
}
}
void qadd_values(int32_t *a, int32_t *b, int n) {
int i;
for (i = 0; i < (n & ~3); i+=4) {
vst1q_s32(&a[i], vqaddq_s32(vld1q_s32(&a[i]), vld1q_s32(&b[i])));
}
if (i & 2) {
vst1_s32(&a[i], vqadd_s32(vld1_s32(&a[i]), vld1_s32(&b[i])));
i += 2;
}
if (i & 1) {
a[i] = vqadds_s32(a[i], b[i]);
}
}

12.3.7.2 Alignment assertions


The AArch32 NEON load and store instructions provide for alignment assertions, which may speed up access to
aligned data (and will fault access to unaligned data). The NEON intrinsics as defined in this document do not
directly provide a means for asserting alignment. Implementations may be able to introduce these assertions by
analyzing the alignment of types or data. In this C++ example, the type alignment of the Point type would allow the
compiler to assert alignment on the NEON loads and stores:
#if !(__cplusplus >= 201103L)
#define alignas(X) __attribute__((aligned(X)))
#endif
struct Point alignas(16) { float32x4_t point; };

void scale_points(Point *a, int n, float scale) {


for (int i = 0; i < n; ++i) {
a[i].point = vmulq_n_f32(a[i].point, scale);
}
}

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 66 of 81
Non-Confidential
Alignment hints are not supported by AArch64.

12.3.8 NEON lane-by-lane operations


The operations in this table perform lane-by-lane operations.
Comparison operations result in a lane of all 1s where the condition is true, all 0s otherwise. The resulting bit
vector is typically used with the vbsl intrinsic.

Saturation clips the result of an operation to the output range when it is narrowed to a smaller result type or shifted
left. Signed-to-unsigned saturation clips a signed result to an unsigned range, so that negative results go to zero.
Rounding is used when a value is shifted right, or when the high part of a result is taken. It effectively adds a value
equivalent to 0.5 bits to the value before truncating it, so values are rounded towards positive infinity.
Variable shift operations are bidirectional, i.e. a shift count is encoded as a signed integer. A shift operation may
be both saturating (when the value is shifted left, or narrowed) and rounding (when the value is shifted right).
Template count Types Additional Additional Instruction
supported in Types Types (AArch64
AArch32 supported in supported format)
(vector) AArch64 in
(vector) AArch64
(AdvSIMD
scalar)

T vaddQC_ST(T a, T b) 22 arith/64 f64 int64 (F)ADD

DT vaddl_ST(T a, T b) 6 int - - (S/U)ADDLq

DT vaddl_high_ST(T a, T b) 6 - int - (S/U)ADDWq

DT vaddw_ST(DT a, T b) 6 int - - (S/U)ADDWq

DT vaddw_high_ST(DT a, T b) 6 - int - (S/U)ADDWq

T vhaddQ_ST(T a, T b) 12 int - - (S/U)HADD

T vrhaddQ_ST(T a, T b) 12 int - - (S/U)RHADD

T vqaddQC_ST(T a, T b) 24 int/64 - int/64 (S/U)QADD

T vuqaddQC_ST(T a, UT b) 12 - sint/64 sint/64 SUQADD

UT vsqaddQC_UST(UT a, T b) 12 - sint/64 sint/64 USQADD

HT vaddhn_ST(T a, T b) 6 nint - - ADDHN

HNT vaddhn_high_ST(HT r, T a, T b) 6 - nint - ADDHN2

HT vraddhn_ST(T a, T b) 6 nint - - RADDHN

HNT vraddhn_high_ST(HT r, T a, T b) 6 - nint - RADDHN2

T vmulQ_ST(T a, T b) 18 arith/poly8 f64 - (F)MUL

T vmulxQC_ST(T a, T b) 6 - float float FMULX

T vmulxQC_lane_ST(T a, T64 v, 0..N-1) 6 - float float FMULX

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 67 of 81
Non-Confidential
Template count Types Additional Additional Instruction
supported in Types Types (AArch64
AArch32 supported in supported format)
(vector) AArch64 in
(vector) AArch64
(AdvSIMD
scalar)

T vmulxQC_laneq_ST(T a, T2 v, 0..N-1) 6 - float float FMULX

T vdivQ_ST(T a, T b) 4 - float - FDIV

T vmlaQ_ST(T a, T b, T c) 16 arith f64 - float mla Impl


defined./ MLA

DT vmlal_ST(DT a, T b, T c) 6 int - - (S/U)MLALq

DT vmlal_high_ST(DT a, T b, T c) 6 - int - (S/U)MLALq

T vmlsQ_ST(T a, T b, T c) 16 arith f64 - Float mls Impl


defined./ MLS

DT vmlsl_ST(DT a, T b, T c) 6 int - - (S/U)MLSLq

DT vmlsl_high_ST(DT a, T b, T c) 6 - int - (S/U)MLSLq

T vfmaQ_ST(T a, T b, T c) 4 f32 f64 - FMLA

T vfmaQ_n_ST(T a, T b, ET c) 4 f32 f64 - FMLA

T vfmaQC_lane_ST(T a, T b, T64 v, 0..N-1) 6 - float float FMLA

T vfmaQC_laneq_ST(T a, T b, T2 v, 0..N-1) 6 - float float FMLA

T vfmsQ_ST(T a, T b, T c) 4 f32 f64 - FMLS

T vfmsQ_n_ST(T a, T b, ET c) 4 f32 f64 - FMLS

T vfmsQC_lane_ST(T a, T b, T64 v, 0..N-1) 6 - float float FMLS

T vfmsQC_laneq_ST(T a, T b, T2 v, 0..N-1) 6 - float float FMLS

T vqdmulhQC_ST(T a, T b) 6 sint16/32 - sint16/32 SQDMULH

T vqrdmulhQC_ST(T a, T b) 6 sint16/32 - sint16/32 SQRDMULH

DT vqdmlalC_ST(DT a, T b, T c) 4 sint16/32 - sint16/32 SQDMLALq

DT vqdmlal_high_ST(DT a, T b, T c) 2 - sint16/32 - SQDMLALq

DT vqdmlslC_ST(DT a, T b, T c) 4 sint16/32 - sint16/32 SQDMLSL

DT vqdmlsl_high_ST(DT a, T b, T c) 2 - sint16/32 - SQDMLSL

DT vmull_ST(T a, T b) 7 int/poly8 - - (S/U/P)MULLq

DT vmull_high_ST(T a, T b) 7 - int/poly8 - (S/U/P)MULLq

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 68 of 81
Non-Confidential
Template count Types Additional Additional Instruction
supported in Types Types (AArch64
AArch32 supported in supported format)
(vector) AArch64 in
(vector) AArch64
(AdvSIMD
scalar)

DT vqdmullC_ST(T a, T b) 4 sint16/32 - sint16/32 SQDMULLq

DT vqdmull_high_ST(T a, T b) 2 - sint16/32 - SQDMULLq

T vsubQC_ST(T a, T b) 22 arith/64 f64 int64 (F)SUB

DT vsubl_ST(T a, T b) 6 int - - (S/U)SUBLq

DT vsubl_high_ST(T a, T b) 6 - int - (S/U)SUBLq

DT vsubw_ST(DT a, T b) 6 int - - (S/U)SUBWq

DT vsubw_high_ST(DT a, T b) 6 - int - (S/U)SUBWq

T vhsubQ_ST(T a, T b) 12 int - - (S/U)HSUB

T vqsubQC_ST(T a, T b) 24 int/64 - int/64 (S/U)QSUB

HT vsubhn_ST(T a, T b) 6 nint - - SUBHN

HNT vsubhn_high_ST(HT r, T a, T b) 6 - nint - SUBHN2

HT vrsubhn_ST(T a, T b) 6 nint - - RSUBHN

HNT vrsubhn_high_ST(HT r, T a, T b) 6 - nint - RSUBHN2

UT vceqQC_ST(T a, T b) 28 arith/poly8 intfloatpoly64 int64/float (F)CMEQ

UT vceqzQC_ST(T a) 28 arith/poly8 intfloatpoly64 int64/float (F)CMEQ

UT vcgeQC_ST(T a, T b) 24 arith intfloat64 int64/float CMGE

UT vcgezQC_ST(T a) 15 - sint/64/float sint64/float (F)CMGE#0

UT vcleQC_ST(T a, T b) 24 arith intfloat64 int64/float CMGT

UT vclezQC_ST(T a) 15 - sint/64/float sint64/float (F)CMLE #0

UT vcgtQC_ST(T a, T b) 24 arith intfloat64 int64/float CMGT

UT vcgtzQC_ST(T a) 15 - sint/64/float sint64/float (F)CMGT #0

UT vcltQC_ST(T a, T b) 24 arith intfloat64 int64/float CMGT

UT vcltzQC_ST(T a) 15 - sint/64/float sint64/float (F)CMLT #0

UT vcageQC_ST(T a, T b) 6 f32 f64 float FACGE

UT vcaleQC_ST(T a, T b) 6 f32 f64 float FACGT

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 69 of 81
Non-Confidential
Template count Types Additional Additional Instruction
supported in Types Types (AArch64
AArch32 supported in supported format)
(vector) AArch64 in
(vector) AArch64
(AdvSIMD
scalar)

UT vcagtQC_ST(T a, T b) 6 f32 f64 float FACGT

UT vcaltQC_ST(T a, T b) 6 f32 f64 float FACGE

UT vtstQC_ST(T a, T b) 24 int/poly intpoly64 int64 CMTST

T vabdQC_ST(T a, T b) 18 arith f64 float (S/U)ABD

DT vabdl_ST(T a, T b) 6 int - - (S/U)ABDLq

DT vabdl_high_ST(T a, T b) 6 - int - (S/U)ABDLq

T vabaQ_ST(T a, T b, T c) 12 int - - (S/U)ABA

DT vabal_ST(DT a, T b, T c) 6 int - - (S/U)ABALq

DT vabal_high_ST(DT a, T b, T c) 6 - int - (S/U)ABALq

T vmaxQ_ST(T a, T b) 16 arith f64 - (S/U/F)MAX

T vminQ_ST(T a, T b) 16 arith f64 - (S/U/F)MIN

T vmaxnmQ_ST(T a, T b) 4 - float - FMAXNM

T vminnmQ_ST(T a, T b) 4 - float - FMINNM

T vshlQC_ST(T a, IT b) 18 int/64 - int64 (S/U)SHL

T vqshlQC_ST(T a, IT b) 24 int/64 - int/64 (S/U)QSHL

T vrshlQC_ST(T a, IT b) 18 int/64 - int64 (S/U)RSHL

T vqrshlQC_ST(T a, IT b) 24 int/64 - int/64 (S/U)QRSHL

T vshrQC_n_ST(T a, 1..B) 18 int/64 - int64 (S/U)SHR

T vshlQC_n_ST(T a, 0..B-1) 18 int/64 - int64 SHL

T vrshrQC_n_ST(T a, 1..B) 18 int/64 - int64 (S/U)RSHR

T vsraQC_n_ST(T a, T b, 1..B) 18 int/64 - int64 (S/U)SRA

T vrsraQC_n_ST(T a, T b, 1..B) 18 int/64 - int64 (S/U)RSRA

T vqshlQC_n_ST(T a, 0..B-1) 24 int/64 - int/64 (S/U)QSHL

UT vqshluQC_n_ST(T a, 0..B-1) 12 sint/64 - sint/64 SQSHLU

HT vshrn_n_ST(T a, 1..B/2) 6 nint - - SHRN

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 70 of 81
Non-Confidential
Template count Types Additional Additional Instruction
supported in Types Types (AArch64
AArch32 supported in supported format)
(vector) AArch64 in
(vector) AArch64
(AdvSIMD
scalar)

HNT vshrn_high_n_ST(HT r, T a, 1..B/2) 6 - nint - SHRN2

UHT vqshrunC_n_ST(T a, 1..B/2) 6 snint - snint SQSHRUN

UHNT vqshrun_high_n_ST(UHT r, T a, 1..B/2) 3 - snint - SQSHRUN2

UHT vqrshrunC_n_ST(T a, 1..B/2) 6 snint - snint SQRSHRUN

UHNT vqrshrun_high_n_ST(UHT r, T a, 1..B/2) 3 - snint - SQRSHRUN2

HT vqshrnC_n_ST(T a, 1..B/2) 12 nint - nint (S/U)QSHRN

HNT vqshrn_high_n_ST(HT r, T a, 1..B/2) 6 - nint - (S/U)QSHRN2

HT vrshrn_n_ST(T a, 1..B/2) 6 nint - - RSHRN

HNT vrshrn_high_n_ST(HT r, T a, 1..B/2) 6 - nint - RSHRN2

HT vqrshrnC_n_ST(T a, 1..B/2) 12 nint - nint (S/U)QRSHRN

HNT vqrshrn_high_n_ST(HT r, T a, 1..B/2) 6 - nint - (S/U)QRSHRN2

DT vshll_n_ST(T a, 0..B) 6 int - - (S/U)SHLLq

DT vshll_high_n_ST(T a, 0..B) 6 - int - (S/U)SHLLq

T vsriQC_n_ST(T a, T b, 1..B) 22 int/64/poly - intpoly64 SRI

T vsliQC_n_ST(T a, T b, 0..B-1) 22 int/64/poly - intpoly64 SLI

T vcvtRQC_ST_f32(FT a) 30 - int32 int32 FCVTr(S/U)

T vcvtRQC_ST_f64(FT a) 30 - int64 int64 FCVTr(S/U)

T vcvtQC_n_ST_f32(FT a, 1..32) 6 int32 - int32 FCVTZ(S/U)

T vcvtQC_n_ST_f64(FT a, 1..64) 6 - int64 12.3.9 int FCVTZ(S/U)


64

FT vcvtQC_f32_ST(T a) 6 int32 - int32 (S/U)CVTF

FT vcvtQC_f64_ST(T a) 6 - int64 int64 (S/U)CVTF

FT vcvtQC_n_f32_ST(T a, 1..32) 6 int32 - int32 (S/U)CVTF

FT vcvtQC_n_f64_ST(T a, 1..64) 6 - int64 int64 (S/U)CVTF

HT vcvt_f16_f32(T a) 1 f32 - - FCVTN

HNT vcvt_high_f16_f32(HT r, T a) 1 - f32 - FCVTN2

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 71 of 81
Non-Confidential
Template count Types Additional Additional Instruction
supported in Types Types (AArch64
AArch32 supported in supported format)
(vector) AArch64 in
(vector) AArch64
(AdvSIMD
scalar)

HT vcvt_f32_f64(T a) 1 - f64 - FCVTN

HNT vcvt_high_f32_f64(HT r, T a) 1 - f64 - FCVTN2

DT vcvt_f32_f16(T a) 1 f16 - - FCVTLq

DT vcvt_high_f32_f16(T a) 1 - f16 - FCVTLq

DT vcvt_f64_f32(T a) 1 - f32 - FCVTLq

DT vcvt_high_f64_f32(T a) 1 - f32 - FCVTLq

HT vcvtxC_f32_f64(T a) 2 - f64 f64 FCVTXN

HNT vcvtx_high_f32_f64(HT r, T a) 1 - f64 - FCVTXN2

T vrndRXQ_ST(T a) 28 - float - FRINTr

HT vmovn_ST(T a) 6 nint - - XTN

HNT vmovn_high_ST(HT r, T a) 6 - nint - XTN2

DT vmovl_ST(T a) 6 int - - (S/U)SHLLq

DT vmovl_high_ST(T a) 6 - int - (S/U)SHLLq

HT vqmovnC_ST(T a) 12 nint - nint SQXTN

HNT vqmovn_high_ST(HT r, T a) 6 - nint - (S/U)QXTN2

UHT vqmovunC_ST(T a) 6 snint - snint SQXTUN

UHNT vqmovun_high_ST(UHT r, T a) 3 - snint - SQXTUN2

T vmlaQ_lane_ST(T a, T b, T64 v, 0..N-1) 10 arith16/32 - - float mla Impl


defined./ MLA

T vmlaQ_laneq_ST(T a, T b, T2 v, 0..N-1) 10 - arith16/32 - float mla Impl


defined./ MLA

DT vmlal_lane_ST(DT a, T b, T64 v, 0..N-1) 4 int16/32 - - (S/U)MLALq

DT vmlal_high_lane_ST(DT a, T b, T64 v, 0..N-1) 4 - int16/32 - (S/U)MLALq

DT vmlal_laneq_ST(DT a, T b, T2 v, 0..N-1) 4 - int16/32 - (S/U)MLALq

DT vmlal_high_laneq_ST(DT a, T b, T2 v, 0..N-1) 4 - int16/32 - (S/U)MLALq

DT vqdmlalC_lane_ST(DT a, T b, T64 v, 0..N-1) 4 sint16/32 - sint16/32 SQDMLALq

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 72 of 81
Non-Confidential
Template count Types Additional Additional Instruction
supported in Types Types (AArch64
AArch32 supported in supported format)
(vector) AArch64 in
(vector) AArch64
(AdvSIMD
scalar)

DT vqdmlal_high_lane_ST(DT a, T b, T64 v, 0..N-1) 2 - sint16/32 - SQDMLALq

DT vqdmlalC_laneq_ST(DT a, T b, T2 v, 0..N-1) 4 - sint16/32 sint16/32 SQDMLALq

DT vqdmlal_high_laneq_ST(DT a, T b, T2 v, 0..N-1) 2 - sint16/32 - SQDMLALq

T vmlsQ_lane_ST(T a, T b, T64 v, 0..N-1) 10 arith16/32 - - MLS

T vmlsQ_laneq_ST(T a, T b, T2 v, 0..N-1) 10 - arith16/32 - MLS

DT vmlsl_lane_ST(DT a, T b, T64 v, 0..N-1) 4 int16/32 - - (S/U)MLSLq

DT vmlsl_high_lane_ST(DT a, T b, T64 v, 0..N-1) 4 - int16/32 - (S/U)MLSLq

DT vmlsl_laneq_ST(DT a, T b, T2 v, 0..N-1) 4 - int16/32 - (S/U)MLSLq

DT vmlsl_high_laneq_ST(DT a, T b, T2 v, 0..N-1) 4 - int16/32 - (S/U)MLSLq

DT vqdmlslC_lane_ST(DT a, T b, T64 v, 0..N-1) 4 sint16/32 - sint16/32 SQDMLSLq

DT vqdmlsl_high_lane_ST(DT a, T b, T64 v, 0..N-1) 2 - sint16/32 - SQDMLSLq

DT vqdmlslC_laneq_ST(DT a, T b, T2 v, 0..N-1) 4 - sint16/32 sint16/32 SQDMLSLq

DT vqdmlsl_high_laneq_ST(DT a, T b, T2 v, 0..N-1) 2 - sint16/32 - SQDMLSLq

T vmulQ_n_ST(T a, ET b) 12 arith16/32 f64 - MUL

T vmulQC_lane_ST(T a, T64 b, 0..N-1) 14 arith16/32 f64 float MUL

T vmulQC_laneq_ST(T a, T2 b, 0..N-1) 14 arith16/32 f64 float MUL

DT vmull_n_ST(T a, ET b) 4 int16/32 - - (S/U)MULLq

DT vmull_high_n_ST(T a, ET b) 4 - int16/32 - (S/U)MULLq

DT vmull_lane_ST(T a, T64 b, 0..N-1) 4 int16/32 - - (S/U)MULLq

DT vmull_high_lane_ST(T a, T64 b, 0..N-1) 4 - int16/32 - (S/U)MULLq

DT vmull_laneq_ST(T a, T2 b, 0..N-1) 4 - int16/32 - (S/U)MULLq

DT vmull_high_laneq_ST(T a, T2 b, 0..N-1) 4 - int16/32 - (S/U)MULLq

DT vqdmull_n_ST(T a, ET b) 2 sint16/32 - - SQDMULL2

DT vqdmull_high_n_ST(T a, ET b) 2 - sint16/32 - SQDMULL2

DT vqdmullC_lane_ST(T a, T64 b, 0..N-1) 4 sint16/32 - sint16/32 SQDMULLq

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 73 of 81
Non-Confidential
Template count Types Additional Additional Instruction
supported in Types Types (AArch64
AArch32 supported in supported format)
(vector) AArch64 in
(vector) AArch64
(AdvSIMD
scalar)

DT vqdmull_high_lane_ST(T a, T64 b, 0..N-1) 2 - sint16/32 - SQDMULLq

DT vqdmullC_laneq_ST(T a, T2 b, 0..N-1) 4 - sint16/32 sint16/32 SQDMULLq

DT vqdmull_high_laneq_ST(T a, T2 b, 0..N-1) 2 - sint16/32 - SQDMULLq

T vqdmulhQ_n_ST(T a, ET b) 4 sint16/32 - - SQDMULH

T vqdmulhQC_lane_ST(T a, T64 b, 0..N-1) 6 sint16/32 - sint16/32 SQDMULH

T vqdmulhQC_laneq_ST(T a, T2 b, 0..N-1) 6 - sint16/32 sint16/32 SQDMULH

T vqrdmulhQ_n_ST(T a, ET b) 4 sint16/32 - - SQRDMULH

T vqrdmulhQC_lane_ST(T a, T64 b, 0..N-1) 6 sint16/32 - sint16/32 SQRDMULH

T vqrdmulhQC_laneq_ST(T a, T2 b, 0..N-1) 6 - sint16/32 sint16/32 SQRDMULH

T vmlaQ_n_ST(T a, T b, ET c) 10 arith16/32 - - MLA/Impl defined


for floats

DT vmlal_n_ST(DT a, T b, ET c) 4 int16/32 - - (S/U)MLALq

DT vmlal_high_n_ST(DT a, T b, ET c) 4 - int16/32 - (S/U)MLALq

DT vqdmlal_n_ST(DT a, T b, ET c) 2 sint16/32 - - SQDMLALq

DT vqdmlal_high_n_ST(DT a, T b, ET c) 2 - sint16/32 - SQDMLALq

T vmlsQ_n_ST(T a, T b, ET c) 10 arith16/32 - - MLS/Impl.


Defined for floats

DT vmlsl_n_ST(DT a, T b, ET c) 4 int16/32 - - (S/U)MLSLq

DT vmlsl_high_n_ST(DT a, T b, ET c) 4 - int16/32 - (S/U)MLSLq

DT vqdmlsl_n_ST(DT a, T b, ET c) 2 sint16/32 - - SQDMLSLq

DT vqdmlsl_high_n_ST(DT a, T b, ET c) 2 - sint16/32 - SQDMLSLq

T vabsQC_ST(T a) 12 sint/f32 sintfloat64 sint64 (F)ABS

T vqabsQC_ST(T a) 12 sint sint64 sint/64 SQABS

T vnegQC_ST(T a) 12 sint/f32 sintfloat64 sint64 (F)NEG

T vqnegQC_ST(T a) 12 sint sint64 sint/64 SQNEG

T vclsQ_ST(T a) 6 sint - - CLS

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 74 of 81
Non-Confidential
Template count Types Additional Additional Instruction
supported in Types Types (AArch64
AArch32 supported in supported format)
(vector) AArch64 in
(vector) AArch64
(AdvSIMD
scalar)

T vclzQ_ST(T a) 12 int - - CLZ

T vcntQ_ST(T a) 6 8bit - - CNT

T vrecpeQC_ST(T a) 8 f32,u32 f64 float (F/U)RECPE

T vrecpsQC_ST(T a, T b) 6 f32 f64 float FRECPS

T vsqrtQ_ST(T a) 4 - float - FSQRT

T vrsqrteQC_ST(T a) 8 f32,u32 f64 float (F/U)RSQRTE

T vrsqrtsQC_ST(T a, T b) 6 f32 f64 float FRSQRTS

T vmvnQ_ST(T a) 14 int/poly8 - - MVN

T vandQ_ST(T a, T b) 16 int/64 - - AND

T vorrQ_ST(T a, T b) 16 int/64 - - ORR

T veorQ_ST(T a, T b) 16 int/64 - - EOR

T vbicQ_ST(T a, T b) 16 int/64 - - BIC

T vornQ_ST(T a, T b) 16 int/64 - - ORN

T vbslQ_ST(UT a, T b, T c) 24 any f64 - BSL

T vcopyQ_lane_ST(T a, 0..N-1, T64 b, 0..N-1) 24 - any/f64 - INS

T vcopyQ_laneq_ST(T a, 0..N-1, T2 b, 0..N-1) 24 - any/f64 - INS

T vrbitQ_ST (T a) 6 - 8bit - RBIT

12.3.10 NEON Vector Additions to AArch32 in ARMv8


T vmaxnmQ_ST(T a, T b)
T vminnmQ_ST(T a, T b)

Are available for float32_t on AArch32 and both float32_t and float64_t on AArch64, ARMv8 onwards.
These intrinsics are available when __ARM_FEATURE_NUMERIC_MAXMIN is defined.
T vcvtRQC_ST_f32(FT a)
T vcvtRQC_ST_f64(FT a)

Are available for int32_t and uint32_t on AArch32 and int32_t , uint32_t, int64_t, uint64_t on
AArch64
T vrndRXQ_ST(T a)

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 75 of 81
Non-Confidential
Are available for float32_t on AArch32 and float32_t and float64_t on AArch64.
float32_t vrndns_f32(float32_t a)

is available on AArch32 and AArch64 for round with „Exact Ties to Even‟.
These intrinsics are available when __ARM_FEATURE_DIRECTED_ROUNDING is defined.

poly64_t intrinsics for AArch64 also apply to AArch32 in ARMv8 except where explicitly mentioned otherwise.

12.3.11 NEON vector reductions


Some of these reduction intrinsics are defined only for AArch64.
T vpaddQ_ST(T a, T b);

performs a pairwise add operation. For example, given the two input vectors ABCD and EFGH, the result is
{A+B,C+D,E+F,G+H}. The lane type of T can be any 8-bit, 16-bit,32-bit or 64-bit integer type or float32_t,
float64_t. AArch32 supports only 64-bit vectors while AArch64 supports both 64-bit and 128-bit vectors. There
is no support for 64-bit lane size when vector is 64-bit long.
RT vpaddlQ_ST(T a);
RT vpadalQ_ST(RT a, T b);

adds elements pairwise in the input vector, with a long result. The input elements can be 8-bit, 16-bit or 32-bit
integers. The result vector type RT is the same size as the input vector, with half as many lanes, each of twice the
size. For example, given an int16x4_t input vector {A,B,C,D}, the output vector is the int32x2_t vector
{A+B,C+D}. The vpadal() form accumulates the result with another vector.
T vpmaxQ_ST(T a, T b);
T vpminQ_ST(T a, T b);

performs pairwise maximum or minimum on a pair of input vectors. AArch64 supports input vectors of both 64-bit
and 128-bit, while AArch32 supports only 64-bit input vectors. In AArch32, the input elements can be 8-bit, 16-bit
or 32-bit integers, or float32_t. AArch64 adds support for float64_t. Given inputs {A,B,C,D} and {E,F,G,H},
the output vector (for vpmax) is {max(A,B),max(C,D),max(E,F),max(G,H)}. AArch64 also adds support for two
more variants – vpmaxnm and vpminnm that only support 64 and 128-bit vectors of float32_t and
float64_t. There is no support for 64-bit lane size when vector is 64-bit long.

AArch64 offers new vector reduce intrinsics that operate on vectors and return a scalar quantity.
ET vaddvQ_ST (T v);

Performs addition across lanes of the input vector „v‟ and returns a scalar value. 64-bit and 128-bit vectors are
supported. The lane type of T can be any of 8-bit, 16-bit, 32-bit, 64-bit integers, float32_t or float64_t . 64-
integers and float64_t are supported for 128-bit vectors only.
EDT vaddlvQ_ST (T v);

Performs widened addition across lanes of the input vector „v‟ and returns a scalar value that is twice as wide as
the lane type of T. Both 64-bit and 128-bit vectors are supported. The input elements can be 8-bit, 16-bit and 32-
bit integers.

ET vmaxvQ_ST (T v);
ET vminvQ_ST (T v);

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 76 of 81
Non-Confidential
Perform maximum and minimum across the lanes of input vector „v‟. Both 64-bit and 128-bit vectors are
supported. The lanes of T can be any of 8-bit, 16-bit or 32-bit integers. They also support 64-bit and 128-bit
vectors of float32_t and 128-bit vectors of float64_t. 64-bit vectors of float64_t are not supported.

ET vmaxnmvQ_ST (T v);
ET vminnmvQ_ST (T v);

Perform numeric maximum and minimum across the lanes of input vector „v‟. They support 64-bit and 128-bit
vectors of float32_t and 128-bit vectors of float64_t. 64-bit vectors of float64_t are not supported.

12.3.12 NEON vector rearrangements


Like loads and stores, the intrinsics which rearrange vectors are defined for all relevant lane data types, but are
implemented by the same generic instructions.
T vextQ_ST(T a, T b, const int c); where 0 <= c <= (N – 1)

extracts one vector from a pair of concatenated input vectors, starting at a given lane position. Given inputs ABCD
and EFGH, and a lane position of 3, the concatenation EFGHABCD is formed and a vector is extracted at lane 3
to produce a result of FGHA. ARMv8 adds support for float64_t and poly64_t.
T vrevBQ_ST(T vec);

reverses the order of lanes within B-bit sets. For example, vrev32_s8 reverses the order of 8-bit lanes within 32-
bit groups of four lanes in an int8x8_t vector, so that the input ABCDEFGH would result in DCBAHGFE. (At the
machine level, this can also be understood as a SIMD operation on 32-bit elements, reversing the byte order in
each, but to use the vrev intrinsic with int32x2_t vectors it would be necessary to reinterpret the input and
output vector types.) B must be greater than the lane size: i.e. for 8-bit lanes B must be 16, 32 or 64; for 16-bit
lanes B must be 32 or 64; and for 32-bit lanes B must be 64.
Tx2 vzipQ_ST(T a, T b);

interleaves elements pairwise from two vectors, returning a pair (i.e. a 2-element array) of vectors.The inputs
ABCD and EFGH result in AEBF and CGDH. Not available for 64-bit lanes.
Tx2 vuzpQ_ST(T a, T b);

de-interleaves elements from two vectors. The inputs ABCD and EFGH result in ACEG and BDFH. Not available
for 64-bit lanes.
Tx2 vtrnQ_ST(T a, T b);

transposes elements from two vectors, treating them as 2x2 matrices. Not available for 64-bit lanes.
In AArch64, ZIP, UZP and TRN are split into two instructions. They are available for ARMv7 and ARMv8.
The following additional intrinsics are provided to support these operations which are available only on AArch64.
T vzip1Q_ST(T a, T b);

interleaves the elements from lower half of a and b into the result and returns the result which is of the same size
as the input vectors.
T vzip2Q_ST(T a, T b);

Is like vzip1 but for the top halves of a and b.


T vuzp1Q_ST(T a, T b);

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 77 of 81
Non-Confidential
de-interleaves the elements from lower half of a and b into the result and returns the result which is of the same
size as the input vectors.
T vuzp2Q_ST(T a, T b);

Is like vuzp1 but for the top halves of a and b.


T vtrn1Q_ST(T a, T b);

Transposes the elements of lower half of a and b and stores into the result vector.
T vtrn2Q_ST(T a, T b);Is similar to vtrn1, but for the upper half of a and b. Available for 64-bit and 128-
bit vectors of lanes 8, 16, 32 and 64-bits except for 64-bit vectors when lane size is 64 bits.

12.3.13 NEON vector table lookup


T vtblN_ST(TxN a, UT b);

performs 8-bit table lookup. T must be a 64-bit vector type with 8-bit lanes, i.e. int8x8_t, uint8x8_t or
poly8x8_t. The table is supplied in a as an array of 1 to 4 vectors, treated as one large vector consisting of
(respectively) 8 to 32 table entries. The output is formed by using the vector b as a vector of indexes into the table,
and mapping each index by its table entry, or zero if the index is out of range. This operation can be thought of as
either

a lane-by-lane table-lookup operation on b, where the index value in each lane is replaced by the
corresponding table value

or as a general permutation/selection operation on data in a, where the data is rearranged, selected or


duplicated according to the steering information in b.
T vtbxN_ST(T a, TxN b, UT c);

performs an extended table lookup operation. In contrast to vtbl, for vtbx, if the index is out of range, the
resulting lane value is taken from the corresponding lane in the vector a, rather than zero.

In AArch64, the table operations are similar in operation to AArch32, but the table size is always 128-bit and the
index vector can either be 64-bit or 128-bit.
T vqtblNQ_ST (T2xN t, UT idx);

Is similar in operation to ARMv7‟s vtbl, but T2 is always 128-bit. T can be 64-bit or 128-bit i.e. int8x8_t,
uint8x8_t, poly8x8_t or int8x16_t, uint8x16_t or poly8x16_t.
T vqtbxNQ_ST (T a, T2xN t, UT idx);

Is similar in operation to ARMv7‟s vtbx, but T2 is always 128-bit. T can be 64-bit or 128-bit i.e. int8x8_t,
uint8x8_t, poly8x8_t or int8x16_t, uint8x16_t or poly8x16_t.

12.3.14 Crypto Intrinsics


Crypto extension instructions are part of the Advanced SIMD instruction set. These intrinsics are available
when__ARM_FEATURE_CRYPTO is defined.
uint8x16_t vaeseq_u8 (uint8x16_t data, uint8x16_t key);

Performs AES single round encryption.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 78 of 81
Non-Confidential
uint8x16_t vaesdq_u8 (uint8x16_t data, uint8x16_t key);

Performs AES single round decryption.


uint8x16_t vaesmcq_u8 (uint8x16_t data);

Performs AES mix columns.


uint8x16_t vaesimcq_u8 (uint8x16_t data);

Performs AES inverse mix columns.

uint32x4_t vsha1<cpm>q_u32 (uint32x4_t hash_abcd, uint32_t hash_e, uint32x4_t wk);

Performs SHA1 hash update (choose, parity or majority forms).


uint32_t vsha1h_u32 (uint32_t hash_e);

Performs SHA1 fixed rotate.


uint32x4_t vsha1su0q_u32 (uint32x4_t w0_3, uint32x4_t w4_7, uint32x4_t w8_11);

Performs SHA1 schedule update 0.


uint32x4_t vsha1su1q_u32 (uint32x4_t tw0_3, uint32x4_t w12_15);

Performs SHA1 schedule update 1.

uint32x4_t vsha256hq_u32 (uint32x4_t hash_abcd, uint32x4_t hash_efgh, uint32x4_t wk);

Perfroms SHA256 hash update (part 1).


uint32x4_t vsha256h2q_u32 (uint32x4_t hash_efgh, uint32x4_t hash_abcd, uint32x4_t wk);

Performs SHA256 hash update (part 2)


uint32x4_t vsha256su0q_u32 (uint32x4_t w0_3, uint32x4_t w4_7);

Performs SHA256 schedule update 0


uint32x4_t vsha256su1q_u32 (uint32x4_t tw0_3, uint32x4_t w8_11, uint32x4_t w12_15);

Performs SHA256 schedule update 1

poly128_t vmull_p64 (poly64_t, poly64_t);

Performs widening polynomial multiplication on double-words low part. Available on ARMv8 AArch32 and
AArch64.
poly128_t vmull_high_p64 (poly64x2_t, poly64x2_t);

Performs widening polynomial multiplication on double-words high part. Available on ARMv8 AArch32 and
AArch64.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 79 of 81
Non-Confidential
13 FUTURE DIRECTIONS

13.1 Extensions under consideration


13.1.1 Procedure calls and the Q / GE bits
The ARM procedure call standard [AAPCS] says that the Q and GE bits are undefined across public interfaces,
but in practice it is desirable to return saturation status from functions. There are at least two common use cases:

to define small (inline) functions defined in terms of expressions involving intrinsics, which provide
abstractions or emulate other intrinsic families; it is desirable for such functions to have the same well-
defined effects on the Q/GE bits as the corresponding intrinsics

DSP library functions


Options being considered are to define an extension to the “pcs” attribute to indicate that Q is meaningful on the
return, and possibly also to infer this in the case of functions marked as inline.

13.1.2 Returning a value in registers


As a type attribute this would allow things like
struct __attribute__((value_in_regs)) Point { int x[2]; };

This would indicate that the result registers should be used as if the type had been passed as the first argument.
The implementation should not complain if the attribute is applied inappropriately (i.e. where insufficient registers
are available) – it might be a template instance.

13.1.3 Custom calling conventions


Some interfaces may use calling conventions that depart from the AAPCS. Examples include:

using additional argument registers, e.g. passing an argument in R5, R7, R12 etc.

using additional result registers, e.g. R0 and R1 for a combined divide-and-remainder routine (note that
some implementations may be able to support this by means of a “value in registers” structure return)

returning results in the condition flags

preserving and possibly setting the Q (saturation) bit

13.1.4 Traps: system calls, breakpoints etc.


This release of ACLE does not define how to invoke a SVC (supervisor call), BKPT (breakpoint) etc.
One option would be to mark a function prototype with an attribute, e.g.
int __attribute__((svc(0xAB))) system_call(int code, void const *params);

When calling the function, arguments and results would be marshalled according to the AAPCS, the only
difference being that the call would be invoked as a trap instruction rather than a branch-and-link.
One issue is that some calls may have non-standard calling conventions. (For example, ARM Linux system calls
expect the code number to be passed in R7.)

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 80 of 81
Non-Confidential
Another issue is that the code may vary between ARM and Thumb state. This issue could be addressed by
allowing two numeric parameters in the attribute.

13.1.5 Mixed-endian data


Extensions for accessing data in different endianness have been considered. However, this is not an issue
specific to the ARM architecture, and it seems better to wait for a lead from language standards.

13.1.6 Memory access with non-temporal hints.


Supporting memory access with cacheability hints through language extensions is being investigated. Eg.
int *__attribute__((nontemporal)) p;
as a type attribute, will allow indirection of „p‟ with non-temporal cacheability hint.

13.2 Features not considered for support


13.2.1 VFP vector mode
The “short vector” mode of the original VFP architecture is now deprecated, and unsupported in recent
implementations of the ARM floating-point instructions set. There is no plan to support it through C extensions.

13.2.2 Bit-banded memory access


The bit-banded memory feature of certain Cortex-M cores is now regarded as being outside the architecture, and
there is no plan to standardize its support.

IHI 0053C Copyright © 2011-2014 ARM Limited. All rights reserved. Page 81 of 81
Non-Confidential

You might also like