Android NDK & ARM NEON instruction set extension support
--------------------------------------------------------

Introduction:
-------------

Android NDK r3 added support for the new 'armeabi-v7a' ARM-based ABI
that allows native code to use two useful instruction set extensions:

- Thumb-2, which provides performance comparable to 32-bit ARM
  instructions with similar compactness to Thumb-1

- VFPv3, which provides hardware FPU registers and computations,
  to boost floating point performance significantly.

  More specifically, by default 'armeabi-v7a' only supports
  VFPv3-D16 which only uses/requires 16 hardware FPU 64-bit registers.

More information about this can be read in docs/CPU-ARCH-ABIS.html

The ARMv7 Architecture Reference Manual also defines another optional
instruction set extension known as "ARM Advanced SIMD", nick-named
"NEON". It provides:

- A set of interesting scalar/vector instructions and registers
  (the latter are mapped to the same chip area as the FPU ones),
  comparable to MMX/SSE/3DNow! in the x86 world.

- VFPv3-D32 as a requirement (i.e. 32 hardware FPU 64-bit registers,
  instead of the minimum of 16).

Not all ARMv7-based Android devices will support NEON, but those that
do may benefit in significant ways from the scalar/vector instructions.

The NDK supports the compilation of modules or even specific source
files with support for NEON. What this means is that a specific compiler
flag will be used to enable the use of GCC ARM Neon intrinsics and
VFPv3-D32 at the same time. The intrinsics are described here:

    http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html


LOCAL_ARM_NEON:
---------------

Define LOCAL_ARM_NEON to 'true' in your module definition, and the NDK
will build all its source files with NEON support. This can be useful if
you want to build a static or shared library that specifically contains
NEON code paths.


Using the .neon suffix:
-----------------------

When listing sources files in your LOCAL_SRC_FILES variable, you now have
the option of using the .neon suffix to indicate that you want to
corresponding source(s) to be built with Neon support. For example:

  LOCAL_SRC_FILES := foo.c.neon bar.c

Will only build 'foo.c' with NEON support.

Note that the .neon suffix can be used with the .arm suffix too (used to
specify the 32-bit ARM instruction set for non-NEON instructions), but must
appear after it.

In other words, 'foo.c.arm.neon' works, but 'foo.c.neon.arm' does NOT.


Build Requirements:
------------------

Neon support only works when targeting the 'armeabi-v7a' ABI, otherwise the
NDK build scripts will complain and abort. It is important to use checks like
the following in your Android.mk:

   # define a static library containing our NEON code
   ifeq ($(TARGET_ARCH_ABI),armeabi-v7a)
      include $(CLEAR_VARS)
      LOCAL_MODULE    := mylib-neon
      LOCAL_SRC_FILES := mylib-neon.c
      LOCAL_ARM_NEON  := true
      include $(BUILD_STATIC_LIBRARY)
   endif # TARGET_ARCH_ABI == armeabi-v7a


Runtime Detection:
------------------

As said previously, NOT ALL ARMv7-BASED ANDROID DEVICES WILL SUPPORT NEON !
It is thus crucial to perform runtime detection to know if the NEON-capable
machine code can be run on the target device.

To do that, use the 'cpufeatures' library that comes with this NDK. To learn
more about it, see docs/CPU-FEATURES.html.

You should explicitly check that android_getCpuFamily() returns
ANDROID_CPU_FAMILY_ARM, and that android_getCpuFeatures() returns a value
that has the ANDROID_CPU_ARM_FEATURE_NEON flag set, as  in:

    #include <cpu-features.h>

    ...
    ...

    if (android_getCpuFamily() == ANDROID_CPU_FAMILY_ARM &&
        (android_getCpuFeatures() & ANDROID_CPU_ARM_FEATURE_NEON) != 0)
    {
        // use NEON-optimized routines
        ...
    }
    else
    {
        // use non-NEON fallback routines instead
        ...
    }

    ...

Sample code:
------------

Look at the source code for the "hello-neon" sample in this NDK for an example
on how to use the 'cpufeatures' library and Neon intrinsics at the same time.

This implements a tiny benchmark for a FIR filter loop using a C version, and
a NEON-optimized one for devices that support it.