Wiki Tools:

CoreFunc

From CorePy

Jump to: navigation, search

CoreFunc is a framework for implementing NumPy's ufunc operations using CorePy. This project was developed by Andrew Friedley as a Google Summer of Code 2009 project. The project consists of three major components:

  • The CoreFunc Framework: abstracts aspects of code generation and ufunc creation that all ufunc operations share in common.
  • Existing Ufunc Implementations: Use the framework to implement existing ufunc operations, taking advantage of x86_64 hardware features such as SSE and multi-core. The framework also implements a ufunc invocation wrapper that optionally splits the computational work across multiple cores via threading.
  • Custom Ufunc Implementations: Use the framework to implement more complex, customized ufunc operations to more fully utilize the hardware and obtain higher performance.


The code is currently available on a branch, though will eventually be included in the main CorePy codebase:

SVN: https://svn.osl.iu.edu/svn/corepy/branches/afriedle-gsoc-ufunc

Trac: http://svn.osl.iu.edu/trac/corepy/browser/branches/afriedle-gsoc-ufunc


During the course of the project sem-regular updates were posted on this blog:

http://numcorepy.blogspot.com


Contents

[edit] Compiling and Using CoreFunc

First, obtain the source code using the SVN url above. Then, run setup.py in the 'corefunc' directory like the following:

$ cd corefunc
$ python setup.py build_ext -i -f

Also, make sure to add the corefunc directory to your PYTHONPATH:

$ export PYTHONPATH=$PYTHONPATH:`pwd`/corefunc

Once the framework is compiled, use it like the following:

>>> import numpy
>>> import corefunc
 
>>> a = numpy.arange(10, dtype=numpy.int32)
>>> b = numpy.arange(10, dtype=numpy.int32)
 
# NumPy ufunc
>>> numpy.add(a, b)
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18], dtype=int32)
 
# CorePy ufunc
>>> corefunc.add(a, b)
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18], dtype=int32)
 
# Reduction works the same too: 
>>> corefunc.add.reduce(a)
45


[edit] Using Operations included with CoreFunc

Several of the existing ufunc operations were implemented primarily to aid in testing and design of the framework itself. Furthermore, they acted as experiments to find various ways in which the x86_64 architecture can be used to achieve higher performance. They should also be useful to users, acting as drop-in replacements for the existing NumPy ufuncs.

The following Ufuncs have been implemented:

  • Addition: Supports int{8,16,32,64} and float{32,64} types and reduction.
>>> c = corefunc.add(a, b)
  • Multiplication: Same as addition
>>> c = corefunc.multiply(a, b)
  • Maximum: supports int{32,64} and float{32,64} types and reduction.
>>> c = corefunc.maximum(a, b)
  • Greater: supports int32 type; reduction is not supported.
>>> c = corefunc.greater(a, b)
  • Cosine: Two different implementations are available, both supporting float{32,64} types. Reduction does not apply to unary operations. The 'cos' ufunc implements a polynomial approximation of cosine using SSE, trading accuracy for performance. The 'cos_x87' ufunc uses the x87 cosine instructions and is highly accurate but much slower.
>>> c = corefunc.cos(a)
>>> d = corefunc.cos_x87(b)


[edit] Multicore Support

The corefunc framework supports automatic parallelization of ufunc operations across multiple cores. However this is off by default. The number of desired cores should be specified using the set_threads() method:

>>> import corefunc
>>> corefunc.set_threads(2)

Only 1, 2, 4, or 8 cores/threads are currently supported.


[edit] Testing Harness

A testing harness has been developed for the included ufunc operations, contained in 'test.py' at the top level of the CoreFunc code. Running the script will test each of the operations and indicate a pass/fail status. Several failures are to be expected; this is due primarily to rounding error in floating point types incurred due to differences in the ufunc implementations (the faster/less accurate cosine for example) and a different order of operations for reductions when multiple threads/cores are used.


[edit] Using CoreFunc to Develop Custom Ufunc Operations

The real power of the CoreFunc framework is the ability to develop new, arbitrary ufunc operations. In particular, code can be written to combine several operations into a single ufunc. The result is potentially huge gains in performance due to reduced memory accesses and use of specialized x86_64 instructions.

A couple of custom ufuncs are included with the framework:

  • Add/Subtract: Takes two arrays as input and outputs two arrays. The first output array is the sum of each respective element, while the second output array is the difference of each respective element in the two source arrays. int32 and float32 types are supported; the main purpose of this ufunc was to test multiple-output functionality.
  • Vector Normalization: Computes the equivalent of the following NumPy-based code:
def npy_normal(a, b):
l = numpy.sqrt(a**2 + b**2)
return (a / l, b / l)
In other words, takes each respective pair of elements from two input arrays as a 2-D vector and normalizes them to a length of 1. Two arrays are output, one containing each component of the vectors. float{32,64} types are supported. The code for the float32 case is particularly interesting; the reciprocal square-root instruction was combined with a slightly modified iteration of newton-raphson to compute the square root and multiply the two vector components.


[edit] Particle Simulation Demo

A particle simulation demo is included with CoreFunc. This demo was originally written by Chris Mueller, and had two implementations -- one using Numeric, and the other using CorePy and PowerPC/Altivec. The point was to show what kind of performance gains can be had using CorePy. The version of this demo included with CoreFunc has been updated for NumPy, and includes a custom ufunc that performs all of the particle update calculation in one pass over the position & velocity arrays. The performance difference observed on a quad-core xeon system with an ATI 3970x2 graphics card is roughly 100x when simulating 200,000 particles.


[edit] The CoreFunc API

The framework interface for creating new ufuncs consists of three functions:

[edit] gen_ufunc(num_inputs, num_outputs, main_fn, special_fn, iter_elems, reduce_fn = None, ident_fn = None, data_type = None, data_size = None, init_fn = None, final_fn = None)

Generate a CorePy synthetic program for a single ufunc operation and data type. Parameters are as follows:

  • num_inputs Number of array inputs to the ufunc
  • num_outputs Number of array outputs from the ufunc
  • main_fn Synthetic component generating code for the main (SIMD) operation loop body. Expected to have this prototype:
def main_fn(prgm, reg):
Should return an InstructionStream containing a loop body processing iter_elems sequential elements at a time. See the function notes below for information on passed parameters.
  • special_fn Synthetic component generating code for the special-case operation loop body. The special case handles arrays with length less than iter_elems, and remaining elements when the length is not a multiple of iter_elems. Prototype:
def special_fn(prgm, reg):
Should return an InstructionStream containing a loop body processing a single element at a time. See the function notes below for information on passed parameters.
  • iter_elems Number of elements the main operation loop body should process at a time. Must have a value of 2, 4, 8, 16, or 32.
  • reduce_fn A two-operand instruction to be used for the reduction case. Normally an ISA instruction is specified directly, however a custom instruction can be specified as a function with the following prototype:
def reduce_fn(r_acc, r_src):
The r_acc argument is the register containing the reduction result accumulated so far, r_src contains the value to reduce into the accumulator. This function should assume that active code is set, and generate code into that InstructionStream.
Since reduction does not make sense for all operations, reduce_fn is an optional parameter. If reduce_fn is not specified, the ident_fn, data_type, and data_size parameters need not be specified either (they only apply to reduction).
  • ident_fn Identity function for setting the initial value of the reduction accumulator. Several pre-defined functions are available:
    • ident_gp_0 Set 0 in a general-purpose (integer) register.
    • ident_gp_1 Set 1 in a general-purpose (integer) register.
    • ident_xmm_0 Set 0 in a SSE (any type) register.
    • ident_xmm32_1 Set 1 in a SSE (32-bit float) register.
    • ident_xmm64_1 Set 1 in a SSE (64-bit float) register.
  • data_type Register type to use for the reduction. Must be one of GPRegister{8,16,32,64} or XMMRegister.
  • data_size Size of the register to use for the reduction, in bits. Must be one of 8, 16, 32, or 64.
  • init_fn Optional function to be called after the ufunc initializes, but before determining any special cases for the operation. The following prototype is expected:
def init_fn(prgm, regs):
Should return an InstructionStream containing any ufunc-specific initialization code. See the function notes below for information on passed parameters.
  • final_fn Similar to init_fn, but called after the ufunc operation is carried out, just before returning. Prototype:
def final_fn(prgm, regs):
Should return an InstructionStream containing any ufunc-specific initialization code. See the function notes below for information on passed parameters.
[edit] Function Notes:

The main_fn, special_fn, init_fn and final_fn parameters are expected to have a specific function prototype; they should expect prgm, a CorePy program object, and reg, a dictionary of registers with the following string keys:

  • args Array of registers containing pointers to the next element(s) in their respective arrays to process.
  • steps Array of registers containing the number of bytes to step forward in memory to access the next element.
  • count Loop counter register containing the index of the next element(s) to process.

The registers of the args and steps arrays are in order corresponding to the input and output arrays of the ufunc with input arrays being first. For a ufunc with 2 inputs and one output, args[0] and args[1] are pointers to the input array, while args[2] is the output array.

[edit] Synthetic Program Register Allocation

A number of general-purpose registers are used internally by the ufunc framework code, and should not be modified by user-defined code. The table below indicates which registers may be used in what cases. Registers that are reserved during reduction may be used freely except by reduction code. If one of these registers is initialized in init_fn(), it will be overwritten by the reduction and be undefined when final_fn() is called.

The state of the x87 floating point unit must be returned to its initial state when the ufunc completes; i.e. all registers must be popped.

The stack is also available for use; user-defined code is responsible for returning the stack (rsp) to its initial state. The frame pointer may also be used (rbp) as long as its value is restored when the ufunc completes.

Register(s) Availability
rax,rbx Reserved during reduction on GPRegister{8,16,32,64} and XMMRegister types
rcx,rdx Always available
rdi,rsi,r{8-15} Always reserved
rbp,rsp Always available; must be returned to original state before ufunc completes
xmm{0,1} Reserved during reduction on XMMRegister types
xmm{2-15} Always available
mm{0-7} Always available
st{0-7} Always available


[edit] Return Value

A CorePy Program object implementing a ufunc operation for a specific data type is returned from gen_ufunc(). Called from C, the synthetic program's function prototype matches the PyUfuncGenericFunction typdef defined in the Ufunc C API:

void ufunc(char** args, npy_intp* dimensions, npy_intp* steps, void* data);

To create a useable ufunc, the synthetic program should be passed to the create_ufunc() method, documented below:


[edit] create_ufunc(name, identity, num_inputs, num_outputs, typeinfo)

Create a new ufunc using a set of synthetic programs. Parameters:

  • name Name of the ufunc as a string. 'add' and 'multiply' are handled specially; inputs of integer type smaller than the native word size are internally upcast to the integer native word size by NumPy.
  • identity An identity constant NPY_ONE or NPY_ZERO.
  • num_inputs Number of input arrays.
  • num_outputs Number of output arrays.
  • typeinfo A tuple containing a set of pairs (tuples) of the following:
    • prgm A synthetic program for this type.
    • data_Type A NumPy constant representing this data type; e.g. NPY_INT32or NPY_FLOAT64.

The specified synthetic programs are wrapped by a C function that first splits the array into suitable work segments for the number of threads specified by the last call to set_threads(). The synthetic programs are then invoked with the appropriate arguments to execute over the respective work segments.

A callable ufunc is returned as a Python object.


[edit] Using the CoreFunc API (Examples)

Using the API is pretty straight forward. Below is the code to generate a synthetic program performing addition with an int64 data type:

def gen_ufunc_add_int64():
  def main_fn(prgm, reg):
    code = prgm.get_stream()
    # This code uses xmm0/xmm1 explicitly
    
    code += x86.movdqa(xmm0, MemRef(reg['args'][0], data_size = 128))
    code += x86.movdqa(xmm1, MemRef(reg['args'][0], 16, data_size = 128))
    code += x86.paddq(xmm0, MemRef(reg['args'][1], data_size = 128))
    code += x86.paddq(xmm1, MemRef(reg['args'][1], 16, data_size = 128))
    code += x86.movntdq(MemRef(reg['args'][2], data_size = 128), xmm0)
    code += x86.movntdq(MemRef(reg['args'][2], 16, data_size = 128), xmm1)
    return code
    
  def special_fn(prgm, reg): 
    code = prgm.get_stream()
    # This code uses rax explicitly
    
    code += x86.mov(rax, MemRef(reg['args'][0]))
    code += x86.add(rax, MemRef(reg['args'][1]))
    code += x86.movnti(MemRef(reg['args'][2]), rax)
    return code    
  
  return gen_ufunc(2, 1, main_fn, special_fn, 8, x86.add, ident_gp_0, GPRegister64, 64)

First main_fn() is defined, matching the prototype expected by gen_ufunc(). A new InstructionStream object is obtained from the program, into which the an unrolled and vectorized loop body is generated. In this particular case, the addition operation is applied to 4 elements at a time. Each XMM (SSE) register holds only two elements, so the code is duplicated once and interleaved to compute 4 elements simultaneously. In other words, both SIMD vectorization and partial loop unrolling are being applied here.

The special_fn() method creates an InstructionStream and generates scalar addition code, summing one pair of elements at a time. main_fn() is used to process as many elements as possible, followed by special_fn() to process any remaining elements if the array lengths are not a multiple of the number of elements main_fn() processes per iteration.

A complete ufunc implementation will have a set of synthetic programs, defined similarly to the example above, for every supported data type. A ufunc is created from these synthetic programs by doing the following:

# Generate the synthetic programs
# Repeat for each data type
__add_int32_code = gen_ufunc_add_int32()
__add_int64_code = gen_ufunc_add_int64()
__add_float32_code = gen_ufunc_add_float32()
__add_float64_code = gen_ufunc_add_float64()
 
# Create the ufunc:
add = create_ufunc("add", corefuncint.NPY_ZERO, 2, 1,
                        # Specify each synthetic program and data type
                        ((__add_int32_code, corefuncint.NPY_INT32),
                         (__add_int64_code, corefuncint.NPY_INT64),
                         (__add_float32_code, corefuncint.NPY_FLOAT32),
                         (__add_float64_code, corefuncint.NPY_FLOAT64),
                        ))

The above example should be self-explanatory along with the supporting documentation above. First the synthetic programs for each data type are created. Then, the ufunc itself is created by calling create_ufunc() and passing in all the necessary parameters.

A more complicated example, vector normalization, follows:

# Computes L = sqrt(a**2 + b**2), x = a / L, y = b / L
    
def gen_ufunc_normalize_float32():
  # Create an array in memory containing several constant used by the code.
  def init_fn(prgm):
    # rbx is used as a GP base register for the memory references
    import corepy.lib.extarray as extarray
    code = prgm.get_stream()
 
    # Create a dictionary to hold MemRef's and an array to hold the data    
    const = {}
    data = extarray.extarray('f', 8)
    const['_data'] = data
 
    prgm.add_storage('normal_const', const)
 
    # Set up the rbx register to point to the array in memory
    addr = data.buffer_info()[0]
    code.add(x86.mov(rbx, addr))
 
    # Duplicate the constant values 4x for a SIMD vector
    data[0:4] = (0.5,) * 4
    data[4:8] = (3.0,) * 4
 
    # Packed memrefs
    const['0.5'] = MemRef(rbx, data_size = 128)
    const['3.0'] = MemRef(rbx, 16, data_size = 128)
 
    # Scalar memrefs
    const['s 0.5'] = MemRef(rbx, data_size = 32)
    const['s 3.0'] = MemRef(rbx, 16, data_size = 32)
    return code
 
 
  def main_fn(prgm, reg):
    code = prgm.get_stream()
    const = prgm.get_storage('normal_const')
    # This code uses xmm{0,1,2,3,8,9,10} explicitly
 
 
    # Load values
    code += x86.movaps(xmm0, MemRef(reg['args'][0], data_size = 128))
    code += x86.movaps(xmm1, MemRef(reg['args'][1], data_size = 128))
 
    # Square the x and y components
    code += x86.movaps(xmm2, xmm0)
    code += x86.mulps(xmm2, xmm2)
    code += x86.movaps(xmm3, xmm1)
    code += x86.mulps(xmm3, xmm3)
 
    # x**2 + y**2
    code += x86.addps(xmm2, xmm3)
 
    # Approximate 1/sqrt(x2)
    code += x86.rsqrtps(xmm8, xmm2)         # x8 = 1/sqrt(x2)
 
    # Do an iteration of newton-raphson as suggested by AMD optimization manual:
    # y = 0.5 * x2 * x8 * (3.0 - x2 * x8 * x8)
 
    code += x86.mulps(xmm2, xmm8)           # x2 *= x8
 
    code += x86.movaps(xmm9, xmm2)          # x9 = (x2 * x8)
    code += x86.mulps(xmm9, xmm8)           # x9 *= x8
 
    code += x86.mulps(xmm8, const['0.5'])   # x2 *= 0.5
 
    code += x86.movaps(xmm10, const['3.0']) #x10 = 3.0
    code += x86.subps(xmm10, xmm9)          # x10 -= x9
 
    code += x86.mulps(xmm8, xmm10)          # x2 *= x10
 
    # Scale the x and y components
    code += x86.mulps(xmm0, xmm8)
    code += x86.mulps(xmm1, xmm8)
 
    # Store the values
    code += x86.movntps(MemRef(reg['args'][2], data_size = 128), xmm0)
    code += x86.movntps(MemRef(reg['args'][3], data_size = 128), xmm1)
 
    return code
 
 
  def special_fn(prgm, reg):
    code = prgm.get_stream()
    # This code uses xmm{0,1,2,3} explicitly
 
    code += x86.movss(xmm0, MemRef(reg['args'][0], data_size = 32))
    code += x86.movss(xmm1, MemRef(reg['args'][1], data_size = 32))
    code += x86.movss(xmm2, xmm0)
    code += x86.movss(xmm3, xmm1)
 
    code += x86.mulss(xmm0, xmm0)
    code += x86.mulss(xmm1, xmm1)
 
    code += x86.addss(xmm0, xmm1)
 
    # NOTE: Here, sqrt/div are used instead of 1/sqrt and mul.
    #  This keeps the example shorter and shows an alternative way
    #  of doing things, though one approach or the other should be
    #  used for both main_fn() and special_fn().
    code += x86.sqrtss(xmm0, xmm0)
 
    code += x86.divss(xmm2, xmm0)
    code += x86.divss(xmm3, xmm0)
 
    code += x86.movss(MemRef(reg['args'][2], data_size = 32), xmm2)
    code += x86.movss(MemRef(reg['args'][3], data_size = 32), xmm3)
 
    return code
 
  return gen_ufunc(2, 2, main_fn, special_fn, 4, init_fn = init_fn)

The general structure of this example is the same as the addition ufunc shown above, except that the code is more complex, and an init_fn() is defined to set up some constant values in memory. A common use of init_fn(), as shown here, is to set up a set of constant values to be used later in main_fn() and special_fn(). Although here the values are left in memory, constants could also be stored in registers. However my (Andrew Friedley) observation is that leaving constant values in memory little/no real impact on performance, and helps to reduce register pressure.

Again, the main loop body processes 4 elements at a time. Performance could likely be improved by unrolling another iteration of the main loop body and interleaving the instructions. The benefits come from two sources: first, the loop control code is executed fewer times. Second, interleaving independent loop bodies takes advantage of software pipelining and superscalar execution in the processor, hiding instruction latencies.


[edit] Tips for obtaining performance speedups

The first thing to do would be to get familiar with the Intel/AMD software optimization (and ISA) manuals:

AMD Software Guides & Manuals

Intel Software Developer's Manuals

Ufuncs are primarily memory-bound, so the greatest speedup will be obtained by reducing the number of memory accesses. Custom ufuncs developed with CoreFunc can combine a large set of complex operations and allow temporary data to be stored in registers, meaning input elements are read only once, and output elements are written only once. Temporary arrays with intermediate values are avoided.

CorePy gives developers access to the entire processor architecture. x86_64 contains a wide variety of instructions that rarely get used by compilers because the source languages do not contain semantics that can be effectively mapped to these instructions. Being familiar with all that the ISA has to offer and creatively developing tricks can generate significant performance gains.

SIMD vectorization is possible using the SSE extensions, combining SSE with software pipelining techniques results in superscalar code execution.

[edit] Performance Analysis

The results shown in the graphs were obtained on a dual-socket quad-core Xeon L5230 1.86GHz system. For each input size, the wall-clock time to execute 15 iterations was measured. Results are shown in Millions of operations per second. The work involved in one operation varies depending on the ufunc; this is the computation required for one element of each of the respective input/output arrays. For example, an addition operation consists of two reads, an addition, and a write.

[edit] Addition (float32)

Image:corefunc-addition-perf.png

This graph contains a number of artifacts which aren't fully understood. The main conclusions that can be drawn are the following:

  • Single-core CoreFunc performance mostly mirrors that of NumPy, though is somewhat faster starting around ~110,000 elements. This indicates performance is primarily memory-bound.
  • Multi-core incurs a significant performance penalty until around 75,000 elements. The speedup appears to be due to having effectively a larger CPU cache; the multi-core performance continues to increase until the array length exceeds the available cache (8mb in total).
  • 8 cores provides little/no benefit, at any size. This makes sense if the cache is responsible for increased performance; 4 cores is enough to make use of all the available CPU caches.


[edit] Addition-Reduction (float32)

Image:corefunc-addition-reduction-perf.png

  • CoreFunc performance is significantly better for larger array sizes. One key difference is that NumPy is accumulating its result in a memory location, while CoreFunc uses a register.
  • Multiple cores are beneficial past about ~50,000 elements. Again, likely due to having more cache available. The performance gains extend further than with normal addition (and overall performance is higher) because only 1 read access is required per element (as opposed to two reads and a write for normal addition). For some sizes, 8 cores are beneficial, though it is not fully understood why.


[edit] Vector Normalization (float32)

Image:corefunc-normal-perf2.png

This graph shows the result of progressively optimizing the vector normalization code. Performance for a C implementation compiled with GCC is also shown.

  • Call overhead dominates performance for both NumPy and CoreFunc on small arrays. The scalar CoreFunc implementation uses an instruction sequence very similar to that generated by GCC from the C code; it is not suprising that performance is effectively identical for larger array sizes. NumPy lags some, most likely due to the use of temporary arrays.
  • The CoreFunc vector implementation uses the same instruction sequence as the scalar version, but converts the instructions to their SIMD vector forms, yielding significant speedup.
  • The fully optimized CoreFunc implementation replaces the square-root and division instructions with a reciprocal square-root and multiplication. Since the reciprocal square-root instruction is far less accurate than the square root, an iteration of newton-raphson is used to obtain more accuracy. This approach is faster than the square-root instruction, but slightly less accurate.

[edit] How to Reproduce the Performance Numbers

The testing harness included in the codebase first runs a set of correctness tests, then starts running peformance tests. Running performance tests can take a while for all operations. If only a subset of operations are desired, the unwanted operations/types can be commented out in the list near the top of test.py. The set of input sizes and numbers of cores can be adjusted as well.