CoreFunc is a framework for implementing NumPy's ufunc operations using CorePy. This project was developed by Andrew Friedley as a Google Summer of Code 2009 project. The project consists of three major components:
The code is currently available on a branch, though will eventually be included in the main CorePy codebase:
SVN: https://svn.osl.iu.edu/svn/corepy/branches/afriedle-gsoc-ufunc
Trac: http://svn.osl.iu.edu/trac/corepy/browser/branches/afriedle-gsoc-ufunc
During the course of the project sem-regular updates were posted on this blog:
First, obtain the source code using the SVN url above. Then, run setup.py in the 'corefunc' directory like the following:
$ cd corefunc $ python setup.py build_ext -i -f
Also, make sure to add the corefunc directory to your PYTHONPATH:
$ export PYTHONPATH=$PYTHONPATH:`pwd`/corefunc
Once the framework is compiled, use it like the following:
>>> import numpy >>> import corefunc >>> a = numpy.arange(10, dtype=numpy.int32) >>> b = numpy.arange(10, dtype=numpy.int32) # NumPy ufunc >>> numpy.add(a, b) array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18], dtype=int32) # CorePy ufunc >>> corefunc.add(a, b) array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18], dtype=int32) # Reduction works the same too: >>> corefunc.add.reduce(a) 45
Several of the existing ufunc operations were implemented primarily to aid in testing and design of the framework itself. Furthermore, they acted as experiments to find various ways in which the x86_64 architecture can be used to achieve higher performance. They should also be useful to users, acting as drop-in replacements for the existing NumPy ufuncs.
The following Ufuncs have been implemented:
>>> c = corefunc.add(a, b)
>>> c = corefunc.multiply(a, b)
>>> c = corefunc.maximum(a, b)
>>> c = corefunc.greater(a, b)
>>> c = corefunc.cos(a)
>>> d = corefunc.cos_x87(b)
The corefunc framework supports automatic parallelization of ufunc operations across multiple cores. However this is off by default. The number of desired cores should be specified using the set_threads() method:
>>> import corefunc >>> corefunc.set_threads(2)
Only 1, 2, 4, or 8 cores/threads are currently supported.
A testing harness has been developed for the included ufunc operations, contained in 'test.py' at the top level of the CoreFunc code. Running the script will test each of the operations and indicate a pass/fail status. Several failures are to be expected; this is due primarily to rounding error in floating point types incurred due to differences in the ufunc implementations (the faster/less accurate cosine for example) and a different order of operations for reductions when multiple threads/cores are used.
The real power of the CoreFunc framework is the ability to develop new, arbitrary ufunc operations. In particular, code can be written to combine several operations into a single ufunc. The result is potentially huge gains in performance due to reduced memory accesses and use of specialized x86_64 instructions.
A couple of custom ufuncs are included with the framework:
def npy_normal(a, b):
A particle simulation demo is included with CoreFunc. This demo was originally written by Chris Mueller, and had two implementations -- one using Numeric, and the other using CorePy and PowerPC/Altivec. The point was to show what kind of performance gains can be had using CorePy. The version of this demo included with CoreFunc has been updated for NumPy, and includes a custom ufunc that performs all of the particle update calculation in one pass over the position & velocity arrays. The performance difference observed on a quad-core xeon system with an ATI 3970x2 graphics card is roughly 100x when simulating 200,000 particles.
The framework interface for creating new ufuncs consists of three functions:
Generate a CorePy synthetic program for a single ufunc operation and data type. Parameters are as follows:
def main_fn(prgm, reg):
def special_fn(prgm, reg):
def reduce_fn(r_acc, r_src):
def init_fn(prgm, regs):
def final_fn(prgm, regs):
The main_fn, special_fn, init_fn and final_fn parameters are expected to have a specific function prototype; they should expect prgm, a CorePy program object, and reg, a dictionary of registers with the following string keys:
The registers of the args and steps arrays are in order corresponding to the input and output arrays of the ufunc with input arrays being first. For a ufunc with 2 inputs and one output, args[0] and args[1] are pointers to the input array, while args[2] is the output array.
A number of general-purpose registers are used internally by the ufunc framework code, and should not be modified by user-defined code. The table below indicates which registers may be used in what cases. Registers that are reserved during reduction may be used freely except by reduction code. If one of these registers is initialized in init_fn(), it will be overwritten by the reduction and be undefined when final_fn() is called.
The state of the x87 floating point unit must be returned to its initial state when the ufunc completes; i.e. all registers must be popped.
The stack is also available for use; user-defined code is responsible for returning the stack (rsp) to its initial state. The frame pointer may also be used (rbp) as long as its value is restored when the ufunc completes.
| Register(s) | Availability |
|---|---|
| rax,rbx | Reserved during reduction on GPRegister{8,16,32,64} and XMMRegister types |
| rcx,rdx | Always available |
| rdi,rsi,r{8-15} | Always reserved |
| rbp,rsp | Always available; must be returned to original state before ufunc completes |
| xmm{0,1} | Reserved during reduction on XMMRegister types |
| xmm{2-15} | Always available |
| mm{0-7} | Always available |
| st{0-7} | Always available |
A CorePy Program object implementing a ufunc operation for a specific data type is returned from gen_ufunc(). Called from C, the synthetic program's function prototype matches the PyUfuncGenericFunction typdef defined in the Ufunc C API:
void ufunc(char** args, npy_intp* dimensions, npy_intp* steps, void* data);
To create a useable ufunc, the synthetic program should be passed to the create_ufunc() method, documented below:
Create a new ufunc using a set of synthetic programs. Parameters:
The specified synthetic programs are wrapped by a C function that first splits the array into suitable work segments for the number of threads specified by the last call to set_threads(). The synthetic programs are then invoked with the appropriate arguments to execute over the respective work segments.
A callable ufunc is returned as a Python object.
Using the API is pretty straight forward. Below is the code to generate a synthetic program performing addition with an int64 data type:
def gen_ufunc_add_int64(): def main_fn(prgm, reg): code = prgm.get_stream() # This code uses xmm0/xmm1 explicitly code += x86.movdqa(xmm0, MemRef(reg['args'][0], data_size = 128)) code += x86.movdqa(xmm1, MemRef(reg['args'][0], 16, data_size = 128)) code += x86.paddq(xmm0, MemRef(reg['args'][1], data_size = 128)) code += x86.paddq(xmm1, MemRef(reg['args'][1], 16, data_size = 128)) code += x86.movntdq(MemRef(reg['args'][2], data_size = 128), xmm0) code += x86.movntdq(MemRef(reg['args'][2], 16, data_size = 128), xmm1) return code def special_fn(prgm, reg): code = prgm.get_stream() # This code uses rax explicitly code += x86.mov(rax, MemRef(reg['args'][0])) code += x86.add(rax, MemRef(reg['args'][1])) code += x86.movnti(MemRef(reg['args'][2]), rax) return code return gen_ufunc(2, 1, main_fn, special_fn, 8, x86.add, ident_gp_0, GPRegister64, 64)
First main_fn() is defined, matching the prototype expected by gen_ufunc(). A new InstructionStream object is obtained from the program, into which the an unrolled and vectorized loop body is generated. In this particular case, the addition operation is applied to 4 elements at a time. Each XMM (SSE) register holds only two elements, so the code is duplicated once and interleaved to compute 4 elements simultaneously. In other words, both SIMD vectorization and partial loop unrolling are being applied here.
The special_fn() method creates an InstructionStream and generates scalar addition code, summing one pair of elements at a time. main_fn() is used to process as many elements as possible, followed by special_fn() to process any remaining elements if the array lengths are not a multiple of the number of elements main_fn() processes per iteration.
A complete ufunc implementation will have a set of synthetic programs, defined similarly to the example above, for every supported data type. A ufunc is created from these synthetic programs by doing the following:
# Generate the synthetic programs # Repeat for each data type __add_int32_code = gen_ufunc_add_int32() __add_int64_code = gen_ufunc_add_int64() __add_float32_code = gen_ufunc_add_float32() __add_float64_code = gen_ufunc_add_float64() # Create the ufunc: add = create_ufunc("add", corefuncint.NPY_ZERO, 2, 1, # Specify each synthetic program and data type ((__add_int32_code, corefuncint.NPY_INT32), (__add_int64_code, corefuncint.NPY_INT64), (__add_float32_code, corefuncint.NPY_FLOAT32), (__add_float64_code, corefuncint.NPY_FLOAT64), ))
The above example should be self-explanatory along with the supporting documentation above. First the synthetic programs for each data type are created. Then, the ufunc itself is created by calling create_ufunc() and passing in all the necessary parameters.
A more complicated example, vector normalization, follows:
# Computes L = sqrt(a**2 + b**2), x = a / L, y = b / L def gen_ufunc_normalize_float32(): # Create an array in memory containing several constant used by the code. def init_fn(prgm): # rbx is used as a GP base register for the memory references import corepy.lib.extarray as extarray code = prgm.get_stream() # Create a dictionary to hold MemRef's and an array to hold the data const = {} data = extarray.extarray('f', 8) const['_data'] = data prgm.add_storage('normal_const', const) # Set up the rbx register to point to the array in memory addr = data.buffer_info()[0] code.add(x86.mov(rbx, addr)) # Duplicate the constant values 4x for a SIMD vector data[0:4] = (0.5,) * 4 data[4:8] = (3.0,) * 4 # Packed memrefs const['0.5'] = MemRef(rbx, data_size = 128) const['3.0'] = MemRef(rbx, 16, data_size = 128) # Scalar memrefs const['s 0.5'] = MemRef(rbx, data_size = 32) const['s 3.0'] = MemRef(rbx, 16, data_size = 32) return code def main_fn(prgm, reg): code = prgm.get_stream() const = prgm.get_storage('normal_const') # This code uses xmm{0,1,2,3,8,9,10} explicitly # Load values code += x86.movaps(xmm0, MemRef(reg['args'][0], data_size = 128)) code += x86.movaps(xmm1, MemRef(reg['args'][1], data_size = 128)) # Square the x and y components code += x86.movaps(xmm2, xmm0) code += x86.mulps(xmm2, xmm2) code += x86.movaps(xmm3, xmm1) code += x86.mulps(xmm3, xmm3) # x**2 + y**2 code += x86.addps(xmm2, xmm3) # Approximate 1/sqrt(x2) code += x86.rsqrtps(xmm8, xmm2) # x8 = 1/sqrt(x2) # Do an iteration of newton-raphson as suggested by AMD optimization manual: # y = 0.5 * x2 * x8 * (3.0 - x2 * x8 * x8) code += x86.mulps(xmm2, xmm8) # x2 *= x8 code += x86.movaps(xmm9, xmm2) # x9 = (x2 * x8) code += x86.mulps(xmm9, xmm8) # x9 *= x8 code += x86.mulps(xmm8, const['0.5']) # x2 *= 0.5 code += x86.movaps(xmm10, const['3.0']) #x10 = 3.0 code += x86.subps(xmm10, xmm9) # x10 -= x9 code += x86.mulps(xmm8, xmm10) # x2 *= x10 # Scale the x and y components code += x86.mulps(xmm0, xmm8) code += x86.mulps(xmm1, xmm8) # Store the values code += x86.movntps(MemRef(reg['args'][2], data_size = 128), xmm0) code += x86.movntps(MemRef(reg['args'][3], data_size = 128), xmm1) return code def special_fn(prgm, reg): code = prgm.get_stream() # This code uses xmm{0,1,2,3} explicitly code += x86.movss(xmm0, MemRef(reg['args'][0], data_size = 32)) code += x86.movss(xmm1, MemRef(reg['args'][1], data_size = 32)) code += x86.movss(xmm2, xmm0) code += x86.movss(xmm3, xmm1) code += x86.mulss(xmm0, xmm0) code += x86.mulss(xmm1, xmm1) code += x86.addss(xmm0, xmm1) # NOTE: Here, sqrt/div are used instead of 1/sqrt and mul. # This keeps the example shorter and shows an alternative way # of doing things, though one approach or the other should be # used for both main_fn() and special_fn(). code += x86.sqrtss(xmm0, xmm0) code += x86.divss(xmm2, xmm0) code += x86.divss(xmm3, xmm0) code += x86.movss(MemRef(reg['args'][2], data_size = 32), xmm2) code += x86.movss(MemRef(reg['args'][3], data_size = 32), xmm3) return code return gen_ufunc(2, 2, main_fn, special_fn, 4, init_fn = init_fn)
The general structure of this example is the same as the addition ufunc shown above, except that the code is more complex, and an init_fn() is defined to set up some constant values in memory. A common use of init_fn(), as shown here, is to set up a set of constant values to be used later in main_fn() and special_fn(). Although here the values are left in memory, constants could also be stored in registers. However my (Andrew Friedley) observation is that leaving constant values in memory little/no real impact on performance, and helps to reduce register pressure.
Again, the main loop body processes 4 elements at a time. Performance could likely be improved by unrolling another iteration of the main loop body and interleaving the instructions. The benefits come from two sources: first, the loop control code is executed fewer times. Second, interleaving independent loop bodies takes advantage of software pipelining and superscalar execution in the processor, hiding instruction latencies.
The first thing to do would be to get familiar with the Intel/AMD software optimization (and ISA) manuals:
Intel Software Developer's Manuals
Ufuncs are primarily memory-bound, so the greatest speedup will be obtained by reducing the number of memory accesses. Custom ufuncs developed with CoreFunc can combine a large set of complex operations and allow temporary data to be stored in registers, meaning input elements are read only once, and output elements are written only once. Temporary arrays with intermediate values are avoided.
CorePy gives developers access to the entire processor architecture. x86_64 contains a wide variety of instructions that rarely get used by compilers because the source languages do not contain semantics that can be effectively mapped to these instructions. Being familiar with all that the ISA has to offer and creatively developing tricks can generate significant performance gains.
SIMD vectorization is possible using the SSE extensions, combining SSE with software pipelining techniques results in superscalar code execution.
The results shown in the graphs were obtained on a dual-socket quad-core Xeon L5230 1.86GHz system. For each input size, the wall-clock time to execute 15 iterations was measured. Results are shown in Millions of operations per second. The work involved in one operation varies depending on the ufunc; this is the computation required for one element of each of the respective input/output arrays. For example, an addition operation consists of two reads, an addition, and a write.
This graph contains a number of artifacts which aren't fully understood. The main conclusions that can be drawn are the following:
This graph shows the result of progressively optimizing the vector normalization code. Performance for a C implementation compiled with GCC is also shown.
The testing harness included in the codebase first runs a set of correctness tests, then starts running peformance tests. Running performance tests can take a while for all operations. If only a subset of operations are desired, the unwanted operations/types can be commented out in the list near the top of test.py. The set of input sizes and numbers of cores can be adjusted as well.