Wiki Tools:

CorePy CAL

From CorePy

Jump to: navigation, search

CorePy support for CAL IL is still in development, so some aspects of the interface may not yet be finalized.

Contents

[edit] Processor

In order to run CAL programs using a corepy Processor object, several additional parameters must be specified. First, when creating a Processor, a device number must be specified (if only one device is present, it will be numbered 0). Second, when executing a program, a "domain" must be specified. The domain determines the number of threads which will be used and which indices of the input buffer(s) will be used. It is specified as a 4 element tuple giving the begin and end x and y coordinates. For example, (0, 0, 128, 128).

[edit] Memory Management

All memory to be accessed by the GPU must be specially allocated and then "bound" to a GPU buffer. This is true of memory on the video card used by the GPU program (called "local" memory) and main memory used by the GPU program (called "remote" memory). Note that remote memory will include any memory used for input or output.

The procedure for using memory is to use the alloc_remote() or alloc_local() method of Processor, which will return an extarray. alloc_remote() and alloc_local() take a type parameter, a parameter specifying the number of components per entry (1, 2, or 4), and two parameters specifying the x and y dimensions of the memory. Before running the program, the memory must be bound to a specific buffer or input or output register by using InstructionStream.set_remote_binding() (or set_local_binding()). Finally, when done with the memory, call Processor.free_remote() (or free_local()).

For example:

import corepy.arch.cal.platform as env
import corepy.arch.cal.isa as cal
import corepy.arch.cal.types.registers as reg
 
proc = env.Processor(0)
 
input1_mem = proc.alloc_remote('I', 4, 16, 1)
input2_mem = proc.alloc_remote('I', 4, 16, 1)
output_mem = proc.alloc_remote('I', 4, 16, 1)
 
for i in range(16*4):
 input1_mem[i] = i
 input2_mem[i] = i
 
code = env.InstructionStream()
cal.set_active_code(code)
 
cal.dclpi('00--', reg.vWinCoord0)
cal.dcl_resource(0, cal.pixtex_type.oned, cal.fmt.uint, UNNORM=True)
cal.dcl_resource(1, cal.pixtex_type.oned, cal.fmt.uint, UNNORM=True)
cal.dcl_output(reg.o0, USAGE=cal.usage.generic)
 
cal.sample(0, 0, reg.r0, reg.vWinCoord0.x)
cal.sample(1, 0, reg.r1, reg.vWinCoord0.x)
cal.iadd(reg.o0, reg.r0, reg.r1)
 
code.set_remote_binding('i0', input1_mem)
code.set_remote_binding('i1', input2_mem)
code.set_remote_binding('o0', output_mem)
 
domain = (0, 0, 16, 1)
proc.execute(code, domain)
 
print input1_mem
print input2_mem
print output_mem
 
proc.free_remote(input1_mem)
proc.free_remote(input2_mem)
proc.free_remote(output_mem)

[edit] Components and Indexing

The memory returned by a CorePy alloc_remote() is raw memory, so indexing, etc. will be as specified in the Stream Computing User Guide section 3.2.

Thus, when using memory with 2 or 4 components (which will usually be the case), the number of components must be considered when determining the index of an individual element. For example, suppose that the global buffer is being used with 4 components; then g[5] has an index of 4*5 = 20 on the Python side. Furthermore, for one-dimensional memory, the width of the memory will be the width in elements times the number of components. For two-dimensional memory the size is more complicated; see below.

[edit] Two-dimensional Memory

Generally, with CAL using two-dimensional memory is preferred (for performance reasons). However, when using two-dimensional memory, the "pitch" of the memory must be taken into account. See the Stream Computing User Guide for more information on this.

In order to find the pitch of the memory allocated by CorePy, use the gpu_pitch attribute.

For example:

input_mem = proc.alloc_remote('I', 4, 16, 16) # type, number of components, width, height
output_mem = proc.alloc_remote('I', 4, 16, 16)
 
pitch = input_mem.gpu_pitch

Pitch must also be taken into account when calculating the index of an element (so if the number of components is nc, element (x, y) will be at index x*nc + y*nc*pitch). Fortunately, the pitch (when width is greater than 64) is usually the same size as the width.

[edit] Code

[edit] Instructions

In CAL, instructions have fields besides just the usual operands. For example, the sample instruction requires a resource and sampler field, and has an optional aoffinmi field. In CorePy generally mandatory fields are specified as the first arguments to the instruction before the operands, and optional fields are specified using keyword arguments. So, a CAL instruction:
sample_resource(0)_sampler(0)_aoffinmi(0, 0, 0) r0, v0.x
would become in CorePy:
sample(0, 0, r0, v0.x, AOFFINMI=(0,0,0))
The one exception to this system so far is the zeroop field which is mandatory in CAL but is set to a default value of "inf_else_max" and must be specified using the ZEROOP keyword if it is to be set to any other value.

There are definitions for values for these fields and operands in corepy.arch.cal.isa:

  • isa.relop
  • isa.zeroop
  • isa.logicop
  • isa.usage
  • isa.pixtex_type (for dcl_resource "type" field)
  • isa.fmt (for dcl_resource "fmt" field)
  • isa.sharingMode (for lds_read_vec "sharing" field)
  • isa.interp
  • isa.output_topology (for dcl_output_topology operand)

For example:

dcl_resource_id(0)_type(2d, unnorm)_fmtx(uint)_fmty(uint)_fmtz(uint)_fmtw(uint)
becomes
dcl_resource(0, cal.pixtex_type.twod, fmt.uint, UNNORM=True)
or
dcl_resource(0, cal.pixtex_type.twod, (fmt.uint, fmt.uint, fmt.uint, fmt.uint), UNNORM=True)

[edit] Extended instructions

corepy.arch.cal.lib.cal_extended contains many "extended instructions" for CAL. As in CAL there are no immediate values, cal_extended contains immediate variants for most instructions where this would be desired. For example, cal_extended.addi() is like isa.add() except the third operand should be an immediate; CorePy will automatically generate CAL code to load an immediate value and add it. CorePy does this by allocating a literal register containing the value, so be aware that using these instructions will increase the register usage of your program.

[edit] Registers

There are many different registers used in CAL. The most common ones are pre-defined as CorePy register objects. These include the temp or r registers, the literal or l registers, vWinCoord0, v0, a0, the global buffer g, and several input and output registers (i0, i1, o0, o1, etc.) and constant buffers (cb0, cb1, etc.).

InstructionStream.acquire_register() will return a temp register. If a 4-tuple is passed as a parameter, CorePy will automatically insert a dcl_literal instruction into the InstructionStream and return a literal register.

Operands for CAL instructions can have several source modifiers as well as 'swizzles' and target operands can have a mask which is similar to a swizzle. In CorePy modifiers and swizzles/masks are accessed by calling the register objects with appropriate keyword arguments. Additionally, swizzles and masks can be specified using attributes.

Swizzles allow rearranging the components of the source register. Additionally, components can be replicated, or forced to 0 or 1. For example, r0.x in CAL refers to only the x component of r0. r0.xxxx replicates the x component 4 times. r0.wzyx reverses the components. r0.0000 is an operand with 0s in all four slots. Masks are similar, except that additionally, an '_' character can be used to prevent writing to certain components. If r0.x___ were to be used as a target operand, only the first component would be written.

In CorePy, the swizzle can be specified with the first argument to a register object. r0('x') is the same as the standard r0.x in CAL assembly syntax. If the swizzle or mask begins with any character other than '0' or '1', CorePy also supports using the dot syntax (i.e. as attributes of the register object) to specify the swizzle. That is, r0('x') and r0.x are the same in CorePy. Note that swizzles beginning with numbers can only be specified as parameters, and not using the dot syntax.

Source modifiers are specified using parameters to a called register. For example, the x2 modifier doubles the value of an operand. To use this with the r0 register, the syntax would be r0(x2=True). Source modifiers abs, bias, bx2, invert, and sign are specified the same way. The negate modifier is specified as neg='xyzw' (or with some substring thereof). The divcomp keyword can be used with 'y', 'z', or 'w' as a value.

Examples:

CAL CorePy
r0.xr0.x or r0('x')
r0.000wr0('000w')
r0_x2.x r0(x2=True).x or r0('x', x2=True)
r0_neg(yw)r0(neg='yw')

[edit] Destination Modifiers

In CAL assembly modifications to the destination are also possible, but (other than for masks), they are specified on the instruction and not on the target operand. In CorePy destination modifiers are handled the similarly. In CAL there are several modifiers that allow multiplication or division of the result, and one that causes the result to be saturated. CorePy reduces this to two kinds of parameters, a boolean SAT and a SHIFT parameter that takes a string that matches the CAL destination modifiers: 'x2', 'x4', 'x8', 'd2', 'd4', 'd8'.

Example: add(r0, r0, r1, SHIFT='x2')

[edit] Relative Addressing

Certain register and buffer types, such as the global buffer, g, support relative addressing. CorePy supports a syntax nearly identical to CAL IL for relative addressing. For example, in CAL IL one might write g[r0.x]; in CorePy, the syntax is identical. Even offsets are as they are in CAL: g[r0.x + 1] is an acceptable operand in both CAL and CorePy.