See the Links to Processor ISA References for pointers to architecture manuals.
Contents |
In addition to the normal return value, a stop code may also be returned from an SPU program. The value of this stop code originates as a 14-bit value used as the operand for the SPU stop instruction that stopped the program. By default, the value is 0. However, user-synthesized code may contain stop instructions with any desired value, for any reason. To access the stop code from an SPU program, use the keyword argument stop = True when calling Processor.execute(). An example:
>>> import corepy.arch.spu.isa as spu >>> import corepy.arch.spu.platform as env # Platform: linux_spufs.spre_linux_spu >>> code = env.InstructionStream() >>> code.add(spu.il(code.gp_return, 123)) >>> code.add(spu.stop(456)) >>> proc = env.Processor() >>> ret = proc.execute(code, mode = 'int', stop = True) >>> print ret (123, 456)
In the example, notice how a 2-tuple is returned instead of what would normally just be the integer return code. The first element of the tuple is the integer return code, while the second element is the SPU stop code.
In addition to executing instruction streams in asynchronously, CorePy makes it easy to execute an instruction stream in parallel on multiple SPUs. Two things are done in the example below to enable parallel execution. First, a ParallelInstructionStream object is used instead of the usual InstructionStream. Second, a special keyword argument, n_spus, is used to indicate how many SPUs the code should be run on. Finally, a tuple containing the return code from each SPU is returned by the execute method.
>>> import corepy.arch.spu.isa as spu >>> import corepy.arch.spu.platform as env # Platform: linux_spufs.spre_linux_spu >>> code = env.ParallelInstructionStream() >>> code.add(spu.il(code.gp_return, 123)) >>> code.add(spu.stop(456)) >>> proc = env.Processor() >>> ret = proc.execute(code, mode = 'int', n_spus = 4) >>> print ret (123, 123, 123, 123)
TODO - figure out the raw_data_size and rank/parameter stuff that seems to exist, and document it
The Cell features an explicit on-chip, per-SPU cache called the local store. The SPU itself accesses only local store memory directly using load/store instructions, while issuing asynchronous DMA commands to move data between local store and main memory. While moving data between registers and local store is straightforward, using DMA commands to move data between local store and main memory is more involved. As a result, CorePy has several different tools for managing memory and DMA transfers.
First is a fairly low-level API that mimics the libspe2 MFC interface. The source code is the most complete documentation, and may be found in corepy/arch/spu/lib/dma.py.
Below is an example program that loads an array of 32 integers into local store, then stores the data in a second array. The Extended Array class is used here to avoid memory alignment issues, as well as the load_word() utility for loading constant values into registers. On the SPU, different instructions are used to load constants depending on their value; load_word() automatically generates the best instruction sequence.
import corepy.lib.extarray as extarray import corepy.arch.spu.isa as spu import corepy.arch.spu.platform as env import corepy.arch.spu.lib.dma as dma from corepy.arch.spu.lib.util import load_word a = extarray.extarray('i', range(0, 32)) b = extarray.extarray('i', [0 for i in range(0, 32)]) code = env.InstructionStream() proc = env.Processor() spu.set_active_code(code) r_lsa = code.acquire_register() # Local Store address r_mma = code.acquire_register() # Main Memory address r_size = code.acquire_register() # Size in bytes r_tag = code.acquire_register() # DMA Tag # Set the parameters for a GET command abi = a.buffer_info() spu.il(r_lsa, 0x1000) # Local Store address 0x1000 load_word(code, r_mma, abi[0]) # Main Memory address of array a spu.il(r_size, a.itemsize * abi[1]) # Size of array a in bytes spu.il(r_tag, 12) # DMA tag 12 # Issue a DMA GET command dma.mfc_get(code, r_lsa, r_mma, r_size, r_tag) # Wait for completion # Set the completion mask; here we complete tag 12 spu.il(r_tag, 1 << 12) dma.mfc_write_tag_mask(code, r_tag) dma.mfc_read_tag_status_all(code) # Set the parameters for a PUT command bbi = b.buffer_info() spu.il(r_lsa, 0x1000) # Local Store address 0x1000 load_word(code, r_mma, bbi[0]) # Main Memory address of array b spu.il(r_size, b.itemsize * bbi[1]) # Size of array b in bytes spu.il(r_tag, 12) # DMA tag 12 # Issue a DMA PUT command dma.mfc_put(code, r_lsa, r_mma, r_size, r_tag) # Wait for completion # Set the completion mask; here we complete tag 12 spu.il(r_tag, 1 << 12) dma.mfc_write_tag_mask(code, r_tag) dma.mfc_read_tag_status_all(code) code.release_register(r_lsa) code.release_register(r_mma) code.release_register(r_size) code.release_register(r_tag) # Execute the code proc.execute(code)
In the example above, issuing DMA commands requires a fairly significant amount of code. To help with that, some higher level DMA utility routines are also available (located in corepy/arch/spu/lib/dma.py). The example program below does the same thing as the example above, except uses the higher level routines.
import corepy.lib.extarray as extarray import corepy.arch.spu.isa as spu import corepy.arch.spu.platform as env import corepy.arch.spu.lib.dma as dma from corepy.arch.spu.lib.util import load_word a = extarray.extarray('i', range(0, 32)) b = extarray.extarray('i', [0 for i in range(0, 32)]) code = env.InstructionStream() proc = env.Processor() spu.set_active_code(code) # Issue a DMA GET command and wait for completion abi = a.buffer_info() dma.mem_get(code, 0x1000, abi[0], abi[1] * a.itemsize, 12) dma.mem_complete(code, 12) # Issue a DMA PUT command and wait for completion bbi = b.buffer_info() dma.mem_put(code, 0x1000, bbi[0], bbi[1] * b.itemsize, 12) dma.mem_complete(code, 12) # Execute the code proc.execute(code)
See how values are passed directly to the mem_get(), mem_put(), and mem_complete() routines instead of initializing and passing registers? These routines will automatically grab registers and initialize them for you in this case. If this is not desirable (i.e. for performance reasons), registers pre-initialized with the desired values may be passed in as well. If a register is passed to the mem_complete() routine, it is assumed to be a tag completion mask (i.e. 1 << 12), instead of a tag value (just 12). This way the high-level routines still allow for multiple tags to be completed simultaneously.
TODO - DMA with memory desc, iterators
An SPU's local store can be accessed directly from Python, even while the SPU is running. This is possible because the Linux kernel provides a means of mapping local store directly into the process's virtual address space. Once mapped, both the PPU and other SPUs have full read/write access to an SPU's local store by reading/writing in a special region of memory using the same mechanisms that would be normally be used for memory access.
A pointer to this region, represented in Python as an integer, is contained in the 'spuls' member variable of the identifier returned by Processor.execute() when asynchronous mode is enabled (async = True). Setting up an ExtArray object referring to an SPU's local store can be done like the following:
import corepy.lib.extarray as extarray import corepy.arch.spu.platform as env proc = env.Processor() # code is an InstructionStream created elsewhere id = proc.execute(code, async = True) arr = extarray.extarray('I', 262144 / 4) arr.set_memory(id.spuls)
A similar thing can be done with ExtBuffer objects, allowing anything supporting buffer objects to be used to represent local store. The below example creates a NumPy array using an ExtBuffer to access local store.
import corepy.lib.extarray as extarray import numpy buf = extarray.extbuffer(262144, memory = ctx.spuls) array = numpy.frombuffer(buf, dtype=numpy.int32) # Change the array shape so that the second dimension # represents elements of each vector; each element in # the first dimension is represents the same data as a # SPU register. array.shape = (16384, 4)
Each SPU has several mailboxes that may be read/written, as well as two incoming signal channels. Routines for reading/writing local mailboxes & signals on the SPU are available in corepy/arch/spu/lib/dma.py. The SPU environment module provides access to routines for reading/writing SPU mailboxes from the PPU as well.
The example below first synthesizes an SPU program that writes a value to its outbound mailbox and waits for a signal before terminating. While the program is executing, the mailbox value is read on the PPU and a signal is sent to the SPU.
import corepy.arch.spu.isa as spu import corepy.arch.spu.platform as env import corepy.arch.spu.lib.dma as dma from corepy.arch.spu.lib.util import load_word code = env.InstructionStream() proc = env.Processor() # Grab a register and initialize it reg = code.acquire_register() load_word(code, reg, 0xCAFEBABE) # Write the value to the outbound mailbox dma.spu_write_out_mbox(code, reg) # Wait for a signal sig = dma.spu_read_signal1(code) code.release_register(sig) code.release_register(reg) # Start the synthesized SPU program id = proc.execute(code, async = True) # Spin until the mailbox can be read while env.spu_exec.stat_out_mbox(id) == 0: pass value = env.spu_exec.read_out_mbox(id) # Signal the SPU env.spu_exec.write_signal(id, 1, 0x1234) # Wait for the SPU program to complete proc.join(id) print "value 0x%X" % value
Actually, stat_out_mbox(), read_out_mbox(), and write_signal() are C routines compiled into a SWIG module (the 'spu_exec' module). Rather than importing the SWIG module directly, the recommend way is to access it via the platform environment module, as done in the example before. corepy/arch/spu/platform/linux_spufs/spu_exec.h contains the PPU-side mailbox routines. For mailbox & signal communication, the following routines are implemented:
read_out_mbox(id)
read_out_ibox(id)
Return an entry from the respective mailbox for SPU id if available.
write_in_mbox(id, data)
Write data to the inbound mailbox for SPU id.
stat_out_mbox(id)
stat_out_ibox(id)
stat_in_mbox(id)
Return the number of entries in the respective mailbox for SPU id.
write_signal(id, which, data)
Write signal value 'data' to signal channel 'which' (1 or 2) on SPU id.
An obvious way to time a synthesized SPU program might be to wrap a Processor.execute() call by a pair of Python time.time() calls and compute the difference. However, this includes SPU startup/shutdown overhead, which likely is not desireable. Instead, a timing source called the SPU decrementer is available. On the Cell chips in the Sony PS3's, the decrementer register ticks at a rate of 79.8MHz (visible in /proc/cpuinfo). In CorePy, several routines for managing the SPU decrementer can be found in corepy/arch/spu/lib/dma.py. Here is an example illustrating how to time a block of code:
import corepy.arch.spu.isa as spu import corepy.arch.spu.platform as env import corepy.arch.spu.lib.dma as dma code = env.InstructionStream() proc = env.Processor() # Set the initial decrementer value to 2^31 and start it dma.spu_write_decr(code, 0x7FFFFFFFl) dma.spu_start_decr(code) # Code to be timed goes here reg = dma.spu_read_decr(code) dma.spu_stop_decr(code) code.add(spu.ori(code.gp_return, reg, 0)) code.release_register(reg) # Run the code and grab the ticks as the return value ticks = proc.execute(code, mode = 'int') # Compute wall time in milliseconds ticks = 0x7FFFFFFFl - ticks ticktime = (ticks / 79800000.0) * 1000.0 print "SPU Time: %0.5f ms, %d ticks" % (ticktime, ticks)