CorePy Cell/SPU

From CorePy

Jump to: navigation, search

See the Links to Processor ISA References for pointers to architecture manuals.

Contents

[edit] Execution Environment

[edit] Returning Stop Codes

In addition to the normal return value, a stop code may also be returned from an SPU program. The value of this stop code originates as a 14-bit value used as the operand for the SPU stop instruction that stopped the program. By default, the value is 0. However, user-synthesized code may contain stop instructions with any desired value, for any reason. To access the stop code from an SPU program, use the keyword argument stop = True when calling Processor.execute(). An example:

>>> import corepy.arch.spu.isa as spu
>>> import corepy.arch.spu.platform as env
# Platform: linux_spufs.spre_linux_spu
 
>>> code = env.InstructionStream()
>>> code.add(spu.il(code.gp_return, 123))
>>> code.add(spu.stop(456))
 
>>> proc = env.Processor()
>>> ret = proc.execute(code, mode = 'int', stop = True)
>>> print ret
(123, 456)

In the example, notice how a 2-tuple is returned instead of what would normally just be the integer return code. The first element of the tuple is the integer return code, while the second element is the SPU stop code.

[edit] Executing on Multiple SPUs in Parallel

In addition to executing instruction streams in asynchronously, CorePy makes it easy to execute an instruction stream in parallel on multiple SPUs. Two things are done in the example below to enable parallel execution. First, a ParallelInstructionStream object is used instead of the usual InstructionStream. Second, a special keyword argument, n_spus, is used to indicate how many SPUs the code should be run on. Finally, a tuple containing the return code from each SPU is returned by the execute method.

>>> import corepy.arch.spu.isa as spu
>>> import corepy.arch.spu.platform as env
# Platform: linux_spufs.spre_linux_spu
 
>>> code = env.ParallelInstructionStream()
>>> code.add(spu.il(code.gp_return, 123))
>>> code.add(spu.stop(456))
 
>>> proc = env.Processor()
>>> ret = proc.execute(code, mode = 'int', n_spus = 4)
>>> print ret
(123, 123, 123, 123)

TODO - figure out the raw_data_size and rank/parameter stuff that seems to exist, and document it

[edit] Working With The Local Store

The Cell features an explicit on-chip, per-SPU cache called the local store. The SPU itself accesses only local store memory directly using load/store instructions, while issuing asynchronous DMA commands to move data between local store and main memory. While moving data between registers and local store is straightforward, using DMA commands to move data between local store and main memory is more involved. As a result, CorePy has several different tools for managing memory and DMA transfers.

First is a fairly low-level API that mimics the libspe2 MFC interface. The source code is the most complete documentation, and may be found in corepy/arch/spu/lib/dma.py.

Below is an example program that loads an array of 32 integers into local store, then stores the data in a second array. The Extended Array class is used here to avoid memory alignment issues, as well as the load_word() utility for loading constant values into registers. On the SPU, different instructions are used to load constants depending on their value; load_word() automatically generates the best instruction sequence.

import corepy.lib.extarray as extarray
import corepy.arch.spu.isa as spu
import corepy.arch.spu.platform as env
import corepy.arch.spu.lib.dma as dma
from corepy.arch.spu.lib.util import load_word
 
a = extarray.extarray('i', range(0, 32))
b = extarray.extarray('i', [0 for i in range(0, 32)])                         
code = env.InstructionStream()                                                
proc = env.Processor()                                                        
  
spu.set_active_code(code)                                                     
  
r_lsa = code.acquire_register()   # Local Store address                       
r_mma = code.acquire_register()   # Main Memory address                       
r_size = code.acquire_register()  # Size in bytes                             
r_tag = code.acquire_register()   # DMA Tag                                   
  
# Set the parameters for a GET command                                        
abi = a.buffer_info()                                                         
  
spu.il(r_lsa, 0x1000)               # Local Store address 0x1000
load_word(code, r_mma, abi[0])      # Main Memory address of array a          
spu.il(r_size, a.itemsize * abi[1]) # Size of array a in bytes                
spu.il(r_tag, 12)                   # DMA tag 12                              
  
# Issue a DMA GET command
dma.mfc_get(code, r_lsa, r_mma, r_size, r_tag)                                
  
# Wait for completion
# Set the completion mask; here we complete tag 12                            
spu.il(r_tag, 1 << 12)
dma.mfc_write_tag_mask(code, r_tag)                                           
dma.mfc_read_tag_status_all(code)                                             
                                                                                
  
# Set the parameters for a PUT command                                        
bbi = b.buffer_info()                                                         
  
spu.il(r_lsa, 0x1000)               # Local Store address 0x1000
load_word(code, r_mma, bbi[0])      # Main Memory address of array b          
spu.il(r_size, b.itemsize * bbi[1]) # Size of array b in bytes                
spu.il(r_tag, 12)                   # DMA tag 12                              
  
# Issue a DMA PUT command
dma.mfc_put(code, r_lsa, r_mma, r_size, r_tag)                                
  
# Wait for completion
# Set the completion mask; here we complete tag 12                            
spu.il(r_tag, 1 << 12)
dma.mfc_write_tag_mask(code, r_tag)                                           
dma.mfc_read_tag_status_all(code)                                             
                                                                                
code.release_register(r_lsa)
code.release_register(r_mma)
code.release_register(r_size)
code.release_register(r_tag)
 
# Execute the code                                                            
proc.execute(code)

In the example above, issuing DMA commands requires a fairly significant amount of code. To help with that, some higher level DMA utility routines are also available (located in corepy/arch/spu/lib/dma.py). The example program below does the same thing as the example above, except uses the higher level routines.

import corepy.lib.extarray as extarray
import corepy.arch.spu.isa as spu
import corepy.arch.spu.platform as env
import corepy.arch.spu.lib.dma as dma
from corepy.arch.spu.lib.util import load_word
 
a = extarray.extarray('i', range(0, 32))
b = extarray.extarray('i', [0 for i in range(0, 32)])
code = env.InstructionStream()
proc = env.Processor()
 
spu.set_active_code(code)
 
# Issue a DMA GET command and wait for completion
abi = a.buffer_info()
dma.mem_get(code, 0x1000, abi[0], abi[1] * a.itemsize, 12)
dma.mem_complete(code, 12)
 
# Issue a DMA PUT command and wait for completion
bbi = b.buffer_info()
dma.mem_put(code, 0x1000, bbi[0], bbi[1] * b.itemsize, 12)
dma.mem_complete(code, 12)
 
# Execute the code
proc.execute(code)

See how values are passed directly to the mem_get(), mem_put(), and mem_complete() routines instead of initializing and passing registers? These routines will automatically grab registers and initialize them for you in this case. If this is not desirable (i.e. for performance reasons), registers pre-initialized with the desired values may be passed in as well. If a register is passed to the mem_complete() routine, it is assumed to be a tag completion mask (i.e. 1 << 12), instead of a tag value (just 12). This way the high-level routines still allow for multiple tags to be completed simultaneously.

TODO - DMA with memory desc, iterators

[edit] Direct Local Store Access

An SPU's local store can be accessed directly from Python, even while the SPU is running. This is possible because the Linux kernel provides a means of mapping local store directly into the process's virtual address space. Once mapped, both the PPU and other SPUs have full read/write access to an SPU's local store by reading/writing in a special region of memory using the same mechanisms that would be normally be used for memory access.

A pointer to this region, represented in Python as an integer, is contained in the 'spuls' member variable of the identifier returned by Processor.execute() when asynchronous mode is enabled (async = True). Setting up an ExtArray object referring to an SPU's local store can be done like the following:

import corepy.lib.extarray as extarray
import corepy.arch.spu.platform as env
 
proc = env.Processor()
 
# code is an InstructionStream created elsewhere
id = proc.execute(code, async = True)
 
arr = extarray.extarray('I', 262144 / 4)
arr.set_memory(id.spuls)

A similar thing can be done with ExtBuffer objects, allowing anything supporting buffer objects to be used to represent local store. The below example creates a NumPy array using an ExtBuffer to access local store.

import corepy.lib.extarray as extarray
import numpy
 
buf = extarray.extbuffer(262144, memory = ctx.spuls)
array = numpy.frombuffer(buf, dtype=numpy.int32)
 
# Change the array shape so that the second dimension
# represents elements of each vector; each element in
# the first dimension is represents the same data as a
# SPU register.
array.shape = (16384, 4)

[edit] Inter-SPU Communication

[edit] Mailboxes & Signals

Each SPU has several mailboxes that may be read/written, as well as two incoming signal channels. Routines for reading/writing local mailboxes & signals on the SPU are available in corepy/arch/spu/lib/dma.py. The SPU environment module provides access to routines for reading/writing SPU mailboxes from the PPU as well.

The example below first synthesizes an SPU program that writes a value to its outbound mailbox and waits for a signal before terminating. While the program is executing, the mailbox value is read on the PPU and a signal is sent to the SPU.

import corepy.arch.spu.isa as spu
import corepy.arch.spu.platform as env
import corepy.arch.spu.lib.dma as dma
from corepy.arch.spu.lib.util import load_word
 
code = env.InstructionStream()
proc = env.Processor()
 
# Grab a register and initialize it
reg = code.acquire_register()
load_word(code, reg, 0xCAFEBABE)
 
# Write the value to the outbound mailbox
dma.spu_write_out_mbox(code, reg)
 
# Wait for a signal
sig = dma.spu_read_signal1(code)
 
code.release_register(sig)
code.release_register(reg)
 
 
# Start the synthesized SPU program
id = proc.execute(code, async = True)
 
# Spin until the mailbox can be read
while env.spu_exec.stat_out_mbox(id) == 0: pass
value = env.spu_exec.read_out_mbox(id)
 
# Signal the SPU
env.spu_exec.write_signal(id, 1, 0x1234)
 
# Wait for the SPU program to complete
proc.join(id)
 
print "value 0x%X" % value

Actually, stat_out_mbox(), read_out_mbox(), and write_signal() are C routines compiled into a SWIG module (the 'spu_exec' module). Rather than importing the SWIG module directly, the recommend way is to access it via the platform environment module, as done in the example before. corepy/arch/spu/platform/linux_spufs/spu_exec.h contains the PPU-side mailbox routines. For mailbox & signal communication, the following routines are implemented:

read_out_mbox(id)
read_out_ibox(id)

Return an entry from the respective mailbox for SPU id if available.

write_in_mbox(id, data)

Write data to the inbound mailbox for SPU id.

stat_out_mbox(id)
stat_out_ibox(id)
stat_in_mbox(id)

Return the number of entries in the respective mailbox for SPU id.

write_signal(id, which, data)

Write signal value 'data' to signal channel 'which' (1 or 2) on SPU id.


[edit] Performance Measurement

An obvious way to time a synthesized SPU program might be to wrap a Processor.execute() call by a pair of Python time.time() calls and compute the difference. However, this includes SPU startup/shutdown overhead, which likely is not desireable. Instead, a timing source called the SPU decrementer is available. On the Cell chips in the Sony PS3's, the decrementer register ticks at a rate of 79.8MHz (visible in /proc/cpuinfo). In CorePy, several routines for managing the SPU decrementer can be found in corepy/arch/spu/lib/dma.py. Here is an example illustrating how to time a block of code:

import corepy.arch.spu.isa as spu
import corepy.arch.spu.platform as env
import corepy.arch.spu.lib.dma as dma
 
code = env.InstructionStream()
proc = env.Processor()
 
# Set the initial decrementer value to 2^31 and start it
dma.spu_write_decr(code, 0x7FFFFFFFl)
dma.spu_start_decr(code)
 
# Code to be timed goes here
 
reg = dma.spu_read_decr(code)
dma.spu_stop_decr(code)
 
code.add(spu.ori(code.gp_return, reg, 0))
code.release_register(reg)
 
# Run the code and grab the ticks as the return value
ticks = proc.execute(code, mode = 'int')
 
# Compute wall time in milliseconds
ticks = 0x7FFFFFFFl - ticks
ticktime = (ticks / 79800000.0) * 1000.0
 
print "SPU Time: %0.5f ms, %d ticks" % (ticktime, ticks)