Background The modeling exercises performed in our work focus on simulating a region sensitive to particular shapes of three line segments line triples due to some preliminary studies su
Trang 1IBM CELL/B.E CHALLENGE 2007
Brain Circuit Bottom-Up Engine Simulation and Acceleration using
IBM CELL Processor for Vision Applications Introduction
Humans outperform computers on many natural tasks including vision Given the human ability to recognize objects rapidly and almost effortlessly, it is
pragmatically sensible to study and attempt to imitate algorithms used by the brain Analysis of the anatomical structure and physiological operation of brain circuits has led to derivation of novel algorithms that, in initial study, successfully address issues
of known difficulty in visual processing These algorithms are slow on conventional uniprocessors, but as might be expected of algorithms designed for highly parallel brain architectures, they are intrinsically parallel and lend themselves to efficient implementation across multiple processors This paper presents an implementation of such parallel algorithms on a CELL processor and demonstrates the potential to support the massive parallelism inherent in these algorithms by exploiting the CELL’s parallel instruction set and by further extending it to low-cost clusters built using the Sony PlayStation 3 (PS3) The results show that a parallel implementation can
achieve more than 100x performance improvement on a cluster of 3 PS3 These initial findings, while highly promising in their own right, also provide a new platform enabling extended investigation of much larger-scale brain circuit models Early prototyping of such large-scale models has yielded evidence of their efficacy in recognition of time-varying, partially occluded, scale-invariant objects in arbitrary scenes
Background
The modeling exercises performed in our work focus on simulating a region sensitive to particular shapes of three line segments (line triples) due to some preliminary studies suggesting their utility, although two or four line segments could just as easily have been used (and indeed provide a subject for future study) These regions represent the crossover from line segment extraction (a subject well studied in contemporary machine vision) to detecting conjunctions of line segments, which are the first levels of the hierarchical representations of the neocortex Although our understanding of these hierarchies is still in its infancy, they are a crucial milestone to a successful implementation of the further downstream brain areas that operate on abstract concepts and perform decision making The most well understood encoding scheme used in the brain is that of the sparse population code (SC also called LSH) In this scheme, two shapes are only as similar as the number of activated axons they share in their neural activation pattern Because it is sparse, most neurons are inactive; and, this notion can be represented as a bit-vector with mostly zero entries This scheme has been observed to represent information such as joint position, movement intention, location, and most importantly the angle of visual shape orientation The neocortical computational module for which there exists the most agreement as to its operation is termed the competitive network The model
is derived from the canonical local circuit of the neocortex Neurons activate as a result of input stimuli called receptive fields (RF), which are simply specific shapes for which neurons respond Roughly speaking, first, visual stimuli is sent to excitatory neurons, which activate at a level based on a neuron’s synapse (memory) These neurons represent the most basic elements, such as arrangements of line triples that are common to very many objects These neurons behave as shape detectors After a training period, these neurons no longer need to learn This training period corresponds to child development, a process during which half of all synapses die off, and presumably only the strongest remain Hence, synapses can
be modeled as either present or absent, and each neuron’s level of activation can be modeled as a bit-vector
In our model, we use 8192 neurons to represent this set, and term it RF1 Fig 1 shows how this component
Trang 2activations, and can choose the best K neural matches, those with the highest activation levels This operation is performed by the K-Winner Take All (K-WTA) method, which our model incorporates (see the K-WTA block in Fig 1) Finally, these K winners activate an additional set of excitatory neurons, whose activations can be used in top-down processing to determine what type of objects are present in the field of view In our version, we utilize 1024 of these neurons, and term them RF2, to define a particular class of objects (see Fig 1’s RF2 Compute block) As new objects are viewed, these neurons learn Additional classes can be defined by additional sets of neurons In the actual brain, many classes of objects are recognizable, even new ones, and the amount of neurons used is much larger
Application Analysis
In this section we describe the systematic approach to mapping of the application onto the CELL processor A simple functional model of the application on IBM CELL is given in Figure 3 (See Appendix for the pseudocode, and Figure 1 for a detailed description of each stage) The bottom-up engine was executed on a 2.13 GHz Intel Core2 (E6400) CPU A fractional breakdown of execution time into functional bottom-up code blocks is shown in Figure 4 Approximately 1.89ms was required to execute the
BU computation for a single line segment triple This corresponds to a BU computation throughput of about 526 line segment triples per second (526 LST/sec) In Figure 4 we can observe that RF1 Vector computations and RF2 Vector computations take more that 95 % of the overall execution time These are the critical functions which need to be optimized and parallelized to increase the throughput of the BU computation From these results, the runtime of processing an entire video frame with 30,000 line-triples can be estimated at approximately one minute, crucially limiting the experiments that can be run on the system and restricting study of the downstream brain areas To enable experiments using interactive robots the recognition time of humans would need to be approached (approximately 500ms, thus requiring a speedup of about 120x)
Optimization
Other than the higher level parallel models on CELL, we implemented various programming optimizations
to improve the performance, namely:
• DMA alignment optimization: DMA operations in CELL can have a size of 1,2,4,8,16 bytes or multiples of 16 bytes If a particular transaction’s address crosses the 128 byte boundary, the results can be achieved through additional DMA transactions Hence by means of careful alignment of important data that is communicated regularly, the overall communication bandwidth required by the application can be significantly reduced If DMA alignment optimization is carried out on too many data sets, then it will result in significant wastage of precious SPE LS memory
• Mailbox Vs signaling mechanism optimization: The CELL allows various means for communication and synchronization, like regular DMA operations, mailboxes and signaling mechanisms The mailbox mechanism allows 32 bit communication but takes about 0.1us (taking into account setup and various application code overhead) for each 32 bit transaction The signal communication mechanism allows 1-bit synchronization between the SPE and PPE at a much higher speed Hence, selection of this mechanism results in lower synchronization cost
• Branch optimization: The SPE does not have a branch prediction unit and assumes a sequential program flow Thus pipeline stalls due to mispredicted branches can be very costly (on the order of 18
to 19 cycles) Various techniques such as function inlining, loop-unrolling and branch hinting mechanisms help to reduce the branch misprediction overhead
• After the above optimization and using special 128-bit SIMD instructions such as absdiff, abs_sumb,
spu_add, spu_cntb, See IBM(2005), one RF1 inner loop computation takes on an average 31.2 cycles
(the desktop PC version takes 164 cycles) and RF2 inner loop takes about 246 cycles (the desktop PC version takes about 2879 cycles) Thus with superior 128-bit SIMD instruction set and SPE instruction fine-tuning impressive speedups are possible
Performance
Trang 3• It should be noted that no special optimizations using SSE2 or MMX instructions were carried out on the desktop PC running Intel Core
TABLE 1: Actual Performance of BDV on Desktop PC Versus PS3 with a single CELL
Function name
Serial model on 2.13 GHz Intel Core2 (E6400) (us)
Optimized Parallel Model on Single CELL (PS3) (us)
Figure 1: This figure depicts the process of converting line-triple representations into detected shapes, with vectors depicted as circles and operations as horizontal lines with a centered symbol Step 1 is the output of the thalamocortical and corticocortical loop processes (see text), which convert line segments into a somewhat scale invariant and translation invariant bit-vector representation The representation is a sparse encoding, 160 on-bits of 1280-bits, representing the angles to the line segment endpoints from some reference point for the line triple (e.g corner of bounding box) Slight changes in the orientation of the line segments, or movement of a line endpoint, lead to decreasing similarity between bit vectors derived from the nạve and modified shapes Row 3 depicts the approximately 8,192 vectors of 1280-bits each (RF1 vectors), each of which was previously and maximally trained on a single input Row 2 indicates that the dot-product will occur between the input vector and all RF1 vectors Row 4 is the resulting matches (8,192 match values between 0 and
160 each) Row 5 depicts the application of a threshold for K-WTA operation, with K=512 Row 6 shows the 512-hot of 8,192 bit-vector, which provides input into the next set of 1,000 vectors (RF2 vectors, 1024-2048 hot of 8,192, Row 8, trained on multiple inputs and performing more complex shape detection) Row 7 indicates the dot-product operation between the input in Row 6 with all RF2 vectors Row 9 indicates the match values, ranging between 0 and 512, indicating how well each RF2
Trang 4typically numbering 20,000-90,000 for frames with 300-600 line segments Depending on
downstream activity, different match values will suffice to support backward processes searching for particular shapes.
Figure 2: The hierarchical organization of shape detectors operating on the optical character number eight At the first level, line segments are detected similar to the early visual areas The second level
is the portion of the model most studied in this work, which involves converting line triples into bit-vectors via an approximation of thalamocortical and corticocortical loops, and two layers of
feedforward competitive networks Layers 3 and 4 are under preliminary study and were
implemented such that viewing a recognizable object allowed the activity of shape detectors in level 2
to be predicted from the activity of other shape detectors in level 2 Each individual grid square represents a shape processor Only a few are depicted as active in this example, but it could be possible for them to all be active simultaneously, performing the high degree of parallel processing responsible for the fast human visual recognition speed Each shape processor in level 2 was modeled
as detecting the same set of shapes, reducing memory requirements.
Figure 3: Functional Model of the Bottom-Up Engine without task level parallelism
(A)
Compute
SC
(B) RF1 Process Input
(C) RF1 Activate Pattern
(D) RF2 Process Input
(E) Output Stage
Trang 5Figure 4: Performance comparison between different function blocks of the bottom-up engine on an Intel Core E6400 running at 2.13GHz
Figure 5: Parallel Model for Brain-derived Vision algorithm The shaded box corresponds to the serial portion of the code
References
1 Ambros-Ingerson, J., Granger, R., and Lynch, G (1990) Simulation of paleocortex performs hierarchical clustering Science, 247: 1344-1348
2 Duda, R., Hart, P., Stork, D (2000) Pattern Classification John Wiley & Sons, Inc
3 Carmichael, O (2003) Discriminative Techniques For The Recognition Of Complex-Shaped Objects PhD Thesis, The Robotics Institute, Carnegie Mellon University Technical Report CMU-RI-TR-03-34
4 Carmichael, O., Hebert, M WORD: Wiry Object Recognition Database Carnegie Mellon University; http://www.cs.cmu.edu/~owenc/word.htm retrieved Dececember 20, 2006
5 Coultrip, R., Granger, R., and Lynch, G (1992) A cortical model of winner-take-all competition via lateral inhibition Neural Networks, 5: 47-54
6 de Cock, E (2000) YAPI: application modeling for signal processing systems, DAC-2000,
7 Eichenberger, A O’Brien, et al, (2005) Optimizing Compiler for a CELL Processor”, IEEE PACT 2005
8 Gschwind M (2006), Hofstee H.P, et al, “Synergistic Processing in CELL’s Multicore
Compute
SC
Global histogram generate
Partial to Global activation Pattern
RF1 Process Input
RF1 Process Input
RF1 Process Input
RF1 Process Input
RF1 Process Input
RF1 Process Input
RF1 Process Input
RF1 Process Input
RF1 Process Input
RF1 Process Input
RF1 Process Input
RF1 Activate Pattern
RF1 Process Input
RF1 Process Input
RF1 Process Input
RF1 Process Input
RF1 Process Input
RF2 Process Input
Figure 6: Sample output from the recognition of stool using WORD database running on PS3/IBM CELL
Trang 69 IBM (2005) CELL BE Programming Handbook and Programming Tutorial
10 Lee, E., Parks, T (1995) Dataflow Process Networks, Proceedings of the IEEE, May 1995
11 P D Kovesi MATLAB and Octave Functions for Computer Vision and Image Processing School of Computer Science & Software Engineering, The University of Western Australia; http://www.csse.uwa.edu.au/~pk/research/matlabfns/ retrieved December 2006
12 Perrett, D., Oram, M (1993) Neurophysiology of shape processing Imaging Vis Comput 11, 317–333
13 Rodriguez, A., Whitson, J., Granger, R (2004) Derivation and a nalysis of basic
computational operations of thalamocortical circuits Journal of Cognitive Neuroscience, 16: 856-877
14 Wallis, G., Rolls, E (1997) A model of invariant object recognition in the visual system Prog Neurobiol 51, 167–194
15 Wiskott, L Fellous, J., Kruger, N., Malsburg, C (1997) Face recognition by elastic bunch graph matching In Gerald Sommer, Kostas Daniilidis, and Josef Pauli, editors, Proc 7th Intern Conf on Computer Analysis of Images and Patterns, CAIP’97, Kiel, number 1296, pages 456–463,
Heidelberg Springer-Verlag
Appendix: Pseudo-code
Initialize:
load RF1 with 8192 entries of size 120 bits
(ROM)
load RF2 with 1024 entries of size 8192 bits
(RAM)
Capture Video Frame:
extract 400 to 600 line segments (32 bits
each)
group 20,000 to 100,000 line triples (30 bits
each)
depending on scene complexity
Calculate LSH:
let lshVector be a locality sensitive hash (LSH)
based on frame and line data
RF1 Compute:
//Computes the dot product of RF1 neurons
with a
//120 bit vector based on input frame data;
//Additionally computes a threshold so that
about 512
//RF1 neurons are active
popRAM[] = array of 8192 elements of 7 bits
each
sums[] = array of 121 elements of 13 bits
each
for i=1…8192
for j=1…20
popCount = popCount + Max(0, 8
-Abs(lshVector - RF1[i]))
lshVector = lshVector >> 6
RF1[i] = RF1[i] >> 6
popRAM[i] = popCount
sums[popCount] = sums[popCount] + 1
K-WTA Compute:
//Computes a vector indicating what RF1
neurons are
//receptive
i=120
while totalSum < 512 || i != 0
totalSum = totalSum + sums[i ] threshold = i
threshold2 = popCount / 4 for i=1…8192
if popRAM[i] > threshold midVector[i] = 1 else
midVector[k] = 0
RF2 Compute:
//Computes dot products between the resultant //midVector and all RF2 neurons; Outputs the RF2
//index if the population count of the dot prodcut is
//over threshold2 popCountROM[] = array of 2^8192 elements
of 13 bits each //(ROM can be distributed to reduce size) for i=1…1024
popCount = popCountROM[midVector & RF2[i]]
if popCount > currentMax currentMax = popCount
if popCount >= threshold2 output index i
output threshold2 output currentMax
Evaluate Learning:
input results from RF2 Compute analyze highly receptive neurons to determine
if they should learn modify RF2 neurons (to be similar to frame data)
based on learning
RF2 Update:
if applicable, send updates to RF2 as a result of
learning