The chapter also includesan application case study in which the pixel purity index PPI, a well-known remotesensing data processing algorithm, is implemented in different types of HPC pla
Trang 1Chapter 2
High-Performance Computer Architectures for Remote Sensing Data Analysis: Overview and Case Study
Antonio Plaza,
University of Extremadura, Spain
Chein-I Chang,
University of Maryland, Baltimore
Contents
2.1 Introduction 10
2.2 Related Work 13
2.2.1 Evolution of Cluster Computing in Remote Sensing 14
2.2.2 Heterogeneous Computing in Remote Sensing 15
2.2.3 Specialized Hardware for Onboard Data Processing 16
2.3 Case Study: Pixel Purity Index (PPI) Algorithm 17
2.3.1 Algorithm Description 17
2.3.2 Parallel Implementations 20
2.3.2.1 Cluster-Based Implementation of the PPI Algorithm 20
2.3.2.2 Heterogeneous Implementation of the PPI Algorithm 22
2.3.2.3 FPGA-Based Implementation of the PPI Algorithm 23
2.4 Experimental Results 27
2.4.1 High-Performance Computer Architectures 27
2.4.2 Hyperspectral Data 29
2.4.3 Performance Evaluation 31
2.4.4 Discussion 35
2.5 Conclusions and Future Research 36
2.6 Acknowledgments 37
References 38
Advances in sensor technology are revolutionizing the way remotely sensed data are collected, managed, and analyzed In particular, many current and future applications
of remote sensing in earth science, space science, and soon in exploration science require real- or near-real-time processing capabilities In recent years, several efforts
Trang 2have been directed towards the incorporation of high-performance computing (HPC)models to remote sensing missions In this chapter, an overview of recent efforts inthe design of HPC systems for remote sensing is provided The chapter also includes
an application case study in which the pixel purity index (PPI), a well-known remotesensing data processing algorithm, is implemented in different types of HPC platformssuch as a massively parallel multiprocessor, a heterogeneous network of distributedcomputers, and a specialized field programmable gate array (FPGA) hardware ar-chitecture Analytical and experimental results are presented in the context of a realapplication, using hyperspectral data collected by NASA’s Jet Propulsion Laboratoryover the World Trade Center area in New York City, right after the terrorist attacks ofSeptember 11th Combined, these parts deliver an excellent snapshot of the state-of-the-art of HPC in remote sensing, and offer a thoughtful perspective of the potentialand emerging challenges of adapting HPC paradigms to remote sensing problems
2.1 Introduction
The development of computationally efficient techniques for transforming the sive amount of remote sensing data into scientific understanding is critical forspace-based earth science and planetary exploration [1] The wealth of informa-tion provided by latest-generation remote sensing instruments has opened ground-breaking perspectives in many applications, including environmental modeling andassessment for Earth-based and atmospheric studies, risk/hazard prevention and re-sponse including wild land fire tracking, biological threat detection, monitoring ofoil spills and other types of chemical contamination, target detection for military anddefense/security purposes, urban planning and management studies, etc [2] Most ofthe above-mentioned applications require analysis algorithms able to provide a re-sponse in real- or near-real-time This is quite an ambitious goal in most current remotesensing missions, mainly because the price paid for the rich information available fromlatest-generation sensors is the enormous amounts of data that they generate [3, 4, 5]
mas-A relevant example of a remote sensing application in which the use of HPCtechnologies such as parallel and distributed computing are highly desirable is hy-perspectral imaging [6], in which an image spectrometer collects hundreds or eventhousands of measurements (at multiple wavelength channels) for the same area
on the surface of the Earth (see Figure 2.1) The scenes provided by such sors are often called “data cubes,” to denote the extremely high dimensionality
sen-of the data For instance, the NASA Jet Propulsion Laboratory’s Airborne ble Infra-Red Imaging Spectrometer (AVIRIS) [7] is now able to record the vis-ible and near-infrared spectrum (wavelength region from 0.4 to 2.5 micrometers)
Visi-of the reflected light Visi-of an area 2 to 12 kilometers wide and several kilometerslong using 224 spectral bands (see Figure 3.8) The resulting cube is a stack ofimages in which each pixel (vector) has an associated spectral signature or ‘fin-gerprint’ that uniquely characterizes the underlying objects, and the resulting datavolume typically comprises several GBs per flight Although hyperspectral imaging
Trang 3Pure pixel (water)
Mixed pixel (soil + rocks)
Wavelength (nm)
2400 2100 1800 1500 1200 900 600 300
Wavelength (nm)
2400 2100 1800 1500 1200 900 600 300
5000 4000
Wavelength (nm)
2400 2100 1800 1500 1200 900 600 300
Figure 2.1 The concept of hyperspectral imaging in remote sensing
© 2008 by Taylor & Francis Group, LLC
Trang 4is a good example of the computational requirements introduced by remote sensingapplications, there are many other remote sensing areas in which high-dimensionaldata sets are also produced (several of them are covered in detail in this book) How-ever, the extremely high computational requirements already introduced by hyper-spectral imaging applications (and the fact that these systems will continue increasingtheir spatial and spectral resolutions in the near future) make them an excellent casestudy to illustrate the need for HPC systems in remote sensing and will be used inthis chapter for demonstration purposes.
Specifically, the utilization of HPC systems in hyperspectral imaging applicationshas become more and more widespread in recent years The idea developed by thecomputer science community of using COTS (commercial off-the-shelf) computerequipment, clustered together to work as a computational “team,” is a very attractivesolution [8] This strategy is often referred to as Beowulf-class cluster computing [9]and has already offered access to greatly increased computational power, but at a lowcost (commensurate with falling commercial PC costs) in a number of remote sensingapplications [10, 11, 12, 13, 14, 15] In theory, the combination of commercial forcesdriving down cost and positive hardware trends (e.g., CPU peak power doublingevery 18–24 months, storage capacity doubling every 12–18 months, and networkingbandwidth doubling every 9–12 months) offers supercomputing performance that cannow be applied a much wider range of remote sensing problems
Although most parallel techniques and systems for image information processingemployed by NASA and other institutions during the last decade have chiefly beenhomogeneous in nature (i.e., they are made up of identical processing units, thus sim-plifying the design of parallel solutions adapted to those systems), a recent trend in thedesign of HPC systems for data-intensive problems is to utilize highly heterogeneouscomputing resources [16] This heterogeneity is seldom planned, arising mainly as
a result of technology evolution over time and computer market sales and trends
In this regard, networks of heterogeneous COTS resources can realize a very highlevel of aggregate performance in remote sensing applications [17], and the pervasive
availability of these resources has resulted in the current notion of grid computing
[18], which endeavors to make such distributed computing platforms easy to utilize
in different application domains, much like the World Wide Web has made it easy todistribute Web content It is expected that grid-based HPC systems will soon representthe tool of choice for the scientific community devoted to very high-dimensional dataanalysis in remote sensing and other fields
Finally, although remote sensing data processing algorithms generally map quitenicely to parallel systems made up of commodity CPUs, these systems are generallyexpensive and difficult to adapt to onboard remote sensing data processing scenarios,
in which low-weight and low-power integrated components are essential to reducemission payload and obtain analysis results in real time, i.e., at the same time as thedata are collected by the sensor In this regard, an exciting new development in thefield of commodity computing is the emergence of programmable hardware devicessuch as field programmable gate arrays (FPGAs) [19, 20, 21] and graphic processingunits (GPUs) [22], which can bridge the gap towards onboard and real-time analysis
of remote sensing data FPGAs are now fully reconfigurable, which allows one to
Trang 5adaptively select a data processing algorithm (out of a pool of available ones) to beapplied onboard the sensor from a control station on Earth.
On the other hand, the emergence of GPUs (driven by the ever-growing demands
of the video-game industry) has allowed these systems to evolve from expensiveapplication-specific units into highly parallel and programmable commodity compo-nents Current GPUs can deliver a peak performance in the order of 360 Gigaflops(Gflops), more than seven times the performance of the fastest×86 dual-core proces-
sor (around 50 Gflops) The ever-growing computational demands of remote sensingapplications can fully benefit from compact hardware components and take advan-tage of the small size and relatively low cost of these units as compared to clusters ornetworks of computers
The main purpose of this chapter is to provide an overview of different HPCparadigms in the context of remote sensing applications The chapter is organized asfollows:
r Section 2.2 describes relevant previous efforts in the field, such as the
evo-lution of cluster computing in remote sensing applications, the emergence ofdistributed networks of computers as a cost-effective means to solve remotesensing problems, and the exploitation of specialized hardware architectures inremote sensing missions
r Section 2.3 provides an application case study: the well-known Pixel Purity
Index (PPI) algorithm [23], which has been widely used to analyze spectral images and is available in commercial software The algorithm is firstbriefly described and several issues encountered in its implementation are dis-cussed Then, we provide HPC implementations of the algorithm, including acluster-based parallel version, a variation of this version specifically tuned forheterogeneous computing environments, and an FPGA-based implementation
hyper-r Section 2.4 also provides an experimental comparison of the proposed
imple-mentations of PPI using several high-performance computing architectures.Specifically, we use Thunderhead, a massively parallel Beowulf cluster atNASA’s Goddard Space Flight Center, a heterogeneous network of distributedworkstations, and a Xilinx Virtex-II FPGA device The considered application
is based on the analysis of hyperspectral data collected by the AVIRIS ment over the World Trade Center area in New York City right after the terroristattacks of September 11th
instru-r Finally, Section 2.5 concludes with some remarks and plausible future research
lines
2.2 Related Work
This section first provides an overview of the evolution of cluster computing tures in the context of remote sensing applications, from the initial developments inBeowulf systems at NASA centers to the current systems being employed for remote
Trang 6architec-sensing data processing Then, an overview of recent advances in heterogeneouscomputing systems is given These systems can be applied for the sake of distributedprocessing of remotely sensed data sets The section concludes with an overview ofhardware-based implementations for onboard processing of remote sensing data sets.
2.2.1 Evolution of Cluster Computing in Remote Sensing
Beowulf clusters were originally developed with the purpose of creating a effective parallel computing system able to satisfy specific computational require-ments in the earth and space sciences communities Initially, the need for largeamounts of computation was identified for processing multispectral imagery withonly a few bands [24] As sensor instruments incorporated hyperspectral capabilities,
cost-it was soon recognized that computer mainframes and mini-computers could not vide sufficient power for processing these kinds of data The Linux operating systemintroduced the potential of being quite reliable due to the large number of developersand users Later it became apparent that large numbers of developers could also be adisadvantage as well as an advantage
pro-In 1994, a team was put together at NASA’s Goddard Space Flight Center (GSFC)
to build a cluster consisting only of commodity hardware (PCs) running Linux, whichresulted in the first Beowulf cluster [25] It consisted of 16 100Mhz 486DX4-basedPCs connected with two hub-based Ethernet networks tied together with channelbonding software so that the two networks acted like one network running at twicethe speed The next year Beowulf-II, a 16-PC cluster based on 100Mhz PentiumPCs, was built and performed about 3 times faster, but also demonstrated a muchhigher reliability In 1996, a Pentium-Pro cluster at Caltech demonstrated a sustainedGigaflop on a remote sensing-based application This was the first time a commoditycluster had shown high-performance potential
Up until 1997, Beowulf clusters were in essence engineering prototypes, that is,they were built by those who were going to use them However, in 1997, a project wasstarted at GSFC to build a commodity cluster that was intended to be used by thosewho had not built it, the HIVE (highly parallel virtual environment) project The ideawas to have workstations distributed among different locations and a large number
of compute nodes (the compute core) concentrated in one area The workstationswould share the computer core as though it was apart of each Although the originalHIVE only had one workstation, many users were able to access it from their ownworkstations over the Internet The HIVE was also the first commodity cluster toexceed a sustained 10 Gigaflop on a remote sensing algorithm
Currently, an evolution of the HIVE is being used at GSFC for remote sensing dataprocessing calculations The system, called Thunderhead (seeFigure 2.2), is a 512-processor homogeneous Beowulf cluster composed of 256 dual 2.4 GHz Intel Xeonnodes, each with 1 GB of memory and 80 GB of main memory The total peak perfor-mance of the system is 2457.6 GFlops Along with the 512-processor computer core,Thunderhead has several nodes attached to the core with a 2 Ghz optical fibre Myrinet.NASA is currently supporting additional massively parallel clusters for remotesensing applications, such as the Columbia supercomputer at NASA Ames Research
Trang 7Figure 2.2 Thunderhead Beowulf cluster (512 processors) at NASA’s GoddardSpace Flight Center in Maryland.
Center, a 10,240-CPU SGI Altix supercluster, with Intel Itanium 2 processors,
20 terabytes total memory, and heterogeneous interconnects including InfiniBand work and a 10 GB Ethernet This system is listed as #8 in the November 2006 version
net-of the Top500 list net-of supercomputer sites available online athttp://www.top500.org.Among many other examples of HPC systems included in the list that are currentlybeing exploited for remote sensing and earth science-based applications, we citethree relevant systems for illustrative purposes The first one is MareNostrum, anIBM cluster with 10,240 processors, 2.3 GHz Myrinet connectivity, and 20,480 GB ofmain memory available at Barcelona Supercomputing Center (#5 in Top500) Anotherexample is Jaws, a Dell PowerEdge cluster with 3 GHz Infiniband connectivity,5,200 GB of main memory, and 5,200 processors available at Maui High-PerformanceComputing Center (MHPCC) in Hawaii (#11 in Top500) A final example is NEC’sEarth Simulator Center, a 5,120-processor system developed by Japan’s AerospaceExploration Agency and the Agency for Marine-Earth Science and Technology (#14
in Top500) It is highly anticipated that many new supercomputer systems will bespecifically developed in forthcoming years to support remote sensing applications
2.2.2 Heterogeneous Computing in Remote Sensing
In the previous subsection, we discussed the use of cluster technologies based onmultiprocessor systems as a high-performance and economically viable tool forefficient processing of remotely sensed data sets With the commercial availability
Trang 8of networking hardware, it soon became obvious that networked groups of machinesdistributed among different locations could be used together by one single parallelremote sensing code as a distributed-memory machine [26] Of course, such networkswere originally designed and built to connect heterogeneous sets of machines As aresult, heterogeneous networks of workstations (NOWs) soon became a very populartool for distributed computing with essentially unbounded sets of machines, in whichthe number and locations of machines may not be explicitly known [16], as opposed
to cluster computing, in which the number and locations of nodes are known andrelatively fixed
An evolution of the concept of distributed computing described above resulted
in the current notion of grid computing [18], in which the number and locations ofnodes are relatively dynamic and have to be discovered at run-time It should be notedthat this section specifically focuses on distributed computing environments withoutmeta-computing or grid computing, which aims at providing users access to servicesdistributed over wide-area networks Several chapters of this volume provide detailedanalyses of the use of grids for remote sensing applications, and this issue is notfurther discussed here
There are currently several ongoing research efforts aimed at efficient distributedprocessing of remote sensing data Perhaps the most simple example is the use ofheterogeneous versions of data processing algorithms developed for Beowulf clus-ters, for instance, by resorting to heterogeneous-aware variations of homogeneousalgorithms, able to capture the inherent heterogeneity of a NOW and to load-balancethe computation among the available resources [27] This framework allows one toeasily port an existing parallel code developed for a homogeneous system to a fullyheterogeneous environment, as will be shown in the following subsection
Another example is the Common Component Architecture (CCA) [28], which hasbeen used as a plug-and-play environment for the construction of climate, weather,and ocean applications through a set of software components that conform to stan-dardized interfaces Such components encapsulate much of the complexity of thedata processing algorithms inside a black box and expose only well-defined inter-faces to other components Among several other available efforts, another distributedapplication framework specifically developed for earth science data processing is theJava Distributed Application Framework (JDAF) [29] Although the two main goals ofJDAF are flexibility and performance, we believe that the Java programming language
is not mature enough for high-performance computing of large amounts of data
2.2.3 Specialized Hardware for Onboard Data Processing
Over the last few years, several research efforts have been directed towards the poration of specialized hardware for accelerating remote sensing-related calculationsaboard airborne and satellite sensor platforms Enabling onboard data processingintroduces many advantages, such as the possibility to reduce the data down-linkbandwidth requirements at the sensor by both preprocessing data and selecting data
incor-to be transmitted based upon predetermined content-based criteria [19, 20] Onboardprocessing also reduces the cost and the complexity of ground processing systems so
Trang 9that they can be affordable to a larger community Other remote sensing applicationsthat will soon greatly benefit from onboard processing are future web sensor mis-sions as well as future Mars and planetary exploration missions, for which onboardprocessing would enable autonomous decisions to be made onboard.
Despite the appealing perspectives introduced by specialized data processing ponents, current hardware architectures including FPGAs (on-the-fly reconfigurabil-ity) and GPUs (very high performance at low cost) still present some limitations thatneed to be carefully analyzed when considering their incorporation to remote sensingmissions [30] In particular, the very fine granularity of FPGAs is still not efficient,with extreme situations in which only about 1% of the chip is available for logic while99% is used for interconnect and configuration This usually results in a penalty interms of speed and power On the other hand, both FPGAs and GPUs are still difficult
com-to radiation-harden (currently-available radiation-com-tolerant FPGA devices have twoorders of magnitude fewer equivalent gates than commercial FPGAs)
2.3 Case Study: Pixel Purity Index (PPI) Algorithm
This section provides an application case study that is used in this chapter to illustratedifferent approaches for efficient implementation of remote sensing data processingalgorithms The algorithm selected as a case study is the PPI [23], one of the mostwidely used algorithms in the remote sensing community First, the serial version ofthe algorithm available in commercial software is described Then, several parallelimplementations are given
2.3.1 Algorithm Description
The PPI algorithm was originally developed by Boardman et al [23] and was soonincorporated into Kodak’s Research Systems ENVI, one of the most widely usedcommercial software packages by remote sensing scientists around the world Theunderlying assumption under the PPI algorithm is that the spectral signature associated
to each pixel vector measures the response of multiple underlying materials at eachsite For instance, it is very likely that the pixel vectors shown inFigure 3.8wouldactually contain a mixture of different substances (e.g., different minerals, differenttypes of soils, etc.) This situation, often referred to as the “mixture problem” inhyperspectral analysis terminology [31], is one of the most crucial and distinguishingproperties of spectroscopic analysis
Mixed pixels exist for one of two reasons [32] Firstly, if the spatial resolution ofthe sensor is not fine enough to separate different materials, these can jointly occupy
a single pixel, and the resulting spectral measurement will be a composite of theindividual spectra Secondly, mixed pixels can also result when distinct materialsare combined into a homogeneous mixture This circumstance occurs independent of
Trang 10Extreme Extreme
Extreme
Skewer 3
Skewer 2 Skewer 1
2-dimensional space
the spatial resolution of the sensor A hyperspectral image is often a combination ofthe two situations, where a few sites in a scene are pure materials, but many othersare mixtures of materials
To deal with the mixture problem in hyperspectral imaging, spectral unmixing niques have been proposed as an inversion technique in which the measured spectrum
tech-of a mixed pixel is decomposed into a collection tech-of spectrally pure constituent spectra,
called endmembers in the literature, and a set of correspondent fractions, or dances, that indicate the proportion of each endmember present in the mixed pixel [6].
abun-The PPI algorithm is a tool to automatically search for endmembers that are assumed
to be the vertices of a convex hull [23] The algorithm proceeds by generating a large
number of random, N -dimensional unit vectors called “skewers” through the data set.
Every data point is projected onto each skewer, and the data points that correspond toextrema in the direction of a skewer are identified and placed on a list (see Figure 2.3)
As more skewers are generated, the list grows, and the number of times a given pixel
is placed on this list is also tallied The pixels with the highest tallies are consideredthe final endmembers
The inputs to the algorithm are a hyperspectral data cube F with N dimensions; a
maximum number of endmembers to be extracted, E; the number of random skewers
to be generated during the process, k; a cut-off threshold value, t v, used to select
as final endmembers only those pixels that have been selected as extreme pixels at
least t v times throughout the PPI process; and a threshold angle, t a, used to discard
redundant endmembers during the process The output of the algorithm is a set of E
final endmembers{ee}E
e=1 The algorithm can be summarized by the following steps:
Trang 111 Skewer generation Produce a set of k randomly generated unit vectors
{skewerj}k
j=1.
2 Extreme projections For each skewer j, all sample pixel vectors fiin the
orig-inal data set F are projected onto skewerj via dot products of|fi · skewerj|
to find sample vectors at its extreme (maximum and minimum) projections,
thus forming an extrema set for skewerj that is denoted by S extrema(skewerj)
Despite the fact that a different skewerj would generate a different extrema
set S extrema(skewerj), it is very likely that some sample vectors may appear inmore than one extrema set In order to deal with this situation, we define an
indicator function of a set S, denoted by I S(x), to denote membership of an element x to that particular set as follows:
3 Calculation of PPI scores Using the indicator function above, we calculate
the PPI score associated to the sample pixel vector fi(i.e., the number of timesthat given pixel has been selected as extreme in step 2) using the followingequation:
N PPI(fi)=
k
j=1
I S extr ema(skewerj)(fi) (2.2)
4 Endmember selection Find the pixel vectors with scores of N PPI(fi) that are
above t vand form a unique set of endmembers{ee}E
e=1by calculating the spectralangle distance (SAD) for all possible vector pairs and discarding those pixels
that result in an angle value below t a It should be noted that the SAD between
a pixel vector fiand a different pixel vector fjis a standard similarity metric forremote sensing operations, mainly because it is invariant in the multiplication
of the input vectors by constants and, consequently, is invariant to unknownmultiplicative scalings that may arise due to differences in illumination andsensor observation angle:
Trang 12bands), the PPI generally requires a very high number of skewers (in the order of
k = 104 or k = 105) to produce an accurate final set of endmembers [32] andresults in processing times above one hour when the algorithm is run on a latest-generation desktop PC Such response time is unacceptable in most remote sensingapplications In the following section, we provide an overview of HPC paradigmsapplied to speed up computational performance of the PPI using different kinds ofparallel and distributed computing architectures
2.3.2 Parallel Implementations
This section first develops a parallel implementation of the PPI algorithm that has beenspecifically developed to be run on massively parallel, homogeneous Beowulf clus-ters Then, the parallel version is transformed into a heterogeneity-aware implemen-tation by introducing an adaptive data partitioning algorithm specifically developed tocapture in real time the specificities of a heterogeneous network of distributed work-stations Finally, an FPGA implementation aimed at onboard PPI-based processing
is provided
In this subsection, we describe a master-slave parallel version of the PPI algorithm
To reduce code redundancy and enhance reusability, our goal was to reuse much ofthe code for the sequential algorithm in the parallel implementation For that purpose,
we adopted a spatial-domain decomposition approach [34, 35] that subdivides theimage cube into multiple blocks made up of entire pixel vectors, and assigns one ormore blocks to each processing element (see Figure 2.4)
Local partition #2
Local partition #1
Scatter
Original image
Local PPI scores
Local PPI scores
Gather
Global PPI scores
Processing node #2 Processing node #1
PPI algorithm
Trang 13It should be noted that the PPI algorithm is mainly based on projecting pixel vectorsthat are always treated as a whole This is a result of the convex geometry processimplemented by the PPI, which is based on the spectral “purity” or “convexity” ofthe entire spectral signature associated to each pixel Therefore, a spectral-domainpartitioning scheme (which subdivides the whole multi-band data into blocks made up
of contiguous spectral bands or sub-volumes, and assigns one or more sub-volumes
to each processing element) is not appropriate in our application [8] This is becausethe latter approach breaks the spectral identity of the data because each pixel vector
is split amongst several processing elements
A further reason that justifies the above decision is that, in spectral-domain tioning, the calculations made for each hyperspectral pixel need to originate from sev-eral processing elements, and thus require intensive inter-processor communication.Therefore, in our proposed implementation, a master-worker spatial domain-baseddecomposition paradigm is adopted, where the master processor sends partial data tothe workers and coordinates their actions Then, the master gathers the partial resultsprovided by the workers and produces a final result
parti-As it was the case with the serial version, the inputs to our cluster-based
imple-mentation of the PPI algorithm are a hyperspectral data cube F with N dimensions; a
maximum number of endmembers to be extracted, p; the number of random skewers
to be generated during the process, k; a cut-off threshold value, t v; and a threshold
angle, t a The output of the algorithm is a set of E endmembers{ee}E
e=1 The parallelalgorithm is given by the following steps:
1 Data partitioning Produce a set of L spatial-domain homogeneous partitions
of F and scatter all partitions by indicating all partial data structure elements
that are to be accessed and sent to each of the workers
2 Skewer generation Generate k random unit vectors{skewerj}k
j=1 in paralleland broadcast the entire set of skewers to all the workers
3 Extreme projections For each skewer j, project all the sample pixel vectors at
each local partition l onto skewer jto find sample vectors at its extreme
projec-tions, and form an extrema set for skewerj that is denoted by S (l) extr ema(skewerj)
Now calculate the number of times each pixel vector fi (l)in the local partition
is selected as extreme using the following expression:
4 Candidate selection Select those pixels with N PPI (l) fi (l)
> t vand send them tothe master node
5 Endmember selection The master gathers all the individual endmember sets
provided by the workers and forms a unique set{ee}E
e=1by calculating the SADfor all possible pixel vector pairs in parallel and discarding those pixels that
result in angle values below t a
Trang 14It should be noted that the proposed parallel algorithm has been implemented inthe C++ programming language, using calls to message passing interface (MPI) [36].
We emphasize that, in order to implement step 1 of the parallel algorithm, we resorted
to MPI-derived data types to directly scatter hyperspectral data structures, which may
be stored non-contiguously in memory, in a single communication step As a result,
we avoid creating all partial data structures on the root node (thus making better use
of memory resources and compute power)
In this subsection, we provide a simple application case study in which the standardMPI-based implementation of the PPI is adapted to a heterogeneous environment byreutilizing most of the code available for the cluster-based system [27] This approach
is generally preferred due to the relatively large amount of data processing algorithmsand parallel software developed for homogeneous systems Before introducing ourimplementation of the PPI algorithm for heterogeneous NOWs, we must first formu-late a general optimization problem in the context of fully heterogeneous systems(composed of different-speed processors that communicate through links at differentcapacities) [16] Such a computing platform can be modeled as a complete graph
G = (P, E), where each node models a computing resource p i weighted by its tive cycle-timew i Each edge in the graph models a communication link weighted by
rela-its relative capacity, where c i jdenotes the maximum capacity of the slowest link in
the path of physical communication links from p i to p j (we assume that the system
has symmetric costs, i.e., c i j = c j i
With the above assumptions in mind, processor p i should accomplish a share of
α i · W of the total workload, denoted by W, to be performed by a certain algorithm,
withα i ≥ 0 for 1 ≤ i ≤ P andP
i=1α i = 1 With the above assumptions in mind,
an abstract view of our problem can be simply stated in the form of a master-workerarchitecture, much like the commodity cluster-based homogeneous implementationdescribed in the previous section However, in order for such parallel algorithms to bealso effective in fully heterogeneous systems, the master program must be modified
to produce a set of L spatial-domain heterogeneous partitions of F in step 1.
In order to balance the load of the processors in the heterogeneous ment, each processor should execute an amount of work that is proportional to itsspeed Therefore, two major goals of our partitioning algorithm should be: (i) to ob-tain an appropriate set of workload fractions{α i}P
environ-i=1 that best fit the heterogeneousenvironment; and (ii) to translate the chosen set of values into a suitable decomposition
of the input data, taking into account the properties of the heterogeneous system
To accomplish the above goals, we use a workload estimation algorithm (WEA)
that assumes that the workload of each processor p imust be directly proportional toits local memory and inversely proportional to its cycle-timew i Below, we provide adescription of a WEA algorithm, which replaces step 1 in the implementation of PPIprovided in our previous section Steps 2–5 of the parallel algorithm in the previoussection would be executed immediately after WEA and remain the same as thoseoutlined in the algorithmic description provided in the previous section (thus greatly
Trang 15enhancing code reutilization) The input to WEA is F, an N -dimensional data cube, and the output is a set of L spatial-domain heterogeneous partitions of F:
1 Obtain necessary information about the heterogeneous system, including the
number of available processors P, each processor’s identification number
approximates the{α i}P
i=1so that the amount of work assigned to each processor
is proportional to its speed andα i · w i ≈ const for all processors.
3 Iteratively increment someα iuntil the set of{α i}P
i=1best approximates the total
workload to be completed, W , i.e., for m=P
i=1α i to W , find k ∈ {1, · · · , P}
so thatw k · (α k + 1) = min{w i · (α i+ 1)}P
i=1, and then setα k = α k+ 1
4 Once the set {α i}P
i=1 has been obtained, a further objective is to produce Ppartitions of the input hyperspectral data set To do so, we proceed as follows:
r Obtain a first partitioning of the hyperspectral data set so that the number of
rows in each partition is proportional to the values of{α i}P
a heterogeneous version of MPI that automatically optimizes the workload assigned
to each heterogeneous processor (i.e., this implementation automatically determinesthe load distribution accomplished by our proposed WEA algorithm) Experimen-tally, we tested that both implementations resulted in very similar results, and, hence,the experimental validation provided in the following section will be based on theperformance analysis achieved by the first implementation (i.e., using our proposedWEA algorithm to estimate the workloads)
In this subsection, we describe a hardware-based parallel strategy for tion of the PPI algorithm that is aimed at enhancing replicability and reusability ofslices in FPGA devices through the utilization of systolic array design [38] One ofthe main advantages of systolic array-based implementations is that they are able toprovide a systematic procedure for system design that allows for the derivation of awell-defined processing element-based structure and an interconnection pattern thatcan then be easily ported to real hardware configurations Using this procedure, wecan also calculate the data dependencies prior to the design, and in a very straight-forward manner Our proposed design intends to maximize computational power ofthe hardware and minimize the cost of communications These goals are particularlyrelevant in our specific application, where hundreds of data values will be handledfor each intermediate result, a fact that may introduce problems related with limitedresource availability and inefficiencies in hardware replication and reusability
Trang 16implementa-max T max3
dot K2
dot 2T
dot 1T dot13
f2(N) , , f2 (1)
config-to the sysconfig-tolic array from left config-to right Figure 2.5 illustrates the above principle, inwhich local results remain static at each processing element, while pixel vectors areinput to the systolic array from top to bottom and skewer vectors are fed to the sys-
tolic array from left to right In Figure 2.5, asterisks represent delays while skewer(n) j
denotes the value of the n-th band of the j -th skewer, with j ∈ {1, · · · , K } and
n ∈ {1, · · · , N}, where N is the number of bands of the input hyperspectral scene.
Similarly, f(n) i denotes the reflectance value of the n-th band of the i -th pixel, with
i ∈ {1, · · · , T }, where T is the total number of pixels in the input image The
pro-cessing nodes labeled as dot in Figure 2.5 perform the individual products for the skewer projections On the other hand, the nodes labeled as max and min respectively
compute the maxima and minima projections after the dot product calculations have
been completed In fact, the max and min nodes can be respectively seen as part of a
1-dimensional systolic array that avoids broadcasting the pixel while simplifying thecollection of the results
Basically, a systolic cycle in the architecture described in Figure 2.5 consists incomputing a single dot product between a pixel and a skewer A full vector dot-product