In CBIR, images are automatically indexed by summarizingtheir visual contents through automatically extracted quantities or features such collec-as color, texture, or shape.. A typical c
Trang 1Image Databases: Search and Retrieval of Digital Imagery
Edited by Vittorio Castelli, Lawrence D Bergman Copyright 2002 John Wiley & Sons, Inc ISBNs: 0-471-32116-8 (Hardback); 0-471-22463-4 (Electronic)
Image Retrieval — Overview of Key Techniques
YING LI and C.-C JAY KUO
University of Southern California, Los Angeles, California
is similar to the rapid increase in the amount of alphanumeric data during theearly days of computing, which led to the development of database managementsystems (DBMS) Traditional DBMSs were designed to organize alphanumericdata into interrelated collections so that information retrieval and storage could
be done conveniently and efficiently However, this technology is not well suited
to the management of multimedia information The diversity of data types andformats, the large size of media objects, and the difficulties in automaticallyextracting semantic meanings from data are entirely foreign to traditional databasemanagement techniques To use this widely available multimedia informationeffectively, efficient methods for storage, browsing, indexing, and retrieval [1,2]must be developed Different multimedia data types may require specific indexingand retrieval tools and methodologies In this chapter, we present an overview
of indexing and retrieval methods for image data
Since the 1970s, image retrieval has been a very active research area withintwo major research communities — database management and computer vision.These research communities study image retrieval from two different angles Thefirst is primarily text-based, whereas the second relies on visual properties of thedata [3]
261
Trang 2Text-based image retrieval can be traced back to the late 1970s At thattime, images were annotated by key words and stored as retrieval keys intraditional databases Some relevant research in this field can be found inRefs [4,5] Two problems render manual annotation ineffective when the size
of image databases becomes very large The first is the prohibitive amount
of labor involved in image annotation The other, probably more essential,results from the difficulty of capturing the rich content of images using a smallnumber of key words, a difficulty which is compounded by the subjectivity
of human perception
In the early 1990s, because of the emergence of large-scale image tions, content-based image retrieval (CBIR) was proposed as a way to overcomethese difficulties In CBIR, images are automatically indexed by summarizingtheir visual contents through automatically extracted quantities or features such
collec-as color, texture, or shape Thus, low-level numerical features, extracted by acomputer, are substituted for higher-level, text-based manual annotations or keywords Since the inception of CBIR, many techniques have been developed alongthis direction, and many retrieval systems, both research and commercial, havebeen built [3]
Note that ideally CBIR systems should automatically extract (and index) thesemantic content of images to meet the requirements of specific application areas.Although it seems effortless for a human being to pick out photos of horsesfrom a collection of pictures, automatic object recognition and classification arestill among the most difficult problems in image understanding and computervision This is the main reason why low-level features such as colors [6–8],textures [9–11], and shapes of objects [12,13] are widely used for content-basedimage retrieval However, in specific applications, such as medical or petroleumimaging, low-level features play a substantial role in defining the content ofthe data
A typical content-based image retrieval system is depicted in Figure 10.1 [3].The image collection database contains raw images for the purpose of visualdisplay The visual feature repository stores visual features extracted from imagesneeded to support content-based image retrieval The text annotation reposi-tory contains key words and free-text descriptions of images Multidimensionalindexing is used to achieve fast retrieval and to make the system scalable to largeimage collections
The retrieval engine includes a query interface and a query-processing unit.The query interface, typically employing graphical displays and direct manipu-lation techniques, collects information from users and displays retrieval results.The query-processing unit is used to translate user queries into an internal form,which is then submitted to the DBMS Moreover, in order to gap the bridgebetween low-level visual features and high-level semantic meanings, users areusually allowed to communicate with the search engine in an interactive way
We will address each part of this structure in more detail in later sections.This chapter is organized as follows In Section 10.2, the extraction andintegration of some commonly used features, such as color, texture, shape,
Trang 3FEATURE EXTRACTION AND INTEGRATION 263
Feature extraction
Image
collection
User Query interface Query-processing Multidimensional indexing
Visual features
Text annotation
Retrieval engine
Figure 10.1 An image retrieval system architecture.
object spatial relationship, and so on are briefly discussed Some feature indexingtechniques are reviewed in Section 10.4 Section 10.5 provides key concepts ofinteractive content-based image retrieval, and several main components of aCBIR system are also discussed briefly Section 10.6 introduces a new work item
of the ISO/MPEG family, which is called the “Multimedia Content DescriptionInterface” or MPEG-7 in short, which defines a standard to describe and definemultimedia content features and descriptors Finally, concluding remarks aredrawn in Section 10.7
10.2 FEATURE EXTRACTION AND INTEGRATION
Feature extraction is the basis of CBIR Features can be categorized as general ordomain-specific General features typically include color, texture, shape, sketch,spatial relationships, and deformation, whereas domain-specific features are appli-cable in specialized domains such as human face recognition or fingerprintrecognition
Each feature may have several representations For example, color histogramand color moments are both representations of the image color feature Moreover,numerous variations of the color histogram itself have been proposed, each ofwhich differs in the selected color-quantization scheme
Trang 410.2.1 Feature Extraction
10.2.1.1 Color Color is one of the most recognizable elements of image
content [14] and is widely used in image retrieval because of its invariance withrespect to image scaling, translation, and rotation The key issues in color featureextraction include the color space, color quantization, and the choice of similarityfunction
Color Spaces The commonly used color spaces include RGB, YCbCr, HSV,
CIELAB, CIEL*u*v*, and Munsell spaces The CIELAB and CIEL*u*v* colorspaces usually give a better performance because of their improved perceptualuniformity with respect to RGB [15] MPEG-7 XM V2 supports RGB, YCbCr,HSV color spaces, and some linear transformation matrices with reference toRGB [16]
Color Quantization Color quantization is used to reduce the color resolution of
an image Using a quantized color map can considerably decrease the tional complexity during image retrieval The commonly used color-quantizationschemes include uniform quantization, vector quantization, tree-structured vectorquantization, and product quantization [17–19] In MPEG-7 XM V2 [16], threequantization types are supported: linear, nonlinear, and lookup table
computa-10.3 SIMILARITY FUNCTIONS
A similarity function is a mapping between pairs of feature vectors and a positivereal-valued number, which is chosen to be representative of the visual similaritybetween two images Let us take the color histogram as an example Thereare two main approaches to histogram formation The first one is based on theglobal color distribution across the entire image, whereas the second one consists
of computing the local color distribution for a certain partition of the image.These two techniques are suitable for different types of queries If users areconcerned only with the overall colors and their amounts, regardless of theirspatial arrangement in the image, then indexing using the global color distribution
is useful However, if users also want to take into consideration the positionalarrangement of colors, the local color histogram will be a better choice
A global color histogram represents an image I by an N -dimensional vector,
H(I ) = [H(I, j), j = 1, 2, , N], where N is the number of quantization colors and H(I, j ) is the number of pixels having color j The similarity of two images
can be easily computed on the basis of this representation The four common types
of similarity measurements are the L1 norm [20], the L2 norm [21], the colorhistogram intersection [7], and the weighted distance metric [22] The L1 normhas the lowest computational complexity However, it was shown in Ref [23]that it could produce false negatives (not all similar images are retrieved) TheL2 norm (i.e., the Euclidean distance) is probably the most widely used metric
Trang 5Local color histograms are used to retrieve images on the basis of theircolor similarity in local spatial regions One natural approach is to partitionthe whole image into several regions and then extract color features from each ofthem [25,26] In this case, the similarity of two images will be determined by thesimilarity of the corresponding regions Of course, the two images should havesame number of partitions with the same size If they happen to have differentaspect ratios, then normalization will be required.
10.3.1 Some Color Descriptors
A compact color descriptor, called a binary representation of the image histogram,
was proposed in Ref [27] With this approach, each region is represented by abinary signature, which is a binary sequence generated by a two-level quantiza-tion of wavelet coefficients obtained by applying the two-dimensional (2D) Haartransform to the 2D color histogram In Ref [28], a scalable blob histogram was
proposed, where the term blob denotes a group of pixels with homogeneous
color One advantage of this descriptor is that images containing objects withdifferent sizes and shapes can be easily distinguished without color segmenta-tion A region-based image retrieval approach was presented in Ref [29] Themain idea of this work is to adaptively segment the whole image into sets ofregions according to the local color distribution [30] and then compute the simi-larity on the basis of each region’s dominant colors, which are extracted byapplying color quantization
Some other commonly used color feature representations in image retrievalinclude color moments and color sets For example, in Ref [31], Stricker andDimai extracted the first three color moments from five partially overlappedfuzzy regions In Ref [32], Stricker and Orengo proposed to use color moments
to overcome undesirable quantization effects To speed up the retrieval process in
a very large image database, Smith and Chang approximated the color histogramwith a selection of colors (color sets) from a prequantized color space [33,34]
10.3.2 Texture
Texture refers to visual patterns with properties of homogeneity that do notresult from the presence of only a single color or intensity [35] Tree barks,clouds, water, bricks, and fabrics are examples of texture Typical texturalfeatures include contrast, uniformity, coarseness, roughness, frequency, density,
Trang 6and directionality Texture features usually contain important information aboutthe structural arrangement of surfaces and their relationship to the surroundingenvironment [36] To date, a large amount of research in texture analysis has beendone as a result of the usefulness and effectiveness of this feature in applicationareas such as pattern recognition, computer vision, and image retrieval.
There are two basic classes of texture descriptors: statistical model-based andtransform-based The first approach explores the gray-level spatial dependence
of textures and then extracts meaningful statistics as texture representation InRef [36], Haralick and coworkers proposed the co-occurrence matrix represen-tation of texture features, in which they explored the gray-level spatial depen-dence of texture They also studied the line-angle-ratio statistics by analyzingthe spatial relationships of lines and the properties of their surroundings Inter-estingly, Tamura and coworkers addressed this topic from a totally differentviewpoint [37] They showed, on the basis of psychological measurements, thatsix basic textural features were coarseness, contrast, directionality, line-likeness,regularity, and roughness This approach selects numerical features that corre-spond to characteristics of the human visual system, rather than on statisticalmeasures of the data and, therefore, seems well suited to the retrieval of naturalimages Two well-known CBIR systems (the QBIC system [38] and the MARSsystem [39,40]) adopted Tamura’s texture representation and made some furtherimprovements Liu and Picard [10] and Niblack and coworkers [11,41] used
a subset of the above mentioned 6 features, namely contrast, coarseness, anddirectionality models to achieve texture classification and recognition
A human texture perception study, conducted by Rao and Lohse [42], cated that the three most important orthogonal dimensions are “repetitiveness,”
indi-“directionality,” and “granularity and complexity.”
Some commonly used transforms for transform-based texture extractions arethe discrete cosine transform (DCT transform), the Fourier-Mellin transform,Polar Fourier transform, and the Gabor and the wavelet transform Alata andcoworkers [43] proposed classifying rotated and scaled textures by using thecombination of a Fourier-Mellin transform and a parametric 2D spectrum esti-
mation method called harmonic mean horizontal vertical (HMHV) Wan and
Kuo [44] extracted the texture features in the joint photographic experts group(JPEG) compressed domain by analyzing AC coefficients of the DCT transform.The Gabor filters proposed by Manjunath and Ma [45] offer texture descriptorswith a set of “optimum joint bandwidth.” A tree-structured wavelet transformpresented by Chang and Kuo [46] provides a natural and effective way to describetextures that have dominant middle- or high-frequency subbands In Ref [47],Nevel developed a texture feature–extraction method by matching the first andthe second-order statistics of wavelet subbands
10.3.3 Shape
Two major steps are involved in shape feature extraction They are object tation and shape representation
Trang 7segmen-SIMILARITY FUNCTIONS 267
10.3.3.1 Object Segmentation Image retrieval based on object shape is
considered to be one of the most difficult aspects of content-based imageretrieval because of difficulties in low-level image segmentation and the variety
of ways a given three-dimensional (3D) object can be projected into 2Dshapes Several segmentation techniques have been proposed so far and includethe global threshold-based technique [21], the region-growing technique [48], thesplit-and-merge technique [49], the edge-detection-based technique [41,50], thetexture-based technique [51], the color-based technique [52], and the model-based technique [53] Generally speaking, it is difficult to do a precisesegmentation owing to the complexity of the individual object shape, theexistence of shadows, noise, and so on
10.3.3.2 Shape Representation Once objects are segmented, their shape
features can be represented and indexed In general, shape representations can beclassified into three categories [54]:
• Boundary-Based Representations (Based on the Outer Boundary of the Shape). The commonly used descriptors of this class include the chaincode [55], the Fourier descriptor [55], and the UNL descriptor [56]
• Region-Based Representations (Based on the Entire Shape Region).
Descriptors of this class include moment invariants [57], Zernikemoments [55], the morphological descriptor [58], and pseudo-Zernikemoments [56]
• Combined Representations. We may consider the integration of severalbasic representations such as moment invariants with the Fourier descriptor
or moment invariants with the UNL descriptor
The Fourier descriptor is extracted by applying the Fourier transform to theparameterized 1 D boundary Because digitization noise can significantly affectthis technique, robust approaches have been developed such as the one described
in Ref [54], which is also invariant to geometric transformations Region-basedmoments are invariant with respect to affine transformations of images Detailscan be found in Ref [57,59,60] Recent work in shape representation includes thefinite element method (FEM) [61], the turning function developed by Arkin andcoworkers [62], and the wavelet descriptor developed by Chuang and Kuo [63].Chamfer matching is the most popular shape-matching technique It was firstproposed by Barrow and coworkers [64] for comparing two collections of shapefragments and was then further improved by Borgefors in Ref [65]
Besides the aforementioned work in 2D shape representation, some researchhas focused on 3D shape representations For example, Borgefors and coworkers[66] used binary pyramids in 3D space to improve the shape and the topologypreservation in lower-resolution representations Wallace and Mitchell [67]presented a hybrid structural or statistical local shape analysis algorithm for 3Dshape representation
Trang 810.3.4 Spatial Relationships
There are two classes of spatial relationships The first class, containing ical relationships, captures the relations between element boundaries The secondclass containing orientation or directional relationships captures the relative posi-tions of elements with respect to each other Examples of topological relationshipsare “near to,” “within,” or “adjacent to.” Examples of directional relationships are
topolog-“in front of,” “on the left of,” and “on top of.” A well-known method to describespatial relationship is the attributed-relational graph (ARG) [68] in which objectsare represented by nodes, and an arc between two nodes represents a relationshipbetween them
So far, spatial-based modeling has been widely addressed, mostly in the ature on spatial reasoning, for application areas such as geographic information
liter-systems [69,70] We can distinguish two main categories that are called tive and quantitative spatial modeling, respectively.
qualita-A typical application of the qualitative spatial model to image databases,based on symbolic projection theory, was proposed by Chang [7]; it allows abidimensional arrangement of a set of objects to be encoded into a sequential
structure called a 2D string Because the 2D string structure reduces the matching
complexity from a quadratic function to a linear one, the approach has beenadopted in several other works [72,73]
Compared to qualitative modeling, quantitative spatial modeling can provide
a more continuous relationship between perceived spatial arrangements and theirrepresentations by using numeric quantities as classification thresholds [74,75].Lee and Hsu [74] proposed a quantitative modeling technique that enables thecomparison of the mutual position of a pair of extended regions In this approach,the spatial relationship between an observer and an observed object is represented
by a finite set of equivalence classes based on the dense sets of possible pathsleading from any pixel of one object to that of the other
10.3.5 Features of Nonphotographic Images
The discussion in the previous section focused on features for indexing andretrieving natural images Nonphotographic images such as medical and satelliteimages can be retrieved more effectively using special-purpose features, owing
to their special content and their complex and variable characteristics
10.3.5.1 Medical Images Medical images include diagnostic X-ray images,
ultrasound images, computer-aided tomographical images, magnetic resonanceimages, and nuclear medicine images Typical medical images contain manycomplex, irregular objects These exhibit a great deal of variability, due to differ-ence in modality, equipment, procedure, and the patient [76] This variabilityposes a big challenge to efficient image indexing and retrieval
Features suitable for medical images can be categorized into two basic classes:text-based and content-based
Trang 9Usually, these features are incorporated into labels, which are digitally orphysically affixed to the images and then used as the primary indexing key inmedical imaging libraries.
Content-Based Features Two commonly used content-based features are shape
and object spatial relationship, which are very useful in helping physicians locateimages containing the objects of their interest In Ref [76], Cabral and coworkers
proposed a new feature called anatomic labels This descriptor is associated
with the anatomy and pathology present in the image and provides a means forassigning (unified medical language system) (UMLS) labels to images or specificlocations within images
10.3.5.2 Satellite Images Recent advances in sensor and communication
tech-nologies have made it practical to launch an increasing number of space platformsfor a variety of Earth science studies The large volume of data generated by theinstruments on the platforms has posed significant challenges for data transmis-sion, storage, retrieval and dissemination Efficient image storage, indexing, andretrieval systems are required to make this vast quantity of data useful
The research community has devoted a significant amount of effort to thisarea [77–80] In CBIR systems for satellite imagery, different image featuresare extracted, depending on the type of satellite images and research purposes.For example, in a system used for analyzing aurora image data [79], theauthors extract two types of features Global features include the aurora area,the magnetic flux, the total intensity and the variation of intensity, and radialfeatures along a radial line from geomagnetic north such as the average widthand the variation of width In Ref [77], shape and spatial relationship featuresare extracted from a National Oceanographic and Atmospheric Adminstration(NOAA) satellite image database In a database system for the Earth observingsatellite image [80], Li and Chen proposed an algorithm to progressivelyextract and compare different texture features, such as the fractal dimension,coarseness, entropy, circular Moran autocorrelation functions, and spatial gray-level difference (SGLD) statistics, between an image and a target template
In Ref [78], Barros and coworkers explored techniques for the exploitation ofspectral distribution information in a satellite image database
10.3.6 Some Additional Features
Some additional features that have been used in the image retrieval process arediscussed in the following section
Trang 1010.3.6.1 Angular Spectrum Visual properties of an image are mainly related
to the largest objects it contains In describing an object, shape, texture, andorientation play a major role In many cases, because shape can also be defined
in terms of presence and distribution of oriented subcomponents, the orientation
of objects within an image becomes a key attribute in the definition of similarity toother images On the basis of this assumption, Lecce and Celentano [81] defined
a metric for image classification in the 2D space that is quantified by signaturescomposed of angular spectra of image components In Ref [8.2], an image’sFourier transform was analyzed to find the directional distribution of lines
10.3.6.2 Edge Directionality Edge directionality is another commonly used
feature In Ref [82], Lecce and Celentano detected edges within an image byusing the Canny algorithm [83] and then applied the Hough transform [84], whichtransforms a line in Cartesian coordinate space to a point in polar coordinatespace, to each edge point The results were then analyzed in order to detect maindirections of edges in each image
10.3.7 Feature Integration
Experience shows that the use of a single class of descriptors to index an imagedatabase does not generally produce results that are adequate for real applicationsand that retrieval results are often unsatisfactory even for a research prototype
A strategy to potentially improve image retrieval, both in terms of speed andquality of results, is to combine multiple heterogeneous features
We can categorize feature integration as either sequential or parallel Sequential
feature integration, also called feature filtering, is a multistage process in which
different features are sequentially used to prune a candidate image set In theparallel feature-integration approach, several features are used concurrently in theretrieval process In the latter case, different weights need to be assigned appropri-ately to different features, because different features have different discriminatingpowers, depending on the application and specific task The feature-integrationapproach appears to be superior to using individual features and, as a conse-quence, is implemented in most current CBIR systems The original Query byImage Content (QBIC) system [85] allowed the user to select the relative impor-tance of color, texture, and shape Smith and Chang [86] proposed a spatialand feature (SaFe) system to integrate content-based features with spatial querymethods, thus allowing users to specify a query in terms of a set of regions withdesired characteristics and simple spatial relations
Srihari [20] developed a system for identifying human faces in newspaperphotographs by integrating visual features extracted from images with textsobtained from the associated descriptive captions A similar system based ontextual and image content information was also described in Ref [87] Extensiveexperiments show that the use of only one kind of information cannot producesatisfactory results In the newest version of the QBIC system [85], text-basedkey word search is integrated with content-based similarity search, which leads
Trang 11FEATURE INDEXING 271
to an improvement of the overall system performance The virage system [88]allows queries to be built by combining color, composition (color layout), texture,and structure (object boundary information)
The main limitation of feature integration in most existing CBIR systems isthe heavy involvement of the user, who must not only select the features to beused for each individual query, but must also specify their relative weights Asuccessful system that is built upon feature integration requires a good under-standing of how the matching of each feature is performed and how the queryengine uses weights to produce final results Sometimes, even sophisticated usersfind it difficult to construct queries that return satisfactory results An interac-tive CBIR system can be designed to simplify this problem and is discussed inSection 10.5
10.4 FEATURE INDEXING
Normally, image descriptors are represented by multidimensional vectors, whichare often used to measure the similarity of two images by calculating a descriptordistance in the feature space When the number of images in the database issmall, a sequential linear search can provide a reasonable performance However,with large-scale image databases, indexing support for similarity-based queriesbecomes necessary In regular database systems, data are indexed by key enti-ties, which are selected to support the most common types of searches Similarly,the indexing of an image database should support an efficient search based onimage contents or extracted features In traditional relational database manage-ment systems (RDBMS), the most popular class of indexing techniques is theB-tree family, most commonly the B+-tree [89] B-trees allow extremely efficientsearches when the key is a scalar However, they are not suitable to index thecontent of images represented by high-dimensional features The R-tree [90] andits variations are probably the best-known multidimensional indexing techniques
10.4.1 Dimension Reduction
Experiments indicate that R-Trees [90] and R∗-Trees [91] work well for larity retrieval only when the dimension of the indexing key is less than 20.For a higher dimensional space, the performance of these tree-structured indicesdegrades rapidly Although the dimension of a feature vector obtained by concate-nating multiple descriptors is often in the order of 102, the number of nonredun-dant dimensions is typically much lower Thus, dimension reduction should beperformed before indexing the feature vectors with a multidimensional indexingtechnique There are two widely used approaches to dimension reduction: theKarhunen-Loeve transform (KLT) [92] and the column-wise clustering [93].KLT and its variations have been used in dimension reduction in many areassuch as features for facial recognition, eigen-images, and principal componentanalysis Because KLT is a computationally intensive algorithm, recent work
Trang 12has been devoted to efficient computation of approximations suitable for larity indexing, which include the fast approximation to KLT [94], the low-ranksingular value decomposition (SVD) update algorithm [95], and others.
simi-Clustering is another useful tool to achieve dimension reduction The key idea
of clustering is to group a set of objects with similar feature or features into oneclass Clustering has been developed in the context of lossy data compression(where it is known as vector quantization) and in pattern recognition (where it
is the subject of research on unsupervised learning) It has also been used togroup similar objects (patterns and documents) together for information retrievalapplications It can also be used to reduce the dimensionality of a feature space
as discussed in Ref [93]
10.4.2 Indexing Techniques
The universe of multidimensional indexing techniques is extremely vast and rich.Here, we limit ourselves to citing the bucketing algorithm, the k-d tree [96], thepriority k-d tree [97], the quad-tree [98], the K-D-B tree [99], the hB-tree [96],the R-tree [90,100] and its variants R+-tree [101] and the R∗-tree [91] Amongthem, the R-tree and its variants are the most popular An R-tree is a B-tree-like indexing structure where each internal node represents a k-dimensionalhyper-rectangle rather than a scalar Thus, it is suitable for high-dimensionalindexing and, in particular, for range queries The main disadvantage of theR-tree is that rectangles can overlap so that more than one subtree under anode may have to be visited during a search This results in a degradedperformance
In 1990, Beckman and Kridgel [91] proposed the best dynamic R-tree variant
called the R∗-tree, which minimized the overlap among nodes, thus yielding an
improved performance
Recent research work includes the development of new techniques, such
as the VAM k-d tree, the VAMSplit R-tree, and the 2D h-tree, as well ascomparisons of the performance of various existing indexing techniques in imageretrieval [92,97]
10.5 INTERACTIVE CONTENT-BASED IMAGE RETRIEVAL
In early stages of the development of CBIR, research was primarily focused onexploring various feature representations, hoping to find a “best” representationfor each feature In these systems, users first select some visual features of interestand then specify the weight for each representation This burdens the user byrequiring a comprehensive knowledge of low-level feature representations in theretrieval system There are two more important reasons why these systems arelimited: the difficulty of representing semantics by means of low-level featuresand the subjectivity of the human visual system