C. Ancuti, P. Bekaert
Hasselt University, Expertise Centre for Digital Media
Transnationale Universiteit Limburg- School of Information Technology
Wetenschapspark 2, 3590 Diepenbeek, Belgium
cosmin.ancuti@uhasselt.be
Identifying an important number of correct corresponding points in images of the same scene/object is an essential task of many computer vision applications. The matching problem is based on finding many reliable corresponding points. A considerable research has been invested in finding ways to describe local feature points as distinctive as possible.
In this work we propose a simple method to increase the number of correct correspondences extracted from a pair of images. The features are extracted and described with the original SIFT (Scale Invariant Feature Transform) method. One of the main reasons is that SIFT has been proven to be the most reliable solution in the matching problems. Even if in general the detected DoG (Difference of Gaussian) feature points have a good repeatability score an important part of them are not found as good correspondences in the original matching procedure.
In this paper we try to improve this lack by considering the color information as additional information in the process of matching. The proposed algorithm calculates the cross correlation and the histogram intersection between neighbour keypoints regions that has been warped based on the estimated scale and orientation between feature patches. Our experimental results show that the new method performs better than the original matching procedure.
Keywords: matching, descriptor, detector, feature point, scale space, SIFT, cross correlation, color histogram.
Matching feature points extracted from two images of the same scene/object viewed from different positions can be seen as a twofold problem. First we need to detect the same point independently in both images; this is accomplished by an invariant and repeatable detector. Second, the detected feature points need to be described in a distinctive way in order to be able to match correctly as many as possible of them.
Generally speaking, as much as the detected regions are more invariant, the less information is transported by the descriptor. The relation between detector and descriptor is very important and not all the combination between detectors and different descriptors give similar results.
In this paper we introduce an effective method to increase the number of correct matched keypoints extracted with the SIFT operator .The approach computes the normalized cross correlation (NCC) between the regions centered on the SIFT keypoint.
Overview: The paper is organized as follows: in the following section the related work is presented. The next section contains a briefly review of the SIFT algorithm. In section 4 we present how the new matching approach is implemented. Finally, in the last two sections the experimental results and the conclusions are summarized.

a) The selected gray regions look very similar when the color information is discarded.
b) Failure of matching in absence of the color. In this example are shown the first 10% best matches obtained with the original SIFT procedure.
Fig.1: Importance of considering the color information
In the last decades a lot of techniques have been applied in order to obtain more distinctive and more robust descriptors. A first class is the distribution-based descriptors. One representation is the spin image introduced in [Johnson et al. 1997] in the context of object recognition. The spin descriptor has been adapted to images in [Lazebnik et al. 2003]. Another important class is based on the frequency content of the image. The Fourier transform cannot represent distinctive local regions because the spatial relation between points is not enough explicit. The Gabor transform [Gabor 1946] is maybe the most representative but this technique requires a large number of Gabor filter to capture the signal disparity.
Steerable filters introduced by Freeman [Freeman et al. 1991] guides the derivatives in particular direction and are obtained by convolution with Gaussian derivatives. Another technique is based on the generalized moment invariants [Van Gool et al. 1996]. The moments can be easy to compute but they are sensitive to geometric and photometric distortions.
Scale invariant feature transform SIFT has been introduced by [Lowe 1999] and has been proved to be the most robust and distinctive among the other local invariant descriptors taking into consideration different geometric changes [Mikolajczyk et al. 2004]. Despite of its impressive capacity to distinct similar local features, it does not take into consideration the color information, being developed mainly for gray images.
However, recently a color version of SIFT has been introduced in vision literature. CSIFT [Hackim et al. 2006] combines SIFT with color invariance introduced by [Geusebroek et al. 2001]. Even if they use a relative complex mathematical approach the presented results have not improved considerable the original version. Moreover CSIFT has been tested only on the images affected by the variation of the illumination (more complex transformation like scale and rotation are ignored).
Our method also used color but in a different way and relative more easy to implement. After adapting the neighbor patches to rotation and scale invariance the color histogram of the regions combined with NCC computed for a part of the feature points is used as additional information to increase the distinctness of the SIFT descriptor.
Due to the high popularity it is not surprising that a number of SIFT variants have been developed meanwhile. One of them is PCA-SIFT [ Ke et al. 2004 ] that apply the Principal Component Analysis to the normalized gradient patch in order to reduce the size of the descriptor vector from 128 to 36.
Gradient location and orientation histogram GLOH [Mikolajczyk et al. 2004] is an extension of the SIFT. GLOH computes SIFT descriptor for a log-polar location grid with three bins in radial direction and then descriptor size is reduced with the PCA.
Speed Up Robust Features (SURF) [Bay et al. 2006] uses Hessian matrix and relies on the integral images to increase the computation efficiency for extracting feature points. Feature points are described by the responses of the Haar-wavelets responses within the interest point neighborhood. Even if it is faster then SIFT, In general the number of good corresponding points matched by SURF is lower than SIFT.
For detecting feature point SIFT is based on the DoG, a close approximation of the LoG. The LoG has been used by Lindeberg [Lindeberg 1998] and [Lindeberg 1999] to find blob like feature points. More recently [Mikolajczyk et al. 2003] the Harris Laplacian detector has been introduced. As its name disclosed, this operator is based on Harris detector [Harris et al. 1988] that is used for 2D localization of features and than using the multi scale representation the scale invariant features are extracted.

All this detectors are scale invariant and are based on the automatic scale detection principle which represents the base for the majority scale invariant detectors. The Harris Laplacian seems to have the highest repeatability score among the other invariant detectors.
Color is a very important property of images that make them more distinctness. Several color system have been proposed during the time[Wyszecki et al. 1982]. Various search system schemes based on the color has been proposed. The more used methods include color histograms [Swain et al. 1991], invariant moments [Geusebroek et al. 2001], color co-occurrence matrix [Seong-O Shim et al. 2003].
A common method to describe the texture features used mainly for image retrieval, object recognition and classification is the color histograms [Swain et al. 1991].
Color histograms are efficient and relative easy to compute being robust to translation and rotation transformations and quasi invariant to scale and view angle transformations. Even if is widely used in image retrieval and indexing problems, color histograms have a major drawback: do not take the spatial information of pixels into consideration, thus very different images can have similar color distributions.
To overcome this problem in our approach is used the invariant to scale and to rotation normalized cross correlation. To make the method computational efficient only a very reduced part of the extracted keypoints are compared with NCC measure.
Finding extrema points: In the first stage the image pyramid is
built convolving a given image I(x,y) with a Gaussian kernel G(x,y,σ). The pyramid can be seen as a 3-dimensional space where
the image represents the first two dimensions and the standard deviation σ of the Gaussian kernel is the 3rd dimension.
Scale space can be represented by :![]()
where the Gaussian kernel is :

The standard deviation of the Gaussian kernel controls
the level of the image blurring.
Feature localization: In the next stage all selected candidate points
are verified for the stability. In the initial implementation [Lowe 1999] the
candidate keypoints are identified at the location and scale of the central
sample point. Following the results of [Brown et al. 2003] an important
improvement for stability has been achieved fitting a 3D quadratic function to
the local sample points in order to determine the interpolated location of the
maximum. As a function is used the quadratic

The following equation is solved in order to set the offset:

where
is the offset vector ,and D derivatives are computed using local pixel differences around the point
location.Lowe find very useful to
filter out the unstable points with low contrast and also the points with
strong response along edges. Therefore an approach inspired from Harris feature
points detection [Harris et al. 1988] is used. Keypoints that have a small
value of the principal curvatures will be rejected. The principal curvature is
estimated computing the trace and determinant of the Hessian matrix H of the D function. Features descriptor: The SIFT signature is based on the image
gradient magnitudes and orientations sampled around the feature point location
after the characteristic scale of the feature point is used to select the level
of the image in the scale space. The original implementation computes a 4x4
grid of orientation histogram built on the 4x4 sub-region of the feature point
from a 16x16 region centered on the feature point location. Each histogram has
8 bins corresponding for every 45є(see Figure 3).To avoid boundary influence, trilinear
interpolation is used to distribute the value of each gradient sample into
adjacent histogram bins. In order to have invariance to rotation a
dominant orientation need to be assigned to every feature point. First, the
image gradient orientation and magnitude of every sample point is computed.

Fig. 3: a) the 16x16 patch gradients computed around every keypoint; b) the resulted 4x4 descriptor(128 vector values); each cell consists of histogram with 8 bins corresponding to every 450
Then, a 36 bin (one bin for every 10º) orientation histogram is constructed using the gradient orientation in the neighbor region around the sample point. Every points from that region contributes to the histogram bins by its gradient orientation weighted by its gradient magnitude and by a Gaussian-weighted circular window.
Even if in general the
detected SIFT feature points have a repeatability score greater than 40 % important
proportion of them are not identified as good corresponding points in the SIFT
matching procedure (depends on the complexity of the applied image
transformation).
4.1 Color as an additional discriminative measure
Color received a significant attention in computer vision literature due of its powerful discrimination properties. Color space is in general specified in three dimensions which makes it higher distinctive comparing with the gray images (one dimension representation of the color). The gray value system (G) is computed from the RGB information that corresponds to red, green and blue images provided by a CCD color camera. The intensity (gray) is not perceived uniform and is highly vulnerable to the image conditions:
![]()
Several color space
representation have been proposed during the years [Wyszecki et al. 1982]. The most known
is the RGB representation. RGB is a color system based on the
primary colors red (R=700nm), green (G=546.1nm) and blue (B=435.8nm). XYS system inherits the RGB and is based on the three primaries colors X,Y and Z that cannot perceive or
produced by human eyes. A different approach is L*a*b* color space representation where L* includes the luminance information, a* the combined red-green components and b* reflects the yellow-blue content. The NTSC (National Television
System Committee) introduced the YUV color system that conveys separately the luminance of the color(Y), the hue(I) and the saturation(Q)
of the color. The corresponding color system for
![]()
Comparing with RGB the rgb presents several advantages being not sensitive to surface
orientation, illumination direction and illumination intensity [Geusebroek et al 2001].This is because the normalized system (rgb) depends only to the ratio of the RGB coordinates.A traditional method to
describe the texture is color histogram [Swain et al. 1991]. Color histograms
are efficient and relative easy to compute being robust to translation and
rotation transformations and quasi invariant to scale and view angle transformations.
Even if is widely used in image retrieval and indexing problems, color
histograms have a major drawback: do not take the spatial information of pixels into
consideration, thus very different images can have similar color distributions.
A simple method to increase the distinctness of color histograms is to divide the image into sub-patches and calculate a histogram for each of them.
Unfortunately, increasing the number of sub-patches, affects seriously the
memory and computational time.
![]()
where the color channels are r,g,b and N represents the number of
pixels of the considered image.The similarity of two color histogram can be
compared by several distances [Puzicha et al. 1999]. In this paper the Minkowski-form distance L2 is
used. Considering two histograms h1and h2, the formal equation of the distance (known also as histogram intersection) is:

As discussed before, the color histograms suffer due of not considering the relation between adjacent pixels. To put out this disadvantage we compute the normalized cross correlation (NCC) between the regions centered on the feature points. Because NCC is computational expensive we avoid the computation of the whole sets of feature points based on a reliable and efficient method described in section 4.3.
4.2 Scale and rotation invariant Normalized Cross Correlation
Normalized cross correlation (NCC) is a classical measure to match images. It takes into consideration the simplest descriptor: the vector of patch pixels. Unfortunately it gives reliable results only for the small baseline case. When significant transformation (scale, rotation, translation, illumination, view angle) affects images this measure cannot find good correspondences. Let I1 and I2 the intensity values of two regions (windows) W1 and W2 of the same size (2r+1)x(2r+1) centered on 2 keypoints p1 and p2 from a considered pair of images. The normalized cross correlation between the windows is computed with the following mathematical expression:

represent the means of
intensity values of considered windows. A similar measure is the sum
of squared differences (SSD) that measures the squared
Euclidean distances between I1 and I2.The relation between SSD and NCC is immediately and the properties are very similar. The values of NCC lies between 0 and 1, with 1 indicates a perfect matching. Even if SSD is computational more inexpensive the NCC is preferred in many situations because it is invariant to linear brightness and contrast variations. Our approach is based on the NCC. The method uses the NCC to compare the neighbor patches as an additional constraint for the selected feature points. To reduce the computation time, only a very reduced part of the feature points (5% of the extracted feature points) is compared with NCC.However, in our case where important transformation affect images the NCC should be adapted. Thus, the region vectors centered at every keypoint are adapted in order that corresponding patches to be invariant for rotation and scale transformations
Characteristic scale: One of the most challenging problem is the
invariance to scale modifications. The scale space theory [Lindeberg 1998]
addresses this problem and has been proved an reliable solution to estimate
scale ratio between images. Real world objects appear in
different ways depending of the selected observation scale. Scale space
representation is defined as a solution to the diffusion equation which is
equivalent with the convolution of the signal with Gaussian kernel.
Dominant orientation: The rotation invariance problem is handle by
computing a dominant orientation of the extracted feature points neighbor
regions. The approach is inspired by [Thurnhofer et al. 1996] and is used also in SIFT
descriptor [Lowe 2004]. To determine a prominent orientation for every
keypoint the gradient magnitude and orientation is precomputed at every level
of scale using pixel differences. Let be δx and δy the
finite differences across x and y directions for a considered pixel.

The gradient orientation
histogram is computed in the neighborhood of every keypoint. In our experiments
the histogram is composed by 36 bins with every bin covering 10 degree range of
orientations.

4.3 Our approach
The method proposed in this paper compares the information of the neighbour regions centered on the extracted keypoints. First, the keypoints are tried to be matched using the procedure based on the ambiguity rejection [Lowe 2004]. This procedure selects good matches only if the ratio of the closest match to the second one is lesser than 0.8. For the rest of the unmatched keypoints, we increase the discriminative information of their neighbour regions by computing the color histograms and attempting to match them as was detailed in the previous subsections. The results of this paper have been generated using a size of 16x16 pixels of the neighbour patches centered on the SIFT keypoints. In our experiments, taken into consideration the size of the keypoints regions, we quantized the rgb color space using 5 bits (32 levels of colors). We observed that this level of quantization is a good compromise between computation and matching performance.The algorithm can be resumed in the following pseudo code lines:


To validate our method a number of real images have been tested. The color images have been taken with an ordinary digital camera(Kodak Z740) and transformations like rotation and scale have been synthetically simulated. A part of the tested images are presented in the Figure 4. The number of valid correspondence are identified based on the computed homography (rates a unique relation between points from two images), estimated using the algorithm from [Hartley and Zisermann 2003]. We limit the analyze to the planar scenes.Supposing that x1 and x2 represent the projected points of a 3D space point the relation between them is given by the homography expression:
![]()
Whether the homography matrix is known the repeatability criterion [Schmid et al. 2000] determines the number of good correspondences. The score of repeatability for a pair of images represents the ratio between the number of point-to-point correspondences and the minimum number of points detected in images. Note that only the points located in the same scene part visible in both of the images are considered. For evaluation, we compare the performance of our method with the original SIFT. The SIFT is computed for the gray version of the considered color images applying the equation 5 transformation.To evaluate our approach the ratio between the number of good detected matched points and the number of maximum possible matches of two images is computed.The image groups from the Figure 4 are evaluated and the results are presented in the following graphic.As can be seen our approach performs better than original SIFT, finding in general with approximately 7% more valid matches. Additionally, in the figure 6 and figure 7 more examples with detailed information are examined.


In this paper we presented a new method that increase the efficiency of matching images using SIFT operator. Because the original SIFT do not use the color information being mainly designed for gray images we introduce the color as additional discriminative information. Therefore the normalized cross correlation between neighbor patches centered on the feature points is computed. NCC is a very efficient measure only for the small baseline case. When more complex transformation affects images the NCC may be adapted the compared patches to scale and orientation after the ratio scale and difference of orientation is computed. Additionally, to increase the distinctness of the SIFT descriptor the color histogram of considered feature points regions are compared using the histogram intersection measure.
The experimental results show that our approach improves the number of good matched feature points using the SIFT detector/descriptor and also the stability of extracted keypoints.
In future work we would like to focus more on the color in order to explore more efficient the color space properties. Moreover we would like to extend our approach for more important affine transformations.
The authors acknowledge
financial support on a structural basis from the ERDF (European Regional
Development Fund) and the Flemish Government and the Flemish Interdisciplinary
institute for BroadBand Technology(IBBT). The first author also acknowledges
financial support by a research grant from the
Bay, H. and Tuytelaars, T. and Van Gool, L.J., 2006. SURF: Speeded Up Robust Features, ECCV06 pp. 404-417
Freeman, W. and Adelson, E. 1991. The Design and Use of Steerable Filters. IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 13, no. 9, pp. 891-906.
Gabor, D. 1946. Theory of Communication, J. Inst. Electr. Engineering London 93, vol. 3, no. 93, pp. 429-457.
Richard Hartley, Andrew Zisserman, Multiple View Geometry in Computer Vision, Publisher Cambridge University Press; 2Rev Ed edition (30 Nov 2003)
Geusebroek, J. M., Burghouts, G. J.and Smeulders, A. W. M. 2005. The Amsterdam Library of Object Images. Int. J. Comput. Vision, 61(1):103–112.
Geusebroek, J. M., van den Boomgaard, R. Smeulders, A. W. M. and Geerts, H. 2001. Color Invariance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(12):1338–1350.
Abdel-Hakim, A.E.; Farag, A.A., 2006. CSIFT: A SIFT Descriptor with Color Invariant Characteristics , In Proceedings of Computer Vision and Pattern Recognition Conference, pp. 1978- 1983.
Harris, C.G. and Stephens, M. 1988. A Combined Corner and Edge Detector, In Proceedings of Fourth Alvey Vision Conference, vol. 18, pp.147-151.
Jing Huang Kumar, Mitra, S.R., Wei-Jing Zhu, Zabih, R. 1997. Image Indexing Using Color Correlograms. In Proceedings of Conference IEEE Computer Vision and Pattern Recognition, pp. 768-762.
Johnson, A. and Hebert, M. 1997. Object Recognition by Matching Oriented Points. In Proceedings of Computer Vision and Pattern Recognition Conference, pp. 684-689.
Ke, Y. and Sukthankar, R. 2004. PCA-SIFT: A More Distinctive Representation for Local Image Descriptors. In Proceedings of Conference IEEE Computer Vision and Pattern Recognition, vol. 2, pp. 506-513.
Koenderink, J.J. 1984. The structure of images. Journal Biological Cybernetics, vol. 50, pp. 363-370.
Young, R. A. and Lesperance, R. M. 2001. The Gaussian Derivative Model For Spatial-Temporal Vision: II. Cortical Data. Spatial Vision, 14(3-4): pp. 321-389.
Lazebnik, S., Schmid, C. and Ponce , J. 2003. Sparse Texture Representation Using Affine-Invariant Neighborhoods. In Proceedings of Computer Vision and Pattern Recognition, pp. 319-324.
Lindeberg, T. 1998. Feature Detection with Automatic Scale Selection, International Journal of Computer Vision, vol. 30, no. 2, pp. 77-116.
Lindeberg, T. 1999. Methods for Automatic Selection. Handbook on Computer Vision and Applications, volume 2, Academic Press, Boston , pp. 239-274.
Lowe, D. 2004. Distinctive Image Features from Scale-Invariant Keypoints, International Journal of Computer Vision, vol. 2, no. 60, pp. 91-110.
Lowe, D. 1999. Object Recognition from Local Scale-Invariant Features, In Proceedings of Seventh International Conference on Computer Vision, pp. 1150-1157.
Mikolajczyk, K. and Schmid, C. 2003. A Performance Evaluation of Local Descriptors. International Conference on Computer Vision and Pattern Recognition, pp. 257-263.
Mikolajczyk, K. and Schmid, C. 2004. Scale and Affine Invariant Interest Point Detectors. International Journal of Computer Vision, vol. 1, no. 60, pp. 63-86.
Puzicha, J., Rubner, Y., Tomasi, C. and Buhmann, J. 1999. Empirical evaluation of dissimilarity measures for color and texture. In Proceedings of the Seventh IEEE International Conference on Computer Vision , vol 2, pp. 1165-1172.
Peng Chang, Krumm, J. 1999.Object recognition with color cooccurrence histograms .In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition vol. 2, pp. 504.
Schmid, C., Mohr, R., and Bauckhage, C. 2000. Evaluation of Interest Point Detectors. International Journal of Computer Vision, 37(2): pp. 151–172.
Seong-O Shim, Tae-Sun Choi, 2003. Image Indexing by Modified Color Co-occurrence Matrix. IEEE International Conference on Image Processing, vol. 3, pp. 493-496.
Swain, M. and Ballard, D. 1991. Color indexing: International Journal of Computer Vision, vol. 7(1), pp.11-32.
Thurnhofer, S., Mitra, S. 1996. Edge-enhanced image zooming. Optical Engineering, 35(07): pp.1862–1870. Brian J. Thompson; Ed.
Van Gool, L., Moons,T. and Ungureanu, D. 1996. Affine/Photometric Invariants for Planar Intensity Patterns. In Proceedings of the Fourth European Conference on Computer Vision, vol. 1, pp. 642-651.
Wyszecki, G. and Stiles, W. S. 1982. Color Science: Concepts and Methods, Quantitative Data and Formulae. John Wiley & sons, second edition.