Bruce D. LucasCarnegie-Mellon University

Takeo KanadeCarnegie-Mellon University

Abstract

Image registration finds a variety of applications in computer vision. Unfortunately, traditional image registration techniques tend to be costly. We present a new image registration technique that makes use of the spatial intensity gradient of the images to find a good match using a type of Newton-Raphson iteration. Our technique is faster because it examines far fewer potential matches between the images than existing techniques. Furthermore, this registration technique can be generalized to handle rotation, scaling and shearing. We show show our technique can be adapted for use in a stereo vision system.

Introduction

Image registration finds a variety of applications in computer vision, such as image matching for stereo vision, pattern recognition, and motion analysis. Untortunately, existing techniques for image registration tend to be costly. Moreover, they generally fail to deal with rotation or other distortions of the images.

In this paper we present a new image registration technique that uses spatial intensity gradient information to direct the search for the position that yields the best match. By taking more information about the images into account, this technique is able to find the best match between two images with far fewer comparisons of images than techniques which examine the possible positions of registration in some fixed order. Our technique takes advantage of the fact that in many applications the two images are already in approximate registration. This technique can be generalized to deal with arbitrary linear distortions of the image, including rotation. We then describe a stereo vision system that uses this registration technique, and suggest some further avenues for research toward making effective use of this method in stereo image understanding.

The registration problem

The translational image registration problem can be characterized as follows: We are given functions and which give the respective pixel values at each location in two images, where is a vector. We wish to find the disparity vector which minimizes some measure of the difference between and , for in some region of interest . (See ).

Typical measures of the difference between and are:

negative of normalized correlation

= \frac{-\sum_{x \varepsilon R} F(x + h) G(x)}{(\sum_{x \varepsilon R})^\frac{1}{2} (\sum_{x \varepsilon R} G(x) ^2)^ \frac{1}{2}}

We will propose a more general measure of image difference, of which both the norm and the correlation are special cases. The norm is chiefly of interest as an inexpensive approximation to the norm.

Existing techniques

An obvious technique for registering two images is to calculate a measure of the difference between the images at all possible values of the disparity vector —that is, to exhaustively search the space of possible values of . This technique is very time consuming: if the size of the picture is , and the region of possible values of is of size , then this method requires time to compute.

Speedup at the risk of possible failure to find the best can be achieved by using a hill-climbing technique. This technique begins with an initial estimate of the disparity. To obtain the next guess from the current guess , one evaluates the difference function at all points in a small (say, ) neighborhood of and takes as the next guess that point which minimizes the difference function. As with all hill-climbing techniques, this method suffers from the problem of false peaks: the local optimum that one attains may not be the global optimum. This technique operates in time on the average, for and as above.

Another technique, known as the sequential similarity detection algorithm (SSDA) [], only estimates the error for each disparity vector . In SSDA, the error function must be a cumulative one such as the or norm. One stops accumulating the error for the current under investigation when it becomes apparent that the current is not likely to give the best match. Criteria for stopping include a fixed threshold such that when the accumulated error exceeds this threshold one goes on to the next , and a variable threshold which increases with the number of pixels in whose contribution to the total error have been added. SSDA leaves unspecified the order in which the ’s are examined.

Note that in SSDA if we adopt as our threshold the minimum error we have found among the examined so far, we obtain an algorithm similar to alpha-beta pruning in minmax game trees []. Here we take advantage of the fact that in evaluating , where is the contribution of pixel at disparity to the total error, the can only increase as we look at more ’s (more pixels).

Some registration algorithms employ a coarse-fine search strategy. See [] for an example. One of the techniques discussed above is used to find the best registration for the images at low resolution, and the low resolution match is then used to constrain the region of possible matches examined at higher resolution. The coarse-fine strategy is adopted implicitly by some image understanding systems which work with a pyramid of images of the same scene at various resolutions.

It should be nated that some of the techniques mentioned so far can be combined because they concern orthogonal aspects of the image registration problem. Hill climbing and exhaustive search concern only the order in which the algorithm searches for the best match, and SSDA specifies only the method used to calculate (an estimate of) the difference function. Thus for example, one could use the SSDA technique with either hill climbing or exhaustive search, in addition a coarse-fine strategy may be adopted.

The algorithm we present specifies the order in which to search the space of possible ’s. In particular, our technique starts with an initial estimate of , and it uses the spatial intensity gradient at each point of the image to modify the current estimate of to obtain an which yields a better match. This process is repeated in a kind of Newton-Raphson iteration. If the iteration converses, it will do so in steps on the average. This registration technique can be combined with a coarse-fine strategy,since is requires an initial estimate of the approximate disparity .

The registration algorithm

In this section we first derive an intuitive solution to the one dimensional registration problem, and then we derive an alternative solution which we generalize to multiple dimensions. We then show how our technique generalizes to other kinds of registration. We also discuss implementation and performance of the algorithm.

One dimensional case

In the one-dimensional registration problem, we wish to find the horizontal disparity between two curves and . This is illustrated in .

Our solution to this problem depends on , a linear approximation to the behavior of in the neighborhood of , as do all subsequent solutions in this paper. In particular, for small ,

\begin{split} F'(x) &\approx \frac{F(x + h) - F(x)}{h} \\ &= \frac{G(x) - F(x)}{h} \end{split}

so that

h = \frac{G(x) - F(x)}{F'(x)}

The success of our algorithm requires to be small enough that this approximation is adequate. In section we will show how to extend the range of ’s over which this approximation is adequate by smoothing the images.

The approximation to given in depends on . A natural method for combining the various estimates of at various values of would be to simply average them:

h \approx \sum_x \frac{G(x) - F(x)}{F'(x)} / \sum_x 1

We can improve this average by realizing that the linear approximation in is good where is nearly linear, and conversely is worse where is large. Thus we could weight the contribution of each term to the average in in inverse proportion to an estimate of . One such estimate is

F''(x) \approx \frac{G'(x) - F'(x)}{h}

Since our estimate is to be used as a weight in an average, we can drop the constant factor of in , and use as our weighting function

w(x) = \frac{1}{|G'(x) - F(x)|}

This in fact appeals to our intuition: for example, in , where the two curves cross, the estimate of provided by is , which is bad; fortunately, the weight given to this estimate in the average is small, since the difference between and at this point is large. The average with weighting is

h \approx \sum_x \frac{w(x)[G(x) - F(x)]}{F'(x)} / \sum_{x} w(x)

where is given by .

Having obtained this estimate. we can then move by our estimate of , and repeat this procedure, yielding a type of Newton-Raphson iteration. Ideally, our sequence of estimates of will converge to the best . This iteration is expressed by

h_0 = 0,

h_{k+1} = h_k + \sum_{x} \frac{w(x)[G(x) - F(x + h_k)]}{F'(x + h_k)} / \sum_x w(x)

An alternative derivation

The derivation given above does not generalize well to two dimensions because the two-dimensional linear approximation occurs in a different form. Moreover, is undefined where , i.e. where the curve is level. Both of these problems can be corrected by using the linear approximation of equation in the form

F(x + h) \approx F(x) + h F'(x)

to find the which minimizes the norm measure of the difference between the curves:

E = \sum_x [F(x + h) - G(x)]^2

To minimize the error with respect to , we set

\begin{split} 0 &= \frac{\partial E}{\partial h} \\ &\approx \frac{\partial}{\partial h} \sum_x [F(x) + h F'(x) - G(x)]^2 \\ &= \sum_x 2 F'(x)[F(x) + hF'(x) - G(x)] \end{split}

from which

h \approx \frac{\sum_x F'(x)[G(x) - F(x)]}{\sum_x F'(x)^2}

This is essentially the same solution that we derived in , but with the weighting function . As we will see the form of the linear approximation we have used here generalizes to two or more dimensions. Moreover, we have avoided the problem of dividing by , since in we will divide by only if everywhere (in which case really is undefined), whereas in we will divide by if anywhere.

The iterative form with weighting corresponding to is

h_0 = 0,

h_{k+1} = h_k + \frac{\sum_x w(x) F'(x + h_k) [G(x) - F(x + h_k)]}{\sum_x w(x) F'(x + h_k)^2}

where is given by .

Performance

A natural question to ask is under what conditions and how fast the sequence of ’s converges to the real . Consider the case:

It can be shown that both versions of the registration algorithm given above will converge to the correct for , that is, for initial misregistrations as large as one-half wavelength. This suggests that we can improve the range of convergence of the algorithm by suppressing high spatial frequencies in the image, which can be accomplished by smoothing the image, i.e. by replacing each pixel of the image by a weighted average of neighboring pixels. The tradeoff is that smoothing suppresses small details, and thus makes the match less accurate. If the smoothing window is much larger than the size of the object that we are trying to match, the object may be suppressed entirely, and so no match will be possible.

Since lowpass filtered images can be sampled at lower resolution with no loss of information, the above observation suggests that we adopt a coarse-fine strategy. We can use a low resolution smoothed version of the image to obtain an approximate match. Applying the algorithm to higher resolution images will refine the match obtained at lower resolution.

While the effect of smoothing is to extend the range of convergence, the weighting function serves to improve the accuracy of the approximation, and thus to speed up the convergence. Without weighting, i.e. with , the calculated disparity of the first iteration of with falls off to zero as the disparity approaches one-half wavelength. However, with as in , the calculation of disparity is much more accurate, and only falls off to zero at a disparity very near one-half wavelength. Thus with as in convergence is faster for large disparities.

Implementation

Implementing requires calculating the weighted sums of the quantities , , and over the region of interest . We cannot calculate exactly, but for the purposes of this algorithm, we can estimate it by

F'(x) \approx \frac{F(x + \Delta x) - F(x)}{\Delta x}

and similarly for , where we choose appropriately small (e.g. one pixel). Some more sophisticated technique could be used for estimating the first derivatives, but in general such techniques are equivalent to first smoothing the function, which we have proposed doing for other reasons, and then taking the difference.

Generalization to multiple dimensions

The one-dimensional registration algorithm given above can be generalized to two or more dimensions. We wish to minimize the norm measure of error:

E = \sum_{x \varepsilon R} [F(x + h) - G(x)]^2

where and are -dimensional row vectors. We make a linear approximation analogous to that in ,

F(x + h) \approx F(x) + h \frac{\partial}{\partial x} F(x)

where is the gradient operator with respect to , as a column vector:

\frac{\partial}{\partial x} = \left[ \frac{\partial}{\partial x_1} \frac{\partial}{\partial x_2} \dots \frac{\partial}{\partial x_n} \right]^\top

Using this approximation, to minimize , we set

\begin{split} 0 &= \frac{\partial}{\partial h} E \\ &\approx \frac{\partial}{\partial h} \sum_x \left[ F(x) + h \frac{\partial F}{\partial x} - G(x) \right]^2 \\ &= \sum_x 2 \frac{\partial F}{\partial x} \left[ F(x) + h \frac{\partial F}{\partial x} - G(x) \right] \end{split}

from which

h = \left[ \sum_x \left( \frac{\partial F}{\partial x} \right)^\top [G(x) - F(x)] \right] \left[ \sum_x \left( \frac{\partial F}{\partial x} \right)^\top \frac{\partial F}{\partial x} \right]^{-1}

which has much the same form as the one-dimensional version in .

The discussions above of iteration, weighting, smoothing, and the coarse-fine technique with respect to the onedimensional case apply to the n-dimensional case as well. Calculating our estimate of in the two-dimensional case requires accumulating the weighted sum of five products over the region , as opposed to accumulating one product for correlation or the norm. However, this is more than compensated for, especially in high-resolution images, by evaluating these sums at fewer values of .

Further generalizations

Our technique can be extended to registration between two images related not by a simple translation, but by an arbitrary linear transformation, such as rotation, scaling, and shearing. Such a relationship is expressed by

G(x) = F(x A + h)

where is a matrix expressing the linear spatial tranformation between and . The quantity to be minimized in this case is

E = \sum_x [F(xA + h) - G(x)]^2

To determine the amount to adjust and the amount to adjust , we use the linear approximation

\begin{split} & F(x(A + \Delta A) + (h + \Delta h)) \\ \approx& F(xA + h) + (x \Delta A + \Delta h) \frac{\partial}{\partial x} F(x) \end{split}

When we use this approximation the error expression again becomes quadratic in the quantities to be minimized with respect to. Differentiating with respect to these quantities and setting the results equal to zero yields a set of linear equations to be solved simultaneously.

This generalization is useful in applications such as stereovision, where the two different views of the object will be different views, due to the difference of the viewpoints of the cameras or to differences in the processing of the two images. If we model this difference as a linear transformation, we have (ignoring the registration problem for the moment)

F(x) = \alpha G(x) + \beta

where may be thought of as a contrast adjustment and as a brightness adjustment. Combining this with the general linear transformation registration problem, we obtain

E = \sum_x [F(xA + h) - (\alpha G(x) + \beta)]^2

as the quantity to minimize with respect to , , , and . The minimization of this quantity, using the linear approximation in equation , is straightforward. This is the general form promised in section . If we ignore , minimizing this quantity is equivalent to maximizing the correlation coefficient (see, for example, []); if we ignore and as well, minimizing this form is equivalent to minimizing the norm.

Application to stereo vision

In this section we show how the generalized registration algorithm described above can be applied to extracting depth information from stereo images.

The stereo problem

The problem of extracting depth information from a stereo pair has in principle four components: finding objects in the pictures, matching the objects in the two views, determining the camera parameters, and determining the distances from the camera to the objects. Our approach is to combine object matching with solving for the camera parameters and the distances of the objects by using a form of the fast registration technique described above.

Techniques for locating objects include an interest operator [], zero crossings in bandpass-filtered images [], and linear features []. One might also use regions found by an image segmentation program as objects.

Stereo vision systems which work with features at the pixel level can use one of the registration techniques discussed above. Systems whose objects are higher-level features must use some difference measure and some search technique suited to the particular feature being used. Our registration algorithm provides a stereo vision system with a fast method of doing pixel-level matching.

Many stereo vision systems concern themselves only with calculating the distances to the matched objects. One must also be aware that in any real application of stereo vision the relative positions of the cameras will not be known with perfect accuracy. has shown how to simultaneously solve for the camera parameters and the distances of objects.

A mathematical characterization

The notation we use is illustrated in . Let be the vector of camera parameters that describe the orientation and position of camera 2 with respect to camera 1’s coordinate system. These parameters are azimuth, elevation, pan, tilt, and roll, as defined in . Let denote the position of an image in the camera 1 film plane of an object. Suppose the object is at a distance from camera 1. Given the position in picture 1 and distance of the object, we could directly calculate the position that it must have occupied in three-space. We express with respect to camera 1’s coordinate system so that does not depend on the orientation of camera 1. The object would appear on camera 2’s film plane at a position that is dependent on the object’s position in three-space and on the camera parameters . Let be the intensity value of pixel in picture 1, and let the intensity value of pixel in picture 2. The goal of a stereo vision system is to invert the relationship described above and solve for and , given , and .

Applying the registration algorithm

First consider the case where we know the exact camera parameters , and we wish to discover the distance of an object. Suppose we have an estimate of the distance . We wish to see what happens to the quality of our match between and as we vary by an amount . The linear approximation that we use here is:

where

\frac{\partial F}{\partial z} = \frac{\partial p}{\partial z} \frac{\partial q}{\partial p} \frac{\partial F}{\partial q}

This equation is due to the chain rule of the gradient operator; is a matrix of partial derivatives of the components of with respect to the components of , and is the spatial intensity gradient of the image . To update our estimate of , we want to find the which satisfies

0 = \frac{\partial}{\partial \Delta z} E \approx \frac{\partial}{\partial \Delta z} \sum_x [F + \Delta z \frac{\partial F}{\partial \Delta z} - G]^2

Solving for , we obtain

\delta z = \sum_x \frac{\partial F}{\partial \Delta z} [G - F] / \sum_x \left( \frac{\partial F}{\partial z} \right)^2

where is given by .

On the other hand. suppose we know the distances , of each of objects from camera 1, but we don’t know the exact camera parameters . We wish to determine the effect of changing our estimate of the camera parameters by an amount . Using the linear approximation

F(c + \Delta c) \approx F(c) + \Delta c \frac{\partial q}{\partial c} \frac{\partial F}{\partial q}

we solve the minimization of the error function with respect to by setting

0 = \frac{\partial}{\partial \Delta c} \sum_i \sum_{x \varepsilon R_i} [F(c + \Delta c) - G]^2 \approx \frac{\partial}{\partial \Delta c} \sum_i \sum_{x} [F + \Delta c \frac{\partial q}{\partial c} \frac{\partial F}{\partial q} - G]^2

obtaining

\Delta c \approx \left[\sum_x \left(\frac{\partial q}{\partial c} \frac{\partial F}{\partial q} \right)^\top [G - F] \right] \left[ \left(\frac{\partial q}{\partial c} \frac{\partial F}{\partial q} \right)^\top \left(\frac{\partial q}{\partial c} \frac{\partial F}{\partial q} \right) \right]^{-1}

As with the other techniques derived in this paper, weighting and iteration improve the solutions for and derived above.

An implementation

We have implemented the technique described above in a system which functions well under human supervision. Our program is capable of solving for the distances to the objects, the five camera parameters described above, and a brightness and contrast parameter for the entire scene, or any subset of these parameters. As one would expect from the discussion in section , the algorithm will converge to the correct distances and camera parameters when the initial estimates of the ’s and are sufficiently accurate that we know the position in the camera 2 film plane of each object to within a distance on the order of the size of the object.

A session with this program is illustrated in figures 4 through 10. The original stereo pair is presented in . (Readers who can view stereo pairs cross-eyed will want to hold the pictures upside down so that each eye receives the correct view). The camera parameters were determined independently by hand-selecting matching points and solving for the parameters using the program described in [].

and are bandpass-flitered versions of the pictures in . Bandpass-filtered images are preferred to lowpass-filtered images in finding matches because very low spatial frequencies tend to be a result of shading differences and carry no (or misleading) depth information. The two regions enclosed in rectangles in the left view of have been hand-selected and assigned an initial depth of in units of the distance between cameras. If these were the actual depths, the corresponding objects would be found in the right view at the positions indicated . After seven depth-adjustment iterations, the program found the matches shown in . The distances are for object 1 and for object 2.

and are bandpass-filtered with a band one octave higher than and . Five new points have been hand-selected in the left view, reflecting the different features which have become visible in this spatial frequency range. Each has been assigned an initial depth equal to that found for the corresponding larger region in . The predicted position corresponding to these depths is shown in the right view of . After five depth-adjustment iterations, the matches shown in were found. The corresponding depths are for object 1, for object 2, for object 3, for object 4, and for object 5.

and are bandpass-filtered with a band yet another octave higher than and . Again five new points have been hand-selected in the left view, reflecting the different features which have become visible in this spatial frequency range. Each has been assigned an initial depth equal to that found for the corresponding region in . The predicted position corresponding to these depths is shown in the right view of . After four depthadjustment iterations, the matches shown in were found. The corresponding depths are for object 1, for object 2, For object 3, for object 4, and for object 5.

Future research

The system that we have implemented at present requires considerable hand-guidance. The following are the issues we intend to investigate toward the goal of automating the process.

Providing initial depth estimates for objects: one should be able to use approximate depths obtained from low resolution images to provide initial depth estimates for nearby objects visible only at higher resolutions. This suggests a coarse-fine paradigm not just for the problem of finding individual matches but for the problem of extracting depth infortnation as a whole.
Constructing a depth map: one could construct a depth map from depth measurements by some interpolation method, and refine the depth map with depth measurements obtained from successively higher resolution views.
Selecting points of interest: the various techniques mentioned in section should be explored.
Tracking sudden depth changes: the sudden depth changes found at the edges of objects require some set of higher-level heuristics to keep the matching algorithm on track at object boundaries.
Compensating for the different appearances of objects in the two views: the general form of the matching algorithm that allows for arbitrary linear transformations should be useful here.

Acknowledgements

We would like to thank Michael Horowitz, Richard Korf, and Pradeep Sindhu for their helpful comments on early drafts of this paper.

References

Baker, H. H. (1980). Edge Based Stereo Correlation. DARPA Image Understanding Workshop, 168–175.
Barnea, D. I., & Silverman, H. F. (1972). A Class of Algorithms for Fast Digital Image Registration. IEEE Transactions on Computers, C–21, 179–186. https://doi.org/10.1109/TC.1972.5008923
Dudewicz, E. J. (1976). Introduction to statistics and probability. Holt, Rinehart.
Gennery, D. B. (1979). Stereo-Camera Calibration. DARPA Image Understanding Workshop, 101–107.
Marr, D., & Poggio, T. A. (1979). A computational theory of human stereo vision. Proceedings of the Royal Society of London. Series B, Biological Sciences, 204 1156, 301–328. https://doi.org/10.1016/B978-1-4832-1446-7.50046-7
Moravec, H. P. (1979). Visual Mapping by a Robot Rover. IJCAI. https://doi.org/10.5555/1624861.1624997
Nilsson, N. J. (1971). Problem-solving methods in artificial intelligence. McGraw-Hill Computer Science Series. https://doi.org/10.1145/1056578.1056583