Project 4: Warping and Mosaicing

Manual Labelling

Warping

Given two sets of four points, we can compute a "homography" that captures the projective transformation from one set of points to the other. Additional pairs of points let us compute a least-squares homography that is less sensitive to errors in the placement of individual points.

To obtain a "rectification" of an image, we label four points in the image. Then we compute the homography between the labelled points and a rectangle of our choice. Applying the homography to the image produces a warped version of the image in which the region defined by the labelled points appears rectangular.

img img img

[A photo of my desk (room_left.jpg); with labels; warped so that the computer monitor appears rectangular.]

img img img

[A photo of the door to my house (door.jpg); with labels; warped so that the door appears rectangular.]

For the photo of my desk, I labelled the four corners of the computer monitor and mapped these points to a 16:9 rectangle (the monitor has a 16:9 aspect ratio). For the photo of the door, I labelled the four corners of the doorframe and mapped these points to a 1:2 rectangle.

Mosaicing

Rather than mapping the labelled points of an image to a rectangle, we can map the labelled points of an image to corresponding points in a second image (that was taken from approximately the same point in space). Applying the resulting homography to the first image warps it to match the perspective of the second image, and combining the warped first image with the original second image produces a seamless "mosaic."

A homography defined by hand-picked pairs of corresponding points may be imprecise. To improve the homography, we perform gradient descent over the eight parameters of the homography. (For a particular homography, our gradient-descent loss function computes the per-pixel difference between the warped first image and the original second image.)

img img img img

[Two photos room_left.jpg and room_right.jpg; with matching labels.]

img

The result of warping room_left.jpg to fit the corresponding points of room_right.jpg.

Once a precise homography is obtained, the resulting warped first image and original second image must be combined in a way that prevents visual artifacts. We divide the combined image into three regions: one in which the warped first image is the sole contributor, one in which the original second image is the sole contributor, and one in which both images contribute. For each pixel in the region of overlap, we calculate the Manhattan distance between the pixel and each of the exclusive regions. These distances decide the weights for our weighted-average computation of the RGB values of the pixel in the combined image.

img

[A mosaic of room_left.jpg and room_right.jpg.]

img img img

[Two photos couch_left.jpg and couch_right.jpg; a mosaic of the two photos.]

img img img

[Two photos plushies_left.jpg and plushies_right.jpg; a mosaic of the two photos.]

Automated Labelling

We can automate our code for mosaicing by writing code that automatically determines the corresponding points of a homography. Our approach is a simplified version of the approach described in ""Multi-Image Matching using Multi-Scale Oriented Patches" (Brown et al.).

Harris Corners

The project spec provides us with a get_harris_corners function. This function takes a grayscale image as input and outputs all integer-coordinate points that are local maximums for "corner strength". (A point has high corner strength if the derivatives of the image w.r.t. x and w.r.t. y are both large at that point.)

img img

[The results of calling get_harris_corners on room_left.jpg and room_right.jpg.]

Calling get_harris_corners gives us thousands of points because the threshold for any given point being a local maximum is very low. Some points, such as the points on the corners of the computer monitor, make sense. But most of them do not correspond to visible corners. We need to eliminate most of the returned points and keep only the best ones.

Adaptive Non-Maximal Suppression

We want to keep the returned points with the greatest corner strength. However, simply keeping the greatest-strength points is not good enough, as explained by Brown et al. This naive strategy tends to keep points that are closely clustered. Instead, we want to keep points that are roughly evenly-distributed so that two overlapping images are likely to have many corresponding points in their overlap.

The Adaptive Non-Maximal Suppression (ANMS) strategy calculates the "suppression radius" of each returned point. The point with the highest corner strength has an infinite suppression radius; for each other point p, the suppression radius of p is the minimum distance to a point with "significantly greater" corner strength (in our case, at least 11.1% more). ANMS keeps the 500 points with the greatest suppression radii.

(It turns out that calculating suppression radii is the most computationally expensive part of automated mosaicing. I found it helpful to ignore all but the 5000 points with greatest corner strength while calculating suppression radii and deciding which points to keep.)

img img

[The results of applying ANMS to the previous images.]

In the post-ANMS examples above, we can see that the points on the corners of the computer monitor and other "sensible" points remain. But most of the arbitrary-looking points are gone. Furthermore, the remaining points are distributed fairly evenly.

Feature Descriptors and Feature Matching

Once we have identified points that correspond to meaningful corners in an image, we obtain the "feature descriptor" of each point. The feature descriptor for point p samples the 40x40 pixels surrounding p and compresses them into an 8x8 grayscale image.

With the post-ANMS points and feature descriptors for two overlapping images, we can now perform "feature matching" to find corresponding points between the two images. We measure the error between two feature descriptors as the squared norm of their difference, which is easily calculated in NumPy.

For each post-ANMS point p in image im1, we compute the nearest neighbor of p in image im2; that is, the point q1 in im2 whose feature descriptor has the least error with the feature descriptor of p. We also compute the second-nearest neighbor q2 of p.

Let e1 be the error between the f.d.s of p and q1; let e2 be the error between the f.d.s of p and q2. From "Distinctive image features from scale-invariant keypoints" (Lowe), we know that the ratio e1 / e2 is closely linked to whether p and q1 are corresponding points. In particular, Brown et al. tell us that p and q1 are overwhelmingly likely to be corresponding points when e1 / e2 < 0.4. However, our RANSAC method for computing a least-squares homography works even if we incorrectly identify many pairs of points as corresponding, so we choose a more generous threshold of e1 / e2 < 0.7.

We conclude feature matching by identifying all pairs of points (p, q) that are "symmetrically" satisfactory; that is, p and q are each other's nearest neighbors and both points satisfy the e1 / e2 < 0.7 threshold.

img img

[All pairs of points identified by matching the f.d.s of room_left.jpg and room_right.jpg.]


img img

[The f.d.s for the pair of points at the bottom-left corner of the computer monitor, labelled "2".]

img img

[The f.d.s for the pair of points at the top-right corner of the computer monitor, labelled "3".]

img img

[The f.d.s for the pair of points at the top-right corner of the yellow water bottle, labelled "32".]

RANSAC

The final step in automating our mosaicing process is to compute a least-squares homography from our pairs of corresponding points. Not all of our pairs are guaranteed to be correct, so we cannot simply use all of them in the homograpy computation.

Our strategy is to repeatedly and randomly select four pairs of corresponding points. For each selection, we compute the homography H implied by the four pairs. Then we count the "inliers" of H; that is, we count how many of our pairs are correctly mapped by H. This computation is very quick for an individual selection, so we can iterate over thousands of random selections.

After inspecting a sufficient number of selections, we choose the selection which has the maximum "inliers" count. The homography H for this selection is likely very close to the correct homography. We refine H by greedily introducing additional pairs of corresponding points to its least-squares computation until we can no longer improve the "inliers" count for our current H.

Once we have settled on a final H, we proceed with warping and overlapping. Unlike manual mosaicing, we do not need to improve H with gradient descent.

Results

img img
img img
img img

[The results of manual (left) and automated (right) mosaicing.]

Bells and Whistles

Two improvements for automated mosaicing.

Scale Invariance

For two overlapping images with drastically different resolutions, the feature-matching process described above will not work well because feature descriptors for the larger image with appear "zoomed in" compared to feature descriptors for the smaller image. We solve this problem by sampling Harris corners from a Gaussian pyramid, as mentioned by Brown et al.

We create a Gaussian pyramid for each input image by repeatedly smoothing and downsampling the image. Lower-resolution levels in the pyramid let us capture more pixels from the original image when we compute feature descriptors. The pyramid should not be too deep, however, because the sub-pixel errors inherent to get_harris_corners grow enormously when we convert from the coordinates of a low-resolution level to the coordinates of the original level.

In our code for obtaining the initial points of interest, we call get_harris_corners on each level of a pyramid. Then we combine the results of these calls (while recording the "origin level" of each point) and proceed normally. The origin level of a point matters primarily for computing its feature descriptor.

img img

[Harris corners for the zeroth and first levels of the Gaussian pyramid derived from a doubled-resolution room_left.jpg.]

img

[The automated mosaic between room_left.jpg and room_right.jpg; the automated mosaic between a doubled-resolution room_left.jpg and the original room_right.jpg.]

We can see that the automated mosaic becomes slightly worse when we introduce a doubled-resolution room_left.jpg. This makes sense because our reliance on the first level of the Gaussian pyramid effectively doubles the errors of get_harris_corners.

Unordered Set Processing

We can use feature matching not just to find corresponding points between two overlapping images, but also to determine whether two images overlap at all. We define the constant value min_pairs = 15. If feature matching between images im1 and im2 yields less than min_pairs pairs of corresponding points, then it is very likely that im1 and im2 do not overlap.

Given an unordered set of images, we remove the two images for which feature matching yields the most corresponding points across all pairs of images. Then we combine these images via automated mosaicing and append the result to the set. We repeat this process until no pair of images in the set has at least min_pairs pairs of corresponding points.

img img
img img

[The result of processing a shuffled list containing the two room photos, the two couch photos, the two plushie photos, and the door photo.]

In the results above, we can see that door.jpg, which does not overlap with any of the other input images, is unchanged.