Apple Patent | Method of providing image feature descriptors

Patent: Method of providing image feature descriptors

Publication Number: 10192145

Publication Date: 2019-01-29

Applicants: Apple

Abstract

A method of providing a set of feature descriptors configured to be used in matching an object in an image of a camera is provided. The method includes: a) providing at least two images of a first object; b) extracting in at least two of the images at least one feature from the respective image, c) providing at least one descriptor for an extracted feature, and storing the descriptors; d) matching descriptors in the first set of descriptors; e) computing a score parameter based on the result of the matching process; f) selecting at least one descriptor based on its score parameter; g) adding the selected descriptor(s) to a second set of descriptors; and h) updating the score parameter of descriptors in the first set based on a selection process and to the result of the matching process.

Background

Such method may be used among other applications, for example, in a method of determining the position and orientation of a camera with respect to an object. A common approach to determine the position and orientation of a camera with respect to an object with a known geometry and visual appearance uses 2D-3D correspondences gained by means of local feature descriptors, such as SIFT described in D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. Journal on Computer Vision, 60(2):91-110, 2004. In an offline step, one or more views of the object are used as reference images. Given these images, local features are detected and then described resulting in a set of reference feature descriptors with known 3D positions. For a live camera image, the same procedure is performed to gain current feature descriptors with 2D image coordinates. A similarity measure, such as the reciprocal of the Euclidean distance of the descriptors, can be used to determine the similarity of two features. Matching the current feature descriptors with the set of reference descriptors results in 2D-3D correspondences between the current camera image and the reference object. The camera pose with respect to the object is then determined based on these correspondences and can be used in Augmented Reality applications to overlay virtual 3D content registered with the real object. Note, that analogously the position and orientation of the object can be determined with respect to the camera coordinate system.

Commonly, both feature detectors and feature description methods need to be invariant to changes in the viewpoint up to a certain extent Affine-invariant feature detectors as described in K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. V. Gool. A comparison of affine region detectors. Int. Journal Computer Vision, 65:43-72, 2005. that estimate an affine transformation to normalize the neighborhood of a feature exist, but they are currently too expensive for real-time applications on mobile devices. Instead, usually only a uniform scale factor and an in-plane rotation is estimated resulting in true invariance to these two transformations only. The feature description methods then use the determined scale and orientation of a feature to normalize the support region before computing the descriptor. Invariance to out-of-plane rotations, however, is usually fairly limited and in the responsibility of the description method itself.

If auxiliary information is available, this can be used to compensate for out-of-plane rotations. Provided with the depth of the camera pixels, the 3D normal vector of a feature can be determined to create a viewpoint-invariant patch, as described in C. Wu, B. Clipp, X. Li, J.-M. Frahm, and M. Pollefeys. 3d model matching with viewpoint-invariant patches (VIP). In Proc. IEEE CVPR, 2008, of the feature. For horizontal surfaces, the gravity vector measured with inertial sensors enables the rectification of the camera image prior to feature description, as described in D. Kurz and S. Benhimane Gravity-Aware Handheld Augmented Reality. In Proc. IEEE/ACM ISMAR, 2011.

If such data is not available, rendering techniques, such as image warping, can be employed to create a multitude of synthetic views, i.e. images, of a feature. For descriptors providing a low invariance to viewpoint variations or in-plane rotations but enabling very fast descriptor matching, such synthetic views are used to create different descriptors for different viewpoints and/or rotations to support larger variations, as described in S. Taylor, E. Rosten, and T. Drummond. Robust feature matching in 2.3 ms. In IEEE CVPR Workshop on Feature Detectors and Descriptors, 2009; M. Calonder, V. Lepetit, M. Ozuysal, T. Trzcinski, C. Strecha, and P. Fua. Brief: Computing a local binary descriptor very fast. IEEE Trans. Pattern Anal. Mach. Intell, 34:1281-1298, 2012.

However, with an increasing number of reference feature descriptors, the time to match a single current feature descriptor increases, making real-time processing impossible at some point. Additionally, the amount of reference data, which potentially needs to be transferred via mobile networks, increases which results in longer loading times.

However, with an increasing number of reference feature descriptors, the time to match a single current feature descriptor increases, making real-time processing impossible at some point. Additionally, the amount of reference data, which potentially needs to be transferred via mobile networks, increases which results in longer loading times.

In addition to invariance to spatial transformations resulting from a varying viewpoint, it is also crucial that feature descriptors (and feature classifiers) provide invariance to changes in illumination, noise and other non-spatial transformations. Approaches exist, that employ learning to find ideal feature descriptor layouts within a defined design space, as described in M. Brown, G. Hua, and S. Winder. Discriminative learning of local image descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 33(1):43-57, 2011, based on a ground truth dataset containing corresponding image patches of features under greatly varying pose and illumination conditions. Analogically, classifiers can be provided with warped patches that additionally contain synthetic noise, blur or similar in the training phase. Thanks to the training stage provided with different appearances of a feature, classifiers in general provide a good invariance to the transformations that were synthesized during training. However, the probabilities that need to be stored for feature classifiers require a lot of memory, which makes them unfeasible for a large amount of features in particular on memory-limited mobile devices.

Using different synthetic views, i.e. images, of an object to simulate different appearances has shown to provide good invariance to out-of-plane rotations. However, the existing methods making use of this result in large amount of descriptor data making them almost unfeasible on mobile devices.

It would therefore be beneficial to provide a method of providing a set of feature de-scriptors which is capable of being used in methods of matching features of an object in an image of a camera applied on devices with reduced memory capacities.

Summary

Aspects of the invention are provided according to the independent claims.

According to an aspect, there is disclosed a method of providing a set of feature de-scriptors configured to be used in matching at least one feature of an object in an image of a camera, comprising the steps of: a) providing at least two images of a first object or of multiple instances of a first object, wherein the multiple instances provide different appearances or different versions of an object, b) extracting in at least two of the images at least one feature from the respective image, c) providing at least one descriptor for an extracted feature, and storing the descriptors for a plurality of extracted features in a first set of descriptors, d) matching a plurality of the descriptors of the first set of descriptors against a plurality of the descriptors of the first set of descriptors, e) computing a score parameter for a plurality of the descriptors based on the result of the matching process, f) selecting among the descriptors at least one descriptor based on its score parameter in comparison with score parameters of other descriptors, g) adding the selected descriptor to a second set of descriptors, h) updating the score parameter of a plurality of the descriptors in the first set of de-scriptors according to any preceding selection process and to the result of the matching process i) performing steps f) and g) again wherein the second set of descriptors is configured to be used in matching at least one feature of the first object or of a second object in an image of a camera.

The term “view” of an object means an image of an object which can either be captured using a real camera or synthetically created using an appropriate synthetic view creation method, as explained in more detail later.

Our method in general creates a first set of descriptors and then adds descriptors from the first set of descriptors to a second set of descriptors. It is known to the expert, that this can be implemented in many different ways and does not necessarily mean that a descriptor is physically copied from a certain position in memory in the first set to a different location in memory in the second set of descriptors. Instead, the second set can for example be implemented by marking descriptors in the first set to be part of the second set, e.g. by modifying a designated parameter of the descriptor. Another possible implementation would be to store memory addresses, pointers, references, or indices of the descriptors belonging to the second set of descriptors without modifying the descriptor in memory at all.

Particularly, according to an embodiment, there is presented a method to automatically determine a set of feature descriptors that describes an object such that it can be matched and/or localized under a variety of conditions. These conditions may include changes in viewpoint, illumination, and camera parameters such as focal length, focus, exposure time, signal-to-noise-ratio, etc. Based on a set of, e.g. synthetically, generated views of the object, preferably under different conditions, local image features are detected, described and aggregated in a database. The proposed method evaluates matches between these database features to eventually find a reduced, preferably minimal set of most representative descriptors from the database. Using this scalable offline process, the matching and/or localization success rate can be significantly increased without adding computational load to the runtime method.

For example, steps h) and i) are repeatedly processed until the number of descriptors in the second set of descriptors has reached a particular value or the number of descriptors in the second set of descriptors stops varying.

According to an embodiment, step g) may be preceded by modifying the at least one selected descriptor based on the selection process.

For example, the modification of the selected descriptor comprises updating the descriptor as a combination of the selected descriptor and other descriptors in the first set of descriptors.

According to an embodiment, the usage of the result of the matching process in the update step h) is restricted to the result of the matching process of the least one selected descriptor, or the result of the matching process of the descriptors that match with the at least one selected descriptor.

According to another aspect of the disclosure, there is provided a method of providing at least two sets of feature descriptors configured to be used in matching at least one feature of an object in an image of a camera, comprising the steps of: a) providing at least two images of a first object or of multiple instances of a first object, wherein the multiple instances provide different appearances or different versions of an object, wherein each of the images is generated by a respective camera having a known orientation with respect to gravity when generating the respective image, b) extracting in at least two of the images at least one feature from the respective image, c) providing at least one descriptor for an extracted feature, and storing the descriptors for a plurality of extracted features in multiple sets of descriptors with at least a first set of descriptors and a second set of descriptors, wherein the first set of descriptors contains descriptors of features which were extracted from images corresponding to a first orientation zone with respect to gravity of the respective camera, and the second set of descriptors contains descriptors of features which were extracted from images corresponding to a second orientation zone with respect to gravity of the respective camera, d) matching a plurality of the descriptors of the first set of descriptors against a plurality of the descriptors of the first set of descriptors, and matching a plurality of the descriptors of the second set of descriptors against a plurality of the descriptors of the second set of descriptors, e) computing a score parameter for a plurality of the descriptors based on the result of the matching process, f) selecting within the first set of descriptors at least one descriptor based on its score parameter in comparison with score parameters of other descriptors, and selecting within the second set of descriptors at least another descriptor based on its score parameter in comparison with score parameters of other descriptors, g) adding the at least one selected descriptor from the first set to a third set of descriptors and adding the at least one selected descriptor from the second set to a fourth set of descriptors, h) updating the score parameter of a plurality of descriptors in the first and/or second set of descriptors according to any preceding selection process and to the result of the matching process i) performing steps f) and g) again wherein the third and/or fourth set of descriptors are configured to be used in matching at least one feature of the first object or of a second object in an image of a camera.

Thus, if, e.g., camera localization is performed with respect to objects at a known orientation of the camera with respect to gravity, it is proposed to create multiple reference descriptor sets for different orientation zones of the camera. For example, different angles between camera rays and a measured gravity vector may be used, as set out in more detail below. This approach is particularly suited for handheld devices with built-in inertial sensors (which may be used to measure an orientation with respect to gravity) and enables matching against a reference dataset only containing the information relevant for camera poses that are consistent with the measured orientation.

Therefore, the presented approach aims at benefiting from multiple, e.g. synthetic, views of an object without increasing the memory consumption. The method (which may be implemented as so-called offline method which does not need to run when running the application) therefore first creates a larger database of descriptors from a variety of views, i.e. images of the object, and then determines a preferably most representative subset of those descriptors which enables matching and/or localization of the object under a variety of conditions.

For example, steps h) and i) are repeatedly processed until the number of descriptors in the third and/or fourth set of descriptors has reached a particular value or the number of descriptors in the third and/or fourth set of descriptors stops varying.

According to an embodiment, step g) is preceded by modifying the at least one selected descriptor based on the selection process.

For example, the modification of the selected descriptor comprises updating the descriptor as a combination of the selected descriptor and other descriptors in the first or second set of descriptors.

For example, in the above methods, steps h) and i) are processed iteratively multiple times until the number of descriptors stored in the second, third and/or fourth set of descriptors has reached a particular value.

According to an embodiment, step d) includes determining for each of the descriptors which were matched whether they were correctly or incorrectly matched, and step e) includes computing the score parameter dependent on whether the descriptors were correctly or incorrectly matched.

For example, the score parameter is indicative of the number of matches the respective descriptor has been correctly matched with any other of the descriptors. Then, in step f) at least one descriptor with a score parameter indicative of the highest number of matches within the first set of descriptors is selected, and step h) reduces the score parameter of the at least one selected descriptor and the score parameter of the descriptors that match with the at least one selected descriptor.

According to another aspect of the invention, there is disclosed a method of matching at least one feature of an object in an image of a camera, comprising providing at least one image with an object captured by a camera, extracting current features from the at least one image and providing a set of current feature descriptors with at least one current feature descriptor provided for an extracted feature, providing a second set of descriptors according to the method as described above, and comparing the set of current feature descriptors with the second set of descriptors for matching at least one feature of the object in the at least one image.

According to further aspect of the invention, there is disclosed a method of matching at least one feature of an object in an image of a camera, comprising providing at least one image with an object captured by a camera, extracting current features from the at least one image and providing a set of current feature descriptors with at least one current feature descriptor provided for an extracted feature, providing a third and a fourth set of descriptors according the method as described above, and comparing the set of current feature descriptors with the third and/or fourth set of descriptors for matching at least one feature of the object in the at least one image.

For example, the method may further include determining a position and orientation of the camera which captures the at least one image with respect to the object based on correspondences of feature descriptors determined in the matching process. For instance, the method may be part of a tracking method for tracking a position and orientation of the camera with respect to an object of a real environment.

According to an embodiment, the method of providing a set of feature descriptors is applied in connection with an augmented reality application and, accordingly, is a method of providing a set of feature descriptors configured to be used in localizing an object in an image of a camera in an augmented reality application.

According to an embodiment, the method of matching at least one feature of an object in an image of a camera is applied in an augmented reality application and, accordingly, is a method of localizing an object in an image of a camera in an augmented reality application.

For example, step a) of the above method includes providing the different images of the first object under different conditions which includes changes from one of the images to another one of the images in at least one of the following: viewpoint, illumination, camera parameters such as focal length, focus, exposure time, signal-to-noise-ratio.

According to an embodiment, step a) may include providing the multiple images of the first object by using a synthetic view creation algorithm creating the multiple images by respective virtual cameras as respective synthetic views. Alternatively, one or more of the multiple images may be generated by a real camera.

For example, the synthetic view creation algorithm includes a spatial transformation which projects a 3D model onto the image plane of a respective synthetic view, and a rendering method is applied which is capable to simulate properties of a real camera, particularly such as defocus, motion blur, noise, exposure time, brightness, contrast, and to also simulate different environments, particularly such as by using virtual light sources, shadows, reflections, lens flares, blooming, environment mapping.

According to an embodiment, step c) includes storing the descriptor for an extracted feature together with an index of the image from which the feature has been extracted.

Particularly, the above described methods are performed on a computer system which may have any desired configuration. Advantageously, as a result of reducing the size of the set of descriptors, the methods using such reduced set of descriptors are capable of being applied on mobile devices, such as mobile phones, which have only limited memory capacities.

In another aspect, there is provided a computer program product adapted to be loaded into the internal memory of a digital computer system, and comprising software code sections by means of which the steps of a method as described above are performed when said product is running on said computer system.

发表评论

电子邮件地址不会被公开。 必填项已用*标注