Google Patent | Adaptive Control For Immersive Experience Delivery
Publication Number: 10440407
Publication Date: 20191008
A combined video of a scene may be generated for applications such as virtual reality or augmented reality. In one method, a data store may store video data with a first portion having a first importance metric, and a second portion having a second importance metric, denoting that viewing of the first portion is more likely and/or preferential to viewing of the second portion. The subset may be retrieved and used to generate viewpoint video from a virtual viewpoint corresponding to a viewer’s viewpoint. The viewpoint video may be displayed on a display device. One of storing the video data, retrieving the subset, and using the subset to generate the viewpoint video may include, based on the difference between the first and second importance metrics, expediting and/or enhancing performance of the step for the first portion, relative to the second portion.
CROSS-REFERENCE TO RELATED APPLICATIONS
The present application is related to U.S. application Ser. No. 15/590,841 for “Vantage Generation and Interactive Playback,” filed on May 9, 2017, the disclosure of which is incorporated herein by reference.
The present application is also related to U.S. application Ser. No. 15/590,877 for “Spatial Random Access Enabled Video System with a Three-Dimensional Viewing Volume,” filed on May 9, 2017, the disclosure of which is incorporated herein by reference.
The present application is also related to U.S. application Ser. No. 15/590,951 for “Wedge-Based Light-Field Video Capture,” filed on May 9, 2017, the disclosure of which is incorporated herein by reference.
The present document relates to the use of importance metrics to streamline the capture, storage, delivery, and/or rendering of video data for an immersive experience such as virtual reality or augmented reality.
As better and more immersive display devices are created for providing virtual reality (VR) and augmented reality (AR) environments, it is desirable to be able to capture high quality imagery and video for these systems. In a stereo VR environment, a user sees separate views for each eye; also, the user may turn and move his or her head while viewing. As a result, it is desirable that the user receive high-resolution stereo imagery that is consistent and correct for any viewing position and orientation in the volume within which a user may move his or her head.
The most immersive virtual reality and augmented reality experiences have six degrees of freedom, parallax, and view-dependent lighting. The resulting video data can be quite voluminous, requiring significant resources in terms of storage, delivery bandwidth, and/or processing power. These resources are often constrained, for example, by the processing power of the user’s computer, the storage capacity of the user’s computer, the bandwidth of the user’s connection to a data source, and/or other factors. Such factors significantly limit the quality of the viewer’s experience.
Various embodiments of the described system and method utilize importance metrics to indicate the relative likelihood and/or desirability of viewing different portions of video data. For example, a first portion of the video data for a virtual reality or augmented reality experience may have a first importance metric, and a second portion of the video data may have a second importance metric. A difference between the first and second importance metrics may denote that the first portion is more likely to be viewed and/or preferred for viewing, relative to the second portion.
A subset of the video data may be retrieved and used to generate viewpoint video from a virtual viewpoint corresponding to a viewer’s actual viewpoint. Storage, retrieval, and/or generation of the viewpoint video may be carried out with respect to the importance metrics, such that one or more of these tasks are expedited and/or enhanced for the first portion, relative to the second portion.
In some embodiments, the video data may be divided into a plurality of vantage video data sets, each of which represents a view from one of a plurality of vantages within a viewing volume containing the virtual viewpoint. The position of the viewer’s viewpoint may be used to determine which vantage video data sets will be used to generate the viewpoint video. The first and second portions of the video data may each include one or more of the vantages, such that some vantages are expedited and/or enhanced for storage, retrieval, and/or processing, relative to other vantages.
Additionally or alternatively, the first portion of the video data may be for a first region of the viewing volume, and the second portion of the video data may be for a second region of the viewing volume. Various parameters such as a number of vantages, a density of vantages, locations of vantages, a number of vantages used to generate the viewpoint video, lighting applied to vantages, and resolution of vantages may be enhanced for the first region, relative to the second volume.
Further, each vantage may be divided into a plurality of tiles, each of which represents the view from the vantage along a viewing direction. The orientation of the viewer’s viewpoint may be used to determine which tiles will be used to generate the viewpoint video. The first and second portions of the video data may each include one or more of the tiles for each of a plurality of the vantages, such that some tiles are expedited and/or enhanced for storage, retrieval, and/or processing, relative to other tiles.
Additionally or alternatively, the first portion of the video data may be for a first set of tiles oriented along a first set of viewing directions, and the second portion of the video data may be for a second set of tiles oriented along a second set of viewing directions. Various parameters such as tile spatial resolution, tile temporal resolution, tile color depth, and tile bit rate may be enhanced for the first set of tiles, relative to the second set of tiles.
The importance metrics may be established in a wide variety of ways. For example, the importance metric may be based on viewing data indicating which portions of the experience have been viewed or preferred by more viewers, user input from an author of the experience indicating which portions are more likely or desirable for viewing and/or which portions correspond to other stimuli presented as part of the experience, and/or analysis of the video data and/or accompanying audio data that determines which portions are more likely or desirable for viewing.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings illustrate several embodiments. Together with the description, they serve to explain the principles of the embodiments. One skilled in the art will recognize that the particular embodiments illustrated in the drawings are merely exemplary, and are not intended to limit scope.
FIG. 1 is a flow diagram depicting a method for delivering video for a virtual reality or augmented reality experience, according to one embodiment.
FIG. 2 is a screenshot diagram depicting a frame from a viewpoint video of a virtual reality experience, according to one embodiment.
FIG. 3 is a screenshot diagram depicting the screenshot diagram of FIG. 2, overlaid with a viewing volume for each of the eyes, according to one embodiment.
FIG. 4 is a screenshot diagram depicting the view after the headset has been moved forward, toward the scene of FIG. 2, according to one embodiment.
FIG. 5 is a screenshot diagram depicting the color channel from a single vantage, such as one of the vantages of FIG. 3, according to one embodiment.
FIG. 6 is a diagram depicting the manner in which the tiles of a vantage, such as one of the vantages of FIG. 3, may be selected, according to one embodiment.
FIG. 7 is a screenshot diagram depicting the depth channel from the vantage used to provide the screenshot diagram of FIG. 5, according to one embodiment.
FIG. 8 is a diagram depicting a portion of a scene in which two objects are positioned such that an occluded area exists behind the objects, according to one embodiment.
FIG. 9 is a diagram depicting the portion of the scene of FIG. 9, in which another vantage has been added to enhance viewing of the occluded area, according to one embodiment.
FIG. 10 is a screenshot diagram depicting the vantages traversed by a single viewer and accumulated over time, according to one embodiment.
FIG. 11 is a screenshot diagram depicting the vantages traversed by multiple viewers and accumulated over time, according to one embodiment.
FIG. 12 is a diagram depicting a vantage, according to one embodiment.
Multiple methods for capturing image and/or video data in a light-field volume and creating virtual views from such data are described. The described embodiments may provide for capturing continuous or nearly continuous light-field data from many or all directions facing away from the capture system, which may enable the generation of virtual views that are more accurate and/or allow viewers greater viewing freedom.
For purposes of the description provided herein, the following definitions are used: Augmented reality: an immersive viewing experience in which images presented to the viewer are based on the location and/or orientation of the viewer’s head and/or eyes, and are presented in conjunction with the viewer’s view of actual objects in the viewer’s environment. Conventional image: an image in which the pixel values are not, collectively or individually, indicative of the angle of incidence at which light is received on the surface of the sensor. Depth: a representation of distance between an object and/or corresponding image sample and the entrance pupil of the optics of the capture system. Image: a two-dimensional array of pixel values, or pixels, each specifying a color. Importance metric: an indicator of the importance of a subset of video data. Input device: any device that receives input from a user. Light-field camera: any camera capable of capturing light-field images. Light-field data: data indicative of the angle of incidence at which light is received on the surface of the sensor. Light-field image: an image that contains a representation of light-field data captured at the sensor, which may be a four-dimensional sample representing information carried by ray bundles received by a single light-field camera. Light-field volume: the combination of all light-field images that represents, either fully or sparsely, light rays entering the physical space defined by the light-field volume. Processor: any processing device capable of processing digital data, which may be a microprocessor, ASIC, FPGA, or other type of processing device. Ray bundle, “ray,” or “bundle”: a set of light rays recorded in aggregate by a single pixel in a photosensor. Scene: an arrangement of objects and/or people to be filmed. Sensor, “photosensor,” or “image sensor”: a light detector in a camera capable of generating images based on light received by the sensor. Stereo virtual reality: an extended form of virtual reality in which each eye is shown a different view of the virtual world, enabling stereoscopic three-dimensional perception. Tile: a portion of a vantage video data set corresponding to a particular viewing direction. Vantage: a position in three-dimensional space with associated video data. Vantage video data set: the portion of video data associated with a particular vantage. Video data: a collection of data comprising imagery and/or audio components that capture a scene. Viewing data: data that records aspects of viewing of an experience by one or more viewers. Viewing volume: a three-dimensional region from within which virtual views of a scene maybe generated. Viewpoint video: imagery and/or sound comprising one or more virtual views. Virtual reality: an immersive viewing experience in which images presented to the viewer are based on the location and/or orientation of the viewer’s head and/or eyes. Virtual view: a reconstructed view, typically for display in a virtual reality or augmented reality headset, which may be generated by resampling and/or interpolating data from a captured light-field volume. Virtual viewpoint: the location, within a coordinate system and/or light-field volume, from which a virtual view is generated.
In addition, for ease of nomenclature, the term “camera” is used herein to refer to an image capture device or other data acquisition device. Such a data acquisition device can be any device or system for acquiring, recording, measuring, estimating, determining and/or computing data representative of a scene, including but not limited to two-dimensional image data, three-dimensional image data, and/or light-field data. Such a data acquisition device may include optics, sensors, and image processing electronics for acquiring data representative of a scene, using techniques that are well known in the art. One skilled in the art will recognize that many types of data acquisition devices can be used in connection with the present disclosure, and that the disclosure is not limited to cameras. Thus, the use of the term “camera” herein is intended to be illustrative and exemplary, but should not be considered to limit the scope of the disclosure. Specifically, any use of such term herein should be considered to refer to any suitable device for acquiring image data.
In the following description, several systems and methods for capturing video are described. One skilled in the art will recognize that these various systems and methods can be performed singly and/or in any suitable combination with one another. Further, many of the configurations and techniques described herein are applicable to conventional imaging as well as light-field imaging. Further, although the ensuing description focuses on video capture for use in virtual reality or augmented reality, the systems and methods described herein may be used in a much wider variety of video applications.
* Importance Metrics*
As described previously, delivery of a virtual reality or augmented reality experience may push the limits of bandwidth, storage, and/or processing capabilities of known computing and display systems. Accordingly, it is desirable to give priority, in terms of such system resources, to the content that is most desirable and/or most likely to be viewed by the viewer. This may be accomplished, in some embodiments, by assigning different importance metrics to different portions of video data for a virtual reality or augmented reality experience. The importance metrics may denote which portions are most important, and therefore should be prioritized for delivery to the viewer. One exemplary method for using such importance metrics will be shown and described in connection with FIG. 1.
Referring to FIG. 1, a flow diagram depicts a method 100 for delivering video for a virtual reality or augmented reality experience, according to one embodiment. As shown, the method 100 may start 110 with a step 120 in which video data is stored. The video data may encompass video from multiple viewpoints and/or viewing directions within a viewing volume that can be selectively delivered to the viewer based on the position and/or orientation of the viewer’s head within the viewing volume, thus providing an immersive experience for the viewer.
The video data may be divided into a plurality of vantages, each of which is for one of a plurality of positions within the viewing volume. Each vantage may be divided into a plurality of tiles, each of which is for one of a plurality of possible viewing directions. Vantages and tiles will be described in greater subsequently, and are also described in the above-cited related U.S. application Ser. No. 15/590,877 for “Spatial Random Access Enabled Video System with a Three-Dimensional Viewing Volume,” filed on May 9, 2017, the disclosure of which is incorporated herein by reference in its entirety.
In a step 130, importance metrics may be assigned to different portions of the video data. For example, the video data may be broken down into different regions within the viewing volume, some of which receive higher priority than others. Additionally or alternatively, the video data may be broken down into different sets of tiles, representing different viewing directions. The regions and/or sets of tiles may be determined based one or more factors, which may include, but are not limited to, the following: The likelihood that the viewer will position his or her head in the region and/or orient his or her head along the viewing direction; The locations of any points of interest likely to be visited by the viewer, including but not limited to featured portions of a virtual reality experience and real-life locations likely to be of interest in an augmented reality experience; The quality of the experience as viewable from within the region and/or from along the viewing direction, which may include factors such as the visual quality of the content, the degree of parallax, the level of interactivity for augmented reality experience, and the like; and The presence or absence of additional sensory content, such as an auditory, olfactory, or tactile stimulus coordinated with the region and/or along the viewing direction.
The determination of the importance metric for any given region and/or set of tiles may be made, for example, through the use of one or more approaches including, but not limited to: Viewing data obtained from historical viewings of the experience, indicating that viewers prefer or more frequently one region and/or set of tiles over another; Receipt of input from a user, such as a viewer or director, setting importance metrics for one or more regions and/or tile sets; and Analysis of the video data to determine the quality and/or likelihood of viewing of a given region and/or set of tiles.
These approaches will each be described in greater detail below.
In order to capture viewing data, the movement of a viewer’s head may be measured as he or she views a particular segment of the experience. Viewing position and/or direction may be tracked and logged as the viewer experiences the segment. Where the video data is divided into vantages and/or tiles, the particular vantage and/or tile being delivered to the viewer may be logged.
Less frequently viewed vantages and/or tiles may receive a lesser importance metric (an importance metric indicating they are less important than other vantages and/or tiles). Based on the lesser importance metric, these vantages and/or tiles may be compressed with lesser quality, delivered after other vantages and/or tiles, processed after other vantages and/or tiles, omitted from the experience altogether, and/or otherwise de-prioritized in the delivery of the experience.
If desired, the actions of the same and/or different viewers may be measured across time and/or across multiple viewings to determine behavior. It is likely that a viewer will look at different places on their second or third viewing of the same content. Viewing statistics, such as vantage and/or tile viewing statistics, may be gathered offline and/or online to inform the compression algorithm and/or other modules that control the capture, storage, delivery, and/or processing of the experience.
Some viewers may have a special preference for a particular actor, athlete, or other performer in a sports broadcast, movie, concert, or other event. Compression quality and/or other delivery parameters may be set for the viewer based on his or her preferences. This may be done automatically by observing the viewing actions taken by the viewer.
In the alternative, explicit input may be received from the viewer to indicate such preferences. For example, the viewer may select particular performers, particular types of scenes, particular portions of an experience, and/or other aspects of an experience to be prioritized or de-prioritized for viewing.
In some examples, the viewer may explicitly select regions of the experience to be prioritized. A testing method such as A/B testing may be used to determine which resource allocation parameters provide the best experience for the viewer.
According to still other embodiments, a content producer or other person involved with the generation of the experience may provide input to explicitly assign importance metrics. For example, a director may indicate the most optimal or intended viewing experience and assign higher importance metrics to those vantage tile paths. This may be used to drive the viewer to the path chosen by the director. Thus, importance metrics may be used to subtly encourage viewers to view the content as indicated by the director.
In some embodiments, importance metrics may be assigned based on the presence or absence of additional sensory content. Such additional sensory content may include, but is not limited to, sounds, smells, tactile content such as haptic feedback or other vibrations, and the like. Such sensory content may be timed to coincide with a key portion of the video data, which may have a high likelihood of being viewed by a viewer, or may desirably be rendered with higher quality. Such sensory content may further provide an impulse to the viewer to look in a particular direction, strengthening the likelihood that the associated video data will be viewed by the viewer.
As another alternative, the video data may be analyzed, for example, by a computing device, to automatically set importance metrics. The importance metrics may be calculated using one or more objective measures. Such objective measures may be computed using attributes such as, but not limited to, the following: Pixel coverage of the three-dimensional scene, for example, as in Vazquez, P. P., Feixas, M., Sbert, M. and Heidrich, W. (2003), Automatic View Selection Using Viewpoint Entropy and its Application to Image-Based Modelling. Computer Graphics Forum, 22: 689-300. doi:10.1111/j.1467-8659.2003.00717.x; Quality metrics, such as PSNR, SSIM, for example, as in Wang, Zhou, et al. “Image quality assessment: from error visibility to structural similarity.” IEEE transactions on image processing 13.4 (2004): 200-612, AQM, for example, as in Myszkowski, Karol, Przemyslaw Rokita, and Takehiro Tawara. “Perception-based fast rendering and antialiasing of walkthrough sequences.” IEEE Transactions on Visualization and Computer Graphics 6.4 (2000): 360-379, and absolute differences; Saliency, for example, as in Itti, Laurent, Christof Koch, and Ernst Niebur. “A model of saliency-based visual attention for rapid scene analysis.” IEEE Transactions on pattern analysis and machine intelligence 20.11 (1998): 1254-1259 or Lee, Chang Ha, Amitabh Varshney, and David W. Jacobs. “Mesh saliency.”ACM transactions on graphics (TOG). Vol. 24. No. 3. ACM, 2005; Contrast sensitivity analysis, for example, as in Robson, J. G. “Spatial and temporal contrast-sensitivity functions of the visual system.” Josa 56.8 (1966): 1141-1142; Image entropy; and Motion.
The importance metric may be calculated in a wide variety of ways, through the use of data obtained from any of the foregoing methods. Such methods may be carried out across the vantages of the video data by comparing vantages with each other, and/or across the tiles of one or more vantage by comparing the tiles of each vantage with each other. For example, for a given tile, the importance metric may be calculated as the change in quality measure when resource allocation parameters of the given tile are changed. One can also formulate the resource allocation as an optimization problem with the importance metrics as the objective function and the resource allocation parameters as the input parameters, constraining by available system resources. Such optimization may be carried out as set forth in Everett III, Hugh. “Generalized Lagrange multiplier method for solving problems of optimum allocation of resources.” Operations research 11.3 (1963): 399-417.
Importance metrics may be in any of a variety of forms, including but not limited to: Numeric scores such as one through ten, which may be rounded to the nearest integer or expressed as a floating point number; Importance categories such as least important, moderately important, and most important; and Letter scores, such as a, b, c, and d.
Importance metrics may assign vantages and/or tiles to two categories such as “more important” and “less important.” Alternatively, more than two distinct importance metrics may be assignable to provide a broader spectrum of importance levels. Importance metrics may be stored in association with the video data, for example, in metadata stored in files of the video data.
In some embodiments, the step 130 may be carried out prior to the step 120, and the step 120 may then be carried out with reference to the importance metrics of the video data. For example, some or all of the video data may be compressed, and the compression used for vide data with a high importance metric may be different from that of the video data with a low importance metric. Specifically, the video data with a higher importance metric may be encoded and/or compressed in a manner that provides higher quality, faster retrieval, faster processing, and/or the like, by comparison with the video data with a lower importance metric.
Additionally or alternatively, the video data may be captured in a manner that references the importance metrics. For example, more video data may be captured proximate locations of interest in the immersive experience, or the video data captured may be captured at higher quality at those locations. This may enhance the quality of the video data that is available for generation of virtual viewpoints proximate the locations of interest.
Once the importance metrics have been assigned to the video data, the virtual reality or augmented reality experience may be initiated. In a step 140, viewpoint data may be received to indicate the position and/or orientation of the viewer’s head, indicating the viewer’s actual viewpoint. The actual viewpoint may be converted into a virtual viewpoint within the viewing volume of the video data.
In a step 150, a subset of the video data may be retrieved. The subset may be selected to include all of the video data likely to be needed to render a virtual view of the scene captured by the video data, from the virtual viewpoint corresponding to the viewpoint data received in the step 140. The contents of the subset may be determined based, in part, on the importance metrics; video data corresponding to particular vantages and/or tiles may optionally be excluded from the subset if the importance metric for those vantages and/or tiles is below a threshold. Additionally or alternatively, the order in which the video data within the subset is retrieved may be determined based on the importance metrics, with the video data within the subset having lower importance metrics retrieved after that having higher importance metrics.
In a step 160, the subset of the video data retrieved in the step 150 may be used to generate viewpoint video data, representing a view of the scene from the viewer’s viewpoint. Generation of the viewpoint video may also be carried out with reference to the importance metrics. For example, decompression of the portion of the subset having a higher importance metric may be carried out before and/or with higher quality than decompression of the portion of the subset having a lower importance metric. Additionally or alternatively, the portion of the subset with a higher importance metric may be rendered for viewing before and/or with higher quality than rendering of the portion of the subset having a lower importance metric.
In a step 170, the viewpoint video may be displayed for the viewer on a display device. In some embodiments, the display device may be part of a virtual reality or augmented reality headset. The viewpoint video may include sound, which may be played for the viewer via an output device such as one or more speakers or headphones. As indicated previously, the experience may additional sensory content such as sounds, smells, tactile content, and the like. If desired, virtual reality or augmented reality equipment may include other output devices, such as vibration or scent-producing elements, that provide such additional sensory content.
Pursuant to a query 180, a determination may be made as to whether the experience has been completed. If not, the method 100 may return to the step 140, in which the viewpoint data may again be captured to obtain the position and/or orientation for a new virtual viewpoint from which the scene is to be rendered for display for the viewer. The step 140, the step 150, the step 160, and the step 170 may be repeated until the query 180 is answered in the affirmative, representing that the experience is complete. The method 100 may then end 190.
The steps of the method 100 may be reordered, omitted, replaced with alternative steps, and/or supplemented with additional steps not specifically described herein. The steps set forth above will be described in greater detail subsequently in the discussion of vantages and tiles.
* Virtual Reality Display*
Referring to FIG. 2, a screenshot diagram 200 depicts a frame from a viewpoint video of a virtual reality experience, according to one embodiment. As shown, the screenshot diagram 200 depicts a left headset view 210, which may be displayed for the viewer’s left eye, and a right headset view 220, which may be displayed for the viewer’s right eye. The differences between the left headset view 210 and the right headset view 220 may provide a sense of depth, enhancing the viewer’s perception of immersion in the scene.
As indicated previously, the video data for a virtual reality or augmented reality experience may be divided into a plurality of vantages, each of which represents the view from one location in the viewing volume. More specifically, a vantage is a view of a scene from a single point in three-dimensional space. A vantage can have any desired field-of-view (e.g. 90.degree. horizontal.times.90.degree. vertical, or 360.degree. horizontal.times.180.degree. vertical) and pixel resolution. A viewing volume may be populated with vantages in three-dimensional space at some density.
Based on the position of the viewer’s head, which may be determined by measuring the position of the headset worn by the viewer, the system may interpolate from a set of vantages to render the viewpoint video in the form of the final left and right eye view, such as the left headset view 210 and the right headset view 220 of FIG. 2. A vantage may contain extra data such as depth maps, edge information, and/or the like to assist in interpolation of the vantage data to generate the viewpoint video.
The vantage density may be uniform throughout the viewing volume, or may be non-uniform. A non-uniform vantage density may enable the density of vantages in any region of the viewing volume to be determined based on the likelihood the associated content will be viewed, the quality of the associated content, and/or the like. Thus, if desired, importance metrics may be used to establish vantage density for any given region of a viewing volume.
Referring to FIG. 3, a screenshot diagram 300 depicts the screenshot diagram 200 of FIG. 2, overlaid with a viewing volume 310 for each of the eyes, according to one embodiment. Each viewing volume 310 may contain a plurality of vantages 320, each of which defines a point in three-dimensional space from which the scene may be viewed by the viewer. Viewing from between the vantages 320 may also be carried out by combining and/or extrapolating data from vantages 320 adjacent to the viewpoint.
Referring to FIG. 4, a screenshot diagram 400 depicts the view after the headset has been moved forward, toward the scene of FIG. 2, according to one embodiment. Again, a left headset view 410 and a right headset view 420 are shown, with the vantages 320 of FIG. 3 superimposed. Further, for each eye, currently and previously traversed vantages 430 are highlighted, as well as the current viewing direction 440.
Referring to FIG. 5, a screenshot diagram 500 depicts the color channel from a single vantage, such as one of the vantages 320 of FIG. 3, according to one embodiment. As shown, each vantage 320 may have a wide angle field-of-view of the scene, encompassing many possible viewing directions. For a full 360.degree. horizontal.times.180.degree. vertical vantage, the viewer is only looking at a certain portion of vantage at any given time. The portion may be defined by the headset’s field-of-view for each eye.
To be efficient for rendering performance, data input/output performance, and/or data compression/decompression, vantages may be tiled into smaller areas. Uniformly-sized rectangular tiles may be used in some embodiments. For example, as depicted in FIG. 5, the color channel for one of the vantages 320 may be divided into a rectangular grid of tiles 520, with thirty-two columns of tiles, and sixteen rows of tiles. This is merely exemplary, as different numbers of tiles may be used, such as sixteen columns by eight rows. The tiles 520 are also depicted in rectilinear space, but may, in alternative embodiments, be defined in the latitudinal/longitudinal space defined by wrapping the screenshot diagram 500 around a sphere.
Referring to FIG. 12, a diagram 1200 depicts a vantage 1210 according to one embodiment. The vantage 1210 may have a center 1220 and image data, such as an RGB channel and/or a depth channel, which may define a sphere 1230 encircling the center 1220. A field-of-view may be represented by four vectors 1240 extending outward from the center 1220 to pass through the surface of the sphere 1230. A semispherical area 1250 (shown in red hatching) on the surface of the sphere 1230, between the locations at which the vectors 1240 pass through the surface of the sphere 1230, may represent the portion of the RGB channel of the vantage 1210 that is to be viewed currently and/or used in combination with other vantage data to generate viewpoint video.
Referring to FIG. 6, a diagram 600 depicts the manner in which the tiles of a vantage, such as one of the vantages 320 of FIG. 3, may be selected, according to one embodiment. The vantages 320 have any of a wide variety of shapes, including but not limited to spherical and cylindrical shapes. The diagram 600 depicts vantages 320 as having spherical shapes, by way of example.
The top row depicts four side views of a sphere representing the vantage 320. A field-of-view 610 is oriented along the viewing direction 620 currently being viewed by the viewer. The field-of-view 610 is depicted in the same orientation in each view of the top row because the field-of-view 610 is depicted, in each case, from its left side. The middle row depicts four top views of the sphere representing the vantage 320, depicting the field-of-view 610 in various orientations.
The bottom row depicts the color channel 630 for the vantage 320, divided into tiles as in FIG. 5. A subset 640 of the tiles of the color channel 630 may be fetched to correspond to the viewer’s viewpoint, permitting the viewpoint video to be rendered. The subset 640 may move, for example, to the right, within the color channel 630, as the viewer pivots his or her head to the right, as can be seen by viewing the first row, then the second row, then the third row, and then the fourth row of FIG. 6. Tiles may also be fetched from other vantages proximate the viewers viewpoint and combined with the subset 640 to render the viewpoint video.
In alternative embodiments, non-uniformly sized and/or non-rectangular tiles may be used. The sizes and/or shapes of the tiles may be dependent on the content depicted in those tiles. For example, more tiles may be positioned areas of vantages with higher importance metrics than the surrounding areas, enabling the more important viewing directions to be rendered in greater detail.
* Depth Channel*
Referring to FIG. 7, a screenshot diagram 700 depicts the depth channel from the vantage used to provide the screenshot diagram 500 of FIG. 5, according to one embodiment. Depth information may be encoded into each vantage to provide proper parallax and/or other visual effects.
* Vantage-Based and Tile-Based Usage of Importance Metrics*
As indicated previously, vantages, such as the vantages 320 of FIG. 3, may have different importance metrics to indicate the relative importance of the vantages 320. Further, tiles, such as the tiles 520 of FIG. 5, can have different importance metrics indicating the relative importance of the tiles. The importance metrics may be used in a variety of ways to prioritize and/or enhance delivery of more important content to the viewer.
In some embodiments, the importance metrics may be used to guide the compression algorithms to allocate more bits/quality to the more important portions and less bits/quality to the less important portions of the viewpoint provided by a vantage 320. As each of the tiles 520 represents a direction into the scene as well as a position in space, importance levels of tiles may vary along either or both of the X and Y axes.
Such importance metrics may guide a vantage-based video system in the allocation of resources such as, but not limited to, the number of vantages, vantage density, vantage placement, bits and encoding/decoding complexity, in order to meet system constraints, such as bandwidth, storage, CPU resources, and/or GPU resources.
For example, to maximize perceived quality, the system can allocate more resources to more important regions of the viewing volume and/or more important tiles than less important regions and/or tiles. If a limit on the maximal number of vantage must be adhered to in order to meet system requirements, the importance metrics may be used to determine the optimal location of the vantages. Thus, importance metrics may be used to place vantages or tiles in the video data in the step 120 of the method 100 of FIG. 1.
The importance metric for vantage tiles may be applied to caching strategy for playback, for example, on a personal computer or mobile device. In such applications, where disk input/output and/or network streaming bandwidth may be constrained, it may be desirable to pre-fetch the most important vantage tiles ahead of time. Any number of known predictive caching techniques may be used to accomplish this. The importance metrics may be referenced to prioritize more important video data for predictive caching.
A system can utilize an importance map to allocate system resources for capturing, encoding, decoding, storing, pre-processing, post-processing, delivering, and/or playing content. The parameters used for resource allocation may include, but are not limited to the following, and may be applied to each individual tile or vantage, a subset of tiles or vantages and/or globally: The number of vantages in the viewing volume and/or a region of the viewing volume; Vantage density in the viewing volume and/or a region of the viewing volume; The position of vantages relative to content; The location of vantages; The number, complexity, and location of view-dependent variations, such as variations in lighting and resolution; The spatial resolution of tiles; The temporal resolution of tiles; The color/depth bit-sampling of tiles; The bitrate of tiles; The quality and/or rate of rendering; The number of vantages used for generating a viewpoint; The density of meshes, for example, for rendered three-dimensional models; The density of cameras used to capture the scene; The manner in which various portions of the video data are prioritized (which portions of the video data to process, store, render, and/or transmit when resources are constrained); The extent of pre-processing to be carried out for various portions of the video data; and Other codec-related parameters used for encoding/decoding image data.
Those of skill in the art will recognize that the list set forth above is merely exemplary. The parameters listed above may be modified singly or in combination with each other. In other embodiments, other system resource parameters may be modified based on the importance metrics of the corresponding video data.
* Vantage Density and Position*
The density with which vantages are arranged (uniformly or non-uniformly) within a viewing volume, or a region of a viewing volume, may be an important resource allocation parameter. Based on how important a particular vantage and/or tile is, more vantages can be allocated to that region of the viewing volume. Optimal positions may be found to cover disocclusions that may be very content-dependent and/or scene-dependent. Parallax and view-dependent lighting may be taken into account in the assignment of importance metrics, since having the correct vantage density and position may greatly enhance provision of parallax and view dependent lighting.
Referring to FIG. 8, a diagram 800 depicts a portion of a scene in which two objects 810 are positioned such that an occluded area 820 exists behind the objects 810, according to one embodiment. A keyhole 830 may exist between the objects 810, through which the occluded area 820 may be viewable from a viewing volume 840. A plurality of vantages 850 with a viewing volume 840 are positioned proximate the objects 810. The vantages 850 may be cylindrical or spherical vantages, or may have any other shape, as discussed in connection with FIG. 6.
None of the vantages 850 within the viewing volume 840 are aligned with the keyhole 830, as shown by the fields-of-view 860 centered at the vantages 850 and oriented toward the objects 810. Accordingly, the corresponding video data may not contain accurate imagery depicting the occluded area 820. A viewer positioning his or head between the vantages 850 in an attempt to view the occluded area 820 may view viewpoint video that lacks detail regarding the occluded area 820 because the viewpoint video may be generated based on the tiles from the vantages 850; none of these tiles effectively depicts the occluded area 820.
Referring to FIG. 9, a diagram 900 depicts the portion of the scene of FIG. 9, in which another vantage 950 has been added to enhance viewing of the occluded area 820, according to one embodiment. A field-of-view 960 from the vantage 950, oriented toward the keyhole 830, enables the viewer to view a portion of the occluded area 820.
This illustration of keyhole disocclusion depicts one manner in which vantage density and/or placement may help to determine the quality of the viewing experience. Each additional vantage adds to the quantity of video data that needs to be stored, retrieved, and/or processed; accordingly, it is beneficial to conserve system resources by using a smaller vantage density for less important portions of the video data.
One method of optimizing vantage density and position is for content creators to place vantages manually and adjust them based on quick feedback. A fixed vantage density may initially be used, and the output may be viewed in a virtual reality headset. Then, the vantages may be manually moved and adjusted vantages, either singly or in groups, until the final output quality is satisfactory. This process may be repeated for each frame in time.
To save editing time, the content creator may “pin” a set of vantages to a particular region of the viewing volume. Optical flow methods and/or the like may be applied to track these regions over time to provide the content creators with a better starting set of vantage positions. This may reduce the amount of editing that needs to be done.
Another method is to place the vantages automatically using a software algorithm that analyzes the scene and generates the optimal vantage density and/or position for the final output. In some embodiments, a mixture of the two methods (manual and automated vantage placement) may be carried out. For example, the automated method may generate a starting vantage placement for each frame. The content creator may make further adjustments in each frame, if necessary.
* Viewing Data*
As mentioned above, viewing data may be collected and used to set importance metrics. The viewing data may come from viewing by the content creator to set vantage densities and/or positions, as set forth in the preceding section. Alternatively, the viewing data may come from other viewers (such as consumers who are unaffiliated with the content creator) who view the experience subsequent to its creation, as described in the description of FIG. 1. The viewing data may be used not just to set vantage position and/or density, but also to set any of the system resource parameters listed previously.
Referring to FIG. 10, a screenshot diagram 1000 depicts the vantages 1010 traversed by a single viewer and accumulated over time, according to one embodiment. The vantages 1010 traversed by the viewer may be assigned higher importance metrics, relative to the importance metrics assigned to vantages that the viewer did not traverse.
In some embodiments, other aspects of viewing data may be recorded in connection with the information presented in FIG. 10. For example, the number of times a vantage was traversed by the viewer, the amount of time the viewer spent traversing each vantage, and/or viewing data for particular tiles of the vantages 1010 may be recorded and factored into the importance metrics to be assigned.
In some embodiments, more explicit viewer feedback may also be received and recorded. For example, a viewer may fill out a survey indicating which aspects of the virtual reality or augmented reality experience were the most enjoyable. Additionally or alternatively, biometric data (such as pulse rate, blood pressure, brain activity, etc.) may be tracked to glean information regarding the viewer’s level of engagement with each portion of the experience.
In some examples, viewing data from multiple viewers may be recorded and aggregated to assign importance metrics. One such example will be shown and described in connection with FIG. 11.
Referring to FIG. 11, a screenshot diagram 1100 depicts the vantages 1110 traversed by multiple viewers and accumulated over time, according to one embodiment. Vantages 1120 that have been viewed more frequently and/or for longer periods of time may be shown in a darker color, thereby presenting vantage viewing in the form of a “heat map.” This may be extended in time so that changes over time in the volume and/or “heat” of the heat map may be visualized. Such information may be used to facilitate assignment of the importance metrics.
The above description and referenced drawings set forth particular details with respect to possible embodiments. Those of skill in the art will appreciate that the techniques described herein may be practiced in other embodiments. First, the particular naming of the components, capitalization of terms, the attributes, data structures, or any other programming or structural aspect is not mandatory or significant, and the mechanisms that implement the techniques described herein may have different names, formats, or protocols. Further, the system may be implemented via a combination of hardware and software, as described, or entirely in hardware elements, or entirely in software elements. Also, the particular division of functionality between the various system components described herein is merely exemplary, and not mandatory; functions performed by a single system component may instead be performed by multiple components, and functions performed by multiple components may instead be performed by a single component.
Reference in the specification to “one embodiment” or to “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least one embodiment. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.
Some embodiments may include a system or a method for performing the above-described techniques, either singly or in any combination. Other embodiments may include a computer program product comprising a non-transitory computer-readable storage medium and computer program code, encoded on the medium, for causing a processor in a computing device or other electronic device to perform the above-described techniques.
Some portions of the above are presented in terms of algorithms and symbolic representations of operations on data bits within a memory of a computing device. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps (instructions) leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical, magnetic or optical signals capable of being stored, transferred, combined, compared and otherwise manipulated. It is convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like. Furthermore, it is also convenient at times, to refer to certain arrangements of steps requiring physical manipulations of physical quantities as modules or code devices, without loss of generality.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “displaying” or “determining” or the like, refer to the action and processes of a computer system, or similar electronic computing module and/or device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system memories or registers or other such information storage, transmission or display devices.
Certain aspects include process steps and instructions described herein in the form of an algorithm. It should be noted that the process steps and instructions of described herein can be embodied in software, firmware and/or hardware, and when embodied in software, can be downloaded to reside on and be operated from different platforms used by a variety of operating systems.
Some embodiments relate to an apparatus for performing the operations described herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computing device. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, flash memory, solid state drives, magnetic or optical cards, application specific integrated circuits (ASICs), and/or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus. Further, the computing devices referred to herein may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
The algorithms and displays presented herein are not inherently related to any particular computing device, virtualized system, or other apparatus. Various general-purpose systems may also be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will be apparent from the description provided herein. In addition, the techniques set forth herein are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the techniques described herein, and any references above to specific languages are provided for illustrative purposes only.
Accordingly, in various embodiments, the techniques described herein can be implemented as software, hardware, and/or other elements for controlling a computer system, computing device, or other electronic device, or any combination or plurality thereof. Such an electronic device can include, for example, a processor, an input device (such as a keyboard, mouse, touchpad, trackpad, joystick, trackball, microphone, and/or any combination thereof), an output device (such as a screen, speaker, and/or the like), memory, long-term storage (such as magnetic storage, optical storage, and/or the like), and/or network connectivity, according to techniques that are well known in the art. Such an electronic device may be portable or nonportable. Examples of electronic devices that may be used for implementing the techniques described herein include: a mobile phone, personal digital assistant, smartphone, kiosk, server computer, enterprise computing device, desktop computer, laptop computer, tablet computer, consumer electronic device, television, set-top box, or the like. An electronic device for implementing the techniques described herein may use any operating system such as, for example: Linux; Microsoft Windows, available from Microsoft Corporation of Redmond, Wash.; Mac OS X, available from Apple Inc. of Cupertino, Calif.; iOS, available from Apple Inc. of Cupertino, Calif.; Android, available from Google, Inc. of Mountain View, Calif.; and/or any other operating system that is adapted for use on the device.
In various embodiments, the techniques described herein can be implemented in a distributed processing environment, networked computing environment, or web-based computing environment. Elements can be implemented on client computing devices, servers, routers, and/or other network or non-network components. In some embodiments, the techniques described herein are implemented using a client/server architecture, wherein some components are implemented on one or more client computing devices and other components are implemented on one or more servers. In one embodiment, in the course of implementing the techniques of the present disclosure, client(s) request content from server(s), and server(s) return content in response to the requests. A browser may be installed at the client computing device for enabling such requests and responses, and for providing a user interface by which the user can initiate and control such interactions and view the presented content.
Any or all of the network components for implementing the described technology may, in some embodiments, be communicatively coupled with one another using any suitable electronic network, whether wired or wireless or any combination thereof, and using any suitable protocols for enabling such communication. One example of such a network is the Internet, although the techniques described herein can be implemented using other networks as well.
While a limited number of embodiments has been described herein, those skilled in the art, having benefit of the above description, will appreciate that other embodiments may be devised which do not depart from the scope of the claims. In addition, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, the disclosure is intended to be illustrative, but not limiting.