APPARATUS AND METHOD FOR AUDIOVISUAL RENDERING

Title:

APPARATUS AND METHOD FOR AUDIOVISUAL RENDERING

Document Type and Number:

WIPO Patent Application WO/2024/023126

Kind Code:

Abstract:

An apparatus comprises a receiver (101) receiving audiovisual data representing a scene. Sources (105, 107) provide a vehicle motion signal indicative of a motion of a vehicle and a relative user motion signal indicative of a motion of a user relative to the vehicle. A predictor (109) generates a predicted relative user motion signal by applying a prediction model to the vehicle motion signal A residual signal generator (111) generates a residual user motion signal indicative of the residual difference between the predicted and received relative user motion. A view pose determiner (113) determines a view pose with different dependencies on the predicted relative user motion signal and the residual user motion signal. A renderer (103) renders an audiovisual signal for the view pose from the audiovisual data. The approach may provide enhanced user experiences that may compensate or include effects of user motion caused by vehicle motion.

Inventors:

VAREKAMP CHRISTIAAN (NL)
KROON BART (NL)
OOMEN ARNOLDUS WERNER JOHANNES (NL)

Application Number:

PCT/EP2023/070653

Publication Date:

February 01, 2024

Filing Date:

July 26, 2023

Export Citation:

Click for automatic bibliography generation Help

Assignee:

KONINKLIJKE PHILIPS NV (NL)

International Classes:

B60W50/00

Foreign References:

US20200180647A1	2020-06-11
US9396588B1	2016-07-19

Other References:

YUAN HANG: "A review on image-based rendering", GUO-PING ANG, VIRTUAL REALITY & INTELLIGENT HARDWARE, vol. 1, 1 February 2019 (2019-02-01), pages 39 - 54, Retrieved from the Internet
SHUM; KANG: "A Review of Image-Based Rendering Techniques", PROCEEDINGS OF SPIE - THE INTERNATIONAL SOCIETY FOR OPTICAL ENGINEERING, vol. 4067, May 2000 (2000-05-01), pages 2 - 13
TAKALATAPIOJAMES, HAHN: "Sound Rendering", SIGGRAPH COMPUT. GRAPH., vol. 26, pages 211 - 220
DAVID A: "Techniques for Low Cost Spatial Audio by Burgess", PROCEEDINGS OF THE 5TH ANNUAL ACM SYMPOSIUM ON USER INTERFACE SOFTWARE AND TECHNOLOGY, pages 53 - 59
PSYCHOACOUSTIC MUSIC SOUND FIELD SYNTHESISZIEMER, TIM: "Current Research in Systematic Musicology", vol. 7, CHAM: SPRINGER, pages: 287

Attorney, Agent or Firm:

PHILIPS INTELLECTUAL PROPERTY & STANDARDS (NL)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS:

Claim 1. An apparatus for audiovisual rendering, the apparatus comprising: a receiver (101) arranged to receive audiovisual data representing a scene; a first source (105) providing a vehicle motion signal indicative of a motion of a vehicle; a second source (107) providing a relative user motion signal indicative of a motion of a user relative to the vehicle; a predictor (109) arranged to generate a predicted relative user motion signal by applying a prediction model to the vehicle motion signal; a residual signal generator (111) arranged to generate a residual user motion signal indicative of at least a component of a difference between the relative user motion signal and the predicted relative user motion signal; a view pose determiner (113) arranged to determine a view pose in dependence on the residual user motion signal and the predicted relative user motion signal, a dependency of the view pose on the residual user motion signal being different than a dependency of the view pose on the predicted relative user motion signal; and a Tenderer (103) arranged to render an audiovisual signal for the view pose from the audiovisual data.

Claim 2. The apparatus of claim 1 wherein the prediction model comprises a temporal filtering of the vehicle motion signal.

Claim 3. The apparatus of claim 2 wherein the predictor (109) is arranged to apply a temporal filtering to the relative user motion signal to generate a filtered relative user motion signal, and to predict the predicted relative user motion signal in dependence on the filtered relative user motion signal.

Claim 4. The apparatus of any previous claim wherein the prediction model comprises a biomechanical model.

Claim 5. The apparatus of any of the previous claims wherein the view pose determiner (113) is arranged to apply a different weighting to the predicted relative user motion signal than to the residual user motion signal. Claim 6. The apparatus of any of the previous claims wherein the view pose determiner (113) is arranged to apply a different temporal fdtering to the predicted relative user motion signal than to the residual user motion signal.

Claim 7. The apparatus of any previous claim wherein the view pose determiner (113) is arranged to determine a first view pose contribution from the predicted relative user motion signal and a second view pose contribution from the residual user motion signal, and to generate the view pose by combining the first view pose contribution and the second view pose contribution.

Claim 8. The apparatus of any previous claim wherein the view pose determiner (113) is arranged to extract a first motion component from the predicted relative user motion signal by attenuating temporal frequencies, and to determine the view pose to include a contribution from the first motion component.

Claim 9. The apparatus of any previous claim wherein the view pose determiner (113) is arranged to detect a user gesture motion component in the residual user motion signal and to extract a first motion component from the residual user motion signal in dependence on the user gesture motion component, and to determine the view pose to include a contribution from the user gesture motion component.

Claim 10. The apparatus of any previous claim wherein the predictor (109) is arranged to determine the predicted relative user motion signal in response to a correlation of the relative user motion signal and the vehicle motion signal.

Claim 11. The apparatus of any previous claim wherein at least one of the predicted relative user motion signal and the residual user motion signal is indictive of a plurality of pose components, and the view pose determiner is arranged to determine the view pose with different dependencies on at least a first pose component and a second pose component of the plurality of pose components.

Claim 12. The apparatus of any previous claim wherein the receiver (101) is arranged to receive processing data indicative of a processing of at least one of the predicted relative user motion signal and the residual user motion signal; and the view pose determiner (113) is arranged to determine the view pose in accordance with processing instructions of the processing data.

Claim 13. The apparatus of any previous claim wherein the predictor (109) is arranged to generate the predicted relative user motion signal in dependence on the residual user motion signal.

Claim 14. A method of audiovisual rendering, the method comprising: receiving audiovisual data representing a scene; providing a vehicle motion signal indicative of a motion of a vehicle; providing a relative user motion signal indicative of a motion of a user relative to the vehicle; generating a predicted relative user motion signal by applying a prediction model to the vehicle motion signal; generating a residual user motion signal indicative of at least a component of a difference between the relative user motion signal and the predicted relative user motion signal; determining a view pose in dependence on the residual user motion signal and the predicted relative user motion signal, a dependency of the view pose on the residual user motion signal being different than a dependency of the view pose on the predicted relative user motion signal; and rendering an audiovisual signal for the view pose from the audiovisual data.

Claim 15. A computer program product comprising computer program code means adapted to perform all the steps of claim 14 when said program is run on a computer.

Description:

APPARATUS AND METHOD FOR AUDIOVISUAL RENDERING

FIELD OF THE INVENTION

The invention relates to an apparatus and method for audiovisual rendering, such as in particular, but not exclusively, to rendering of an audiovisual signal for an extended Reality application for a user subjected to a vehicle motion.

BACKGROUND OF THE INVENTION

The variety and range of image and video applications have increased substantially in recent years with new services and ways of utilizing and consuming video being continuously developed and introduced.

For example, one service being increasingly popular is the provision of image sequences in such a way that the viewer is able to actively and dynamically interact with the system to change parameters of the rendering. A very appealing feature in many applications is the ability to change the effective viewing position and viewing direction of the viewer, such as for example allowing the viewer to move and look around in the scene being presented.

Such a feature can specifically allow a virtual reality experience to be provided to a user. This may allow the user to e.g. (relatively) freely move about in a virtual environment and dynamically change his position and where he is looking. Typically, such Virtual Reality (VR) applications are based on a three-dimensional model of the scene with the model being dynamically evaluated to provide the specific requested view. This approach is well known from e.g. gaming applications, such as in the category of first person shooters, for computers and consoles. Other examples include Augmented Reality (AR) or Mixed Reality (MR) applications. Such applications are often in common referred to as extended Reality (XR) applications.

An important feature in many e.g. XR applications is the determination of user movements in the real world and adapting the audiovisual representation of virtual features to reflect the user movements.

Detection of user movements may include what is referred to as Outside-in and Inside-out Tracking. Outside-in VR tracking uses cameras or other sensors placed in a stationary location and oriented towards the tracked object (e.g., a headset) that moves freely inside a predetermined area that is covered by the sensors.

Inside-out tracking differentiates itself from outside-in tracking in that the sensors are attached to the object (e.g., integrated in the headset). Image data as captured by multiple camera sensors are used to reconstruct 3D features in the visual surrounding world. Assuming that this world is static, the headset is positioned relative to this world system. Next to cameras, other sensors such as accelerometers can be used to improve accuracy and robustness of the headset pose estimate.

When consuming A/V content in a stationary environment with a VR headset or headphones, the head rotations and optionally translations (e.g. 3DoF or 6DoF (Degrees of Freedom)) (e.g., using one or more cameras) are continuously measured. The measured motions are fed back to the rendering process such that the user’s motions relative to the real world are taken into account. For example, when the user turns his head left, relevant objects will turn right. When the user moves slightly sideways, relevant objects are rendered such that the user is able to look from the side and maybe even look around an object to view a previously occluded object. An important assumption for correct operation of such a system is that world objects that are used as reference to determine the user’s movements do not move themselves (Called independent motion in the context of visual Structure from Motion algorithms). If for instance, the world reference system would move but not the user, then the user would observe a moving virtual scene while the user is not moving. This would be an unwanted effect. The inside-out tracking therefore needs to be robust for independently moving objects in the real-world scene that surrounds the user.

It has been proposed to provide XR applications and services that are not only suitable for consuming in a mobile environment, but which may also be adapted to reflect or compensate for the movement. However, this provides a number of challenges and difficulties and in particular as the preferred operation and experience may depend on the specific desires and preferences of the individual application.

In some cases, when the user is traveling in a mobile environment, for example a train or car, it may be desirable to not classify the vehicle as an independently moving object. For example, visual features inside the vehicle may be considered to define the reference system and motions of the world outside the vehicle (relative to the vehicle) may be ignored.

If the vehicle travels with a purely constant speed this may be relatively easy to achieve. However, as soon as the vehicle changes velocity or vibrates, it may introduce a motion difference between a passenger and the vehicles internal reference system (typically determined by the visual appearance and geometry of the vehicle) which may cause complex relationships that may substantially complicate the operation of the application. Typically, such motion difference may be picked up by an inside-out tracking device and used to render the virtual scene. Such effects may result in a limited and/or suboptimal user experience and may for example result in presentations that are not fully consistent with the perceived motion.

Further, operations and algorithms considering such more complex motions tend to be suboptimal in terms of complexity, accuracy, computational resource usage etc.

Hence, an improved approach for audiovisual rendering suitable for a user subjected to vehicle motion would be advantageous. In particular, an approach that allows improved operation, increased flexibility, an improved user experience, reduced complexity, facilitated implementation, improved rendering quality, improved and/or facilitated rendering, improved and/or facilitated adaptation to user and vehicle motions, and/or improved performance and/or operation would be advantageous.

SUMMARY OF THE INVENTION

Accordingly, the Invention seeks to preferably mitigate, alleviate or eliminate one or more of the above-mentioned disadvantages singly or in any combination.

According to an aspect of the invention, there is provided an apparatus for audiovisual rendering, the apparatus comprising: a receiver arranged to receive audiovisual data representing a scene; a first source providing a vehicle motion signal indicative of a motion of a vehicle; a second source providing a relative user motion signal indicative of a motion of a user relative to the vehicle; a predictor arranged to generate a predicted relative user motion signal by applying a prediction model to the vehicle motion signal; a residual signal generator arranged to generate a residual user motion signal indicative of at least a component of a difference between the relative user motion signal and the predicted relative user motion signal; a view pose determiner arranged to determine a view pose in dependence on the residual user motion signal and the predicted relative user motion signal, a dependency of the view pose on the residual user motion signal being different than a dependency of the view pose on the predicted relative user motion signal; and a Tenderer arranged to render an audiovisual signal for the view pose from the audiovisual data.

The invention may in many applications and scenarios allow improved generation of an audiovisual signal for a user subjected to movement of a vehicle. It may in many scenarios and applications provide an improved user perception of a scene and may for example provide more desirable user experiences. The approach may specifically allow a differentiated adaptation of the rendered audiovisual experience to different types of motion. It may for example allow a user experience which is less affected by vehicle motion, and which may be perceived more similar to an experience in scenarios where the user is not subjected to vehicle movement.

In some scenarios, the approach may for example provide a more flexible and/or improved compensation for the effect of the vehicle movement. It may for example provide an experience that may compensate or adapt the rendered audiovisual signal depending on the vehicle motion such that e.g. motion induced sickness (this typically arises from conflict between the sensory inputs of different senses, such as the sense of balance and the visual sense) or discomfort may be reduced while still allowing the user’s intentional motion to be determined and accounted for in the rendering.

For example, it may allow an adaptation of a VR or XR experience where some user motion may be reflected in the audiovisual cues (images and/or sound) provided to the user while compensation of some vehicle motion and its impact on the user may be compensated for.

In many scenarios, an efficient and/or low complexity and/or resource demanding operation may be achieved. The approach may e.g. as a specific example be used in a situation where a user may be travelling in a vehicle while trying to use a VR application that presents a 3D image to the user. The 3D image may show a view of a 3D scene and as the user moves his head, the view of the scene may change correspondingly. For example, if the user turns his head to the left, the 3D image is updated to show the view of the scene that a person in the scene would see by turning his head to the left. However, if the user is in a moving vehicle, his head movement is likely to reflect different components. One is the intentional movements that are actively performed by the user, such as moving his head to the side. However, another component is the movements of the user’s head that are caused by the movement of the car. For example, going over a speedbump may cause the head of a passenger in a car to make sudden vertical movements relative to the car. This movement is involuntary and may simply be determined by mechanical dynamics of the car (seat suspension etc) as well as of the user’s body (e.g. elasticity provided by the user’s neck). Thus, the motion of the vehicle may have an impact on the user which causes the user’s head to move relative to the vehicle. If such movement is detected by the VR headset, and the image is adapted accordingly, the user may experience an unrealistic virtual world experience. The apparatus for audiovisual rendering may provide improved performance and an improved user experience in many such/similar scenarios and cases. The apparatus may be arranged to differentiate between movements that are intentionally caused by the user (e.g. by turning his head) or which are caused by the movement of the vehicle without the user desiring to make such head movements. The rendering of the scene, and specifically the view pose of the user in the scene, may then be reflected to have a different dependency on motion that is intentional/voluntary and movement that is involuntary/ not intentional and which simply happens as a function of the movement of the vehicle.

The view pose determiner may in many embodiments be arranged to determine a view pose signal in dependence on the residual user motion signal and the predicted relative user motion signal, a dependency of the view pose on the residual user motion signal being different than a dependency of the view pose on the predicted relative user motion signal. The Tenderer may be arranged to render an audiovisual signal for the (view poses of a) view pose signal from the audiovisual data.

User motions may specifically be indicative of motions of a head (or possibly) eye of the user. A motion may be a time derivative of a pose. A signal may be one or more values that have a temporal aspect/ are time dependent/ may vary with time.

According to an optional feature of the invention, the prediction model comprises a temporal filtering of the vehicle motion signal.

This may provide particularly advantageous operation and/or rendering in many embodiments. It may allow a prediction that is particularly suitable for adapting the rendering of an audiovisual signal reflecting user movement for a user subjected to a vehicle movement. The temporal filtering may be a high pass filtering of the vehicle motion signal.

According to an optional feature of the invention, the predictor is arranged to apply a temporal filtering to the relative user motion signal to generate a filtered relative user motion signal, and to predict the predicted relative user motion signal in dependence on the filtered relative user motion signal.

This may provide particularly advantageous operation and/or rendering in many embodiments. It may allow a more accurate prediction of relative user motion that is e.g. directly resulting from vehicle motion.

According to an optional feature of the invention, the prediction model comprises a biomechanical model.

According to an optional feature of the invention, the view pose determiner is arranged to apply a different weighting to the predicted relative user motion signal than to the residual user motion signal.

This may provide particularly advantageous effects in many embodiments and scenarios and may specifically provide improved adaptation and differentiation of different motions.

According to an optional feature of the invention, the view pose determiner is arranged to apply a different temporal filtering to the predicted relative user motion signal than to the residual user motion signal.

This may provide particularly advantageous effects in many embodiments and scenarios and may specifically provide improved adaptation and differentiation of different motions.

According to an optional feature of the invention, the view pose determiner is arranged to determine a first view pose contribution from the predicted relative user motion signal and a second view pose contribution from the residual user motion signal, and to generate the view pose by combining the first view pose contribution and the second view pose contribution.