Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD BASED ON INVERSE BLURRING FOR IMMERSIVE MEDIA DEVICES TO MITIGATE VERGENCE ACCOMMODATION CONFLICT
Document Type and Number:
WIPO Patent Application WO/2024/089497
Kind Code:
A1
Abstract:
Method of showing a distorted stereoscopic digital image on a display of an immersive media device with a stereoscopic head-wearable unit, comprising the steps of: a. Receiving a focused stereoscopic digital image; b. Receiving information about a depth level of at least an object of interest within the image to identify a virtual depth of the object of interest; c. Processing the image with a deconvolution distortion algorithm for deblurring based on a point-spread function calculated via said depth and a focal plane placed on the display so as to obtain a distorted image. d. Showing the processed distorted image on the display of the immersive media device so that the user, via said deconvolution of the focused input stereoscopic image, perceives a focused area from the display, which is outside the focal plane on the display for mitigating the vergence accommodation conflict.

Inventors:
HUSSAIN RAZEEN (IT)
SOLARI FABIO (IT)
CHESSA MANUELA (IT)
Application Number:
PCT/IB2023/059732
Publication Date:
May 02, 2024
Filing Date:
September 29, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV DEGLI STUDI GENOVA (IT)
International Classes:
H04N13/128; G06T7/593
Foreign References:
US9269012B22016-02-23
Other References:
KOHEIUSHIMA ET AL: "SharpView: Improved clarity of defocused content on optical see-through head-mounted displays", 2016 IEEE SYMPOSIUM ON 3D USER INTERFACES (3DUI), IEEE, 19 March 2016 (2016-03-19), pages 173 - 181, XP032894874, DOI: 10.1109/3DUI.2016.7460049
Attorney, Agent or Firm:
KARAGHIOSOFF, Giorgio A. (IT)
Download PDF:
Claims:
CLAIMS

1. Method of showing a distorted stereoscopic digital image on a display of an immersive media device with a stereoscopic head-wearable unit, comprising the steps of: a. Receiving a focused stereoscopic digital image; b. Receiving information about a depth level of at least an object of interest within the image to identify a virtual depth of the object of interest; c. Processing the image with a deconvolution distortion algorithm for deblurring based on a point-spread function calculated via said depth and a focal plane placed on the display so as to obtain a distorted image. d. Showing the processed distorted image on the display of the immersive media device so that the user, via said deconvolution of the focused input stereoscopic image, perceives a focused area from the display, which is outside the focal plane on the display for mitigating the vergence accommodation conflict.

2. Method of showing a distorted stereoscopic digital image according to claim 1, wherein the step of processing the image with a deconvolution deblurring distortion algorithm is also based on a signal-to-noise ratio function (SNR) accounting for the noise present in the system.

3. Method of showing a distorted stereoscopic digital image according to claim 2, wherein a Wiener filtering is applied in the deconvolution deblurring distortion algorithm, for each stereoscopic digital image processing such Wiener filtering both the calculated point-spread function and the signal-to-noise ratio parameter (SNR);

4. Method of showing a distorted stereoscopic digital image according to claim 3, wherein the calculated point spread function has a circular shape representing the blur provided to the user by the non-distorted image on the focal plane on the display, wherein the shape of such calculated point spread function is defined through the calculation of the radius R of the circle that a point out of the focal plane produces on user’s retina, being the signal-to-noise ratio parameter (SNR) associated to a given R value. Method of showing a distorted stereoscopic digital image according to claims 1 to 4, comprising an eye tracking algorithm to select a virtual depth of a region of interest to calculate the point spread function Method of showing a distorted stereoscopic digital image according to claims 1 to 4, wherein the processing step of the image distortion based on an assumed user gaze further comprises: receiving filter kernels in the spatial domain based on the relative distance of the objects within the image with respect to the focal plane, so as to define image sections having one or more pixels to which a filter kernel applies; processing each pixel of the image so that each pixel is convolved with a different filter kernel based on the difference between the pixel depth and the focal plane to provide a distortion of the image. Method of showing a distorted stereoscopic digital image according to claim 6, wherein filter kernels are pre-computed offline so as to provide a group of filter kernels, each one selected during the image processing step based on the relative distance of the objects within the image with respect to the focal plane Method of showing a distorted stereoscopic digital image according to any of the preceding claims, wherein the step of processing the image based on a deconvolution deblurring distortion algorithm adopts one of an inverse filtering, Wiener filtering, iterative Richardson-Lucy algorithm;

9. Immersive media system comprising an immersive media display configured to show a distorted stereoscopic digital image, a stereoscopic head-wearable unit to view such distorted digital image with a stereoscopic vision, a memory and a processor coupled in data exchange with the memory programmed to execute the method of claim 1. 10. Immersive media system according to claim 9, wherein the head-wearable unit comprises a display configured to show a distorted stereoscopic digital image.

Description:
UNIVERSITA’ DEGLI STUDI DI GENOVA

“METHOD BASED ON INVERSE BLURRING FOR IMMERSIVE MEDIA DEVICES TO MITIGATE VERGENCE ACCOMMODATION CONFLICT”

DESCRIPTION

TECHNICAL FIELD

The present invention refers to the field of immersive media technologies, more specifically in stereoscopic displays, non-optical-see-through near eye displays and virtual reality.

PRIOR ART

In recent years, there has been a significant development within the field of immersive media technologies. However, a major stumbling block for the widespread usage of immersive technology is the fact that users tend to feel discomfort or eyestrain after prolonged usage. This problem exists in almost all immersive media devices whether it be 3D cinema or virtual reality (VR). A widely accepted reason for this is the sensory conflict caused by vergence and accommodation. In particular, humans when visually perceiving their environment use two cues to estimate the object distances, converging their eyes inwards to the object(s) their gaze is addressed to, while the ciliary muscle deforms the lens so as to allow human eyes to see such object(s) in sharp focus based on the distance where the human eyes converge. The former is referred to as convergence or vergence while the latter is referred to as accommodation. In almost all immersive media devices, a near-eye display is used. Furthermore, when humans view virtual content on such a display, convergence works in a similar way to the real-world viewing, i.e. the human eyes are converged inwards to the object(s) where the user addresses his/her gaze. However, accommodation does not. Since the virtual content is displayed on the display placed at a fixed distance from the user eyes, accommodation does not change with the virtual content, causing the users to experience mismatching cues, since the users are requested to see the virtual content with convergence that does not have the corresponding accommodation in real world. As humans grow and leam how to perceive their environment, they instinctively trigger accommodation and convergence simultaneously, so asking the users to separate the two leads to a wide range of issues from fatigue to nausea and headaches that last hours afterwards.

Since this problem has been known for many years, many attempts have been made to rectify it. The most prevalent approach is to use focus-tunable lenses such as those used in multi-focal displays. However, a main obstacle of this technology being widely introduced into modern immersive media device systems is the fact that they are hardware-intensive approaches, and the current generation of focus-tunable lenses are not compact enough to be introduced into consumer headsets. Other attempts have also been made on altering the projected image to cater for different focus distances to simulate this focus blur inside headsets. However, studies have shown that this depth-of-field effect can contribute to a lower level of induced sickness, while improvement in depth perception has been limited.

It is therefore a felt need that to provide a solution to these mismatching cues, not only for minimizing the onset of visual discomfort but also for better depth perception.

SCOPE AND SUMMARY OF THE INVENTION

The scope of the present invention is that to satisfy at least partially the needs described above, wherein such scope is achieved through a method of showing a distorted stereoscopic digital image on a display of an immersive media device, which is perceived by the eyes as focused, the distortion accounting for the out-of-focus caused by the mismatch between a perceived depth of a virtual object of interest within the stereoscopic image and the position of the display i.e. the focal plane from the eyes of the user. In particular, the method of showing a distorted stereoscopic digital image on a display of an immersive media device with a stereoscopic head-wearable unit, comprises the steps of:

Receiving a focused stereoscopic digital image, which may be either shown on a single display and provide stereoscopic effect for the right and left eye when seen via filters such as chromatic filters e.g. red/blue or polarized filters or shown on right and left displays of a headset;

Receiving information about a depth level of at least an object of interest within the image to identify a virtual depth of the object of interest; indeedjnost frameworks provide this information with the stereoscopic image through a depth texture which is essentially comprised of a matrix containing high precision pixel-wise depth values ranging between 0 and 1 with a non-linear distribution. In case this information is not available, it can be computed using the disparity present in the stereo image pair.

Processing the image with a deconvolution distortion algorithm for deblurring based on a point-spread function calculated via said depth and a focal plane on the display, so as to obtain a distorted image;

Showing the processed distorted image on the display of the immersive media device so that the user, via said deconvolution of the focused input stereoscopic image, perceives a focused area from the display, which is outside the focal plane on the display for mitigating the vergence accommodation conflict.

In particular, the method described in the present disclosure provides a computer implemented solution to the problem of vergence-accommodation conflict by applying a

SUBSTITUTE SHEET (RULE 26) point-spread-function that is related to accommodation to a stereoscopic image, which provides for the perception of depth.

According to the invention, such method allows to show a distorted stereoscopic digital image on a display of an immersive media device, e.g. 3D displays with stereoscopic glasses or headsets having left and right stereoscopic displays, so that it is possible to provide an improved depth perception to the user and to avoid the known discomfort problems, e.g. nausea, deriving from the use of such devices for prolongated periods.

In particular, the deconvolution operation is applied to a sharp stereoscopic digital image being projected on an immersive media device display so as to provide to the user’s brain a distorted stereoscopic digital image. Being the accommodation constant in such immersive media devices since the stereoscopic digital image is projected on a display placed at a fixed distance from the user eyes, a stereoscopic image distortion allows the user to view in a sharp focus the object(s) his/her gaze is addressed to, while the remaining objects are perceived blurred. Such distortion not only provides an improved depth perception among the objects displayed on the immersive media device, but it also synchronizes the user accommodation with the vergence, mitigating the above mentioned related discomfort issues.

According to a preferred embodiment of the present invention, the step of processing the image with a deconvolution distortion algorithm for deblurring is also based on a signal-to-noise function accounting for the noise present in the system.

According to a preferred embodiment of the present invention, Wiener filtering is applied in the deconvolution distortion algorithm, since it allows to reduce the computational effort immersive media devices have to perform, compared to other deconvolution algorithms, especially for what concerns real time applications. Moreover, such a Wiener filtering allows to process each stereoscopic digital image taking into account both the calculated point-spread function and the signal-to-noise ratio, so as to obtain an optimized distorted stereoscopic digital image that further enhances the depth perception and accommodation of the objects within the image.

BRIEF DESCPRIPTION OF THE DRAWINGS

The content of each step of the method and related deblurring algorithm adopted can be better understood from the detailed description below, wherein reference is made to the attached figures which represent preferred and non-limiting embodiments, wherein:

• Fig. 1 shows a flowchart of the method of showing a distorted stereoscopic digital image in immersive media devices without an eye tracking system;

• Fig. 2 shows the processing steps of the deconvolution distortion algorithm for immersive media devices without an eye tracking system;

• Fig. 3 shows a flowchart of the method of showing a distorted stereoscopic digital image in immersive media devices with an eye tracking system;

• Fig. 4 shows the processing steps of the deconvolution distortion algorithm for immersive media devices with an eye tracking system;

• Fig. 5 shows a plot presenting optimal SNR tuning for different R by using the PSNR quality image metric;

• Fig. 6 shows a plot presenting optimal SNR tuning for different R by using the mean- SSIM quality image metric;

• Fig. 7 shows a schematic layout of the experiment for testing the method of showing a distorted stereoscopic digital image in immersive media devices with an eye tracking system, specifically used in example 1 of the experimental validation section;

• Fig. 8 shows a schematic layout of the experiment for testing the method of showing a distorted stereoscopic digital image in immersive media devices without an eye tracking system, specifically used in example 3 of the experimental validation section;

• Fig. 9 shows a plot presenting the discrimination sensitivity obtained under two experimental conditions shown in example 2 of experimental validation section;

• Fig. 10 shows a plot presenting the discrimination sensitivity obtained under two experimental conditions shown in example 3 of experimental validation section.

DETAILED DESCRIPTION OF THE INVENTION

It is known that, when light rays enter the eyes through the cornea of the human eye, the eyes diffract the light rays to form a focused image on the retina, and the diffraction pattern can be modelled as a Point Spread Function (PSF). If this PSF is known, it is possible to identify the optical requirement of an optical device, e.g. corrective lenses, that is necessary to adjust the light rays entering the eyes.

In the image processing domain instead, this operation involving PSF can be analogously expressed by a deconvolution operator. Deconvolution is the inverse process of convolution process, whose application in image processing domain, e.g. in image deblurring, is described below i.e. distortion of any input digital image that would be perceived as blurred when not distorted and is therefore appropriately deconvoluted i.e. distorted to be perceived as deblurred. In particular, deconvolution is a computationally intensive process that can be used to recover the blurring in a perceived image; such process can also be referred to as inverse blurring or deblurring.

Generally, given an image i, the convolution operation with a blurring filter f can be defined as: b = f * i + n where * is the convolution operator, n is the noise in the system (that in some simpler cases can be neglected) and b is the resulting blurred image. In the Fourier or frequency domain, the above equation can be written as: B = FI + N where B, F, I and N are the Fourier transforms of b, f, i and n respectively. Typically, this blurred image can be corrected through the inverse procedure:

Such deconvolution can be mathematically executed via inverse filtering, Wiener filtering, Richardson-Lucy algorithm or the like.

According to the invention applied to an immersive media device, a focussed stereoscopic input image is deconvoluted taking a given PSF into account, in order to generate a distorted image that, otherwise, would be perceived blurred because, via the stereoscopic image, a region of interest is located at a perceived virtual distance (depth) and the display (focal plane) emitting light is physically located at a real distance different from the virtual distance. By applying such deconvolution to the stereoscopic input image, the inventors found an improvement of the vergence accommodation conflict (VAC). A stereoscopic input image may either be an image displayed on a single screen where the user is looking with a stereoscopic device such as left-right polarized filter glasses, left-right chromatic filter glasses or the like. Such screen may be either far from the user (see e.g. the embodiments discussed below), the light filter glasses being the headset worn by the user or such single display may be carried by the headset in an appropriate head-mounted support device. Or the stereoscopic device has left-right head mounted screens, where each screen shows a frame for the correspondent left/ right eye in order to generate a stereoscopic image. The PSF accounts for the virtual depth of the virtual object the user is looking i.e. where the gaze of the user is pointing within the stereoscopic image. It is worth noting that most frameworks provide depth information with the stereoscopic image through a depth texturein case this information is not available, it can be computed using the disparity present in the stereo image pair.

When deconvolution includes noise, a risk is to amplify such noises and, according to a preferred embodiment of the present invention, a Wiener filter is adopted for image distortion as it is insensitive to small variations in the signal power spectrum. In particular, the Wiener filter assumes that the image is modelled as a random process whose 2nd order statistics along with noise variances are known. The image restoration model can be written as:

I' = H W B where I' is the estimation of the original image and H w is the Wiener filter. Assuming the PSF is real and symmetric and the power spectrum of the original image and the noise is unknown, then Wiener filter can be defined as: where SNR is the Signal-to-Noise Ratio (SNR) and H is the estimate of the PSF of the blur provided to the user by the non-distorted image because, as previously mentioned, the virtual depth of the object of interest is different from the distance of the display from the eyes. Furthermore, for out-of-focal plane distortions such as those that are naturally present in human visual system e.g. blurring of the background when eyes focus a single close finger, a circular PSF is considered a good approximation, even if the exact shape varies depending on the user eyes performances (e.g. myopic, hyperopic etc.). Such a PSF can be defined by only one parameter R, which is the radius of the circle that a point out of the focal plane produces on user’s retina. Therefore, in the embodiment where noise is not amplified, after the determination of the parameter R, it is possible to define the parameter SNR to implement Wiener deconvolution.

According to an example, the method described implements Wiener filter using OpenCV for Unity package by Enox Software. In a second example, the approximation of the Wiener filter in the spatial domain is precomputed using MATLAB and the subsequent deconvolution operation is done using a shader implementation. Alternatively, the Wiener filter and transformations to the frequency domain can be done using a custom shader implementation.

According to one embodiment, related to the latter example, the gaze of the user is assumed to be in the centre of the stereoscopic image and therefore the depth is that of the object shown in the centre of the stereoscopic image. As an alternative, the gaze is assumed at the focus distance of the head-mounted display. Such distance may not be on the display but rather a distance away from the display through the use of optics. As explained below, depth and R are related. A flowchart of this embodiment is shown in figures 1 and 2 and, in view of the pre-defined gaze of the user, some steps can be pre-computed before the streaming of images.

The former example is particularly effective in a further embodiment when an eye tracker algorithm is present in the system. In the known art, eye tracking technology comprises techniques such as scleral search coil technique, infrared oculography (IOG), electro oculography (EOG), video oculography (VOG), that work based on several eye tracking algorithms, e.g. shape based, feature based or a combination of them; such embodiment is shown in figures 3 and 4. This embodiment is more computationally intensive during the streaming of images because for each image the gaze of the user is detected in real time and then the R and the deconvolution, eventually including the SNR, is applied.

However, both embodiments are equal in terms of performance and effectiveness in reducing the vergence-accommodation conflict. In other examples, such method can be implemented using a C# or shader implementation of the convolution and Discrete Fourier Transform (DFT) functions.

According to the invention, it is possible to show a distorted stereoscopic digital image on a display of an immersive media device, e.g. a device having near-eye stereoscopic displays, one for each eye of the user, so that in use a user can have a better depth perception and does not feel the above described discomfort, e.g. nausea, after a prolongated use of the device. In particular, such method for the preferred embodiment with eye-tracker comprises the steps of:

1) Image resizing and transformation to frequency domain. Each virtual scene frame is processed as an independent image, wherein each image needs to be transformed to the frequency domain using preferably a Discrete Fourier Transformation (DFT). Furthermore, since DFT is often applied to images with nxn resolution while some immersive media devices that can perform such method, e.g. XR devices, operate images with mxn resolution, the described method performs such pre-processing step, wherein an image resizing operation is performed e.g. in case an image with mxn resolution has to be processed. In this way, the image resizing process transforms the image into a nxn image and down-samples it to half the original resolution. The resulting image is then transformed into the frequency domain

This approach has the advantage that it can be incorporated into the down-sampling process that is already present in the system to lower computation costs, so that by processing at half-resolution, the frame processing times can be speeded up. Also, less information is lost as the restored image and the image can be resized back to the original size, e.g. mxn, during the up-scaling post-process described below.

2) Computing Wiener filter(s) based on user gaze towards the image wherein an object of interest is identified.

In order to compute Wiener filter, it is necessary according to a preferred embodiment of the invention to determine the optimal values of parameters R and SNR by performing a tuning process. Since R parameter contributes more towards the accommodation process of the image, it should be computed first and then SNR can be tuned to obtain the optimal value. 2a) Detecting user gaze to calculate R via object virtual depth

According to an embodiment, to select the optimal value of parameter R, it is fundamental to determine where the user is looking. For instance, head mounted VR and non-see-through or video AR, i.e. head-mounted devices configured to show recorded or synthetic video feeds or real time video feeds from on board cameras respectively, already come with an in-built tracking system, and can reliably track the user eye position in the virtual world. However, in case the eye tracking is not available in the immersive media device, other devices may be used, e.g. Microsoft Kinect, that perform real time approximations on where the user is looking by detecting other parameters, e.g. where the user’s head is oriented. When the detection or estimation of the gaze is performed during streaming on the display, the method of the invention is more computationally intensive during streaming on the display.

A less computationally intensive embodiment pre-defines where the user is looking either statically e.g. based on the assumption that a user always looks at the centre of the screen, or dynamically e.g. because the depth of the objects within the image is given or is estimated by algorithms analysing the content of the virtual scene or through head movements. Based on such a pre-definition, part of the process of the image can be operated before the streaming of images, as shown in figures 1 and 2.

2b) Receiving the distance of the user focus plane.

The user focal plane is where the user’s eyes are focused at, e.g.. the display of the headset when the user wears such displays. Its location is dependent on the type of device. Furthermore, the device is designed in such a way that the focus distance is maintained at a fixed distance and can be extracted from the technical data of the device. For example, the focus distance of most VR headset currently available in the market have ranges between 1- 3m. It should be noted that this focus distance in the case of head-mounted displays is not on the VR display but rather a distance away from the screen through the use of optics. However, in the case of a 3D screen, this focus distance is on the screen itself and can be calculated as the distance between the user and the screen.

2c) Determining parameter “R” of the point spread function

The parameter R is dependent on the distance between the user focus plane and virtual depth of the object under viewing. To select the optimal value of parameter R, the method exploits on one side the depth-of-field (DoF) concept. In particular, DoF is the distance between the nearest and farthest object that are in acceptably sharp focus in an image. To estimate on the other side how much blur is naturally present in the system, the method uses a Circle of Confusion (CoC) concept from the field of optics. That is, when the eye is focused at an object at a distance of df, a circle with diameter C is imaged on the retina by an object placed at distance dp. This diameter can be calculated using: where A is the aperture of the eye and s is the posterior nodal distance.

Therefore, R can be calculated as follows taking the depth map and the focal plane into consideration:

1 1

2R = kAs - - - - — - - focus distance virtual depth

So, since in stereoscopic digital images the relative depth among the objects are known a priori, i.e. defined during stereoscopic digital image creation, the proposed method is able to compute the estimate of parameter R based on where the user is looking or, where gaze is pre-defined. It follows that, once the gaze position of the user is detected or predefined, the relative depth among the object(s) are elaborated so as to provide an estimate of parameter R.

2d) Tuning parameter “SNR”

Once parameter R has been identified, the corresponding value of parameter SNR should be used as determined by a tuning process.

To tune parameter SNR, several images containing virtual objects were created. They were custom blurred using a smoothing filter. This ensures that parameter R is known for the blurred image and then parameter SNR can be tuned accordingly. To select the optimal values of parameter SNR, image quality metrics such as PSNR, mean-SSIM and FovVideoVDP are used. In particular, PSNR refers to the Peak Signal-to-Noise Ratio and is the ratio between the power of a signal to the noise corrupting the signal, while SSIM is the Structural Similarity Index Measure and is a perceptual model that is used to measure the similarity between two images. Instead of computing SSIM globally on an image, a more effective approach is to apply it locally to regions of the image and then take the mean. FovVideoVDP is a visual difference metric that models the temporal aspects of vision and accounts for foveated viewing. Other image quality metrics such as Brisque and Visual Information Fidelity (VIF) can also be used, however, they are based on natural scene statistics so they don’t always work well with most synthetic/ artificial scenes used by some immersive media devices such as VR/AR devices.

During this tuning process, an original unblurred image is already known. The blurring parameters are also known so the only unknown is the noise which is tuned through the parameter SNR. In the tuning process, the following parameter SNR values can be obtained for each parameter R value using the image quality metrics (PSNR and mean-SSIM) . Using this, a range of parameter SNR can be identified.

For different values of parameter R (in centimetres), a range of parameter SNR values is considered and the inverse blurring can be applied to the blurred images. PSNR and mean- SSIM are calculated for the resulting images. The obtained values can be used to plot a best fit curve as shown in Fig. 5 and Fig. 6. In Fig. 5 and 6 for example, a cut-off value of 25 and 0.8 was used for PSNR and mean-SSIM respectively to determine a finer range of SNR. Step 2d is an optimization of the mathematical model of the invention and is performed before broadcasting or streaming of a content on the display, e.g. a video feed for the amusement of the user.

3) calculating the Wiener filter based on the detected or predefined R and eventually SNR and applying Wiener deconvolution to the image so as to obtain a perceived focused image outside the user focal plane i.e. display. In particular, Wiener filters are computed in real-time for each frame where the eye-tracker or gaze estimator is present on the base of R obtained from the eye-tracker or gaze estimator.

4) Transforming the image back to spatial domain using preferably an Inverse Discrete Fourier Transform (IDFT), wherein images resized to have a nxn resolution are resized back to original resolution, e.g. mxn.

5) Showing the processed distorted image on the immersive media device display so that an induced deblurred image is perceived by the user.

The overall process is illustrated in Fig. 3 and in Fig. 4 where an eye tracker or real time eye track estimator is present. Furthermore, the process is performed individually on each RGB channel. The proposed method does not affect the standard frame rate required for XR applications.

It should be noted that the values of optimal parameter R and SNR are subjective to the type of application and it is recommended to fine tune such values for each application using the method explained above. Although the method allows for each RGB channel to have different parameters, same values are used for each channel.

Another preferred embodiment (i.e. for media devices without eye -tracker) for the method comprises the steps of:

1) Computing Wiener filters based on pre-defined user focal points in the frequency domain In most immersive devices such as the VR/AR devices, the virtual content is shown at the display screen. Through optics, the focal distance is located at a fixed distance from the user. Instead of adjusting the PSF based on where the user is looking, the filter is applied based on object distances to the focal plane, i.e., parts of the image are convolved with different estimations of PSF. This way a space variant implementation is introduced where, in a single image or frame, if there is a single object in a neutral background e.g. a single balloon floating in a clear sky, the Wiener filter is calculated using the R referring to that object. Where at least two objects are present in the frame or image, a first and a second Wiener filters are calculated with the corresponding first and second R referring to each object, such filters being applied to the corresponding portions of the image where the pixels of the objects are. The same procedure can be iteratively applied for several objects.

For optimal performance, immersive media devices require high processing power and resources. Computing the DFT and ID FT of a large image each frame can also be quite expensive. To overcome this, many of the processing steps can be done offline. For this purpose, the filter kernels are computed offline and only deconvolution takes place in realtime. To compute the filter kernels in the frequency domain, the steps explained in the previous method are to be followed.

2) Transforming the Wiener filter into the spatial domain using preferably an Inverse Discrete Fourier Transform (ID FT). Alternatively, the Wiener filter can be computed directly in the spatial domain.

3) Extracting the pre-defined user focal plane and the depth map of the virtual scene Using the object distances (depths) in the virtual space (i.e. in fig. 8), the depth map of the virtual scene can be computed. The difference of each object distance to the focal plane defined by the display can be used for calculating the parameter "R” and eventually SNR. The depth map of the scene signifies the distances of the objects shown in the image. This information is used to define sections in the image based on object distances to predefined user focal plane where each section of the image can be convolved with a different Wiener filter kernel.

4) Applying the space-variant Wiener filter to the image using a shader implementation of the convolution process.

Using a shader in the linear colour space, each pixel of the image is convolved with a different filter kernel. The choice of the filter kernel is dependent on the difference between the pixel depth and the focal plane distance. To this aim, filter kernels are computed at different depth levels in the Fourier domain. Their equivalent filter kernels in the spatial domain are approximated.

The filter kernel representing the depth level of each pixel is extracted from the precomputed data. Each pixel is convolved with the respective filter kernel determined by the depth level of the object representing the pixel. This way a space-variant implementation is achieved which no longer requires an eye tracker and can offer faster processing times since the computationally expensive elements of the technique are no longer performed in realtime.

5) Showing the processed distorted image on the immersive media device display so that an induced deblurred image is perceived by the user outside the focal plane thereof.

The overall process is illustrated in Fig. 1 and Fig. 2. Similar to the previous preferred embodiment, the process is performed individually on each RGB channel and the proposed method does not affect the standard frame rate required for XR applications

Eye trackers can help understand where the user is looking which can ultimately help estimate the PSF of the retinal blurring. However, most currently available commercial VR/AR devices come without an integrated eye tracker. In case an eye tracker is available, the correspondent presented preferred embodiment is more suitable while in case an eye tracker is unavailable to predict the value of parameter R, the correspondent presented preferred embodiment is more suitable. It is pertinent to mention that both methods are equivalent and offer same performance in reducing the conflict between convergence and accommodation.

EXPERIMENTAL VALIDATION

In order to evaluate whether the described method can help reduce the effects of vergence accommodation mismatch, two validation tasks on depth perception are presented. The first one focuses on a reaching task while the other one is related to a spatial awareness task. A 3D display as a media device for the example 1 and 2, and a head-mounted display as a media device for example 3 are adopted. It should be noted that the described method can be performed also with other immersive media devices

EXAMPLE 1 — reaching task with a 3D screen

The described method was implemented using Unity operating on an Intel Core i7- 9700K processor equipped with a NVIDIA GeForce 1080 graphics card. A 47-inch LG 3D TV which supports 1080p resolution at 60Hz frequency was used for interacting with the user. Users wore 3D polarized glasses to view the virtual objects in 3D. A Microsoft Kinect Sensor v2 was used to measure the reaching positions. The 3D display was placed on an office desk and the user was seated 125cm away from the display. The user viewpoint was set at the middle of the 3D display. The Kinect was fixated below the display. An illustration of the experimental setup is shown in the Fig. 7

A simplistic virtual 3D environment was created containing a spherical object of radius 1cm that spawned at different locations. The spherical object or ball acts as the target position that the user will have to try to reach. From literature, it is known that the average human adult’s IPD is 63mm. However, it is also known that females have a lower mean IPD as compared to males. For this purpose, two settings for the IPD were chosen (63mm and 58mm). Furthermore, the Kinect v2 SDK for Unity was used for tracking the user finger. The Kinect SDK offers tracking of 25 body joints covering the whole body. For the purpose of the experiment, only two joints were tracked, namely the Head and the HandTipRight joints. The Head joint acted as a reference to ensure the user stayed at the defined viewing position. The HandTipRight joint represented the position of the right-hand index finger. Using a homogeneous transformation, it is possible to convert object positions from Kinect reference frame to Unity reference frame and vice versa. This transformation can be achieved through the following equations: = [R | t ] where k and u represent Kinect and Unity reference frames, respectively. R and t are the rotation matrix and translation respectively as defined by:

To perform a linear transformation, a simple matrix multiplication operation can be performed to find the point k p expressed in the Kinect reference frame to the same point up expressed in Unity reference frame using: up = u k T k p

To ensure the perceived distances corresponded to actual distances, a pre-experiment session was conducted in which three people took part. A cube of size 5x5x5 cm was created in the virtual scene and a similar one was created in the real world. The virtual cube was shown to the users at different depths and the users were asked to place the real cube where they see the virtual cube. Using a tape measure, the position of the real cube was measured and compared with the cube position in the virtual scene computed.

Data was collected from 23 users (13 males and 10 females) aged from 23 to 54 years (mean 30.65 ± 7.15). All participants were volunteers and received no reward. All users had normal to corrected-to-normal acuity. Users who normally wore corrective glasses or lenses wore them underneath the polarized glasses.

The target positions were vertices of a 20x20x20 cm cube located 5cm apart in depth. Therefore, the total possible positions were 125. In each session, there were 50 trials. The sequence of the target positions was randomly generated without repetition. The user was asked to reach the position of the ball with their right-hand index finger. Once, they felt that they have reached the target position, they were asked to hold steady their finger and press a button on the keyboard with their left hand to register the position. The users were asked to undergo the sessions using the normal IPD setting. If the user reported issues fusing the stimuli, the session was stopped, and the setting was changed to the lower IPD setting.

Two conditions were considered: normal viewing and inverse blurring viewing. In the normal viewing, the input stimuli were presented in full fidelity i.e. focussed. This session acted as the control group to have a reference performance. The stimuli in the other session were presented using the inverse blurring method developed. All users underwent the experimental conditions in random order, i.e., half performed the normal session first and half performed the inverse blurring session first. This was done to ensure no bias was present in the system. For quantitative analysis, the finger positions were used.

Data from 5 users was discarded. Three of these had a very high mean error (>25cm) and two had all their reaching positions in the same depth plane. This indicated that these users were not able to fuse the stimuli properly.

The error between the expected finger position and the perceived finger position was calculated. The mean errors along with their standard deviation are reported below. It can be seen that there is a small difference between the performance in the X (horizontal) and Y (vertical) planes. However, there is an improvement of around 2.23 cm in the Z (depth) plane. Error in the Euclidean space was also calculated and a decrease in the error can be noticed. To understand whether the experimental results have a statistical significance, a t- test was performed. In the depth plane, the error reduction obtained by the proposed method was statistically significant. The p-values from the statistical analysis are also reported in the table below.

EXAMPLE 2 — spatial awareness task with a 3D screen

Since some of the people reported issues with fusing the stimuli in the reaching experiment, an alternative experiment based on spatial awareness was performed to verify the improvement in the performance with the inverse blurring condition.

In particular, the described method was implemented using the same setup as the reaching experiment. Instead of the Kinect, an on-screen Eye Tribe tracker was used to track user's eye movements. A similar virtual environment to the reaching experiment was created. Two virtual textured cubes of size 10x10x10 cm were placed equally distant from the centre of the display (one towards the left and the other towards the right). The distance between the cube was 60cm in the horizontal plane and 0 cm in the vertical plane. Ten depth levels were created with 5 cm intervals. A plus sign was placed at the centre of the screen.

Data was collected from 24 users (13 males and 10 females) aged from 23 to 54 years (mean 30.65 ± 7.00). All participants were volunteers and received no reward. All users had normal to corrected-to-normal acuity. Users who normally wore corrective glasses or lenses wore them underneath the polarized glasses.

The user was seated 80 cm from the 3D display. The user was told to fixate on the cross. A stimulus containing the two cubes was shown. The depth level of each cube was randomly selected. Each user session lasted for 50 trails. The stimuli were shown for 800 ms. This time was chosen based on studies found in literature which suggested that humans take around 500-800 ms to respond and fuse the stimuli depending on the distance. When the stimuli disappeared, the users were asked to select which of the two cubes was closer to them. The users made the selection by pressing the arrow keys on a keyboard, i.e., left arrow key if they perceived the left cube as closer to them or the right arrow key if they perceived the right cube was closer. The choice was forced. Even if they perceived the two cubes at the same depth, they had to make a selection. This approach was based on the Two- Alternative Forced Choice (2AFC) paradigm

In each trial, before showing the stimuli, the users were asked to fixate on the plus sign. They were given 500ms to do this. This was done to ensure that the starting gaze condition is similar for all trials and also to give the users some time to focus back on the screen in case they looked on the keyboard to make the selection. Similar to the previous experiment, two conditions were considered during the experiment: normal viewing and inverse blurring viewing.

The probability of getting the correct answer is 50% when choosing randomly. For this reason, a threshold was set. Four users had more 50% error rate for both conditions indicating either they did not fully understand the task or were guessing randomly. So, their data was discarded from the analysis.

The number of correct and incorrect answers for all users were computed. The means of the group are summarized in table shown below. It can be observed that the error is much lower in the inverse blurring condition, indicating that the described method lowers the conflict caused by accommodation and convergence in stereo displays. It should be noted that in some trials, the two cubes appeared at the same depths. Those trials were considered neither as correct nor incorrect.

To understand whether the results have statistical significance, the discrimination sensitivity can be computed for the 2AFC task. The data for each user and condition was converted into discrimination d’. A boot strapping procedure was used to compute the group confidence levels on d’ measurements. Mean d’ were computed for each user and condition from the original data re-sampled with replacement 5000 times. These bootstrapped distributions were then collapsed across observers to obtain group distributions for each condition. The group distributions were fitted over a Gaussian distribution from which the 2.5th and 97.5th quantiles were taken as the 95% confidence interval. In particular, Fig. 9 shows the discrimination for the two experimental conditions. A mean discrimination of 1.46 was observed for the normal viewing session whereas the discrimination significantly increased to 2.02 with the inverse blurring method (fig. 9).

EXAMPLE 3 — spatial awareness task with a VR device

The described second preferred embodiment was also tested using the spatial awareness task. The experimental setup and procedure were identical to the previous experiment. The only differences were the absence of an eye tracking device and the usage of an HTC Vive Pro device instead of the 3D screen (see Fig.8).

Data was collected from 18 users (12 males and 6 females) aged from 20 to 54 years (mean 25.89 ± 7.52). All participants were volunteers and received no reward. All users had normal to corrected-to-normal acuity. Users who normally wore corrective glasses or lenses wore them underneath the VR headset. The users were asked to wear the VR device whilst being seated on a chair. The users held an HTC Vive Pro controller in each hand. The controllers acted as the input source. The user pressed the trigger on the controllers to make the selection, i.e., if the user judged that the cube on the left was closer, they pressed the trigger held in the left hand and vice versa.

The stimulus was shown to the user on the VR device for a short time. The duration of the stimulus was identical to the experiment on the 3D screen. The depth levels were randomly selected. Each user session lasted for 60 trials. Once the stimulus disappeared, the users were asked to make the selection regarding the cube which was perceived to be closer to the user using the controllers. Like the previous experiments, two conditions were considered: normal viewing and inverse blurring viewing. The inverse blurring condition used the embodiment without eye-tracker.

The number of correct and incorrect answers were computed for all the users. No data was discarded as all the users achieved more than 50% correct answers. The means of the group are summarized in the table below. Similar to the experiment on the 3D screen, it can be observed that error is much lower in the inverse blurring setting, indicating that the technique performs similarly on different devices and that the second implementation (i.e. the one for media devices without eye-tracker) is also effective in lowering the conflict between convergence and accommodation.

To understand whether the results have statistical significance, the discrimination sensitivity was computed for the 2AFC task. The data for each user and condition was converted into discrimination d’. A boot strapping procedure similar to the previous experiment was performed and the group distributions were fitted over a Gaussian distribution from which the 2.5th and 97.5th quantiles were taken as the 95% confidence interval. Fig. 10 shows the discrimination for the two experimental conditions. A mean discrimination of 1.77 was observed for the normal viewing session whereas the discrimination significantly increased to 2.68 with the inverse blurring method (fig. 10).

The experimental analysis supports the notion that the described method can improve user’s perception of virtual object distances by minimizing the mismatch between accommodation and vergence.