GENERATION OF MULTICHANNEL AUDIO SIGNAL

Title:

GENERATION OF MULTICHANNEL AUDIO SIGNAL

Document Type and Number:

WIPO Patent Application WO/2024/056359

Kind Code:

Abstract:

An audio apparatus comprises a receiver (101) arranged to receive a downmix audio signal for a multichannel audio signal and upmix parametric data for upmixing the downmix audio signal. A first artificial neural network (107) generates a set of feature values for the downmix audio signal from samples of the downmix audio signal. A second artificial neural network (109) has input nodes receiving second samples of the downmix audio signal and nodes receiving feature values from the set of feature values. Based on these inputs, the second artificial neural network (109) generates samples of an auxiliary audio signal for the downmix audio signal. A generator (105) generates the multichannel audio signal from the downmix signal and the auxiliary audio signal in dependence on the upmix parametric data. In many embodiments, the operation may be subband based with separate artificial neural networks being used for different subbands.

More Like This:

JP6697695	Information processing device and information processing method
WO/2014/160895	METADATA DRIVEN DYNAMIC RANGE CONTROL
JP5564803	Audio equipment and audio processing methods

Inventors:

SCHUIJERS ERIK GOSUINUS PETRUS (NL)
GALLUCCI ALESSIO (NL)

Application Number:

PCT/EP2023/073575

Publication Date:

March 21, 2024

Filing Date:

August 29, 2023

Export Citation:

Click for automatic bibliography generation Help

Assignee:

KONINKLIJKE PHILIPS NV (NL)

International Classes:

G10L19/008; G10L19/02

Other References:

"Spatial Audio Processing", 1 January 2007, JOHN WILEY & SONS, LTD, England, article JEROEN BREEBAART ET AL: "Spatial Audio Processing - Ch. 6 MPEG Surround", pages: 93 - 115, XP055152635
CHUN CHAN JUN ET AL: "Extension of Monaural to Stereophonic Sound Based on Deep Neural Networks", AES CONVENTION 139; OCTOBER 2015, AES, 60 EAST 42ND STREET, ROOM 2520 NEW YORK 10165-2520, USA, 23 October 2015 (2015-10-23), XP040672253
E. SCHUIJERSW. OOMENB. DEN BRINKERJ. BREEBAART: "Advances in Parametric Coding for High-Quality Audio", 114TH AES CONVENTION, AMSTERDAM, THE NETHERLANDS, 2003, PREPRINT 5852, 2003
E. SCHUIJERSJ. BREEBAARTH. PUMHAGENJ. ENGDEGARD: "Low Complexity Parametric Stereo Coding", 116TH AES, BERLIN, GERMANY, 2004, PREPRINT 6073
XAVIER GLOROTANTOINE BORDES: "Yoshua Bengio Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics", PMLR, vol. 15, 2011, pages 315 - 323
OORD, AARON VAN DENSANDER DIELEMANHEIGA ZENKAREN SIMONYANORIOL VINYALSALEX GRAVESNAL KALCHBRENNERANDREW SENIORKORAY KAVUKCUOGLU: "Wavenet: A generative model for raw audio.", ARXIV: 1609.03499, 2016

Attorney, Agent or Firm:

PHILIPS INTELLECTUAL PROPERTY & STANDARDS (NL)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS:

1. An apparatus for generating a multichannel audio signal, the apparatus comprising: a receiver (101) arranged to receive a downmix audio signal for the multichannel audio signal and upmix parametric data for upmixing the downmix audio signal; a first artificial neural network (107) arranged to generate a set of feature values for the downmix audio signal, the first artificial neural network having input nodes for receiving first samples of the downmix audio signal and output nodes for providing the set of feature values; a second artificial neural network (109) having input nodes for receiving second samples of the downmix audio signal and output nodes arranged to provide samples of an auxiliary audio signal for the downmix audio signal, the second artificial neural network further (109) comprising nodes receiving feature values from the set of feature values; and a generator (105) arranged to generate the multichannel audio signal from the downmix signal and the auxiliary audio signal in dependence on the upmix parametric data.

2. The apparatus of claim 1 comprising a first filter bank (401) for generating a frequency subband representation of the downmix audio signal; and wherein at least some of the second samples of the downmix audio signal are subband samples of the frequency subband representation.

3. The apparatus of claim 2 wherein the second artificial neural network (109) is a artificial neural network of a first plurality of subband artificial neural networks, each subband artificial neural network of the first plurality of subband artificial neural networks being arranged to generate subband samples for a subset of subbands of a frequency subband representation of the auxiliary audio signal.

4. The apparatus of claim 3 wherein the plurality of subband artificial neural networks includes an artificial neural network for each subband of the frequency subband representation of the auxiliary audio signal.

5. The apparatus of any of claims 2 to 4 wherein the generator (105) is arranged to generate a frequency subband representation of the multichannel audio signal by applying a subband matrix operation to the frequency subband representation of the auxiliary audio signal and the frequency subband representation of the downmix audio signal, and to transform the frequency subband representation of the multichannel audio signal to a time domain representation of the multichannel audio signal.

6. The apparatus of any previous claim 2-5 wherein the set of feature values generated by a subband artificial neural network of the first plurality of subband artificial neural networks is common for a plurality of subbands of the frequency subband representation of the downmix audio signal.

7. The apparatus of any previous claim 2-6 wherein a number of input nodes for artificial neural networks of the first plurality of subband artificial neural networks is monotonically decreasing for increasing frequency.

8. The apparatus of any previous claim comprising a second filter bank (401) for generating a frequency subband representation of the downmix audio signal; and wherein at least some of the first samples of the downmix audio signal are subband samples of the frequency subband representation.

9. The apparatus of claim 8 as dependent on any of claims 3 to 7 wherein the first artificial neural network (107) is an artificial neural network of a second plurality of subband artificial neural networks, each subband artificial neural network of the second plurality of subband artificial neural networks being arranged to generate subband samples for a subset of artificial neural networks of the first plurality of artificial neural networks.

10. The apparatus of claim 9 as dependent on any of the claims 3 to 7 wherein subband samples of the second subband samples for at least one artificial neural network of the second plurality of artificial neural networks include a plurality of subband samples for multiple processing time intervals of an artificial neural network of the first plurality of artificial neural networks.

11. The apparatus of claim 9 or 10 wherein subband samples of the second subband samples for at least one artificial neural network of the second plurality of artificial neural networks include at least one subband sample for a subband of the subband representation of the downmix audio signal for which the at least one artificial neural network does not generate subband samples for the subband representation of the auxiliary audio signal.

12. The apparatus of any previous claim wherein the first artificial neural network and the second artificial neural network are trained by a joint training process based on training data comprising sets of samples of a downmix audio signal generated by downmixing a training multichannel audio signal and a target audio signal determined from a residual signal generated for the downmix audio signal, and using a cost function indicative of a difference of a generated auxiliary audio signal for the training multichannel audio signal and the target audio signal.

13. The apparatus of any previous claim further comprising generating feature values for the set of feature values from analytical analysis of the downmix audio signal.

14. A method of generating a multichannel audio signal, the method comprising: receiving a downmix audio signal for the multichannel audio signal and upmix parametric data for upmixing the downmix audio signal; a first artificial neural network (107) generating a set of feature values for the downmix audio signal, the first artificial neural network having input nodes for receiving first samples of the downmix audio signal and output nodes for providing the set of feature values; a second artificial neural network (109) having input nodes for receiving second samples of the downmix audio signal and output nodes providing samples of an auxiliary audio signal for the downmix audio signal, the second artificial neural network further (109) comprising nodes receiving feature values from the set of feature values; and generating the multichannel audio signal from the downmix signal and the auxiliary audio signal in dependence on the upmix parametric data.

15. A computer program product comprising computer program code means adapted to perform all the steps of claim 14 when said program is run on a computer.

Description:

GENERATION OF MULTICHANNEL AUDIO SIGNAL

FIELD OF THE INVENTION

The invention relates to generation of multichannel audio signals and in particular, but not exclusively, to generation of stereo signals from upmixing of a mono downmix signal using upmix parametric data.

BACKGROUND OF THE INVENTION

Spatial audio applications have become numerous and widespread and increasingly form at least part of many audiovisual experiences. Indeed, new and improved spatial experiences and applications are continuously being developed which results in increased demands on the audio processing and rendering.

For example, in recent years, Virtual Reality (VR) and Augmented Reality (AR) have received increasing interest and a number of implementations and applications are reaching the consumer market. Indeed, equipment is being developed for both rendering the experience as well as for capturing or recording suitable data for such applications. For example, relatively low-cost equipment is being developed for allowing gaming consoles to provide a full VR experience. It is expected that this trend will continue and indeed will increase in speed with the market for VR and AR reaching a substantial size within a short time scale. In the audio domain, a prominent field explores the reproduction and synthesis of realistic and natural spatial audio. The ideal aim is to produce natural audio sources such that the user cannot recognize the difference between a synthetic or an original one.

A lot of research and development effort has focused on providing efficient and high- quality audio encoding and audio decoding for spatial audio. A frequently used spatial audio representation is multichannel audio representations, including stereo representation, and efficient encoding of such multichannel audio based on downmixing multichannel audio signals to downmix channels with fewer channels have been developed. One of the main advances in low bit-rate audio coding has been the use of parametric multichannel coding where a downmix signal is generated together with parametric data that can be used to upmix the downmix signal to recreate the multichannel audio signal.

In particular, instead of traditional mid-side or intensity coding, in parametric multichannel audio coding a multichannel input signal is downmixed to a lower number of channels (e.g. two to one) and multichannel image (stereo) parameters are extracted. Then the downmix signal is encoded using a more traditional audio coder (e.g. a mono audio encoder). The bitstream of the downmix is multiplexed with the encoded multichannel image parameter bitstream. This bitstream is then transmited to the decoder, where the process is inverted. First the downmix audio signal is decoded, after which the multichannel audio signal is reconstructed guided by the encoded multichannel image upmix parameters.

An example of stereo coding is described in E. Schuijers, W. Oomen, B. den Brinker, J. Breebaart, “Advances in Parametric Coding for High-Quality Audio”, 114th AES Convention, Amsterdam, The Netherlands, 2003, Preprint 5852. In the described approach, the downmixed mono signal is parametrized by exploiting the natural separation of the signal into three components (objects): transients, sinusoids, and noise. In E. Schuijers, J. Breebaart, H. Pumhagen, J. Engdegard, “Low Complexity Parametric Stereo Coding”, 116th AES, Berlin, Germany, 2004, Preprint 6073 more details are provided describing how parametric stereo was realized with a low (decoder) complexity when combining it with Spectral Band Replication (SBR).

In the described approaches, the decoding is based on the use of the so-called decorrelation process. The de-correlation process generates a decorrelated helper signal from the monaural signal. In the stereo reconstruction process, both the monaural signal and the decorrelated helper signal are used to generate the upmixed stereo signal based on the upmix parameters. Specifically, the two signals may be multiplied by a time- and frequency-dependent 2x2 matrix having coefficients determined from the upmix parameters to provide the output stereo signal.

However, although Parametric Stereo (PS) and similar downmix encoding/ decoding approaches were a leap forward from traditional stereo and multichannel coding, the approach is not optimal in all scenarios. In particular, known encoding and decoding approaches tend to introduce some distortion, changes, artefacts etc. that may introduce differences between the (original) multichannel audio signal input to the encoder and the multichannel audio signal recreated at the decoder. Typically, the audio quality may be degraded and imperfect recreation of the multichannel occurs. Further, the data rate may still be higher than desired and/or the complexity/ resource usage may of the processing may be higher than preferred.

Hence, an improved approach would be advantageous. In particular, an approach allowing increased flexibility, improved adaptability, an improved performance, increased audio quality, improved audio quality to data rate trade-off, reduced complexity and/or resource usage, reduced computational load, facilitated implementation and/or an improved spatial audio experience would be advantageous.

SUMMARY OF THE INVENTION

Accordingly, the Invention seeks to preferably mitigate, alleviate or eliminate one or more of the above mentioned disadvantages singly or in any combination.

According to an aspect of the invention there is provided an apparatus for generating a multichannel audio signal, the apparatus comprising: a receiver arranged to receive a downmix audio signal for the multichannel audio signal and upmix parametric data for upmixing the downmix audio signal; a first artificial neural network arranged to generate a set of feature values for the downmix audio signal, the first artificial neural network having input nodes for receiving first samples of the downmix audio signal and output nodes for providing the set of feature values; a second artificial neural network having input nodes for receiving second samples of the downmix audio signal and output nodes arranged to provide samples of an auxiliary audio signal for the downmix audio signal, the second artificial neural network further comprising nodes receiving feature values from the set of feature values; and a generator arranged to generate the multichannel audio signal from the downmix signal and the auxiliary audio signal in dependence on the upmix parametric data.

The approach may provide an improved audio experience in many embodiments. For many signals and scenarios, the approach may provide improved generation/ reconstruction of a multichannel audio signal with an improved perceived audio quality. The approach may provide a particularly advantageous arrangement which may in many embodiments and scenarios allow a facilitated and/or improved possibility of utilizing artificial neural networks in audio processing, including typically audio encoding and/or decoding. The approach may allow an advantageous employment of artificial neural network(s) in generating a multichannel audio signal from a downmix audio signal.

The approach may provide an efficient implementation and may in many embodiments allow a reduced complexity and/or resource usage. The approach may in many scenarios allow a reduced data rate for data representing a multichannel audio signal using a downmix signal.

The first samples and the second samples may be the same samples or may be different samples (or may be partially the same samples). The first samples and the second samples may be time domain samples, may be frequency domain samples, or may span a particular time and frequency range (specifically subband domain samples). The samples of an auxiliary audio signal may be time domain samples, may be frequency domain samples, or may span a particular time and frequency range (specifically subband domain samples).

The upmix parametric data may comprise parameter (values) relating properties of the downmix signal to properties of the multichannel audio signal. The upmix parametric data may comprise data being indicative of relative properties between channels of the multichannel audio signal. The upmix parametric data may comprise data being indicative of differences in properties between channels of the multichannel audio signal. The upmix parametric data may comprise data being perceptually relevant for the synthesis of the multichannel audio signal. The properties may for example be differences in phase and/or intensity and/or timing and/or correlation. The upmix parametric data may in some embodiments and scenarios represent abstract properties not directly understandable by a human person/expert (but may typically facilitate a better reconstruction/lower data rate etc). The upmix parametric data may comprise data including at least one of interchannel intensity differences, interchannel timing differences, interchannel correlations and/or interchannel phase differences for channels of the multichannel audio signal.

The first and second artificial neural networks are trained artificial neural networks. The first and/or second artificial neural network may be a trained artificial neural network(s) trained by training data including training downmix audio signals and training upmix parametric data generated from training multichannel audio signals; the training employing a cost function comparing the training multichannel audio signals to upmixed multi-channel signals generated, using the training upmix parametric data, from the training downmix signals and generated auxiliary audio signals. The first and/or second artificial neural network may be a trained artificial neural network(s) trained by training data including training data representing a range of relevant audio sources including recording of videos, movies, telecommunications, etc.

The first and/or second artificial neural network may be a trained artificial neural network(s) trained by training data having training input data comprising training downmix audio signals of training multichannel audio signals, and using a cost function including a contribution indicative of a difference between training auxiliary audio signals generated by the second artificial neural network in response to the training data and training residual signals for the training downmix audio signals.

The generator may be arranged to generate the multichannel audio signal by applying a matrix multiplication to the downmix signal and the auxiliary audio signal with the coefficients of the matrix being determined as a function of parameters of the upmix parametric data. The matrix be time- and frequency-dependent.

The audio apparatus may specifically be an audio decoder apparatus.

According to an optional feature of the invention, that apparatus comprises a first filter bank for generating a frequency subband representation of the downmix audio signal; and wherein at least some of the second samples of the downmix audio signal are subband samples of the frequency subband representation.

Subband processing may provide a particularly advantageous operation in many embodiments. The arrangement may be particularly suited for subband processing which may allow reduced complexity and/or and improved multichannel audio signal to be generated.

According to an optional feature of the invention, the second artificial neural network is a artificial neural network of a first plurality of subband artificial neural networks, each subband artificial neural network of the first plurality of subband artificial neural networks being arranged to generate subband samples for a subset of subbands of a frequency subband representation of the auxiliary audio signal.

A particular advantage of the approach is that it may allow highly efficient subband processing thereby allowing partitioning of the required processing into a plurality of smaller artificial neural networks. This may typically allow reduced complexity and/or an improved multichannel audio signal to be generated.

In many embodiments, each (or at least some) subband neural network(s) is arranged to generate subband samples for one subband of the frequency subband representation of the auxiliary audio signal. According to an optional feature of the invention, the plurality of subband artificial neural networks includes an artificial neural network for each subband of the frequency subband representation of the auxiliary audio signal.

This may in many embodiments and scenarios provide a highly advantageous and efficient implementation and/or operation and/or performance.

According to an optional feature of the invention, the generator is arranged to generate a frequency subband representation of the multichannel audio signal by applying a subband matrix operation to the frequency subband representation of the auxiliary audio signal and the frequency subband representation of the downmix audio signal, and to transform the frequency subband representation of the multichannel audio signal to a time domain representation of the multichannel audio signal.

This may in many embodiments and scenarios provide a highly advantageous and efficient implementation and/or operation and/or performance.

According to an optional feature of the invention, the set of feature values generated by a subband artificial neural network of the first plurality of subband artificial neural networks is common for a plurality of subbands of the frequency subband representation of the downmix audio signal.