PROCESSING AUDIO SIGNALS - NOKIA TECHNOLOGIES OY

Title:

PROCESSING AUDIO SIGNALS

Document Type and Number:

WIPO Patent Application WO/2018/234618

Kind Code:

Abstract:

A method, apparatus and computer-readable medium are disclosed for using a linear transformation of a far-field audio signal and a linear transformation of a near-field audio signal to determine a room impulse response filter relating to a recording space; and assigning a filter length to the room impulse response filter, the filter length being dependent on a recording parameter.

Inventors:

RÄMÖ ANSSI (FI)
VILERMO MIIKKA (FI)
VIRTANEN TUOMAS (FI)
NIKUNEN JOONAS (FI)

Application Number:

PCT/FI2018/050396

Publication Date:

December 27, 2018

Filing Date:

May 25, 2018

Export Citation:

Click for automatic bibliography generation Help

Assignee:

NOKIA TECHNOLOGIES OY (FI)

International Classes:

H04S7/00; G10K15/12; H03H17/02; H03H17/04; H03H21/00; H04R5/04; H04R29/00

Domestic Patent References:

WO2006024850A2	2006-03-09
WO2015058818A1	2015-04-30
WO2007060443A2	2007-05-31

Foreign References:

EP3128766A2	2017-02-08
US6246760B1	2001-06-12
US5548642A	1996-08-20
US20060002547A1	2006-01-05
EP2365630A1	2011-09-14
US20160140950A1	2016-05-19
US20070270988A1	2007-11-22

Attorney, Agent or Firm:

NOKIA TECHNOLOGIES OY et al. (FI)

Download PDF:

View/Download PDF PDF Help

Claims:

Claims l. A method, comprising:

using a linear transformation of a far-field audio signal and a linear

transformation of a near-field audio signal to determine a room impulse response filter relating to a recording space; and

assigning a filter length to the room impulse response filter, the filter length being dependent on a recording parameter. 2. A method according to claim l, wherein the recording parameter is a

reverberation time of the recording space.

3. A method according to claim 2, wherein the filter length is proportional to the reverberation time.

4. A method according to claim 2 or claim 3, wherein the reverberation time is a RT60 reverberation time and the filter length is within the range RT60/8 to RT60/2.

5. A method according to claim l, wherein the recording parameter is determined by determining spectral content of the near-field audio signal.

6. A method according to claim 5, wherein determining spectral content of the near-field audio signal comprises identifying a spectral centroid of the near-field audio signal.

7. A method according to claim 5 or claim 6, wherein the filter length is inversely proportional to the frequency of the spectral centroid of the near-field audio signal.

8. A method according to claim 1, wherein the filter length varies for each frequency band within the near-field audio signal.

9. A method according to claim 8, wherein the filter length decreases exponentially with increasing frequency.

10. A method according to claim l, wherein the recording parameter is associated with the recording space and is determined using a forgetting factor dependent on the frequency of the near-field audio signal. 11. A method according to claim l, wherein assigning a filter length to the room impulse response filter comprises truncating the filter length of a determined room impulse response filter.

12. A method according to any preceding claim, wherein the far-field audio signal is obtained from a microphone array.

13. A method according to any preceding claim, wherein the near-field audio signal is obtained from a Lavalier microphone. 14. Apparatus configured to perform a method according to any preceding claim.

15. Computer-readable instructions which when executed by computing apparatus cause the computing apparatus to perform a method as claimed in any of claims 1 to 13. 16. An apparatus comprising:

at least one processor; and

at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to:

use a linear transformation of a far-field audio signal and a linear

transformation of a near-field audio signal to determine a room impulse response filter relating to a recording space; and

assign a filter length to the room impulse response filter, the filter length being dependent on a recording parameter. 17. An apparatus according to claim 16, wherein the recording parameter is a reverberation time of the recording space.

18. An apparatus according to claim 17, wherein the filter length is proportional to the reverberation time.

19. An apparatus according to claim 17 or claim 18, wherein the reverberation time is a RT60 reverberation time and the filter length is within the range RT60/8 to

20. An apparatus according to claim 16, wherein the recording parameter is determined by determining spectral content of the near-field audio signal.

21. An apparatus according to claim 20, wherein determining spectral content of the near-field audio signal comprises identifying a spectral centroid of the near-field audio signal.

22. An apparatus according to claim 20 or claim 21, wherein the filter length is inversely proportional to the frequency of the spectral centroid of the near-field audio signal.

23. An apparatus according to claim 16, wherein the filter length varies for each frequency band within the near-field audio signal.

24. An apparatus according to claim 23, wherein the filter length decreases exponentially with increasing frequency.

25. An apparatus according to claim 16, wherein the recording parameter is associated with the recording space and is determined using a forgetting factor dependent on the frequency of the near-field audio signal.

26. An apparatus according to claim 16, wherein assigning a filter length to the room impulse response filter comprises truncating the filter length of a determined room impulse response filter. 27. An apparatus according to any one of claims 16-26, wherein the far-field audio signal is obtained from a microphone array.

28. An apparatus according to any one of claims 16-27, wherein the near-field audio signal is obtained from a Lavalier microphone.

29. A computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by a least one processor, causing performance of at least:

using a linear transformation of a far-field audio signal and a linear

transformation of a near-field audio signal to determine a room impulse response filter relating to a recording space; and

assigning a filter length to the room impulse response filter, the filter length being dependent on a recording parameter.

30. Apparatus comprising:

means for using a linear transformation of a far-field audio signal and a linear transformation of a near-field audio signal to determine a room impulse response filter relating to a recording space; and

means for assigning a filter length to the room impulse response filter, the filter length being dependent on a recording parameter.

Description:

Processing Audio Signals

Field

This specification relates to processing audio signals and, more specifically, to processing audio signals for mixing audio signals.

Background

Spatial audio signals are being used more often to produce a more immersive audio experience. A stereo or multi-channel recording can be passed from the recording or capture apparatus to a listening apparatus and replayed using a suitable multi-channel output such as a multi-channel loudspeaker arrangement and, with virtual surround processing, a pair of stereo headphones or headset.

As the possibilities for using such immersive audio functionality become more widespread, there is a need to ensure that audio signals are mixed in such a way so as to complement the virtual reality environment of the user. For example, if a user is in a virtual reality environment, there is a requirement that audio content from a particular source sounds as though it is coming from a location corresponding to the location of that source in virtual reality.

Summary

In a first aspect, this specification describes a method comprising: using a linear transformation of a far-field audio signal and a linear transformation of a near-field audio signal to determine a room impulse response filter relating to a recording space; and assigning a filter length to the room impulse response filter, the filter length being dependent on a recording parameter.

The recording parameter maybe a reverberation time of the recording space. The filter length may be proportional to the reverberation time.

The reverberation time may be a RT60 reverberation time and the filter length may be within the range RT60/8 to RT60/2. The recording parameter may be determined by determining spectral content of the near-field audio signal. Determining spectral content of the near-field audio signal may comprise identifying a spectral centroid of the near-field audio signal.

The filter length may be inversely proportional to the frequency of the spectral centroid of the near-field audio signal.

The filter length may vary for each frequency band within the near-field audio signal.

The filter length may decrease exponentially with increasing frequency.

The recording parameter may be associated with the recording space and may be determined using a forgetting factor dependent on the frequency of the near-field audio signal.

Assigning a filter length to the room impulse response filter may comprise truncating the filter length of a determined room impulse response filter.

The far-field audio signal may be obtained from a microphone array.

The near-field audio signal may be obtained from a Lavalier microphone.

In a second aspect, this specification describes apparatus configured to perform a method according to any preceding claim.

In a third aspect, this specification describes computer-readable instructions which when executed by computing apparatus cause the computing apparatus to perform a method in accordance with the first aspect of this specification.

In a fourth aspect, this specification describes an apparatus comprising: at least one processor; and at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to: use a linear

transformation of a far-field audio signal and a linear transformation of a near-field audio signal to determine a room impulse response filter relating to a recording space; and assign a filter length to the room impulse response filter, the filter length being dependent on a recording parameter. In a fifth aspect, this specification describes a computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by a least one processor, causing performance of at least: using a linear transformation of a far-field audio signal and a linear transformation of a near-field audio signal to determine a room impulse response filter relating to a recording space; and assigning a filter length to the room impulse response filter, the filter length being dependent on a recording parameter. In a sixth aspect, this specification describes an apparatus comprising: means for using a linear transformation of a far-field audio signal and a linear

transformation of a near-field audio signal to determine a room impulse response filter relating to a recording space; and means for assigning a filter length to the room impulse response filter, the filter length being dependent on a recording parameter.

Brief description of the drawings

So that the invention may be fully understood, embodiments thereof will now be described with reference to the accompanying drawings, in which:

Figure 1 is a schematic diagram of an audio mixing system and a recording space; Figure 2 is a schematic block diagram of elements of certain embodiments;

Figure 3 is a flow chart illustrating operations carried out in certain embodiments; Figure 4 is an illustration of a recording space;

Figure 5 is a schematic diagram of an audio mixing system and a recording space;

Figure 6 is a schematic diagram of an audio mixing system and a recording space as a target source is replaced with a replacement source;

Figure 7 is a schematic diagram of an audio mixing system and a recording space as a new source is introduced to an audio mixture; and

Figures 8A and 8B are graphs illustrating filter lengths.

Detailed description

In the description and drawings, like reference numerals refer to like elements throughout. Embodiments of the present invention provide room impulse response (RIR) filters of differing lengths based on a recording parameter. In a typical indoor environment, the reverberation time (RT60) of a particular signal is longer for low frequencies, since they are not as easily absorbed when interacting with a reflecting/absorbing boundary. For example, a reverberation time below 100 Hz can be several seconds, whereas for high frequencies above 4000 Hz the reverberation time may only be a fraction of seconds. Due to the varying RT60 over a frequency range, sources with different spectral characteristics may require different RIR filter lengths in terms of short time Fourier transform (STFT) frames for accurate modelling and projection. For example, a kick drum and bass guitar may require very long RIR filter lengths in order to model all the reverberation at low frequencies.

Additionally, within one broadband source occupying both relatively low and high frequencies it is useful to allow different RIR filter lengths for different frequencies. Typically, at high frequencies, the filter length is substantially shorter since signal energy is absorbed and reverberation abates faster to be modelled using the RIR data and projection. Additionally, excessively long filters cannot be used for an entire frequency range since they can cause over-modelling/overfitting effects due to too flexible a model, which leads to decreased subjective performance in the projection and removal. Also, long filters cause unnecessary computation and may cause performance issues in real-time implementation.

Embodiments of the present invention relate to mixing audio signals received from both a near-field microphone and from a far-field microphone. Example near-field microphones include Lavalier microphones which may be worn by a user to allow hands-free operation or a handheld microphone. In some embodiments, the near-field microphone may be location tagged. The near-field signals obtained from near-field microphones may be termed "dry signals", in that they have little influence from the recording space and have relatively high signal-to-noise ratio (SNR).

Far-field microphones are microphones that are located relatively far away from a sound source. In some embodiments, an array of far-field microphones may be provided, for example in a mobile phone or in a Nokia Ozo (RTM) or similar audio recording apparatus. Devices having multiple microphones may be termed multi- channel devices and can detect an audio mixture comprising audio components received from the respective channels.

The microphone signals from far-field microphones may be termed "wet signals", in that they have significant influence from the recording space (for example from ambience, reflections, echoes, reverberation, and other sound sources). Wet signals tend to have relatively low SNR. In essence, the near-field and far-field signals are in different "spaces", near-field signals in a "dry space" and far-field signals in a "wet space".

When the originally "dry" audio content from the sound sources reaches the far-field microphone array the audio signals have changed because of the effect of the recording space. That is to say, the signal becomes "wet" and has a relatively low SNR. The near- field microphones are much closer to the sound sources than the far-field microphone array. This means that the audio signals received at the near-field microphones are much less affected by the recording space. The dry signal has much higher signal to noise ratio and lower cross talk with respect to other sound sources. Therefore, the near-field and far-field signals are very different and mixing the two ("dry" and "wet") results in audible artefacts or non-natural sounding audio content.

Further problems arise, if a signal outside the system needs to be inserted into the audio mixture. For example, an audio stream from an external player such as a professional audio recorder may be mixed with audio content recorded in a particular recording space. These signals need to be mixed together because only the microphone array can provide spatial audio content, for example for a virtual reality (VR) or augmented reality (AR) audio delivery system. However, with simply mixed sound sources this cannot be done due to artefacts or at least due to the virtual presence aspect being lost in listening. Furthermore, future six degrees of freedom (6DoF) audio production systems require ways to estimate room impulse responses.

Additionally, mixing or editing of the multi-channel array signal is not straightforward due to low SNR, cross-talk and spatial artefacts that editing might cause. Editing of the near-field microphone and pre-recorded signal is relatively straightforward due to high SNR and isolation between individual channels. However, near-field signals only provide audio content without spatial information. The resulting mix quality is up to personal preferences and use case demands; however some amount of spatial information insertion capability is often needed. A new problem arises when a totally new "dry" signal is introduced into to the audio mixture, for example from a sound source located externally with respect to the recording space. Since the new audio signal has no room impulse response (RIR) data available for the current room and environment, realistic sounding mixing is not possible without a database of RIR values from all around the space used for the original audio capture.

Current audio mixing systems often rely on the expert audio mixer's personal abilities and spatial information may be added to the "dry" near-field signal with signal processors that create artificial spatial information. Examples include reverb processors that generate spatial information with an algorithm for different sounding and tunable spaces or that rely on real impulse responses (convolution processor) with some amount of manual modification to some parameters such as panning, volume, equalization, pre-echo, decay time and residual noise floor adjustments. More information may be found at http://www.nongnu.0rg/freeverb3/.

Hitherto, there are no known methods available that use a collected RIR database together with the position data and/or models of the recording space to render realistic sounding VR, AR or 6D0F audio playback.

Embodiments of this invention provide a database where estimated RIR values are collected around the place of performance based on the captured "dry" and "wet" signals as well as available position data of the near-field microphones (which correspond to the position of the sound source). The RIR data are estimated based on the dry to wet signal transfer function at every relevant position within the recording space. There may be one or more "wet" multi-channel arrays as well as one or more "dry" sound sources collected at the RIR database at the same time.

In some embodiments, the RIR database may be collected during an initial calibration phase where a sound source (for example, white noise, talking human, acoustic instrument, a flying drone with speaker, etc) is moving or is moved around the recording space either manually or automatically. The benefit of having calibration recordings and database collection prior to actual performance is that the RIR database can be used during the performance to insert additional sound sources to the audio mix in real-time. Also, the recording space might have higher SNR available in some circumstances, for example when a studio audience is missing and also use of special signals such as white noise that will provide more accurate room impulse responses for the whole frequency range.

In other embodiments, continuous collection of new RIR data is performed during the recording itself, the new RIR data being inserted into the database as the actual performance occurs. Additional RIR data that is inserted into a pre-existing RIR database can also be collected during the actual performance.

Collection of RIR data during a performance can be made in order to add more data points to make the database denser. There are multiple dimensions that can be enhanced in the database. For example, the position grid can be made denser. For instance, data may be acquired for a 10 centimetre (cm) grid instead of an originally calibrated 20 cm grid so that more data points can be gathered. By adding more spectral points, if calibration was initially performed quickly by walking around the vicinity of the far-field microphone array, all further captured signals will decrease the spectral sparseness of the RIR database.

Since the acoustic environment may change during the performance, the RIR database can contain time varying RIR values. To capture time varying responses, RIR measurements need to be captured over an extended period of time for optimal quality. For example, when more people enter the recording space a damping of the recording space occurs which affects the acoustic properties of that recording space. Figure 1 shows an audio mixing system 100 which comprises a far-field audio recording device 101, such as a video/audio capture device, and one or more near-field audio recording devices 102, such as Lavalier microphones. The far-field audio recording device 101 comprises an array of far-field microphones and may be a mobile phone, a stereoscopic video/audio capture device or similar recording apparatus such as the Nokia Ozo (RTM). The near-field audio recording devices 102 may be worn by a user, for example a singer or actor. The far-field audio recording device 101 and the near- field audio recording devices 102 are located within a recording space 103.

The far-field audio recording device 101 is in communication with an RIR processing apparatus 104 either via a wired or wireless connection. The RIR processing apparatus 104 may be located within the recording space 103 or outside the recording space 103. The RIR processing apparatus 104 has access to an RIR database 105 containing RIR data relating to the recording space 103. The RIR database 105 may be physically incorporated with the RIR processing apparatus 104. Alternatively, the RIR database 105 may be maintained remotely with respect to the RIR processing apparatus 104.

Figure 2 is a schematic block diagram of the RIR processing apparatus 104. The RIR processing apparatus 104 may be incorporated within a general purpose computer. Alternatively, the RIR processing apparatus 104 may be a standalone apparatus. The RIR processing apparatus 104 may comprise a short-time Fourier transform

(STFT) module 201 for determining short-time Fourier transforms of received audio signals. The RIR processing apparatus 104 comprises an RIR estimator 202 and a projection module 203. The RIR processing apparatus 104 comprises a processor 204 which controls the STFT module 201, the RIR estimator 202 and the projection module 203. The RIR processing apparatus 104 comprises a memory 205. The memory comprises a volatile memory 206 such as random access memory (RAM). The memory also comprises non-volatile memory 207, such as read-only memory (ROM).

The RIR processing apparatus 104 further comprises input/output 208 to enable communication with the far-field audio recording device 101 and with the RIR database 105 as well as any other remote entities. The input/output 208 comprises hardware, software and/or firmware that allows the RIR processing apparatus 104 to

communicate with the far-field audio recording device 101 and with other remote entities via wired or wireless connection using communication protocols known in the art.

Some further details of components and features of the above-described RIR processing apparatus 104 and alternatives will now be described. The RIR processing apparatus 104 comprises a processor 204 communicatively coupled with memory 205. The memory 205 has computer readable instructions stored thereon, which when executed by the processor 204 causes the processor 204 to cause performance of various ones of the operations described with reference to Figure 3. The RIR processing apparatus 104 may in some instance be referred to, in general terms, as "apparatus". The RIR processing apparatus 104 may be of any suitable composition. For example, the processor 204 may be a programmable processor that interprets computer program instructions and processes data. The processor 204 may include plural programmable processors. Alternatively, the processor 204 may be, for example, programmable hardware with embedded firmware. The processor 204 may be termed processing means. The processor 204 may alternatively or additionally include one or more Application Specific Integrated Circuits (ASICs). In some instances, processor 204 may be referred to as computing apparatus. The processor 204 is coupled to the memory (or one or more storage devices) 205 and is operable to read/write data to/from the memory 205. The memory 205 may comprise a single memory unit or a plurality of memory units, upon which the computer readable instructions (or code) is stored. For example, the memory 205 may comprise both volatile memory and non-volatile memory. For example, the computer readable instructions/program code may be stored in the non-volatile memory and may be executed by the processor 204 using the volatile memory for temporary storage of data or data and instructions. Examples of volatile memory include RAM, DRAM, and SDRAM etc. Examples of non-volatile memory include ROM, PROM, EEPROM, flash memory, optical storage, magnetic storage, etc. The memories in general may be referred to as non-transitory computer readable memory media.

The term 'memory', in addition to covering memory comprising both non-volatile memory and volatile memory, may also cover one or more volatile memories only, one or more non-volatile memories only, or one or more volatile memories and one or more non-volatile memories.

The computer readable instructions/program code may be pre-programmed into the RIR processing apparatus 104. Alternatively, the computer readable instructions may arrive at the RIR processing apparatus 104 via an electromagnetic carrier signal or may be copied from a physical entity such as a computer program product, a memory device or a record medium such as a CD-ROM or DVD. The computer readable instructions may provide the logic and routines that enables the devices/apparatuses to perform the functionality described above. The combination of computer-readable instructions stored on memory (of any of the types described above) may be referred to as a computer program product. - lO -

Embodiments of the present invention may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. The software, application logic and/or hardware may reside on memory, or any computer media. In an example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media. In the context of this document, a "memory" or "computer-readable medium" may be any media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer.

Reference to, where relevant, "computer-readable storage medium", "computer program product", "tangibly embodied computer program" etc., or a "processor" or "processing apparatus" etc. should be understood to encompass not only computers having differing architectures such as single/multi-processor architectures and sequencers/parallel architectures, but also specialised circuits such as field

programmable gate arrays FPGA, application specify circuits ASIC, signal processing devices and other devices. References to computer program, instructions, code etc. should be understood to express software for a programmable processor firmware such as the programmable content of a hardware device as instructions for a processor or configured or configuration settings for a fixed function device, gate array,

programmable logic device, etc.

Overall algorithm description

The following is a description of one way in which far-field audio signals maybe processed to obtain a short-time Fourier transform (STFT). The far-field audio recording device 101 comprising a microphone array composed of far-field

microphones with indexes (c = x,...,C) captures a mixture of source signals with indexes p= i,...,P and their signals x&^n) sampled at discrete time instances indexed by n. The sound sources may be moving and have time-varying mixing properties, denoted by room impulse response (RIR), Ιΐαι ^(ρ)(τ) for each channel c at each time index n. Some of the sound sources (e.g. speaker, car, piano or any sound source) have Lavalier microphones 102 close to them. The resulting mixture signal can be given as:

Vein - τ) ?(τ) + η,(η) (Equation i) wherein:

y _c(n) is the audio mixture in time domain for each channel index c of the far-field audio recording device 101, i.e. the signal received at each far-field microphone;

is the p ^th near-field source signal in time domain (source index p);

h^(r) is the partial impulse response in time domain (sample delay index τ), i.e. the room impulse response;

n _c(n) is the noise signal in time domain.

Applying the short time Fourier transform (STFT) to the time-domain

array signal allows expressing the capture in time-frequency domain

as:

,, — P vo-i I, (P) _V(P) . „ _ ·Ρ Φ(Ρ) _Ι_ _Μ

Vft— Lp=i Ld=0 ⁿftd ^Xft-d ^{+ n}ft— Lp=l *ft ' ⁿft

(Equation 2) wherein: yf _t is the STFT of the array mixture (frequency and frame index/,f);

is the STFT of pth near-field source signal (p); ^room impulse response (RIR) in STFT domain (frame delay index d);

is the STFT of pth reverberated (filtered/projected) source signal;

Tif _t is the STFT of the noise signal.

The STFT of the array signal is denoted by yjt = [y h,...,y/ic] ^r where/ and t are frequency and time frame index, respectively. The source signal as captured by the microphone array of the far-field audio recording device 101 is modelled by convolution between the near-field source STFT and its frequency domain RIR .... hf _tdC] ^{T '}.

The length of the convolutive frequency domain RIR is D timeframes which can vary from a few timeframes to several tens of frames depending on the STFT window length and maximum effective amount of reverberation components in the recording environment. This model differs from the usual assumption of instantaneous mixing in frequency domain with mixing consisting of complex valued weights only for the current timeframe. The additive uncorrelated noise is denoted by r t - [n _/h,... _}rc _/tc] ^:r.

The reverberated source signals are denoted by x _{f t} . The way in which RIR measurements are obtained in accordance with various embodiments will now be explained, with reference to Figure 3 which is a flow chart illustrating various steps taken in embodiments of the invention. The process starts at step 3.1.

At step 3.2 an audio signal y _c(n)is received from the far-field audio recording device 101. At step 3.3 an audio signal χ^(η) is received from the near-field audio recording device 102 for those sound sources provided with a near-field audio recording device 102.

At step 3.4, the location of the mobile source is determined. The location can be determined using information received from a tag with which the mobile source is provided. Alternatively, the location may be calculated using multilateration techniques described below.

At step 3.5, a short-time Fourier transform (STFT) is applied to both far-field and near- field audio signals. Alternative transforms may be applied to the audio signals as described below.

In some embodiments, time differences between the near-field and far-field audio signals can be taken into account. However, if the time differences are large (several hundreds of milliseconds or more) a rough alignment may be done prior to the process commencing. For example, if a wireless connection between a near-field microphone and RIR processor causes a delay, the delay may be manually fixed by delaying the other signals in the RIR processor or by an external delay processor which may be implemented as hardware or software.

A signal activity detection (SAD) may be estimated from the near-field signal in order to determine when the RIR estimate is to be updated. For example, if a source does not emit any signal over a time period, its RIR value does not need to be estimated. The STFT values yytand xf _t' are input to the RIR estimator 202 at RIR estimation step 3.6. The RIR estimation may be performed using a block-wise linear least squares (LS) projection in offline operation mode, that is where the RIR estimation is performed as part of a calibration operation. Alternatively, a recursive least squares (RLS) algorithm for real time operation mode, that is where the RIR estimation occurs during a performance itself. In other embodiments, the RLS algorithm may be used in offline operation instead of the block-wise linear LS algorithm. In any case, as a result, a set of RIR filters in time-frequency domain are obtained. The process ends at step 3.7.

Block- wise linear least squares projection

The RIR ft ¾ can be thought of as a projection operator from near-field signal space

(i.e. "dry" signals) to far-field signal space (array capture in case of multiple channels, i.e. "wet" signals).

The projection is time, frequency and channel dependent. The parameters of RIR can be estimated using linear least squares (LS) regression, which is equivalent to finding the projection between the near-field and far-field signal spaces.

The method of LS regression for estimating RIR values may be applied for moving sound sources by processing the input signal in blocks of approximately 500ms and the RIR values may be assumed to be stationary within each block. Block-wise processing with moving sources assumes that the difference between RIR values associated with adjacent frames is relatively small and remains stable within the analysed block. This is valid for sound sources that move at low speeds in an acoustic environment where small changes in source position with respect to the receiver do not cause substantial change in the RIR value.

The method of LS regression is applied individually for each source signal in each channel of the array. Additionally, the RIR values are frequency dependent and each frequency bin of the STFT is processed individually. Thus, in the following discussion it should be understood that the processing is repeated for all channels and all frequencies. Assuming a block of STFT frames with indices t,...,t + T where the RIR is assumed stationary inside the block, the mixture signal STFT with the convolutive frequency domain mixing can be given as:

(Equation 3) wherein y is a vector of far-field STFT coefficients obtained from the far-field audio recording device 101 from frame tto f + T;

X is a matrix containing the near-field STFT coefficients starting from frame t - o and the delayed versions starting from t - i,...,t - D - 1; and

h is the RIR to be estimated.

The length of the RIR filter to be estimated is D STFT frames. The block length is Γ+ 1 frames, and T + 1 > D in order to avoid overfitting due to an overdetermined model.

The above equation (3) can be expressed as:

(Equation 4) and assuming that data before the first frame index t is not available, the model becomes:

(Equation 5)

The linear LS solution minimization is:

D-l

min 2 ( yt - ^ t- _dh _d ) = minlly - Xh\\' (Equation 6)

d=0 is achieved as: h = (X ^TX)- ¹X ^Ty (Equation 7)

The projected source signal for a single block can be trivially obtained as:

D-l

x _t = ^ Xt-d^d (Equation 8)

A subsequent removal of a particular source signal from the audio mixture is a simple subtraction:

9t = yt - %t (Equation 9)

Equation 9 demonstrates the removal of a particular source signal from the audio mixture. As well as removing a source from the audio mixture, it is also possible to add the effect of a source to the audio mix. This may be done by using addition instead of subtraction with a user specified gain.

System calibration and RIR database collection

The RIR estimation presented in embodiments of the present invention allows removal of a target source from the audio mixture or addition of a source to the audio mixture of the far-field audio recording device 101. Based on target source direction of arrival (DOA) trajectory or location estimates of the target source, the signal emitted by the source can be replaced by augmenting separate content to the array mixture of the far- field audio recording device 101. The problem of augmenting separate signals using the RIR values estimated from the target source in prior approaches lies in the fact that the source signal is not broadband and estimates of RIR values from frequencies with no signal energy emitted are unreliable. Having different spectral content (source signal frequency occupancy in each frame) leads to poor subjective quality of the synthesized augmented source since accurate RIR data for all frequencies are not available.

To overcome this problem, embodiments herein described provide a calibration method with a constant broadband signal which is used to estimate and store RIR values from substantially all possible locations of the recording space. The purpose of the calibration stage is that reliably broadband RIR data from all positions of the recording space are captured before the actual operation (i.e. before an audio recording or broadcast). The location data may be either relative or absolute such as GPS coordinates.

During the operation stage itself (i.e. during a recording or broadcast), the target source is removed from the mixture using the block-wise LS or RLS method described above. The direction of arrival (DOA) is estimated either acoustically or using other localization techniques.

There is a variety of ways in which the DOA may be estimated. In some embodiments, the estimated RIR value in the time domain relating to each channel of the array of the far-field audio device 101, is analysed. The first received RIR sample that is above a threshold gives an estimate of the delay at which the sound arrives at the nearest microphone of the far-field audio device 101. Comparing the delays from all microphones of the far-field audio device 101 provides the time differences of arrival (TDOA) between microphones in the array of the far-field audio device 101. From these values the direction can be calculated using multilateration methods that are known in the art.

The augmented source is synthesized using the target source DOA estimates for retrieving the RIR corresponding to each DOA from the database generated in the calibration stage. The length of the calibration stage depends on the size of the recording space and the required density of the database. The length of the calibration stage may vary from around 10 seconds to several minutes.

Figure 4 is a plan view of a recording space 103 in accordance with an embodiment whereby audio data is recorded as part of a calibration stage. A speaker 400 is provided with a near-field microphone 102 such as a Lavalier microphone or a handheld microphone. The speaker 400 may also be provided with a location tag 401. A far-field audio recording device 101 is provided towards the centre of the recording space 103. During the calibration stage, the speaker 400 walks around the recording space 103 along a trajectory T. The speaker 400 speaks so that audio data is recorded by both the far-field audio recording device 101 and the near-field microphone 102. The person may also be playing an instrument or carrying a sound producing loudspeaker.

The room impulse response (RIR) data are collected around the place of performance based on the captured "dry" and "wet" signals as well as available position data from the location tag 401. The RIR data are estimated based on the dry to wet signal transfer function at every relevant position with a processing unit using one of the algorithms described above.

Figure 5 is a plan view of a recording space 103 in accordance with another

embodiment whereby audio data is recorded as part of a calibration stage. In this embodiment, two drones 500 are provided. Each drone 500 is provided with a near- field microphone 102. Each of the drones 500 emits a noise, either through a loudspeaker or merely from the drone rotors. Two or more far-field audio recording devices 101 are also provided.

The RIR database 105 may be collected during an initial calibration phase where an audio source of wideband noise, for example white noise, MLSA sequence, pseudo random noise, or a talking human, an acoustic instrument, a flying drone with speaker or a ground based robot, is moving or is moved around the recording space 103 either manually or automatically.

The benefit of having some calibration recordings and database collection prior to an actual performance is that the pre-existing RIR database 105 can be used during the performance to insert additional sound sources to the audio mix in real-time.

Additionally, when wideband noise is used for calibration, the RIR data are more accurate over the whole spectrum. The recording stage will also have higher SNR available, for example when the audience is missing from the recording space 103. This may provide more accurate and/or faster RIR measurements. In other embodiments, RIR data may be collected during the performance itself. This may be instead of the calibration phase described above or in addition to the calibration - l8 - phase. In the latter scenario, the reliability of the RIR data captured during the calibration process described above using the least block-wise linear least squares projection may be improved by capturing further RIR during the performance itself. As mentioned above, RIR data estimated are generally valid only for the frequency indices at which the source produced meaningful acoustic output. Usually RIR data are applied to the same close-field signal and no mismatch between time-frequency content and RIR data occurs. However, for example in the case of augmenting a completely near-field signal which is very different from the RIR data available, the RIR data need to be broadband and valid at least for the STFT frequency indices where the augmented signal has significant energy.

In order to avoid the active calibration with a known broadband signal a method for passive online RIR database collection is provided in some embodiments. RIR data estimated at each position of the recording space 103 are used to gradually build a database of broadband RIR data by combining estimates at different times from the same location within the recording space 103. The recent magnitude spectrum of the near-field signal can be used as an indicator of reliability of the RIR data and only frequency indices with substantial signal energy are updated in the database. The database update can vary from simple weighted average to more advanced

combinations based on probabilistic modelling and machine learning in general.

In some embodiments, real-time RIR estimation may be performed by using a recursive least squares (RLS) algorithm. The signal model, consisting of convolutive mixing in time-frequency domain, may be defined as:

yft— λ,ρ=ι Ld=o ^aftd ^xft-d ^{+ n}ft ~ Lp=i Xft ^{+ n}j

(Equation 10)

In real time operation the filter weights vary for each time frame f and, again by dropping the frequency index/and the channel dimension, the filtering equation for single source at time frame t may be specified as: *t ^xt-d d = x ht (Equation 11) where Xt

Efficient real-time operation can be achieved with recursive estimation of the RIR filter weights h using the recursive least squares (RLS) algorithm. The modelling error for timeframe f may be specified as: e _t = y _t - x _t (12) where y _t is the observed/desired mixture signal.

The cost function to be minimized with respect to filter weights may be expressed as:

C(h _t) =∑ j ₌₀ A ^t-'e , 0 < λ < 1 (Equation 13) which accumulates the estimation error from past frames with exponential weight λ ^ι~ The weight of the cost function can be thought of as a forgetting factor which determines how much past frames contribute to the estimation of the RIR filter weights at the current frame. RLS algorithms where Λ < 1 may be referred to in the art as exponentially weighted RLS and λ = ι may be referred to as growing window RLS.

The RLS algorithm minimizing Equation 13 is based on recursive estimation of the inverse correlation matrix P _t of the close-field signal and the optimal filter weights h _t and can be summarized as:

Initialization:

0 Repeat for t = 1, 2, ...

1 (Equations 14) 1 1 ft _t = ^ht-i + ^at9t

The initial regularization of the inverse autocorrelation matrix is achieved by defining 8 using a small positive constant, typically from 10 ^-2 to 10 ¹. A small δ value causes faster convergence, whereas a larger δ value constrains the initial convergence to happen over a longer time period (for example, over a few seconds).

The contribution of past frames to the RIR filter estimate at current frame t may be varied over frequency. Generally, the forgetting factor Λ acts in a similar way as the analysis window shape in the truncated block-wise least squares algorithm. However, small changes in source position can cause substantial changes in the RIR filter values at high frequencies due to highly reflected and more diffuse sound propagation paths. Therefore, the contribution of past frames at high frequencies needs to be lower than at low frequencies. It is assumed that the RIR parameters slowly change at lower frequencies and source evidence can be integrated over longer periods, meaning that the exponential weight A ^t_' can have substantial values for frames up to 1.5 seconds in past.

A similar regularization as described above with reference to block-wise LS may also be adopted for the RLS algorithm. The regularization is done to achieve a similar effect as in block-wise LS to improve robustness towards low-frequency crosstalk between near- field signals and avoid excessively large RIR weights. The near-field microphones are generally not directive at low frequencies and can pick up a fair amount of low- frequency signal content generated by noise source, for example traffic, loudspeakers etc.

In order to specify regularization of the RIR filter estimates, the RLS algorithm is given in a direct form. In other words, the RLS algorithm is given without using a matrix inversion lemma to derive updates directly to the inverse autocorrelation matrix P _t but for the autocorrelation matrix R _t ( R^ ¹ = P _t). The formulation can be found for example from T. van Waterschoot, G. Rombouts, and M. Moonen, "Optimally regularized recursive least squares for acoustic echo cancellation," in Proceedings of The second annual IEEE BENELUX/DSP Valley Processing Symposium (SPS-DARTS 2006), Antwerp, Belgium, 2005, PP- 28-2 .

The direct form RLS algorithm updates are specified as,

Initialization:

h ₀ = 0

R 0 = δ ^'1!

Repeat for t = 1, 2, ...

R _t = (Equation 15)

h-t— ^t-i + Rt ^lxt ^at This algorithm would give the same result as the RLS algorithm discussed above but requires operation for calculating the inverse of the autocorrelation matrix, and is thus computationally more expensive, but does allow regularization of it. The

autocorrelation matrix update with Levenberg-Marquardt regularization (LMR) according to T. van Waterschoot, G. Rombouts, andM. Moonen, "Optimally

regularized recursive least squares for acoustic echo cancellation, " in Proceedings of The second annual IEEE BENELUX/DSP Valley Processing Symposium (SPS-DARTS 2006), Antwerp, Belgium, 2005, pp. 28-29 is:

Rt - -Rt-i + ^xt ^XI + (1 ^— )P _LMRI (16) where β _ιΜΚ is obtained from the regularization kernel k _f increasing towards low frequencies weighted by the inverse average log-spectrum of the close-field signal (1 - e/) as discussed above with respect to the block-wise LS algorithm.

Another type of regularization is the Tikhonov regularization (TR), as also

introduced in the case of block-wise LS, which can defined for the RLS algorithm as:

R _t = AH _t_ _! + xlx ^T _t + (1 - λ)β _τκΙ (17)

h _t = h _t_ + Ri x\a _t + (1 - λ)β _τ{{Η _ί. _ί) (18) Similarly as before, β _τκ is based on the regularization kernel and the inverse average log-spectrum of the close-field signal. It should be noted that the kernel k _f needs to be modified to account for the differences between block-wise LS and RLS algorithms, and can depend on the level difference between the close-field signal and the far-field mixtures.

In addition to regularization weight being adjusted based on the average log-spectrum, it can also be varied based on the RMS level difference between near-field and far-field signals. The RMS levels of these signals might not be calibrated in real-time operation and thus additional regularization eight strategy is required. A trivial low-pass filter applied to RMS of each individual STFT frame can be used to track the varying RMS level of close-field and far-field signals. The estimated RMS level is used to adjust the regularization weights /? _iMS or β _τκ in order to achieve similar regularization impact as with RMS calibrated signals assumed in earlier equations.

Additional RIR data to be inserted into the RIR database 105 may be collected during the actual performance. This can be made in order to add more data points, for example to make the RIR position database grid denser or for sensing the time varying responses, for example when more crowd comes inside the room, it dampens the room. Time varying responses may also be useful in post-production if some original performances are edited and later added back to the original recording space 103.

Figure 6 illustrates a recording environment whereby a target source 601 is removed from the audio mixture and replaced with a replacement source 602 at the same position. Based on target source DOA trajectory or location estimates obtained from a location tag of the target source 601, the signal emitted by the target source 601 can be replaced by augmenting separate content to the array mixture. An example scenario of this simple method to replace a speaker inside a room with another person is shown in Figure 6. The replacement of a target source may be done using realtime RIR estimation, where no RIR database 105 need be used. Alternatively, a calibration phase may be performed with respect to the recording space 103, as described above.

A drawback of augmenting separate signals using the RIR data estimated from the target source 601 in real time lies in the fact that the target source signal may not be broadband and estimates of RIR data from frequencies with no signal energy emitted may be unreliable. Where the target source 601 and the replacement source 602 have different spectral content (i.e. source signal frequency occupancy in each frame) poor subjective quality of the synthesized augmented source may result since accurate RIR data for all frequencies may not be available.

In other embodiments a calibration phase is used to build up a RIR database 105, as described above. The RIR data in RIR database 105 that is collected with wideband noise is accurate and reliable over the whole frequency spectrum. Using this pre- collected RIR data enables higher quality replacement of the audio source.

A selection of a position within the recording space is received. This may be the position of the target source 601 received from any location determination method described above. A near-field audio signal is received from the target source 601.

A RIR filter related to the position of the target source is identified.

The identified room impulse response filter is then applied to the near -field audio signal of the target source to project the near-field audio signal of the target source into a far-field space. As explained above, this RIR filter may be calculated in real-time.

The projected near-field audio signal may then be removed from the audio mixture, as shown in Equation 9 above. A near-field audio signal from the replacement source 602 is received.

A room impulse response filter relating to the position within the recording space is identified. This may be same room impulse response filter used to remove the target source. Alternatively, the room response response filter applied to the near-field audio signal of the replacement source 602 may be retrieved from a room impulse response filter database collected during a calibration phase.

The selected room impulse response filter is then applied to the near-field audio signal of the replacement source 602 to obtain a projected near -field audio signal of the replacement source 602. The audio mixture of the far-field microphone device may then be augmented by adding the projected near-field audio signal of the replacement source 602 to the audio mixture. As such, the target source 601 is removed and replaced with the replacement source 602.

Figure 7 illustrates a recording environment whereby a completely new near-field signal recorded from a new source 701 located outside the recording space 103 is inserted into the audio mix of the far-field audio recording device 101. In this case of adding a completely new near-field signal to the augmented mix, the RIR data need to be broadband and valid at least for the STFT frequency indices where the augmenting signal has significant energy. A user may wish for the new source 701 to be added to the recording space 103 at a particular virtual location within the recording space 103. Based on this specified virtual location, the new signal can be used to augment the content to the audio mixture recorded by the far-field microphone array of the far-field audio recording device 101.

For example, a virtual person can be visually rendered to an AR view and at the same time the audio can be rendered in such a way that it sounds as though the new source 701 is standing at the location at which the source appears visually in AR. An example scenario of this method to add a virtual speaker to a room is shown in Figure 7.

Rendering a virtual speaker to a room and using some advanced AR, VR or 6D0F rendering a large amount of RIR data. For example, there may be more than one far- field audio recording device 101-1 and 101-2 and new sound sources 701 to be rendered at the same time. The 6 DoF usage scenario requires that rendering from a first position to a second position is possible (in 6 DoF the listener, for whom the audio is being rendered in playback can move freely anywhere in the virtual environment).

Embodiments of the invention use the RIR data from the RIR database 105 to render the audio objects with naturally sounding presence in any location within the scene.

Having time varying RIR responses maybe useful in post-production if some original performances are edited and later added back to the original recording space 103. In practice this requires that the most recent time stamped RIR data is obtained from the RIR database 105 in addition to the selected position.

A near-field audio signal from the new source 701 is received to be added to a far-field audio mixture at a selected position of a recording space 103. A room impulse response filter relating to the selected position within the recording space is identified. The room response filter is applied to the near-field audio signal of the new source 701 may be retrieved from a room impulse response filter database collected during a calibration phase.

The selected room impulse response filter is then applied to the near-field audio signal of the new source 701 to obtain a projected near-field audio signal of the new source 701. The audio mixture of the far-field microphone devices 101 may then be augmented by adding the projected near-field audio signal of the new source 701 to the audio mixture.

Herein, the term filter length refers to the length of an RIR filter applied to a particular source. The filter length is variable for both different sources and also depends on the frequency bin. Different sources may have different RIR filter lengths due to their spectral content and absolute volume. Furthermore, RIR filter length may have to be longer at low frequencies compared to higher frequencies due to physical properties of acoustic spaces. In practice, the RIR filter length is longer at lower frequencies and reduces gradually when going towards higher frequencies. This may be due to the physical properties of the recording space.

Hitherto, the selection of RIR filter length for a particular signal in particular acoustic space relied solely on expertise or trial and error. For example, loud, bass heavy signals in large halls may require very long RIR filter lengths. On the other hand, a quiet whisper in a well-insulated recording booth may only require very short filter lengths.

For example, a bass drum in a night club may require RIR filter length of 64 frames which corresponds to 683 milliseconds. Whispering in a vocal booth may require a filter length of less than 4 frames (i.e. 43 milliseconds). In both of these cases the maximum RIR length of either 64 or 4 frames is mostly needed at lower frequencies. Starting about half of the frequency range (e.g. 12 kHz or frequency bin 256) the RIR filter length can be reduced significantly. For example, in the above case to 24 frames and 2 frames respectively. This reduction of RIR filter length helps both calculation complexity and also the performance, since it prevents overmodelling. In one embodiment of the invention, the RIR filter length may be adjusted based on a reverberation time. Additionally, or as an alternative, the RIR filter length may be dependent on the dimensions of the recording space 103. For example, a small well insulated room may have a much shorter reverberation time than large room covered mainly hard materials.

The RIR filter lengths D _pf for optimal subjective projection quality may be

automatically estimated based on reverberation time of the recording space 103. R. Ratnam, D. L. Jones, B. C. Wheeler, W. D. O'Brien Jr, C. R. Lansing, and A. S. Feng, "Blind estimation of reverberation time," The Journal of the Acoustical Society of America, vol. 114, no. 5, pp. 2877-2892, 2003 or J.Y. Wen, E. A. Habets, and P. A. Naylor, "Blind estimation of reverberation time based on the distribution of signal decay rates," in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2008, pp 329-332 describe methods of blind reverberation time determination which may be used in embodiments of the present invention.

Blind reverberation time estimation may be used to determine required length for the RIR database 105 in order to match the RIR filter length with the reverberation characteristics of the recording space 103. The relation of reverberation time RT60 in seconds and the RIR filter length can typically be such that the length of the RIR filter in absolute time (seconds) is within the range of RT60/2 to RT60/8. The RIR filters are defined in STFT domain with overlapping frames and the length in frame can be easily converted to absolute time. Using RIR filters as long as the full 60 dB attenuation time (RT60) is unnecessary due to a typical noisy recording scene and interference from other sources causes the effective dynamic range to rarely exceed 30 dB. In some embodiments, RT30 (time before signal has decayed below ~3odB level) should be measured and used instead of the RT60. However if RT30 is not available, with the assumption of linear decay, RIR filter lengths can be estimated with length

corresponding to RT30 equal to RT60/2.

The reverberation time may either be measured prior to a recording session as part of a calibration phase or the RIR length may be estimated based on the room size and reverberation properties of the room in comparison with reverberation properties of similar recording spaces. Some experimental filter lengths are shown in Table 1. In Table 1, the sampling frequency is assumed to be 48 kHz and filter length is 1024 samples with overlap of 512 samples. This translates that length of 1 RIR filter frame is 512/48000 seconds i.e. 10.66 ms. For example, RIR filter length of 16 frames translates to 171 ms decay time that can be removed/modified by the algorithm.

Table 1

In other embodiments, the RIR filter length is adjusted based on the

spectral content of different near-field audio signals. Furthermore,

the absolute sound energy may affect RIR filter length. Louder

volumes echo longer and thus require longer RIR filter length. The RIR filter lengths D _pf for optimal subjective projection quality may be

automatically estimated based on an analysis of the near-field signal spectral content. Features that may be automatically analysed for automatic estimation of the RIR filter length can include one or more of: spectral centroid, spectral mass at frequency bands, for example 1/3 octave bands, uniform bands at Mel frequency scale. The lower the spectral centroid or the more that the close-field signal energy is concentrated below 100-200 Hz, the longer the RIR filters are required to be at lower frequencies. For example up to RT60/2 for STFT frequency bins corresponding to 500 Hz. In contrast, sources with no significant low frequency content can be specified having RIR filter length at 500Hz to be only of order RT60/4 to RT60/8.

For example, a bass drum echoes for a longer time in the same space as a clarinet. Louder sounds reverberate for longer absolute audible time (i.e. has content over system noise floor) than more silent sounds. Furthermore, RT60 reverberation time is longer at low frequencies where room modes are strong. Quieter signals fade below noise floor fairly soon and are thus not needed to be processed for too long a time. The following table clarifies how RIR filter length can be adjusted based on different space and different signal types.

Table 2 shows how RIR filter length can be adjusted based on different space and different signal types.

Table 2

In other embodiments, the RIR filter length is varied for each frequency component of each source.

In general, high frequency signals decay faster than lower frequencies and thus for example the formula below can be used to model the required RIR length for each frequency band.

¾ = Dmln +

^Lfft J (19)

where D _p/is the length of the RIR filter at STFT frequency index / ;

D _mln is the minimum RIR filter length at high frequencies;

D _max is the maximum RIR filter length at low frequencies;

L _fft is the length of the STFT transform; and

/ is the STFT bin index and ranges from o to L _fft - 1. In embodiments, the filter length decreases exponentially in relation to linear frequency scale as frequency index /increases. The minimum filter length is typically achieved around half of the frequency range (e.g. 12 kHz and upwards). The typical values of the filter length with respect to frequency are illustrated in Figure 8A

decreasing from a length of 28 frames to 16 frames in cubic relation. The minimum filter length 16 is achieved around 15 kHz. Figure 8B shows that at the lowest frequencies a maximum RIR filter length D _max = 28 blocks is used. The STFT analysis filter length L^ _t is 1024 samples. Since there is overlap of 50%, the total RIR filter length is 512x28 samples, i.e. approximately 300ms when sampling frequency is

48kHz. At the Nyquist frequency the filter length is set to L _mi _n = 16 blocks, i.e.

~ 170ms.

In some embodiments, recursive least squares (RLS) estimation of RIR coefficients uses frequency dependent forgetting factor λ that depends on the frequency.

Higher frequencies decay faster in real life and similarly algorithmic decay time is adjusted for optimal estimation performance. The contribution of past frames to the RIR filter estimate at current frame t may be varied with frequency. Generally, the forgetting factor Λ acts in a similar way as the analysis window shape in truncated block- wise least squares (LS). However, since small changes in source position can cause substantial changes in the RIR values at high frequencies due to highly reflected and more diffuse sound propagation path and therefore the contribution of past frames at high frequencies needs to be lower than compared to low frequencies. It is assumed that the RIR parameters slowly change at lower frequencies and source evidence can be integrated over longer periods, meaning that the exponential weight λ ^{~' may have substantial values for frames up to 1.5 seconds in past. In contrast, past frames only up to 0.5-0.8 seconds can be reliably used to update the filter weights at high frequencies, and the error weight should be close to zero for frames older than that.

C(h _t) =∑£ ₌₀ '-'e , 0 < λ≤ 1 (20)

In words, this equation means that the total cumulative error is a weighted sum of past squared errors for each frequency bin and past time instance. The weights depend on the frequency and time so that, at lower frequencies, more cumulative error for longer time is accounted for than at higher frequencies, where the error decays faster in time. When removing a signal from the audio mixture this may be done using the RIR filter estimated using the process described above with reference to Figure 3. This may be referred to as a full length RIR. When adding a near-field source to the audio mixture, an RIR filter may be applied for clarity or artistic reasons to reduce, for example the main vocalist ambience. Using an RIR filter having a truncated filter length (compared with the filter length of the original RIR filter) would still contain significant ambience, but shortening the long tail of echo would make the vocal content easier to

comprehend.

This allows rendering "dryer" synthesized output than original captured and estimated signal RIR. This keeps the main spatial cues intact (i.e. direction and first echo reflection components) and clarifies too "muddy" output by rejecting the long tail part of the reverberation. This "dry-ified" projection is very useful feature in post- production and while mixing live content.

Whilst the above embodiments have been described with reference to short time Fourier transforms, it should be appreciated that there are several different linear transforms from time to frequency domains which can used instead. Examples include discrete cosine transforms, wavelet transforms or MEL or Bark scale modified implementations.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the above-described functions may be optional or may be combined. Similarly, it will also be appreciated that the flow diagram of Figure 3 is an example only and that various operations depicted therein may be omitted, reordered and or combined. Although various aspects of the invention are set out in the independent claims, other aspects of the invention comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims. It is also noted herein that while the above describes various examples, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications which may be made without departing from the scope of the present invention as defined in the appended claims.

Previous Patent: PROCESSING AUDIO SIGNALS

Next Patent: PROCESSING AUDIO SIGNALS