PROCESSING AUDIO SIGNALS - NOKIA TECHNOLOGIES OY

Title:

PROCESSING AUDIO SIGNALS

Document Type and Number:

WIPO Patent Application WO/2018/234617

Kind Code:

Abstract:

A method, apparatus and computer-readable medium are disclosed for receiving, from a far-field microphone device, at least one far-field audio signal in a time domain corresponding to a mobile source located within a recording space, the or each far-field audio signal corresponding to a respective channel of an audio mixture of the far-field microphone device;receiving, from a near-field microphone, a first near-field audio signal in a time domain corresponding to the mobile source;determining location information relating to the mobile source;transforming the or each far-field audio signal and the first near-field audio signal from the time domain to a time-frequency domain; and using the transformations of the far-field audio signal and the first near-field audio signal and the location information of the mobile source to determine a set of room impulse response filters of the recording space.

Inventors:

RÄMÖ ANSSI (FI)
VILERMO MIIKKA (FI)
VASILACHE ADRIANA (FI)
MATE SUJEET SHYAMSUNDAR (FI)
NIKUNEN JOONAS (FI)

Application Number:

PCT/FI2018/050395

Publication Date:

December 27, 2018

Filing Date:

May 25, 2018

Export Citation:

Click for automatic bibliography generation Help

Assignee:

NOKIA TECHNOLOGIES OY (FI)

International Classes:

H03H21/00; G06F17/30; H04R1/32; H04R3/04

Foreign References:

US20140161270A1

2014-06-12

Other References:

KOKKINIS, E. ET AL.: "Identification of a Room Impulse Response Using a Close-Microphone Reference Signal", AES, May 2010 (2010-05-01), XP055557294, Retrieved from the Internet [retrieved on 20180910]
AVARGEL, Y. ET AL.: "System Identification in the Short-Time Fourier Transform Domain With Crossband Filtering", IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 15, no. 4, May 2007 (2007-05-01), XP011177194, Retrieved from the Internet [retrieved on 20180910]
CROCCO, M. ET AL.: "Uncalibrated 3D Room Reconstruction from Sound", ARXIV:1606.06258V 1, XP080709673, Retrieved from the Internet [retrieved on 20180910]
WATERSCHOOT T. ET AL.: "Optimally regularized recursive least squares for acoustic echo cancellation", IN: PROCEEDINGS OF THE SECOND ANNUAL IEEE BENELUX/DSP VALLEY SIGNAL PROCESSING SYMPOSIUM (SPS-DARTS 2006, March 2006 (2006-03-01), Antwerp, Belgium, pages 31 - 34, XP055557300, Retrieved from the Internet [retrieved on 20180910]
See also references of EP 3642957A4

Attorney, Agent or Firm:

NOKIA TECHNOLOGIES OY et al. (FI)

Download PDF:

View/Download PDF PDF Help

Claims:

Claims

1. A method comprising:

receiving, from a far-field microphone device, at least one far-field audio signal in a time domain corresponding to a mobile source located within a recording space, the or each far-field audio signal corresponding to a respective channel of an audio mixture of the far-field microphone device;

receiving, from a near-field microphone, a first near-field audio signal in a time domain corresponding to the mobile source;

determining location information relating to the mobile source;

transforming the or each far-field audio signal and the first near-field audio signal from the time domain to a time-frequency domain; and

using the transformations of the far-field audio signal and the first near-field audio signal and the location information of the mobile source to determine a set of room impulse response filters of the recording space.

2. A method according to claim 1, further comprising:

receiving a selection of a position within the recording space;

receiving a second near-field audio signal associated with a source;

identifying a room impulse response filter relating to the selected position within the recording space from the set of room impulse response filters of the recording space;

applying the selected room impulse response filter to the second near-field audio signal to obtain a projected second near -field audio signal; and

augmenting the audio mixture of the far-field microphone device by adding the projected second near-field audio signal to the audio mixture.

3. A method according to claim 2, wherein the room impulse response filter applied to the second near-field audio signal is retrieved from a room impulse response filter database.

4. A method according to claim 3, wherein the room impulse response filter database contains room impulse response filters obtained from a broadband signal within the recording space.

5. A method according to any preceding claim, further comprising:

receiving a selection of a position within the recording space;

receiving a third near-field audio signal, the third near-field audio signal being associated with the selected position;

identifying a room impulse response filter relating to the selected position within the recording space from the set of room impulse response filters of the recording space;

applying the identified room impulse response filter to the third near- field audio signal associated with the selected position to project the third near- field audio signal into a far-field space; and

removing the projected near-field audio signal from the audio mixture.

6. A method according to claim 5, wherein the room impulse response filter applied to the third near-field audio signal is calculated using the first near-field audio signal and the far-field audio mixture.

7. A method according to any preceding claim, wherein the set of room impulse response filters is determined using a block-wise linear least squares projection algorithm applied to a broadband calibration signal.

8. A method according to any preceding claim, wherein the set of room impulse response filters is collected during a calibration phase and stored in a room impulse response database. 9. A method according to any of claims 1-7, wherein the set of room impulse response filters is determined using far-field and near-field audio signals obtained in real-time.

10. A method according to any preceding claim, wherein the mobile source moves around the recording space either manually or automatically.

11. A method according to any preceding claim, wherein the set of room impulse response filters is obtained using a recursive least squares algorithm.

12. A method according to any preceding claim, wherein the near-field microphone is provided with a location tag and the location information is received from the location tag. 13. A method according to any of claims 1-11, wherein the location information is determined using multilateration.

14. A method according to any preceding claim, further comprising

determining a signal activity detection signal.

15. Apparatus configured to perform a method according to any preceding claim.

16. Computer-readable instructions which when executed by computing apparatus cause the computing apparatus to perform a method as claimed in any of claims 1 to 14.

17. An apparatus comprising:

at least one processor; and

at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to:

receive, from a far-field microphone device, at least one far-field audio signal in a time domain corresponding to a mobile source located within a recording space, the or each far-field audio signal corresponding to a respective channel of an audio mixture of the far-field microphone device;

receive, from a near-field microphone, a first near-field audio signal in a time domain corresponding to the mobile source;

determine location information relating to the mobile source;

transform the or each far-field audio signal and the first near-field audio signal from the time domain to a time-frequency domain; and

use the transformations of the far-field audio signal and the first near-field audio signal and the location information of the mobile source to determine a set of room impulse response filters of the recording space.

18. An apparatus according to claim 17, wherein the computer program code, when executed by the at least one processor, causes the apparatus to :

receive a selection of a position within the recording space;

receive a second near-field audio signal associated with a source; identify a room impulse response filter relating to the selected position within the recording space from the set of room impulse response filters of the recording space;

apply the selected room impulse response filter to the second near-field audio signal to obtain a projected second near -field audio signal; and

augment the audio mixture of the far-field microphone device by adding the projected second near-field audio signal to the audio mixture.

19. An apparatus according to claim 18, wherein the room impulse response filter applied to the second near-field audio signal is retrieved from a room impulse response filter database.

20. An apparatus according to claim 19, wherein the room impulse response filter database contains room impulse response filters obtained from a broadband signal within the recording space.

21. An apparatus according to any one of claims 17-20, wherein the computer program code, when executed by the at least one processor, causes the apparatus to:

receive a selection of a position within the recording space ;

receive a third near-field audio signal, the third near-field audio signal being associated with the selected position;

identify a room impulse response filter relating to the selected position within the recording space from the set of room impulse response filters of the recording space;

apply the identified room impulse response filter to the third near-field audio signal associated with the selected position to project the third near-field audio signal into a far-field space; and

remove the projected near-field audio signal from the audio mixture.

22. An apparatus according to claim 21, wherein the room impulse response filter applied to the third near-field audio signal is calculated using the first near-field audio signal and the far-field audio mixture.

23. An apparatus according to any one of claims 17-22, wherein the set of room impulse response filters is determined using a block-wise linear least squares projection algorithm applied to a broadband calibration signal. 24. An apparatus according to any one of claims 17-23, wherein the set of room impulse response filters is collected during a calibration phase and stored in a room impulse response database.

25. An apparatus according to any one of claims 17-23, wherein the set of room impulse response filters is determined using far-field and near-field audio signals obtained in real-time.

26. An apparatus according to any one of claims 17-25, wherein the mobile source moves around the recording space either manually or automatically.

27. An apparatus according to any one of claims 17-26, wherein the set of room impulse response filters is obtained using a recursive least squares algorithm.

28. An apparatus according to any one of claims 17-27, wherein the near-field microphone is provided with a location tag and the location information is received from the location tag.

29. An apparatus according to any one of claims 17-27, wherein the location information is determined using multilateration.

30. An apparatus according to any one of claims 17-29, wherein the computer program code, when executed by the at least one processor, causes the apparatus to determine a signal activity detection signal. 31. A computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by a least one processor, causing performance of at least:

determining location information relating to the mobile source;

transforming the or each far-field audio signal and the first near-field audio signal from the time domain to a time-frequency domain; and

32. Apparatus comprising:

means for receiving, from a far-field microphone device, at least one far-field audio signal in a time domain corresponding to a mobile source located within a recording space, the or each far-field audio signal corresponding to a respective channel of an audio mixture of the far-field microphone device;

means for receiving, from a near-field microphone, a first near-field audio signal in a time domain corresponding to the mobile source;

means for determining location information relating to the mobile source;

means for transforming the or each far-field audio signal and the first near-field audio signal from the time domain to a time-frequency domain; and

means for using the transformations of the far -field audio signal and the first near-field audio signal and the location information of the mobile source to determine a set of room impulse response filters of the recording space.

Description:

Processing Audio Signals

Field

This specification relates to processing audio signals and, more specifically, to processing audio signals for mixing audio signals.

Background

Spatial audio signals are being used more often to produce a more immersive audio experience. A stereo or multi-channel recording can be passed from the recording or capture apparatus to a listening apparatus and replayed using a suitable multi-channel output such as a multi-channel loudspeaker arrangement and, with virtual surround processing, a pair of stereo headphones or headset.

As the possibilities for using such immersive audio functionality become more widespread, there is a need to ensure that audio signals are mixed in such a way so as to complement the virtual reality environment of the user. For example, if a user is in a virtual reality environment, there is a requirement that audio content from a particular source sounds as though it is coming from a location corresponding to the location of that source in virtual reality.

Summary

In a first aspect, this specification describes a method comprising: receiving, from a far- field microphone device, at least one far-field audio signal in a time domain

corresponding to a mobile source located within a recording space, the or each far-field audio signal corresponding to a respective channel of an audio mixture of the far-field microphone device; receiving, from a near-field microphone, a first near-field audio signal in a time domain corresponding to the mobile source; determining location information relating to the mobile source; transforming the or each far-field audio signal and the first near-field audio signal from the time domain to a time-frequency domain; and using the transformations of the far-field audio signal and the first near- field audio signal and the location information of the mobile source to determine a set of room impulse response filters of the recording space.

The method may further comprise: receiving a selection of a position within the recording space; receiving a second near-field audio signal associated with a source; identifying a room impulse response filter relating to the selected position within the recording space from the set of room impulse response filters of the recording space; applying the selected room impulse response filter to the second near-field audio signal to obtain a projected second near -field audio signal; and augmenting the audio mixture of the far -field microphone device by adding the projected second near-field audio signal to the audio mixture.

The room impulse response filter applied to the second near-field audio signal may be retrieved from a room impulse response filter database.

The room impulse response filter database may contain room impulse response filters obtained from a broadband signal within the recording space.

The method may further comprise: receiving a selection of a position within the recording space; receiving a third near-field audio signal, the third near-field audio signal being associated with the selected position; identifying a room impulse response filter relating to the selected position within the recording space from the set of room impulse response filters of the recording space;

applying the identified room impulse response filter to the third near-field audio signal associated with the selected position to project the third near-field audio signal into a far-field space; and removing the projected near-field audio signal from the audio mixture.

The room impulse response filter applied to the third near-field audio signal may be calculated using the first near-field audio signal and the far-field audio mixture.

The set of room impulse response filters may be determined using a block-wise linear least squares projection algorithm applied to a broadband calibration signal.

The set of room impulse response filters may be collected during a calibration phase and stored in a room impulse response database.

The set of room impulse response filters may be determined using far-field and near- field audio signals obtained in real-time. The mobile source may move around the recording space either manually or automatically.

The set of room impulse response filters may be obtained using a recursive least squares algorithm.

The near-field microphone may be provided with a location tag and the location information is received from the location tag. The location information may be determined using multilateration.

The method may further comprise determining a signal activity detection signal.

In a second aspect, this specification describes an apparatus configured to perform a method according to any preceding claim.

In a third aspect, this specification describes computer-readable instructions which when executed by computing apparatus cause the computing apparatus to perform a method according to the first aspect of the specification.

In a fourth aspect, this specification describes apparatus comprising: at least one processor; and at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to: receive, from a far-field microphone device, at least one far-field audio signal in a time domain corresponding to a mobile source located within a recording space, the or each far-field audio signal corresponding to a respective channel of an audio mixture of the far-field microphone device; receive, from a near-field microphone, a first near-field audio signal in a time domain corresponding to the mobile source; determine location information relating to the mobile source; transform the or each far-field audio signal and the first near-field audio signal from the time domain to a time-frequency domain; and use the

transformations of the far-field audio signal and the first near-field audio signal and the location information of the mobile source to determine a set of room impulse response filters of the recording space. In a fifth aspect, this specification describes a computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by a least one processor, causing performance of at least: receiving, from a far-field microphone device, at least one far-field audio signal in a time domain corresponding to a mobile source located within a recording space, the or each far-field audio signal corresponding to a respective channel of an audio mixture of the far-field microphone device; receiving, from a near-field microphone, a first near-field audio signal in a time domain corresponding to the mobile source; determining location information relating to the mobile source; transforming the or each far-field audio signal and the first near- field audio signal from the time domain to a time-frequency domain; and using the transformations of the far-field audio signal and the first near-field audio signal and the location information of the mobile source to determine a set of room impulse response filters of the recording space.

In a sixth aspect, this specification describes apparatus comprising: means for receiving, from a far-field microphone device, at least one far-field audio signal in a time domain corresponding to a mobile source located within a recording space, the or each far-field audio signal corresponding to a respective channel of an audio mixture of the far-field microphone device; means for receiving, from a near-field microphone, a first near-field audio signal in a time domain

corresponding to the mobile source; means for determining location information relating to the mobile source; means for transforming the or each far-field audio signal and the first near-field audio signal from the time domain to a time- frequency domain; and means for using the transformations of the far-field audio signal and the first near-field audio signal and the location information of the mobile source to determine a set of room impulse response filters of the recording space.

Brief description of the drawings

So that the invention may be fully understood, embodiments thereof will now be described with reference to the accompanying drawings, in which:

Figure 1 is a schematic diagram of an audio mixing system and a recording space;

Figure 2 is a schematic block diagram of elements of certain embodiments;

Figure 3 is a flow chart illustrating operations carried out in certain embodiments; Figure 4 is an illustration of a recording space;

Figure 5 is a schematic diagram of an audio mixing system and a recording space; Figure 6 is a schematic diagram of an audio mixing system and a recording space as a target source is replaced with a replacement source; and

Figure 7 is a schematic diagram of an audio mixing system and a recording space as a new source is introduced to an audio mixture.

Detailed description

In the description and drawings, like reference numerals refer to like elements throughout. Embodiments of the present invention relate to mixing audio signals received from both a near-field microphone and from a far-field microphone. Example near-field microphones include Lavalier microphones which may be worn by a user to allow hands-free operation or a handheld microphone. In some embodiments, the near-field microphone may be location tagged. The near-field signals obtained from near-field microphones may be termed "dry signals", in that they have little influence from the recording space and have relatively high signal-to-noise ratio (SNR).

Far-field microphones are microphones that are located relatively far away from a sound source. In some embodiments, an array of far-field microphones may be provided, for example in a mobile phone or in a Nokia Ozo (RTM) or similar audio recording apparatus. Devices having multiple microphones may be termed multichannel devices and can detect an audio mixture comprising audio components received from the respective channels. The microphone signals from far-field microphones may be termed "wet signals", in that they have significant influence from the recording space (for example from ambience, reflections, echoes, reverberation, and other sound sources). Wet signals tend to have relatively low SNR. In essence, the near-field and far-field signals are in different "spaces", near-field signals in a "dry space" and far-field signals in a "wet space".

When the originally "dry" audio content from the sound sources reaches the far-field microphone array the audio signals have changed because of the effect of the recording space. That is to say, the signal becomes "wet" and has a relatively low SNR. The near- field microphones are much closer to the sound sources than the far-field microphone array. This means that the audio signals received at the near-field microphones are much less affected by the recording space. The dry signal has much higher signal to noise ratio and lower cross talk with respect to other sound sources. Therefore, the near-field and far-field signals are very different and mixing the two ("dry" and "wet") results in audible artefacts or non-natural sounding audio content.

Further problems arise, if a signal outside the system needs to be inserted into the audio mixture. For example, an audio stream from an external player such as a professional audio recorder may be mixed with audio content recorded in a particular recording space. These signals need to be mixed together because only the microphone array can provide spatial audio content, for example for a virtual reality (VR) or augmented reality (AR) audio delivery system. However, with simply mixed sound sources this cannot be done due to artefacts or at least due to the virtual presence aspect being lost in listening. Furthermore, future six degrees of freedom (6DoF) audio production systems require ways to estimate room impulse responses.

Additionally, mixing or editing of the multi-channel array signal is not straightforward due to low SNR, cross-talk and spatial artefacts that editing might cause. Editing of the near-field microphone and pre-recorded signal is relatively straightforward due to high SNR and isolation between individual channels. However, near-field signals only provide audio content without spatial information. The resulting mix quality is up to personal preferences and use case demands; however some amount of spatial information insertion capability is often needed. A new problem arises when a totally new "dry" signal is introduced into to the audio mixture, for example from a sound source located externally with respect to the recording space. Since the new audio signal has no room impulse response (RIR) data available for the current room and environment, realistic sounding mixing is not possible without a database of RIR values from all around the space used for the original audio capture.

Current audio mixing systems often rely on the expert audio mixer's personal abilities and spatial information may be added to the "dry" near-field signal with signal processors that create artificial spatial information. Examples include reverb processors that generate spatial information with an algorithm for different sounding and tunable spaces or that rely on real impulse responses (convolution processor) with some amount of manual modification to some parameters such as panning, volume, equalization, pre-echo, decay time and residual noise floor adjustments. More information maybe found at http://www.nongnu.0rg/freeverb3/. Hitherto, there are no known methods available that use a collected RIR database together with the position data and/or models of the recording space to render realistic sounding VR, AR or 6D0F audio playback.

Embodiments of this invention provide a database where estimated RIR values are collected around the place of performance based on the captured "dry" and "wet" signals as well as available position data of the near-field microphones (which correspond to the position of the sound source). The RIR data are estimated based on the dry to wet signal transfer function at every relevant position within the recording space. There may be one or more "wet" multi-channel arrays as well as one or more "dry" sound sources collected at the RIR database at the same time.

In some embodiments, the RIR database may be collected during an initial calibration phase where a sound source (for example, white noise, talking human, acoustic instrument, a flying drone with speaker, etc) is moving or is moved around the recording space either manually or automatically. The benefit of having calibration recordings and database collection prior to actual performance is that the RIR database can be used during the performance to insert additional sound sources to the audio mix in real-time. Also, the recording space might have higher SNR available in some circumstances, for example when a studio audience is missing and also use of special signals such as white noise that will provide more accurate room impulse responses for the whole frequency range. In other embodiments, continuous collection of new RIR data is performed during the recording itself, the new RIR data being inserted into the database as the actual performance occurs. Additional RIR data that is inserted into a pre-existing RIR database can also be collected during the actual performance. Collection of RIR data during a performance can be made in order to add more data points to make the database denser. There are multiple dimensions that can be enhanced in the database. For example, the position grid can be made denser. For instance, data may be acquired for a 10 centimetre (cm) grid instead of an originally calibrated 20 cm grid so that more data points can be gathered. By adding more spectral points, if calibration was initially performed quickly by walking around the vicinity of the far-field microphone array, all further captured signals will decrease the spectral sparseness of the RIR database. Since the acoustic environment may change during the performance, the RIR database can contain time varying RIR values. To capture time varying responses, RIR measurements need to be captured over an extended period of time for optimal quality. For example, when more people enter the recording space a damping of the recording space occurs which affects the acoustic properties of that recording space.

Figure l shows an audio mixing system 100 which comprises a far-field audio recording device 101, such as a video/audio capture device, and one or more near-field audio recording devices 102, such as Lavalier microphones. The far-field audio recording device 101 comprises an array of far-field microphones and may be a mobile phone, a stereoscopic video/audio capture device or similar recording apparatus such as the Nokia Ozo (RTM). The near-field audio recording devices 102 may be worn by a user, for example a singer or actor. The far-field audio recording device 101 and the near- field audio recording devices 102 are located within a recording space 103. The far-field audio recording device 101 is in communication with an RIR processing apparatus 104 either via a wired or wireless connection. The RIR processing apparatus

104 may be located within the recording space 103 or outside the recording space 103. The RIR processing apparatus 104 has access to an RIR database 105 containing RIR data relating to the recording space 103. The RIR database 105 may be physically incorporated with the RIR processing apparatus 104. Alternatively, the RIR database

105 may be maintained remotely with respect to the RIR processing apparatus 104.

Figure 2 is a schematic block diagram of the RIR processing apparatus 104. The RIR processing apparatus 104 may be incorporated within a general purpose computer. Alternatively, the RIR processing apparatus 104 may be a standalone apparatus.

The RIR processing apparatus 104 may comprise a short-time Fourier transform (STFT) module 201 for determining short-time Fourier transforms of received audio signals. The RIR processing apparatus 104 comprises an RIR estimator 202 and a projection module 203. The RIR processing apparatus 104 comprises a processor 204 which controls the STFT module 201, the RIR estimator 202 and the projection module 203. The RIR processing apparatus 104 comprises a memory 205. The memory comprises a volatile memory 206 such as random access memory (RAM). The memory also comprises non- volatile memory 207, such as read-only memory (ROM). The RIR processing apparatus 104 further comprises input/output 208 to enable communication with the far-field audio recording device 101 and with the RIR database 105 as well as any other remote entities. The input/output 208 comprises hardware, software and/or firmware that allows the RIR processing apparatus 104 to

communicate with the far-field audio recording device 101 and with other remote entities via wired or wireless connection using communication protocols known in the art.

Some further details of components and features of the above-described RIR processing apparatus 104 and alternatives will now be described.

The RIR processing apparatus 104 comprises a processor 204 communicatively coupled with memory 205. The memory 205 has computer readable instructions stored thereon, which when executed by the processor 204 causes the processor 204 to cause performance of various ones of the operations described with reference to Figure 3. The RIR processing apparatus 104 may in some instance be referred to, in general terms, as "apparatus".

The RIR processing apparatus 104 may be of any suitable composition. For example, the processor 204 may be a programmable processor that interprets computer program instructions and processes data. The processor 204 may include plural programmable processors. Alternatively, the processor 204 may be, for example, programmable hardware with embedded firmware. The processor 204 may be termed processing means. The processor 204 may alternatively or additionally include one or more Application Specific Integrated Circuits (ASICs). In some instances, processor 204 may be referred to as computing apparatus.

The processor 204 is coupled to the memory (or one or more storage devices) 205 and is operable to read/write data to/from the memory 205. The memory 205 may comprise a single memory unit or a plurality of memory units, upon which the computer readable instructions (or code) is stored. For example, the memory 205 may comprise both volatile memory and non-volatile memory. For example, the computer readable instructions/program code may be stored in the non-volatile memory and may be executed by the processor 204 using the volatile memory for temporary storage of data or data and instructions. Examples of volatile memory include RAM, DRAM, and SDRAM etc. Examples of non-volatile memory include ROM, PROM, EEPROM, flash memory, optical storage, magnetic storage, etc. The memories in general may be referred to as non-transitory computer readable memory media.

The term 'memory', in addition to covering memory comprising both non-volatile memory and volatile memory, may also cover one or more volatile memories only, one or more non-volatile memories only, or one or more volatile memories and one or more non-volatile memories.

The computer readable instructions/program code may be pre-programmed into the RIR processing apparatus 104. Alternatively, the computer readable instructions may arrive at the RIR processing apparatus 104 via an electromagnetic carrier signal or may be copied from a physical entity such as a computer program product, a memory device or a record medium such as a CD-ROM or DVD. The computer readable instructions may provide the logic and routines that enables the devices/apparatuses to perform the functionality described above. The combination of computer-readable instructions stored on memory (of any of the types described above) may be referred to as a computer program product.

Embodiments of the present invention may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. The software, application logic and/or hardware may reside on memory, or any computer media. In an example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media. In the context of this document, a "memory" or "computer-readable medium" may be any media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer.

Reference to, where relevant, "computer-readable storage medium", "computer program product", "tangibly embodied computer program" etc., or a "processor" or "processing apparatus" etc. should be understood to encompass not only computers having differing architectures such as single/multi-processor architectures and sequencers/parallel architectures, but also specialised circuits such as field

programmable gate arrays FPGA, application specify circuits ASIC, signal processing devices and other devices. References to computer program, instructions, code etc. should be understood to express software for a programmable processor firmware such as the programmable content of a hardware device as instructions for a processor or configured or configuration settings for a fixed function device, gate array,

programmable logic device, etc.

Overall algorithm description

The following is a description of one way in which far-field audio signals may be processed to obtain a short-time Fourier transform (STFT). The far-field audio recording device 101 comprising a microphone array composed of far-field

microphones with indexes (c = i,...,C) captures a mixture of source signals with indexes p= i,...,P and their signals χ(Ρ ⁾(η) sampled at discrete time instances indexed by n. The sound sources may be moving and have time-varying mixing properties, denoted by room impulse response (RIR), hcrfi t) for each channel c at each time index n. Some of the sound sources (e.g. speaker, car, piano or any sound source) have Lavalier microphones 102 close to them. The resulting mixture signal can be given as:

y _c 00 =∑p=l Στ * ^(P) - τ) h%} (τ) + n _c (n) (Equation 1) wherein:

y _c (n) is the audio mixture in time domain for each channel index c of the far-field audio recording device 101, i.e. the signal received at each far-field microphone;

is the p ^& near-field source signal in time domain (source index p);

(7) ^')

h _Cn ( ^T) is th ^e partial impulse response in time domain (sample delay index r), i.e. the room impulse response;

n _c(n) is the noise signal in time domain.

Applying the short time Fourier transform (STFT) to the time-domain

array signal allows expressing the capture in time-frequency domain

as: /t — p=l Ld=0 ⁿftd ^Xft-d ^{+ 11} ft — l<p=l ^xft ^{† n} 1

(Equation 2) wherein: is the STFT of the array mixture (frequency and frame index f,t);

is the STFT of pth near-field source signal (p); h^ _d is the room impulse response (RIR) in STFT domain (frame delay index

Xj _t ^} is the STFT of pth reverberated (filtered/projected) source signal;

rif _t is the STFT of the noise signal.

The STFT of the array signal is denoted by y# = [yfli,~.,y/tc] ^T where /and t are frequency and time frame index, respectively. The source signal as captured by the microphone array of the far-field audio recording device 101 is modelled by convolution between the near-field source STFT and its frequency domain RIR h^ _d = [hf _tdl, ... , hf _td [ .

The length of the convolutive frequency domain RIR is D timeframes which can vary from a few timeframes to several tens of frames depending on the STFT window length and maximum effective amount of reverberation components in the recording environment. This model differs from the usual assumption of instantaneous mixing in frequency domain with mixing consisting of complex valued weights only for the current timeframe. The additive uncorrelated noise is denoted by n# = [nft ...,nftcY-

The reverberated source signals are denoted by x^- The way in which RIR measurements are obtained in accordance with various embodiments will now be explained, with reference to Figure 3 which is a flow chart illustrating various steps taken in embodiments of the invention. The process starts at step 3.1.

At step 3.2 an audio signal y _c(ri) is received from the far-field audio recording device 101. At step 3.3 an audio signal (n) is received from the near-field audio recording device 102 for those sound sources provided with a near-field audio recording device 102.

At step 3.4, the location of the mobile source is determined. The location can be determined using information received from a tag with which the mobile source is provided. Alternatively, the location may be calculated using multilateration techniques described below.

At step 3.5, a short-time Fourier transform (STFT) is applied to both far-field and near- field audio signals. Alternative transforms may be applied to the audio signals as described below.

In some embodiments, time differences between the near-field and far-field audio signals can be taken into account. However, if the time differences are large (several hundreds of milliseconds or more) a rough alignment may be done prior to the process commencing. For example, if a wireless connection between a near-field microphone and RIR processor causes a delay, the delay may be manually fixed by delaying the other signals in the RIR processor or by an external delay processor which may be implemented as hardware or software.

A signal activity detection (SAD) may be estimated from the near-field signal in order to determine when the RIR estimate is to be updated. For example, if a source does not emit any signal over a time period, its RIR value does not need to be estimated.

The STFT values y/i and are input to the RIR estimator 202 at RIR estimation step

3.6. The RIR estimation may be performed using a block-wise linear least squares (LS) projection in offline operation mode, that is where the RIR estimation is performed as part of a calibration operation. Alternatively, a recursive least squares (RLS) algorithm for real time operation mode, that is where the RIR estimation occurs during a performance itself. In other embodiments, the RLS algorithm may be used in offline operation instead of the block-wise linear LS algorithm. In any case, as a result, a set of RIR filters in time-frequency domain are obtained. The process ends at step 3.7. - I4 -

Block-wise linear least squares projection

The RIR Λ. ¾ can be thought of as a projection operator from near-field signal space

(i.e. "dry" signals) to far-field signal space (array capture in case of multiple channels, i.e. "wet" signals).

The projection is time, frequency and channel dependent. The parameters of RIR hjT _td can be estimated using linear least squares (LS) regression, which is equivalent to finding the projection between the near-field and far-field signal spaces. The method of LS regression for estimating RIR values may be applied for moving sound sources by processing the input signal in blocks of approximately 500ms and the RIR values may be assumed to be stationary within each block. Block-wise processing with moving sources assumes that the difference between RIR values associated with adjacent frames is relatively small and remains stable within the analysed block. This is valid for sound sources that move at low speeds in an acoustic environment where small changes in source position with respect to the receiver do not cause substantial change in the RIR value.

The method of LS regression is applied individually for each source signal in each channel of the array. Additionally, the RIR values are frequency dependent and each frequency bin of the STFT is processed individually. Thus, in the following discussion it should be understood that the processing is repeated for all channels and all frequencies. Assuming a block of STFT frames with indices f + T where the RIR is assumed stationary inside the block, the mixture signal STFT with the convolutive frequency domain mixing can be given as:

Vt =∑ _d= ₀ Xt-d ^hd y = Xh (Equation 3) wherein y is a vector of far-field STFT coefficients obtained from the far-field audio recording device 101 from frame t to r + T;

X is a matrix containing the near-field STFT coefficients starting from frame t - o and the delayed versions starting from t - i,...,t - D— i; and h is the RIR to be estimated.

The length of the RIR filter to be estimated is D STFT frames. The block length is T + 1 frames, and T + 1 > D in order to avoid overfitting due to an overdetermined model.

The above equation (3) can be expressed as:

(Equation 4) and assuming that data before the first frame index t is not available, the model becomes:

(Equation 5)

The linear LS solution minimization

min = min||y - A7i|| ^: (Equation 6) is achieved as: h = (X ^TX) ^~1X ^Ty (Equation 7)

The projected source signal for a single block can be trivially obtained as:

D-i

%t — ^ ^xt-i h. (Equation 8) d=0 A subsequent removal of a particular source signal from the audio mixture is a simple subtraction:

9t = yt ^~ %t (Equation 9) Equation 9 demonstrates the removal of a particular source signal from the audio mixture. As well as removing a source from the audio mixture, it is also possible to add the effect of a source to the audio mix. This maybe done by using addition instead of subtraction with a user specified gain.

System calibration and RIR database collection

The RIR estimation presented in embodiments of the present invention allows removal of a target source from the audio mixture or addition of a source to the audio mixture of the far-field audio recording device 101. Based on target source direction of arrival (DOA) trajectory or location estimates of the target source, the signal emitted by the source can be replaced by augmenting separate content to the array mixture of the far- field audio recording device 101.

The problem of augmenting separate signals using the RIR values estimated from the target source in prior approaches lies in the fact that the source signal is not broadband and estimates of RIR values from frequencies with no signal energy emitted are unreliable. Having different spectral content (source signal frequency occupancy in each frame) leads to poor subjective quality of the synthesized augmented source since accurate RIR data for all frequencies are not available.

To overcome this problem, embodiments herein described provide a calibration method with a constant broadband signal which is used to estimate and store RIR values from substantially all possible locations of the recording space. The purpose of the calibration stage is that reliably broadband RIR data from all positions of the recording space are captured before the actual operation (i.e. before an audio recording or broadcast). The location data may be either relative or absolute such as GPS coordinates. During the operation stage itself (i.e. during a recording or broadcast), the target source is removed from the mixture using the block-wise LS or RLS method described above. The direction of arrival (DOA) is estimated either acoustically or using other localization techniques.

There is a variety of ways in which the DOA may be estimated. In some embodiments, the estimated RIR value in the time domain relating to each channel of the array of the far-field audio device 101, is analysed. The first received RIR sample that is above a threshold gives an estimate of the delay at which the sound arrives at the nearest microphone of the far-field audio device 101. Comparing the delays from all microphones of the far-field audio device 101 provides the time differences of arrival (TDOA) between microphones in the array of the far-field audio device 101. From these values the direction can be calculated using multilateration methods that are known in the art.

The augmented source is synthesized using the target source DOA estimates for retrieving the RIR corresponding to each DOA from the database generated in the calibration stage. The length of the calibration stage depends on the size of the recording space and the required density of the database. The length of the calibration stage may vary from around 10 seconds to several minutes.

Figure 4 is a plan view of a recording space 103 in accordance with an embodiment whereby audio data is recorded as part of a calibration stage. A speaker 400 is provided with a near-field microphone 102 such as a Lavalier microphone or a handheld microphone. The speaker 400 may also be provided with a location tag 401. A far-field audio recording device 101 is provided towards the centre of the recording space 103. During the calibration stage, the speaker 400 walks around the recording space 103 along a trajectory T. The speaker 400 speaks so that audio data is recorded by both the far-field audio recording device 101 and the near-field microphone 102. The person may also be playing an instrument or carrying a sound producing loudspeaker.

The room impulse response (RIR) data are collected around the place of performance based on the captured "dry" and "wet" signals as well as available position data from the location tag 401. The RIR data are estimated based on the dry to wet signal transfer - l8 - function at every relevant position with a processing unit using one of the algorithms described above.

Figure 5 is a plan view of a recording space 103 in accordance with another

embodiment whereby audio data is recorded as part of a calibration stage. In this embodiment, two drones 500 are provided. Each drone 500 is provided with a near- field microphone 102. Each of the drones 500 emits a noise, either through a loudspeaker or merely from the drone rotors. Two or more far-field audio recording devices 101 are also provided.

The RIR database 105 may be collected during an initial calibration phase where an audio source of wideband noise, for example white noise, MLSA sequence, pseudo random noise, or a talking human, an acoustic instrument, a flying drone with speaker or a ground based robot, is moving or is moved around the recording space 103 either manually or automatically.

The benefit of having some calibration recordings and database collection prior to an actual performance is that the pre-existing RIR database 105 can be used during the performance to insert additional sound sources to the audio mix in real-time.

Additionally, when wideband noise is used for calibration, the RIR data are more accurate over the whole spectrum. The recording stage will also have higher SNR available, for example when the audience is missing from the recording space 103. This may provide more accurate and/or faster RIR measurements. In other embodiments, RIR data may be collected during the performance itself. This may be instead of the calibration phase described above or in addition to the calibration phase. In the latter scenario, the reliability of the RIR data captured during the calibration process described above using the least block-wise linear least squares projection may be improved by capturing further RIR during the performance itself.

As mentioned above, RIR data estimated are generally valid only for the frequency indices at which the source produced meaningful acoustic output. Usually RIR data are applied to the same close-field signal and no mismatch between time-frequency content and RIR data occurs. However, for example in the case of augmenting a completely near-field signal which is very different from the RIR data available, the RIR data need to be broadband and valid at least for the STFT frequency indices where the augmented signal has significant energy.

In order to avoid the active calibration with a known broadband signal a method for passive online RIR database collection is provided in some embodiments. RIR data estimated at each position of the recording space 103 are used to gradually build a database of broadband RIR data by combining estimates at different times from the same location within the recording space 103. The recent magnitude spectrum of the near-field signal can be used as an indicator of reliability of the RIR data and only frequency indices with substantial signal energy are updated in the database. The database update can vary from simple weighted average to more advanced

combinations based on probabilistic modelling and machine learning in general.

In some embodiments, real-time RIR estimation may be performed by using a recursive least squares (RLS) algorithm. The signal model, consisting of convolutive mixing in time-frequency domain, may be defined as: , _ yP yD - i _hW _Y(p) , _ yP W . _n

Vft - L _p=i Ld=o ⁿftd ^xft-d ^{+ n}ft - ^jp=^i ^xft ^{+ n}ft

(Equation 10)

In real time operation the filter weights vary for each time frame f and, again by dropping the frequency index /and the channel dimension, the filtering equation for a single source at time frame f may be specified as:

*t =∑d=0 ^xt-dhtd = Xt ^ht (Equation 11) where Xt = [ _ij%-i _J...,Xt-D-i] ^rand h _t = .,/ztD-i] ^T-

Efficient real-time operation can be achieved with recursive estimation of the RIR filter weights h using the recursive least squares (RLS) algorithm. The modelling error for timeframe t may be specified as: e _t = y _t — X _t (Equation 12) where y _t is the observed/desired mixture signal.

The cost function to be minimized with respect to filter weights may be expressed as:

C(h _t) < λ < 1 (Equation 13) which accumulates the estimation error from past frames with exponential weight l ^t_i. The weight of the cost function can be thought of as a forgetting factor which determines how much past frames contribute to the estimation of the RIR filter weights at the current frame, RLS algorithms where λ < 1 may be referred to in the art as exponentially weighted RLS and λ = l may be referred to as growing window RLS.

The RLS algorithm minimizing Equation 13 is based on recursive estimation of the inverse correlation matrix P _t of the close-field signal and the optimal filter weights h _t and can be summarized as:

Initialization:

h ₀ = 0

Repeat for t = 1, 2, ...

«t = yt -

g ^t ~ ^{P t}- ^{l X} + x ^T _t p _t_ _lX;

(Equations 14)

h _t = h _t→ + a _tg _t

The initial regularization of the inverse autocorrelation matrix is achieved by defining S using a small positive constant, typically from 10 ^~2 to 10 ¹. A small δ value causes faster convergence, whereas a larger δ value constrains the initial convergence to happen over a longer time period (for example, over a few seconds). The contribution of past frames to the RIR filter estimate at current frame t may be varied over frequency. Generally, the forgetting factor λ acts in a similar way as the analysis window shape in the truncated block-wise least squares algorithm. However, small changes in source position can cause substantial changes in the RIR filter values at high frequencies due to highly reflected and more diffuse sound propagation paths. Therefore, the contribution of past frames at high frequencies needs to be lower than at low frequencies. It is assumed that the RIR parameters slowly change at lower frequencies and source evidence can be integrated over longer periods, meaning that the exponential weight λ ^ί_' can have substantial values for frames up to 1.5 seconds in past.

A similar regularization as described above with reference to block-wise LS may also be adopted for the RLS algorithm. The regularization is done to achieve a similar effect as in block-wise LS to improve robustness towards low-frequency crosstalk between near- field signals and avoid excessively large RIR weights. The near-field microphones are generally not directive at low frequencies and can pick up a fair amount of low- frequency signal content generated by noise source, for example traffic, loudspeakers etc.

In order to specify regularization of the RIR filter estimates, the RLS algorithm is given in a direct form. In other words, the RLS algorithm is given without using a matrix inversion lemma to derive updates directly to the inverse autocorrelation matrix P _t but for the autocorrelation matrix R _t ( R^ ¹ = P _t). The formulation can be found for example from T. van Waterschoot, G. Rombouts, and M. Moonen, "Optimally regularized recursive least squares for acoustic echo cancellation," in Proceedings of The second annual IEEE BENELUX/DSP Valley Processing Symposium (SPS-DARTS 2006), Antwerp, Belgium, 2005, pp. 28-29. The direct form RLS algorithm updates are specified as,

Initialization:

Rn Repeat for t = 1, 2, ...

(Equations 15)

+ Rt ^xt ^at

This algorithm would give the same result as the RLS algorithm discussed above but requires operation for calculating the inverse of the autocorrelation matrix, and is thus computationally more expensive, but does allow regularization of it. The

autocorrelation matrix update with Levenberg-Marquardt regularization (LMR) according to T. van Waterschoot, G. Rombouts, andM. Moonen, "Optimally regularized recursive least squares for acoustic echo cancellation, " in Proceedings of The second annual IEEE BENELUX/DSP Valley Processing Symposium (SPS-DARTS 2006) , Antwerp, Belgium, 2005, pp. 28-29 is:

(16) where β _ΙΜΚ is obtained from the regularization kernel k _f increasing towards low frequencies weighted by the inverse average log-spectrum of the close-field signal (1 - ej) as discussed above with respect to the block-wise LS algorithm.

Another type of regularization is the Tikhonov regularization (TR), as also

introduced in the case of block- wise LS, which can defined for the RLS algorithm as:

AR^ + x x ^T _t + (1 - λ)β _τκΙ (17)

(18)

Similarly as before, β _τκ is based on the regularization kernel and the inverse average log-spectrum of the close-field signal. It should be noted that the kernel k _f needs to be modified to account for the differences between block-wise LS and RLS algorithms, and can depend on the level difference between the close-field signal and the far-field mixtures.

In addition to regularization weight being adjusted based on the average log-spectrum, it can also be varied based on the RMS level difference between near-field and far-field signals. The RMS levels of these signals might not be calibrated in real-time operation and thus additional regularization eight strategy is required. A trivial low-pass filter applied to RMS of each individual STFT frame can be used to track the varying RMS level of close-field and far-field signals. The estimated RMS level is used to adjust the regularization weights /?LMR or TR ^m order to achieve similar regularization impact as with RMS calibrated signals assumed in earlier equations.

Additional RIR data to be inserted into the RIR database 105 may be collected during the actual performance. This can be made in order to add more data points, for example to make the RIR position database grid denser or for sensing the time varying responses, for example when more crowd comes inside the room, it dampens the room. Time varying responses may also be useful in post-production if some original performances are edited and later added back to the original recording space 103. Figure 6 illustrates a recording environment whereby a target source 601 is removed from the audio mixture and replaced with a replacement source 602 at the same position. Based on target source DOA trajectory or location estimates obtained from a location tag of the target source 601, the signal emitted by the target source 601 can be replaced by augmenting separate content to the array mixture. An example scenario of this simple method to replace a speaker inside a room with another person is shown in Figure 6. The replacement of a target source may be done using realtime RIR estimation, where no RIR database 105 need be used. Alternatively, a calibration phase may be performed with respect to the recording space 103, as described above. A drawback of augmenting separate signals using the RIR data estimated from the target source 601 in real time lies in the fact that the target source signal may not be broadband and estimates of RIR data from frequencies with no signal energy emitted may be unreliable. Where the target source 601 and the replacement source 602 have different spectral content (i.e. source signal frequency occupancy in each frame) poor subjective quality of the synthesized augmented source may result since accurate RIR data for all frequencies may not be available.

In other embodiments a calibration phase is used to build up a RIR database 105, as described above. The RIR data in RIR database 105 that is collected with wideband noise is accurate and reliable over the whole frequency spectrum. Using this pre- collected RIR data enables higher quality replacement of the audio source. A selection of a position within the recording space is received. This may be the position of the target source 601 received from any location determination method described above. A near-field audio signal is received from the target source 601.

A RIR filter related to the position of the target source is identified.

The identified room impulse response filter is then applied to the near-field audio signal of the target source to project the near-field audio signal of the target source into a far-field space. As explained above, this RIR filter may be calculated in real-time.

The projected near-field audio signal may then be removed from the audio mixture, as shown in Equation 9 above.

A near-field audio signal from the replacement source 602 is received.

A room impulse response filter relating to the position within the recording space is identified. This may be same room impulse response filter used to remove the target source. Alternatively, the room response response filter applied to the near-field audio signal of the replacement source 602 may be retrieved from a room impulse response filter database collected during a calibration phase.

The selected room impulse response filter is then applied to the near-field audio signal of the replacement source 602 to obtain a projected near-field audio signal of the replacement source 602.

The audio mixture of the far-field microphone device may then be augmented by adding the projected near-field audio signal of the replacement source 602 to the audio mixture. As such, the target source 601 is removed and replaced with the replacement source 602.

Figure 7 illustrates a recording environment whereby a completely new near-field signal recorded from a new source 701 located outside the recording space 103 is inserted into the audio mix of the far-field audio recording device 101. In this case of adding a completely new near-field signal to the augmented mix, the RIR data need to be broadband and valid at least for the STFT frequency indices where the augmenting signal has significant energy. A user may wish for the new source 701 to be added to the recording space 103 at a particular virtual location within the recording space 103. Based on this specified virtual location, the new signal can be used to augment the content to the audio mixture recorded by the far-field microphone array of the far-field audio recording device 101.

For example, a virtual person can be visually rendered to an AR view and at the same time the audio can be rendered in such a way that it sounds as though the new source 701 is standing at the location at which the source appears visually in AR. An example scenario of this method to add a virtual speaker to a room is shown in Figure 7.

Rendering a virtual speaker to a room and using some advanced AR, VR or 6D0F rendering a large amount of RIR data. For example, there may be more than one far- field audio recording device 101-1 and 101-2 and new sound sources 701 to be rendered at the same time. The 6 DoF usage scenario requires that rendering from a first position to a second position is possible (in 6 DoF the listener, for whom the audio is being rendered in playback can move freely anywhere in the virtual environment).

Embodiments of the invention use the RIR data from the RIR database 105 to render the audio objects with naturally sounding presence in any location within the scene.

Having time varying RIR responses may be useful in post-production if some original performances are edited and later added back to the original recording space 103. In practice this requires that the most recent time stamped RIR data is obtained from the RIR database 105 in addition to the selected position.

A near-field audio signal from the new source 701 is received to be added to a far-field audio mixture at a selected position of a recording space 103. A room impulse response filter relating to the selected position within the recording space is identified. The room response filter is applied to the near-field audio signal of the new source 701 may be retrieved from a room impulse response filter database collected during a calibration phase.

The selected room impulse response filter is then applied to the near-field audio signal of the new source 701 to obtain a projected near-field audio signal of the new source 701. The audio mixture of the far-field microphone devices 101 may then be augmented by adding the projected near-field audio signal of the new source 701 to the audio mixture. If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the above-described functions may be optional or may be combined. Similarly, it will also be appreciated that the flow diagram of Figure 3 is an example only and that various operations depicted therein may be omitted, reordered and or combined.

Whilst the above embodiments have been described with reference to short time Fourier transforms, it should be appreciated that there are several different linear transforms from time to frequency domains which can used instead. Examples include discrete cosine transforms, wavelet transforms or MEL or Bark scale modified implementations.

Although various aspects of the invention are set out in the independent claims, other aspects of the invention comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

It is also noted herein that while the above describes various examples, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications which may be made without departing from the scope of the present invention as defined in the appended claims.

Previous Patent: IMAGE PROCESSING

Next Patent: PROCESSING AUDIO SIGNALS