Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
CROSSTALK CANCELLATION AND ADAPTIVE BINAURAL FILTERING FOR LISTENING SYSTEM USING REMOTE SIGNAL SOURCES AND ON-EAR MICROPHONES
Document Type and Number:
WIPO Patent Application WO/2023/192317
Kind Code:
A1
Abstract:
A listening system includes a first microphone device and a second microphone device that generate a first electronic signal and a second electronic signal corresponding to sound within audio detection range. Control logic of the first microphone device detects a crosstalk audio signal from a direction of the second microphone device that matches the second electronic signal. The first electronic signal includes a mixture that includes the crosstalk audio signal. An ear playback device is associated with the second microphone device. A processing device receives the first electronic signal and the second electronic signal, removes the second electronic signal from the first electronic signal to generate a cleansed first electronic signal, and processes the cleansed first electronic signal to integrate the cleansed first electronic signal into an output signal to the ear playback device.

Inventors:
COREY RYAN (US)
SINGER ANDREW (US)
Application Number:
PCT/US2023/016619
Publication Date:
October 05, 2023
Filing Date:
March 28, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV ILLINOIS (US)
International Classes:
H04M9/08; H04R3/00; H04R25/00; H04W84/18
Foreign References:
EP3312839A12018-04-25
EP3333850A12018-06-13
US10438605B12019-10-08
US10182160B22019-01-15
Attorney, Agent or Firm:
GREENE, Nathan, O. (US)
Download PDF:
Claims:
What is claimed is:

1. A listening system comprising: a first microphone device and a second microphone device that are co-located in an area and to generate a first electronic signal and a second electronic signal, respectively, corresponding to sound within audio detection range; control logic associated with the first microphone device, the control logic to detect a crosstalk audio signal from a direction of the second microphone device that matches the second electronic signal, wherein the first electronic signal comprises a mixture that includes the crosstalk audio signal; an ear playback device associated with the second microphone device; and a processing device communicatively coupled to the first and second microphone devices, to the control logic, and to the ear playback device, the processing device to: receive the first electronic signal and the second electronic signal; remove the second electronic signal from the first electronic signal to generate a cleansed first electronic signal; and process the cleansed first electronic signal to integrate the cleansed first electronic signal into an output signal to the ear playback device.

2. The listening system of claim 1, wherein the second microphone device is one of an in-ear microphone integrated within the ear playback device or a microphone integrated within a mobile device of a user of the ear playback device.

3. The listening system of claim 1, wherein the first microphone device is an audio detection system that includes at least a portion of the control logic and the processing device.

4. The listening system of claim 1, wherein, to remove the second electronic signal, the processing device is to apply an adaptive cancellation filter to the first electronic signal with respect to the second electronic signal.

5. The listening system of claim 4, wherein, in response to the control logic identifying a first audio signal indicative of speech from a first user, the processing device is further to disable the adaptive cancellation filter.

6. The listening system of claim 4, wherein the processing device is further to continuously update the adaptive cancellation filter to perform crosstalk cancellation optimization.

7. The listening system of claim 1, wherein the first and second microphone devices are instantiated within a single audio detection device that uses beamforming to detect a first audio signal from the first microphone device and the crosstalk audio signal, wherein the audio detection device furflier comprises at least a portion of the control logic.

8. The listening system of claim 1, further comprising a third microphone device that is co-located with the first and second microphone devices, the third microphone device to generate a third electronic signal corresponding to sound detected within the audio detection range and communicatively coupled to the processing device, wherein: the control logic is further to detect a second crosstalk audio signal from a direction of the third microphone device that matches the third electronic signal, wherein the first electronic signal includes a mixture that includes the crosstalk audio signal and the second crosstalk audio signal; and the processing device is further to receive and remove the third electronic signal from the first electronic signal to generate the cleansed first electronic signal.

9. The listening system of claim 8, wherein, to remove the second crosstalk signal, the processing device is to apply an adaptive cancellation filter to the first electronic signal with respect to the third electronic signal.

10. The listening system of claim 1, wherein, to process the cleansed first electronic signal, the processing device is to apply a set of audio filters comprising a first audio filter to process the cleansed first electronic signal with a first error signal, which is based on an output of a first ear microphone of the ear playback device, to generate the output signal.

11. An electronic assembly comprising: an ear playback device; a first microphone device associated with the ear playback device; and a processing device communicatively coupled to the first microphone device, to the ear playback device, and to a second microphone device that is co-located in an area with the first microphone device, the processing device to: receive a first electronic signal from the first microphone device; receive a second electronic signal from the second microphone device, wherein the second electronic signal comprises a mixture that includes the first electronic signal due to crosstalk between the first and second microphone devices; remove the first electronic signal from the second electronic signal to generate a cleansed second electronic signal; and process the cleansed second electronic signal to integrate the cleansed second electronic signal into an output signal to the ear playback device.

12. The electronic assembly of claim 11, wherein the first microphone device is integrated within the ear playback device and the processing device is integrated within a mobile device that is paired with the ear playback device.

13. The electronic assembly of claim 11, wherein the first microphone device is integrated within a mobile device that includes the processing device and the ear playback device is paired with tire mobile device.

14. The electronic assembly of claim 11, further comprising a second ear playback device in which is integrated the second microphone device.

15. The electronic assembly of claim 11, wherein, to remove the first electronic signal, the processing device is to apply an adaptive cancellation filter to the second electronic signal with respect to the first electronic signal.

16. The electronic assembly of claim 15, wherein the processing device is further to continuously update the adaptive cancellation filter to perform crosstalk cancellation optimization.

17. The electronic assembly of claim 11 , further comprising a third microphone device that is co-located with the first and second microphone devices and communicatively coupled to the processing device, wherein the processing device is further to: receive a third electronic signal from the third microphone device, wherein the second electronic signal further includes the third electronic signal due to crosstalk between the first and third microphone devices; and remove the third electronic signal from the second electronic signal to generate the cleansed second electronic signal.

18. The electronic assembly of claim 17, wherein, to remove the third electronic signal, the processing device is to apply an adaptive cancellation filter to the second electronic signal with respect to the third electronic signal.

19. The electronic assembly of claim 17, wherein, to process the cleansed second electronic signal, the processing device is to apply a set of audio filters comprising a first audio filter to process the cleansed second electronic signal with a first error signal, which is based on an output of a first ear microphone of the ear playback device, to generate the output signal.

20. A non-transitory computer-readable storage medium storing instructions, which when executed by a processing device that is communicatively coupled to an ear playback device, a first microphone device, and a second microphone device co-localed in an area with the first microphone device, cause the processing device to perform operations comprising: receiving a first electronic signal from the first microphone device; receiving a second electronic signal from the second microphone device, wherein the second electronic signal comprises a mixture that includes the first electronic signal due to crosstalk between the first and second microphone devices; removing the first electronic signal from the second electronic signal to generate a cleansed second electronic signal; and processing the cleansed second electronic signal to integrate the cleansed second electronic signal into an output signal to the ear playback device.

Description:
CROSSTALK CANCELLATION AND ADAPTIVE BINAURAL FILTERING FOR LISTENING SYSTEM USING REMOTE SIGNAL SOURCES AND ON-EAR MICROPHONES

RELATED APPLICATIONS

[0001] The present application claims the benefit under 35 U.S.C. § 119(e) of U.S.

Provisional Application No. 63/324,983 filed March 29, 2022, which is incorporated herein by reference.

TECHNICAL FIELD

[0002] Embodiments of the disclosure relate generally to listening systems, and more specifically, relate to crosstalk cancellation and adaptive binaural filtering for a listening system using remote signal sources and on-ear microphones.

BACKGROUND

[0003] Listening devices such as hearing aids and cochlear implants often perform poorly in noisy environments. Remote microphones, which transmit sound directly from a distant talker to the ears of a listener, have been shown to improve intelligibility in adverse environments. The signal from a remote microphone has less noise and reverberation than the signals captured by the earpieces of a listening device, effectively bringing the talker closer. [0004] Although remote microphones can dramatically improve intelligibility, remote microphones often sound artificial. In commercial devices, the signal from the remote microphone is generally presented diotically, e.g., without accounting for delay between the ears. This signal matches the spectral coloration of the remote microphones rather than that of microphones in the earpieces, and lacks interaural time and level differences that humans use to localize sounds. Some modern efforts to resolve these issues are either too processing intensive to be practical and/or employ external microphones that are not sufficiently close to talkers of interest, necessitating beamforming to achieve strong noise reduction. Such systems can be difficult/expensive to implement and are sensitive to motion.

[0005] Further, when remote microphones are employed near talkers at least some of which are using such listening devices, crosstalk is possible in group conversations. For example, participants in the conversation may hear a delayed copy of their own speech that was picked up by the microphone of a nearby (or sufficiently close) participant. While muting the inactive microphone is an option, this option can often be distracting and cause participants to miss parts of the conversation, e.g., the first syllables of users that had been previously muted. Tn a fast-moving conversation, this could be especially annoying.

BRIEF DESCRIPTION OF THE DRAWINGS

[0006] A more particular description of the disclosure briefly described above will be rendered by reference to the appended drawings. Understanding that these drawings only pro- vide information concerning typical embodiments and are not therefore to be considered limiting of its scope, the disclosure will be described and explained with additional specificity and detail through the use of the accompanying drawings.

[0007] FIG. 1 is a block diagram of a listening system that integrates filtering electronic signals from remote signal sources with locally -captured audio signals, according to an embodiment.

[0008] FIG. 2 is a simplified flow chart of a method for filtering the electronic signals with the locally -captured audio signals, according to an embodiment.

[0009] FIG. 3A is a block diagram of an example set of single-input, binaural-output (SIBO) audio filters to perform adaptive filtration on multiple electronic signals from remote signal sources according to an embodiment.

[0010] FIG. 3B is a block diagram of example set of multi-input, binaural-output (MIBO) audio filters to perform adaptive filtration on multiple electronic signals from remote signal sources according to an embodiment.

[0011] FIG. 4A is a block diagram of an experimental setup with a moving human talker with multiple signal sources and a non-moving listener according to an embodiment.

[0012] FIG. 4B is a block diagram of the experimental setup with three loudspeaker signal sources and a moving listener according to an embodiment.

[0013] FIG. 5 is a set of graphs illustrating filter performance for a single moving talker according to an embodiment

[0014] FIGs. 6A-6D is a set of graphs illustrating apparent interaural time delays (ITDs) from either near signal sources or far signal sources varied between the filters of FIG. 4A and FIG. 4B according to various embodiments.

[0015] FIG. 7 is a block diagram illustrating an exemplary listening system involving remote microphones that are co-located in an area and associated with a group conversation according various embodiments. [0016] FIG. 8 is a simplified block diagram of an example of crosstalk cancellation as between two remote microphones associated with two users illustrated in FIG. 7 according to at least one embodiment.

[0017] FIG. 9 is a flow chart of a method of crosstalk cancellation as between two remote microphone associated with the first and second users of FIG. 7 according to at least one embodiment

[0018] FIG. 10 is a simplified block diagram of an example crosstalk cancellation as between three remote microphones associated with three users illustrated in FIG. 7 according to at least one embodiment.

[0019] FIG. 11 is a graph illustrating noise reduction performance at a left earpiece of a listener (where the higher values are better) according to experimental embodiments.

[0020] FIG. 12A is a graph illustrating own-speech crosstalk suppression performance at a left earpiece of a listener using a head microphone adapted to perform voice activity detection (VAD) according to experimental embodiments.

[0021] FIG. 12B is a graph illustrating own-speech crosstalk suppression performance at a left earpiece of a listener using a lapel microphone adapted to perform VAD according to experimental embodiments.

[0022] FIG. 13A is a graph illustrating high-frequency interaural level differences of other talkers at ears of a listener where subjects take turns speaking while moving to face each other according to experimental embodiments.

[0023] FIG. 13B is a graph illustrating high-frequency interaural level differences of other talkers at ears of a listener simulated with double-talk and triple-talk with subjects facing forward according to experimental embodiments.

[0024] FIG. 14 is a block diagram of an example computer system in which embodiments of the present disclosure can operate.

DETAILED DESCRIPTION

[0025] By way of introduction, the present disclosure relates to crosstalk cancellation and adaptive binaural filtering for a listening system that uses remote signal sources and on- ear microphones to enhance received sound realism in all kinds of sound environments. Many listening systems or devices can stream sound from external sources such as smartphones, televisions, or wireless microphones placed on a talker. These streamed signals have lower noise than the sound picked up by microphones integrated within earpieces and so are helpful in noisy environments, but they lack spatial cues that help users to tell where sounds are coming from, and they usually only work with one sound source (e.g., talker) at a time.

[0026] Aspects of the present disclosure address the above and other deficiencies by combining one or more remote signal sources that generate one or more electronic signals, which correspond to sound sources in an ambient environment, with a combination of audio signals detected locally by an ear microphone. The combination of audio signals can include, for example, ambient sound and one or more propagated audio signals, which correspond to the same sound sources as the one or more electronic signals but which have propagated though the air acoustically to ears of the listener. A processing device that is coupled to the one or more remote audio sources and to the ear microphone can then apply a set of audio filters to this combination of electronic signals and audio signals. For example, a respective audio filter can process a respective electronic signal of the one or more electronic signals with an error signal, which is based on an output of the ear microphone, to generate an output signal to an ear playback device. In these embodiments, acoustic cue components of the output signal match corresponding acoustic cue components of the combination of audio signals. In this way, the disclosed listening system or device helps the human brain to separate out sounds from different sources, such as two people talking at the same time, and to do so binaurally. Thus, the present disclosure makes it easier for users to hear in group conversations and to do so realistically in a noisy ambient environment.

[0027] In various embodiments, the disclosed listening system and devices work by processing the clean signals from the remote signal sources to match the sound captured by ear (or earpiece) microphones of the listening devices. Because the ear microphones are next to the ears, these ear microphones provide useful acoustic cues. The processing device applies an adaptive filter, which can be employed as a set of audio filters, similar to the kind used for echo cancellation, but that enhances the sound instead of canceling the sound. The audio filters are updated as the talkers and listener move. Thus, the current listening system and devices are especially adapted to help listeners hear remote signal sources more accurately and with a more immersive experience, despite being within a noisy environment. [0028] FIG. 1 is a block diagram of a listening system 100 (or listening device) that integrates filtering electronic signals from remote signal sources with locally-captured audio signals, according to an embodiment. According to some embodiments, the listening system 100 includes one or more remote signal sources 102, e.g., as a remote signal source 102A, a remote signal source 102B, a remote signal source 102C, and a remote signal source 102N. The one or more remote signal sources 102, for example, can include one or more microphones (e.g., a remote microphone placed on or near each of multiple talkers), a microphone that is part of a wearable listening device such as headphones, earbuds, a hearing aid, or the like, an array of microphones (e.g., placed near, or focusing on, a group of talkers), one or more audio signal transmitters, one or more broadcast devices, one or more sound systems, or a combination thereof.

[0029] A microphone placed on or near each talker, for example, would provide a reliable electronic signal from its wearer (e.g., each participant in a panel discussion), while an array of microphones can be used to enhance all nearby sounds, or to focus on specific sounds of interest, making the array well-suited for dynamic environments where talkers may freely join or leave the conversation within the listening range of the array (e.g., at a small group of people discussing a poster presentation in a large noisy convention center). In some embodiments, the array of microphones is a ceiling mounted array of beamforming microphones designed to pick up on individual talkers that are moving around in certain zones of interest.

[0030] In some embodiments, the signal sources 102 include sound-system-generated electronic signals while speakers of the sound system produce corresponding audio signals that arrive at a listener as propagated audio signals. Certain venues such as theaters and churches may employ telecoil induction loops and radio-frequency or infrared broadcasts so that the transmitted signal appears to originate from the sound system of the venue.

[0031] In these embodiments, the listening system 100 further includes a pair of listening devices, such as a first ear listening device 110 (also referred to herein as associated with the right ear (or R) for ease of explanation) and a second ear listening device 120 (also referred to herein as the left ear (or L) for ease of explanation). The first ear listening device 110 can further include a first ear microphone 112 and a first ear playback device 116. The first ear microphone 112 can detect a first combination of audio signals including ambient sound (including noise) and one or more propagated audio signals, corresponding to the one or more electronic signals, received at a first ear of a listener. The second ear listening device 120 can further include a second ear microphone 122 and a second ear playback device 126. The second ear microphone 122 can detect a second combination of audio signals including ambient sound and one or more propagated audio signals, corresponding to the one or more electronic signals, received at a second ear of the listener that is different than the first ear. [0032] In various embodiments, the first and second ear listening devices 110 and 120 are hearing aids, cochlear implants, ear buds, head phones, bone-conduction devices, or other types of in-ear, behind-hear, or over-the-ear listening devices. Thus, the first and second ear playback devices 116 and 126 can each be a receiver (e.g., in a hearing aid or a cochlear implant), a loudspeaker (e.g., of a headphone, headset, earbuds, or the like), or other sound playback device generally delivering sound to the first and/or second ears, either acoustically, via bone-conduction, or other manner of mechanical transduction.

[0033] In some alternative embodiments, the reference signals from the first and second ear microphones 112 and 122 can be derived from “virtual microphones” inferred from other physical signals, for example using a linear prediction filter or other means of linear estimation. For example, a multiple-input, binaural-output linear prediction filter could predict the signal at the ears based on signals captured by a microphone array surrounding the head. Such a prediction filter could be derived from prior measurements using on-ear microphones and then applied in the field. This use of virtual microphones is not restricted to “prediction” and may involve a variety of methods of estimating the sound that would appear at a microphone at the ear through measurements from other physical signals.

[0034] In various embodiments, the listening system 100 further includes a mobile device 140, which can be any type of mobile processing device such as a mini-computer, a programmed processing device, a smart phone, a mini-tablet, or the like. The mobile device 140 can include a processing device 150, one or more audio detectors 155, a user interface 160, which can be integrated within a graphical user interface display able on a screen, for example, and a communication interface 170. In some embodiments, the processing device 150 is at least partially located within either of the first ear listening device 110 or the second ear listening device 120, or both. In at least some embodiments, the processing device 150 is coupled to the one or more remote signal sources 102, to the first ear microphone 112, to the first ear playback device 116, to the second ear microphone 122, and to the second ear playback device 126.

[0035] In some embodiments, one or more of the audio detectors 155 are located within either of the first ear listening device 110 or the second ear listening device 120, or both, where optional locations are illustrated in dashed lines. In some embodiments, the communication interface 170 is adapted to communicate in over networks such as a personal area network (PAN), a Body Area Network (BAN), or a local area network (LAN) using technology protocols such as, for example, Bluetooth®, Wi-Fi®, Zigbee®, or the like similar protocol that may be generated in the future that is sufficiently low-latency for electronic audio signal transmission.

[0036] In at least some embodiments, the listening system 100 includes a first hearing device containing the first ear microphone 112 and connected to the first ear playback device 116 and a second hearing device containing the second ear microphone 122 and connected to the second ear playback device 126. The processing device 150 can be located within one of the first hearing device, the second hearing device, or the mobile device 140 communicatively coupled to the first hearing device and the second hearing device.

[0037] FIG. 2 is a simplified flow chart of a method 200 for filtering the electronic signals with the locally-captured audio signals, according to an embodiment. The method 200 can be performed by processing logic that can include hardware (e.g., processing device, circuitry, dedicated logic, programmable logic, microcode, hardware of a device, integrated circuit, etc.), software (e.g., instructions run or executed on a processing device), or a combination thereof. In some embodiments, the method 200 is performed by the processing device 150 of FIG. 1, e.g., in conjunction with other hardware components of the listening system 100 (or device). Although shown in a particular sequence or order, unless otherwise specified, the order of the processes can be modified. Thus, the illustrated embodiments should be understood only as examples, and the illustrated processes can be performed in a different order, and some processes can be performed in parallel. Additionally, one or more processes can be omitted in various embodiments. Thus, not all processes are required in every embodiment. Other process flows are possible.

[0038] At operation 210, the processing logic detects a combination of audio signals including ambient sound and one or more propagated audio signals, corresponding to the one or more electronic signals, received at an ear of a listener from the one or more remote signal sources 102.

[0039] At operation 220, the processing logic applies a set of audio filters to respective ones of the one or more electronic signals.

[0040] At operation 230, the processing logic applies an audio filter (of the set of audio filters) to a respective electronic signal of the one or more electronic signals with an error signal, which is based on (e.g., a function of) an output of the ear microphone, to generate a first output signal to the ear playback device.

[0041] At operation 240, the set of audio filters causes acoustic cue components of the output signal to match corresponding acoustic cue components of the combination of audio signals. Spatial cues are especially helpful with multiple conversation partners, as they help listeners to distinguish signals from different talkers.

[0042] With additional reference to FIG. 1, when at least the same number of remote signal sources 102 (such as microphones) as talkers are employed within the listening system 100, the spatial cues of the multiple talkers are best preserved. One advantage of integrating filtering of electronic signals from the remote signal sources 102 and the propagated audio signals detected by the first ear microphone 112 and the second ear microphone 122 is that ambient noise is weakly correlated between the remote signal sources 102 and the local ear microphones. This property has been used to identify the acoustic channel between talkers of interest and the microphones of an array of remote microphones. Here, this correlation property can be exploited to match the magnitude and/or phase of the electronic signals to the propagated audio signals received by the first and second ear microphones 112 and 122.

[0043] In various embodiments, as will be explained in more detail with reference to FIGs. 3A-3B, adaptive filters use the electronic signals as inputs and the combined (received) audio signals as references for the desired outputs. If the noise is uncorrelated between the input electronic signals and the reference signals, then the filter matches the cues of the signals of interest. This adaptive approach need not explicitly estimate the acoustic channel or attempt to separate the sources. The present disclosure proposes two variants of adaptive filtering. A first variant can be a set of independently-adapted single-input, binaural-output (SIBO) filters for wearable microphones on spatially separated moving talkers, which will be discussed in more detail with reference to FIG. 3A. The second variant can be a jointly- adapted multiple-input, binaural-output (MIBO) filter suitable for arrays and closely-grouped talkers which will be discussed in more detail with reference to FIG. 3B.

[0044] For purposes of explanation, assume there are M of the remote signal sources 102

(which for the present examples are assumed to be remote microphones) placed near N talkers of interest. The reader can assume that the present example can be expanded to different remote signal sources 102 that generate electronic signals and audio signals, e.g., from speakers of a sound system for example or other venue example referred to herein. For purposes of the mathematical formulation, assume that the electronic signals from the remote signal sources 102 are available instantaneously and synchronously to the first and second ear listening devices 110 and 120, for example.

[0045] Let be the sampled speech signals produced by the talkers of interest. Consider a short time interval during which the talkers, listener, and microphones do not move, or whose movement is sufficiently small that its effects can be ignored. The discrete-time signals received by the first and second ear microphones 112 and 122 and received (or generated) by the remote signal sources 102 are given by

(1)

(2) where * denotes linear convolution, and are equivalent discrete- time acoustic impulse responses: i) between source n and the first and/or second ear microphones 112 and 122; and ii) between source n and the remote signal sources 102, respectively, for n = 1, ... , N. Further, and are additive noise at the first and/or second ear microphones 112 and 122 and the remote signal sources 102, respectively. While the adaptive filters referred to herein are generally described mathematically herein as implemented in the time domain for ease of explanation, these adaptive filters can also be implemented in the time-frequency domain using, e.g., the short- time Fourier transform, a filter bank, or other appropriate filter structure for adaptive acoustic filters.

[0046] In various embodiments, the listening system 100 produces a binaural output given by

(3) where is a discrete-time binaural filter for inputs m = 1, ... , M . Unlike in a binaural beamformer, the propagated audio signals received at the ears are not inputs to the filter used to generate the output signal to the right and left ear playback devices 116 and 126, e.g., y[t]. However, the propagated audio signals could be mixed with y[t] if desired to improve spatial awareness of ambient noise, as will be further discussed.

[0047] In at least some embodiments, the listening system 100 is designed to be perceptually transparent so that the binaural output approximates the signal captured by the first and/or second ear microphones 112 and 122 but with less noise. Mathematically, the desired output can be given by

(4) where is the desired processing to be applied to each source n. The g n 's can be used to apply different amplification and spectral shaping to each source, for example based on distance. The binaural impulse responses a e,n encode the effects of room acoustics on the spectrum of each speech signal as well as the interaural time and level differences used to localize sounds.

[0048] It may convenient to analyze the filters in the frequency domain. Let W(ω ) E and be the discrete-time Fourier transforms of their respective impulse responses, where G is a diagonal matrix of desired responses for the N sources. To preserve the spectral and spatial cues of the N distinct sources, the filter should satisfy . (5)

For arbitrary A r , the filter can meet this condition if M ≥ N , that is, there are at least as many remote signal sources as talkers.

[0049] Adaptive filters are often designed to minimize a mean square error (MSB) between the output and desired signals. If the speech sources and noise were wide-sense stationary random processes with known second-order statistics and if the acoustic impulse responses were known, one could directly minimize the MSB loss

(6) where E denotes statistical expectation.

[0050] In various embodiments, if the filters are allowed to be non-causal and to have infinite length, then the linear minimum-mean-square-error (MMSE) filter can be readily computed in the frequency domain. Assume that all signals have zero mean and that the speech signals are uncorrelated with tire noise signals. Let and be the power spectral density matrices for s[t], z e [t], and z r [t], respectively, and let be the cross-power spectral density between z e [t] and z r [t]. Then the MMSE filter is given by

(7)

If A r has full column rank, then the Woodbury identity can be used to show that the MMSE filler satisfies Equation (5) in the high-signal-to-noise-ralio (SNR) limit. In the remainder of the disclosure, the frequency variable ω is omitted for brevity.

[0051] The MMSE filter relies on the signal statistics and the transfer functions between the remote signal sources 102 and the first and second ear microphones 112 and 122, which can be difficult to estimate. Fortunately, when remote microphones are close to the sources, they provide high-quality reference signals that eliminate the need for complex source separation algorithms. Ambient noise signals are often mostly uncorrelated between on-ear and remote microphones. The processing device 150 may use this property to efficiently estimate the relative transfer function between the remote signal sources 102 and the first and second ear microphones 112 and 122 using the noisy mixture. This same principle can be applied to the adaptive filtering problem, replacing the desired signal d[t] with the noisy propagated audio signal received at the ear microphone(s), as will be discussed in more detail with reference to FIGs. 3A-3B.

[0052] In various embodiments, the above adaptive filter formulation automatically time-aligns the signals, e.g., adds delays to the remote electronic signals, which travel faster than sound. Further, the adaptive filter formulation matches the magnitude and/or phase between the remote electronic signals and the propagated audio signals that arrive at the speed of sound through the air at the ears of the listener. These features help to prevent echoes and distortion.

[0053] Further, in at least some embodiments, a listener is enabled access to tuning options via the user interface 160. For example, the user interface 160 can display a number of menu selection items, from which the listener can choose to, e.g., listen only to the remote signal sources (e.g., distant talkers), or choose to hear everything in the environment for situational awareness.

[0054] FIG. 3A is a block diagram of example single-input, binaural-output (SIBO) audio filters 350A to perform adaptive filtration on multiple electronic signals from the remote signal sources 102 according to an embodiment. The SIBO audio filters 350A are labeled W m,R for a first set audio filters for the right ear and W m,L for a second set of audio filter for the left ear. Each SIBO audio filter can process a separate incoming remote electronic signal (coming from the one or more remote signal sources 102) together with an error signal that is fed back as reference from an output of the first or second ear microphone 112 or 122, respectively. The adaptive filtering of FIG. 3A can be best suited restricting the listening system 300 to only certain talkers or to apply different amplification to different talkers, or if remote microphones move so that coherent combining is difficult. In these situations, where tracking individual talkers, a remote microphone can be placed on or near each individual talker as the remote signal sources 102.

[0055] More specifically, in some embodiments, the processing device 150 applies a first set of audio filters (W m,R ) including an audio filter (e.g., W 1,R ) to process a respective electronic signal of the one or more electronic signals with a first error signal 301 A, which is based on an output of the first ear microphone 112, to generate a first output signal 340A to the first ear playback device 116. In some embodiments, acoustic cue components of the first output signal 340A match corresponding acoustic cue components of the first combination of audio signals received by the first ear microphone 112. In at least some embodiments, the processing device 150 further applies a second set of audio filters (W m,L ) including an audio filter (e.g., W 1,L ) to process a respective electronic signal of the one or more electronic signals with a second error signal 30 IB, which is based on an output (x e,L ) of the second ear microphone 122, to generate a second output signal 340B to the second ear playback device 126. In some embodiments, acoustic cue components of the second output signal 340B match corresponding acoustic cue components of the second combination of audio signals received by the second ear microphone 122.

[0056] Each filter w m is designed to reproduce the speech of talker m, so that for , and the symbol ≈ denotes “approximately equal”. Each filter is computed separately to minimize its own loss function (8)

[0057] | The solution is given in the frequency domain by the 2 x 1 filter

(9) where A r ,m is the row of A r corresponding to microphone m. If the speech sources are uncorrelated, then the SIBO filter can be expressed as

(10)

[0058] It can be seen from Equation (10) that the interaural cues are distorted by crosstalk among the remote microphones as well as by correlated noise. Crosstalk can also produce unintended interference effects, such as comb-filtering distortion, when the SIBO filter outputs are summed.

[0059] In some embodiments, intermediate outputs 303 of the audio filters are optionally furflier processed, e.g., by other_processing_1L, other_processing_2L, and other_ processing_3L for each of the illustrated three audio filters of the first set of audio filters or other_processing_1R, other_processing_2R, or other_processing_3R for each of the three audio filters of the second set of audio filters. In these embodiments, each of these “other_processing” blocks include the same or different signal processing, e.g., such as frequency-selective gain, feedback suppression, noise reduction, and dynamic range compression. For example, dynamic range compression could operate independently on each of the intermediate outputs 303, which may prevent certain types of distortion.

[0060] In al least some embodiments, the one or more electronic signals include multiple electronic signals, each audio filter of the first set of audio filters is to generate an intermediate output signal corresponding to a respective electronic signal, and the processing device 150 furflier combines (e.g., at a first summer 310A) the intermediate output signals to generate the first output signal 340A. In at least some embodiments, each audio filter of the second set of audio filters is to generate an intermediate output signal corresponding to a respective electronic signal, and the processing device 150 further combines (e.g., at a second summer 310B) the intermediate output signals to generate the second output signal 340B.

[0061] In additional embodiments, the processing device imparts additional processing before the output signals 340A and 340B are generated. More specifically, in some embodiments, the processing device 150 processes (e.g., with other_processing_4R) the output of the first ear microphone 112 to generate a first processed microphone signal and mixes, into the first output signal 340 A, the first processed microphone signal. Further, in these embodiments, the processing device 150 processes (e.g., with other_processing_4L) the output of the second ear microphone 122 to generate a second processed microphone signal and mixes, into the second output signal 340B, the second processed microphone signal. The mixing can occur at the first and second summers 310A and 310B, respectively. Further, the processing device 150 can control the relative levels of the live audios signals from the first and/or second ear microphones 112 and 122 compared to the processed audio signals e.g., for a selected trade-off between signal-to-noise ratio improvement and environmental awareness. This type of mixing can also reduce distortion of binaural cues for non-target sound sources.

[0062] In some embodiments, the processing device 150 further processes the first combination of the intermediate output signals with post-processing (e.g., other_processing_5R) that corresponds to an audio parameter before generating the first output signal 340A In some embodiments, the processing device 150 further processes the second combination of the intermediate output signals with post-processing (e.g., other_processing_5L) that corresponds to the audio parameter before generating the second output signal 340B. In some embodiments, the processing device receives, via the user interface 160, a menu selection to adjust the audio parameter, and adjusts the audio parameter of the post-processing according to the menu selection. The audio parameter can include, for example, different volume levels at respective ear playback devices 116 and 126, different volume levels imparted to the electronic signals from the one or more signal sources 102 compared to the volume level of the propagated audio signals received at the first and second ear microphones 112 and 122 and the like.

[0063] In at least some embodiments, instead of enhancing signals of interest, the listening system 100 could be used to remove unwanted sound source signals. The output of one or more of the set of audio filters can be subtracted from the live sound captured by the first and/or second ear microphones 112 and 122. Such a system could, for example, reduce the level of music audio (or other public-address sounds) in a public venue using a copy of the audio signal of the music transmitted by the sound system. The unwanted sound signal could also be deliberately introduced in order to protect the privacy of a conversation between users of the system, either as music, white-noise, or other background noise.

[0064] In at least some embodiments, the first set of audio filters (W m,R ) and the second set of audio filters (W m,L ) are defined by a parametric model that is to separately, for each of the remote signal sources 102A ... 102N, at least one of apply an equalization filter that is shared by both the first output signal and the second output signal or encode interaural time and amplitude level differences between the first output signal and the second output signal used to localize sounds.

[0065] In at least some embodiments, the first set of audio filters ( W m,R ) is defined by a parametric model that is to separately, for each of the remote signal sources 102A ... 102N, at least one of: encode a delay time for the first output signal; perform parametric equalization for the first output signal; encode a set of effects of ambient acoustics on a spectrum of the one or more electronic signals; define a filter that has an impulse response of a particular length; or define a filter described by a set of poles and zeros. The parametric model can also be applied to the second set of audio filters ( W m,L ).

[0066] In at least some embodiments, by way of example, the first error signal 301 A includes a difference between the output of the first audio filter ( W 1,R ) and the output of the first ear microphone 112. In these embodiments, the processing device 150 is further to input a first electronic audio signal, from a first remote signal source (e.g., x r,1 ), to the first audio filter. The processing device 150 may cause a first relative transfer function of the first audio filter to adaptively minimize the first error signal in a first intermediate output signal, where it is understood that the mean-square value of the error or a variety of other functions of the error are to be minimized. The processing device 150 can input a second electronic signal, from a second remote signal source (e.g., x r,1 ), to a second audio filter (W 2,R ) of the first set of audio filters. The second error signal includes a difference between the output of the second audio filter (W 2,R ) and the output of the first ear microphone 112. The processing device 150 may then cause a second relative transfer function of the second audio filter (W 2,R ) to minimize the MMSE in the second error signal in a second intermediate output signal, and combine, to generate the first output signal 340A, the first intermediate output signal with the second intermediate output signal, where it is understood that the mean-square value of the error or other function of the error can be minimized. These operations can be extended to additional ones of the remote signal sources 102.

[0067] In some embodiments, the processing device 150 further applies a first processing variable to the first intermediate output signal (e.g., other_processing_1R) and applies a second processing variable (e.g., other_processing_2R) to the second intermediate output signal. With additional reference to FIG. 1, in some embodiments, a first audio detector (of the one or more audio detectors 155) is coupled to the processing device 150, which disables the first audio filter ( W 1,R ) in response to sound from the first remote signal source not satisfying a threshold magnitude. Further, in these embodiments, a second audio detector (of the one or more audio detectors 155) is coupled to the processing device 150, which disables the second audio filter (W 2,R ) in response to sound from the second remote signal source not satisfying the threshold magnitude. To satisfy the threshold magnitude is to be greater than or at least equal to the threshold magnitude. This threshold magnitude may be set within a certain audio range that statistically determines that the remote signal source is not active, e.g., the remote talker is not talking or not talking sufficiently loud into a remote microphone. In this way, disabling each respective audio filter that is coupled with a particular remote signal source 102 that is inactive helps to improve the performance of that filter and to conserve processing power needed by the processing device 150.

[0068] FIG. 3B is a block diagram of an example set of -input, binaural-output (MIBO) audio filters 350B to perform adaptive filtration on multiple electronic signals from remote signal sources according to an embodiment. As discussed, the MIBO audio filters 350B may be especially suited for arrays of remote signal sources 102 (e.g., microphone array, speaker array, or the like) and closely-grouped talkers that may come and go within an area of interest to which a microphone array is pointing.

[0069] In at least some embodiments, the MIBO audio filters 350B are similarly labeled W m,R for the first set of audio filters for the right ear and W m,L for the second set of audio filters for the left ear, and thus have similarities to the SIBO audio filters 350A. Different from the SIBO audio filter embodiment (FIG. 3A), however, the first set of audio filters (W m,R ) jointly process the one or more remote electronic signals with a single first feedback signal 302A received from the first ear microphone 112 to generate first intermediate output signals 303A The processing device 150 may combine (e.g., using a summer 305 A) the first intermediate output signals 303A to generate the first output signal 340A Further, the second set of audio filters (W m,L ) jointly process the one or more remote electronic signals with a single second feedback signal 302B received from the second ear microphone 122 to generate second intermediate output signals 303B. The processing device 150 may combine (e.g., using a summer 305B) the second intermediate output signals 303B to generate the second output signal 340B.

[0070] In some embodiments, the first error signal 301 A includes a difference between a combination of the first intermediate output signals 303A and the output of the first ear microphone 112, and a single acoustic loss function of the first set of audio filters is to adaptively minimize the mean-square value of the first error signal, where it is understood that the mean-square value or other function of the error can be minimized. In some embodiments, the second error signal 301B includes a difference between a combination of the second intermediate output signals 303B and the output of the second ear microphone 122, and a single acoustic loss function of the second set of audio filters is to adaptively minimize the mean-square value of the second error signal, where it is understood that the mean-square value or other function of the error can be minimized.

[0071] To discuss the MIBO audio filters 350B mathematically, suppose that a desired response is the same for all talkers, that is, for all n and G(ω ) = G(ω )I for all ω . Instead of minimizing the true MSE, the processing device 150 can minimize the loss function . (11)

[0072] In some embodiments, if the signals are wide-sense stationary, then the linear MMSE filter that minimizes £ is given in the frequency domain by . (12)

[0073] This MIBO audio filter attempts to replicate both the desired speech and the unwanted noise at the ears, e.g., as delivered within the output signals 340A and 340B. However, if the noise is uncorrelated between the combined propagated audio signals and remote electronic signals, then and the adaptive filter of Equation (12) is identical to the MMSE filter of Equation (7). That is, the MIBO audio filter cannot use the remote electronic signals to predict the noise, only propagated audio of the talkers (or other signal sources) of interest.

[0074] With correlated noise, the spatial cues of the target are distorted by those of the noise, as can be readily seen in the special case where M = N = 1:

(13)

[0075] In the numerator of Equation (13), the noise at the remote electronic signals distorts interaural cues to the extent that it is correlated with the noise at the remote microphone (or other signal source). In the denominator of Equation (13), the magnitude of the noise at the remote microphone alters the magnitude of the output, just as it would for the MMSE filter. Thus, system performance maty strongly depend on placement of the remote microphone relative to the remote speakers. [0076] In some embodiments, a property of the MIBO audio filters 350B is that the combined adaptive filter processing does not separate the sources of interest, nor does each MIBO audio filter 350B explicitly model their acoustic transfer functions. Since the inputs to each MIBO audio filter 350B can be combinations of the speech signals of interest, the MIBO audio filter 350B is suitable for systems with significant crosstalk, such as wearable microphones on nearby talkers or a microphone array placed near a group of talkers. It can also adapt easily as talkers move around the area near the microphones or as they enter and leave a conversation, where preferably no more than M talkers participate at a time.

[0077] In additional embodiments, the processing device imparts additional processing before the output signals 340A and 340B are generated. In at least some embodiments, the processing device 150 further optionally processes (e.g., with other_processing_1R) the combination of the first intermediate output signals 303A to generate the first output signal 340A. Further the processing device 150 optionally processes (e.g., with other_processing_1L) the combination of the second intermediate output signals 303B to generate the second output signal 340B.

[0078] In at least some embodiments, the processing device 150 optionally processes (e.g., with other_processing_2R) the output of the first ear microphone 112 to generate a first processed microphone signal. The processing device 150 can further mix (e.g., via a first summer 310A) the first processed microphone signal into the first output signal 340 A. Further, in these embodiments, the processing device 150 optionally processes (e.g., with other_processing_2L) the output of the second ear microphone 122 to generate a second processed microphone signal. The processing device 150 can further mix (e.g., via a second summer 31 OB) the second processed microphone signal into the second output signal 340B. [0079] In at least some embodiments, the processing device 150 furflier optionally processes (e.g., with other_processing_3R) the first output signal 340A before outputting the first output signal 340A to the first ear playback device 116. In these embodiments, the processing device 150 further optionally processes (e.g., with other_prrocessing_3L) the second output signal 340B before outputting the second output signal 340B to the second ear playback device 126. In some embodiments, the processing device 150 can apply the various furflier processing (e.g., designated as “other_processing” herein) separately or in combination, as related to either of the S1BO audio filters 350A (FIG. 3A) or the MIBO audio filter 350B (FIG. 3B). [0080] In some embodiments, if the processing device 150 is unsuccessful with source separation of the remote signal sources 102 within a SIBO mode of applying the SIBO audio filters 350A for each ear (FIG. 3A), the processing device 150 switches to a MIBO mode of applying the MIBO audio filter 350B for each ear (FIG. 3B). While the adaptive filtration of the SIBO audio filters 350A and the MIBO audio filter 350B, respectively, are explained above to use the least mean squares algorithm, in other embodiments, a different adaptive algorithm is employed, such as recursive least squares or normalized least mean squares, or others that are known in the field of adaptive filtering. The samples of the audio signals could be processed in blocks, and the learning rate can also be changed over time. Thus, the listening system 100 does not depend on the specific adaptive algorithm used.

[0081] In some embodiments, instead of an arbitrary audio filter or a specific parametric model, the adaptive filtering of an applied set of audio filters could choose the filter that best matches the observed (e.g., processed) audio from a set of possible audio filters. This set of possible audio filters could include, for example, a database of generic human head-related impulse responses, like those used for virtual-reality audio; a database of personalized head- related impulse responses for the listener user, which have been measured directly or inferred based upon the head shape of the listener user; a database of head-related impulse responses augmented by interpolation techniques; a manifold of head-related impulse responses generated by a manifold learning algorithm. These databases/manifolds could be refined based upon room acoustics of the ambient environment. For example, the system could select from one set in a strongly reverberant room and another set in a weakly reverberant room.

[0082] In some embodiments, one or more audio filters) applied by the processing device 150 is initialized (and occasionally re-initialized) using either a parametric filter model or a filter chosen from a database or manifold. The chosen filter can then be fine-tuned using the adaptive filtering algorithm such as those discussed herein.

[0083] In some embodiments, one or more audio filters) applied by the processing device 150 is constrained based on a physical or perceptual model. For example, the adaptive algorithm could impose upper and lower bounds on the magnitude of the frequency response within different bands or the variation of the magnitude across bands. The audio filters at the two ears can also be constrained so that they do not deviate from each other by more than the expected delay or attenuation due to the head of a listener. [0084] In some embodiments, the adaptation of the one or more audio filters) is aided by position information from a head-tracking system or other motion-capture devices. For example, a head-related impulse response can be selected based not only on the audio data, but also on the direction of the talker relative to the head orientation of the listener user.

[0085] In some embodiments, the listening system 100 can improve the quality of the first and second output signals 340A and 340B (FIGs. 3A-3B) in a reverberant environment by the processing device 150 performing reverberation-reducing processing such as truncating the impulse response (or the part of the impulse response following the direct path) to a prescribed length in order to reduce reverberation. When selecting filters from a database or manifold, the processing device 150 can further choose equivalent filters that share the same spatial cues as the propagated audio signal received at the first and/or second ear microphones 112 and 122, but have milder reverberation. The processing device can further adjust gain and reverberation levels based on the distance from the talker to the listener user. The distance can be inferred from the acoustic time of flight of the signal or measured directly using range-finding technology built into the processing device 150. The processing device 150 can further switch off the binaural filter when the talker is far away. Beyond a prescribed distance, the listening system 100 would function like a conventional remote microphone and present the remote signal dioticaly.

[0086] People with hearing loss often have difficulty hearing people who are not facing them. If the processing device 150 detects that the talker is facing away from the listener, for example based on the slope of the magnitude response of the acoustic transfer function, the processing device 150 can substitute ahead-related impulse response for a remote signal source 102 in the same location, but facing toward the listener.

[0087] In some embodiments, the listener user could use a control interface displayed within the user interface 160 (e.g., that includes physical knobs, a smartphone app, voice commands, gestures, and the like) to adjust the relative levels of different sounds. The options could include tire individual sounds corresponding to remote microphones, the live mixture at the ears, and external signals such as playback from a personal electronic device. In some embodiments, sound levels could be adjusted relative to the magnitude level at the ears, rather than the magnitude level al the source or some absolute measure of sound pressure level. For example, the user could make a conversation partner “twice as loud as real life” or annoying music “half as loud as real life.” The listener user could also directly change reverberation levels of each source, or choose upper or lower bounds on acceptable reverberation levels. To provide a more intuitive user experience, instead of providing separate controls for gain and reverberation, the control interface could allow the listener user to change the perceived distance of the remote signal source 102. If the listener user wishes to “move the sound closer,” the processing device 150 can increase gain and decrease reverberation by a corresponding amount, for example.

[0088] With additional reference to FIG. 1, and in at least one embodiment, the one or more remote signal sources 102 include one or more remote microphones that detect one or more local audio signals and transmit one or more electronic signals corresponding to the one or more local audios signals as well as one or more remote signal sources that generate one or more additional electronic signals. The one or more remote signal sources can include one or more audio signal transmitters, one or more broadcast devices, one or more sound systems, or a combination thereof. In this embodiment, the first ear microphone 112 detects a first combination of audio signals including ambient sound and propagated audio signals, corresponding to the one or more electronic signals and the one or more additional electronic signals, received at a first ear of a listener. In this embodiment, the second ear microphone 122 detects a second combination of audio signals including ambient sound and propagated audio signals, corresponding to the one or more electronic signals and the one or more additional electronic signals, received at a second ear of the listener that is different than the first ear.

[0089] In this at least one embodiment, the processing device 150 applies a first set of audio filters (e.g., W m,R ) to generate the first output signal 340A to the first ear playback device 116. The first set of audio filters can include at least a first audio filter (e.g., W 1,R ) to process a respective electronic signal of the one or more electronic signals with a first error signal, which is based on an output of the first ear microphone 112. At least a second audio filler (e.g., W 2,R ) is included lo process a respective electronic signal of the one or more additional electronic signals with one of the first error signal (for a MIBO audio filter) or a second error signal (for a set of SIBO audio filters), respectively, based on the output of the first ear microphone. In some embodiments, acoustic cue components of the first output signal match corresponding acoustic cue components of the first combination of audio signals. [0090] In at least this embodiment, the processing device applies a second set of audio filters (e.g., W m,R ) to generate the second output signal 340B to the second ear playback device 126. The second set of filters can include at least a third audio filter (e.g., W 1,L ) to process a respective electronic signal of the one or more electronic signals with a third error signal, which is based on an output of the second ear microphone 122. At least a fourth audio filter (e.g., W 2,R ) is included to process a respective electronic signal of the one or more additional electronic signals with one of the third error signal (for a MIBO audio filter) or a fourth error signal (for a SIBO audio filter) based on the output of the second ear microphone 122. In some embodiments, acoustic cue components of the second output signal 340B match corresponding acoustic cue components of the second combination of audio signals.

[0091] With additional reference to FIGs. 3A-3B, the frequency-domain analysis assumes that the audio filters can be non-causal and can have infinite length. In a real listening system, the audio filters are causal and have finite length. Fortunately, because the remote microphones are placed near the talkers, the binaural filters should closely resemble the acoustic impulse responses between the talkers and listener. As long as the group delay of the desired responses plus any transmission delay between the remote microphones (or other remote signal sources 102) and the first and second ear microphones 112 and 122 is smaller than the acoustic time of flight between talkers and listener, it should be possible to design causal binaural filters.

[0092] The above analysis of also assumes that the acoustic listening system 100 (or devices) is stationary. In reality, human talkers and listeners move constantly. To adapt to changing conditions, the SIBO and MIBO audio filters can be designed to be time-varying. Let be the filter coefficien ts at time t for m = 1, ... , M and τ = 0, ... , L — 1, where L is the length of each filter. The filter output is given by

(14)

[0093] In some embodiments, Equation (14) can be written as a matrix-vector multiplication,

(15) where

[0094] In the experiments in this work, we update the filter coefficients with the least mean squares (LMS) algorithm. The MIBO update is given by (16) where μ is a tunable step size parameter.

[0095] The SIBO updates have the same form except that each audio filter is adapted independently:

(17) [0096] With additional reference to FIG. 1, other possible remote signal sources 102 are envisioned, alone or as combined with other remote signal sources, capable of generating electronic signals that represent sound that is also being passed as propagated audio signals through the air. In some embodiments, the remote source signals could be any low-noise mixture of the talkers of interest. For example, the output of a source separation or enhancement algorithm (such as independent vector analysis or a deep neural network) could be connected to the input of the MIBO audio filter 350B. The advantage of the proposed approach is that the input to the adaptive filters can be any combination of the sources of interest. Thus, a source separation algorithm could be useful even if it suffers from a permutation ambiguity, that is, if there is crosstalk in its output.

[0097] As another example, the outputs of a set of beamformers, such as those used in many commercial teleconferencing audio capture systems, could be used as inputs to the MIBO audio filter 350B. The adaptive filter would add utility to the listening system 100 by restoring spatial cues and compensating for any spectral distortion caused by the beamformer. Furthermore, talkers would be able to move between the beams and the MIBO audio filter 350B would continue to produce the correct spatial cues without extensive adaptation, as the MIBO audio filter 350B adapts based on the beams from beamformer microphones, not the talker positions.

[0098] The listening system 100, which employs the disclosed adaptive filtering, was evaluated experimentally using a binaural dummy head in an acoustically treated laboratory (T 60 ≈ 250 ms). Speech signals were either produced by a human talker or derived from the VCTK dataset and played back over loudspeakers. Each talker was recorded separately and the recordings were mixed to simulate simultaneous speech. For each experiment, the adaptive filter coefficients were computed based on the mixture but applied separately to each source recording in order to track the effect of the system on each component signal. The filters were about 20 ms in length and were designed to be transparent for the source(s) of interest . The step size μ was tuned manually. For each experiment, the wideband signal-to-noise ratio (SNR) was computed after high-pass filtering at 200 Hz to exclude mechanical noise in the laboratory. The apparent interaural time delays (TTD) were computed by finding the peak of the cross-correlation within overlapping 5 second windows. [0099] The experiments are summarized in FIGs. 4A-4B and Table 1. FIG. 4A is a block diagram of an experimental setup with a moving human talker with multiple signal sources and a non-moving listener according to an embodiment FIG. 4B is a block diagram of the experimental setup with three loudspeaker signal sources and a moving listener according to an embodiment Table 1 illustrates wideband SNR in decibels for acoustic experiments. Input and filter output SNRs are measured at the left ear for experimental purposes.

Table 1

[00100] FIG. 5 is a set of graphs illustrating filter performance for a single moving talker according to an embodiment The top graph illustrates SNR at the left ear. The bottom graph illustrates apparent ITD of the target source in the filter output. The dotted curve shows the true ITD. In the first experiment, which simulates the typical use case for remote microphone systems today, a lapel microphone was worn by a moving human talker. Noise was produced by seven loudspeakers placed around the room. The human subject followed the same route during each source recording so that sound and motion are roughly synchronized. The lop plot of FIG. 5 shows the wideband input and output SNR at the left ear and the input SNR at the remote microphone. The SNR varied as the talker moved among the interfering loudspeakers. The output SNR closely tracks the remote microphone input SNR, as expected. The bottom plot shows the apparent ITD of the target speech at the output of the binaural filter compared to that of the clean signal at the ears. The adaptive filter is able to track the spatial cues as the talker moves from center to left to right and back again. Thus, the filter output matches the SNR of the remote microphone and the spatial cues of the earpieces.

[00101] FIGs. 6A-6D is a set of graphs illustrating apparent interaural time delays (ITDs) from either near signal sources or far signal sources varied between the filters of FIG. 4A and FIG. 4B according to various embodiments. A second experiment simulated a multiple-talker application with a moving listener. The dummy head was placed on a motorized turntable, which made one rotation during the one minute recording, starting from the FIG. 4B scenario. Loudspeakers simulated three talkers of interest and five unwanted speech sources. The remote microphones were three end-address cardioid vocal microphones. First, to simulate personal remote microphones, each remote microphone was placed about 30 cm in front of its corresponding speaker. Second, to simulate an array, the three remote microphones were grouped together about 60 cm from the talkers.

[00102] The SNR results are shown in Table 1 and the apparent ITDs are shown in FIGs 6A-6B for the four combinations of filter type and microphone placement. When the RMs were close to the talkers, the SIBO filters and MIBO filter both performed well, with the MIBO filter achieving a slightly higher SNR and better preserving interaural cues. When the remote microphones were farther from the talkers, the MIBO filter still preserved interaural cues but also reproduced more unwanted noise. The SIBO filters were better at rejecting noise, but crosstalk between sources caused distortion of the interaural cues.

[00103] FIG. 7 is a block diagram illustrating an exemplary listening system 700 (or electronic assembly) involving remote microphones that are co-located in an area and associated with a group conversation according various embodiments. In these embodiments, the listening system 700 includes several listening devices such as the listening devices 110, 120 already discussed (see FIG. 1), e.g., hearing aids, cochlear implants, or ear buds of different kinds, as well as microphone devices 720, e.g., mobiles devices such as smartphones, tablets, and other mobile computing devices that include an integrated microphone. In this sense, the listening devices 110, 120 may also be considered to be microphone devices 720. For example, a microphone device may be an in-ear microphone integrated within an ear playback device or may be a microphone integrated within a mobile device of the user.

[00104] In these embodiments, the microphone devices 720 may include a first microphone device 720B owned (e.g., carried) by a first user, a second microphone device 720B owned (e.g., carried) by a second user, and a third microphone device 720C owned (e.g., carried) by a third user that are co-located in an area and to generate a first electronic signal, a second electronic signal, and a third electronic signal, respectively. Thus, to be “co- located” connotates within audio detection range for speech and these electronic signals correspond to sound within such audio detection range. [00105] In at least some embodiments, each of these microphone devices 720 (illustrated in detail with respect to the third microphone device 720C for purposes of explanation) includes a microphone 722, a playback device 726 (e.g., speaker, paired speaker, paired ear buds, paired ear phones, or the like), a processing device 750, an audio detector 755 that includes control logic 757, and a communication interface 770. In some embodiments, as discussed with reference to FIG. 1, at least a portion of the processing device 750 may be incorporated within one of the listening devices 110, 120. Thus, in at least one embodiment, the first microphone device 720A (or other of the disclosed microphone devices) is an audio detection system that includes at least a portion of the control logic 757 (implementing audio detecting) and at least a portion of the processing device 750. While only three users and their respective microphone devices 720A, 720B, 720C are illustrated, this disclosure contemplates N users and N microphone devices as will be discussed in more detail.

[00106] In various embodiments, the pairing of a playback device 726 or one of the listening devices 110, 120 to one or more of the microphone devices 720 may be performed over a respective communication interface 770 of a microphone device over a network 715. In different embodiments, the network 715 is a personal area network (PAN), a Body Area Network (BAN), or a local area network (LAN). The technology used for such pairing and communication over the network 715 may indude technologies such as Bluetooth®, Near- Field Communication (NFC), Wi-Fi®, Zigbee®, or the like similar protocol that enables generation of a personal area network (PAN) 715. It is envisioned that a future network 715 will be configured to handle communication at the speed of sound to facilitate the audio- based intercommunication of the microphone devices 720 and the listening devices 110, 120. [00107] In various embodiments, the audio detector 755 may be a voice activity detector that is coupled to the microphone 722 (or other voice detection hardware) and is able to detect speech from the different users as an audio signal. An audio signal is to be distinguished herein from noise in an ambient environment of the users. Audio signals may be combined (e.g., mixed) with such noise to generate the first, second, and third electronics signals from the three users of FIG. 7. In some embodiments, one or more of the first microphone device 720A, the second microphone device 720B, and the third microphone device 720C are instantiated within a single audio detection device (e.g., an audio puck and with an array of microphones pointed in different directions) that uses beamforming to detect audio signals and crosstalk audio signals from multiple sources. Other microphone arrays are envisioned as well.

[00108] Voice activity detection (or VAD) is a technique in which presence or absence of human speech is detected, e.g., identifying or classifying audio as human speech as opposed to other ambient sounds or noise. Although VAD is commonly performed using logic (e.g., the control logic 757) coupled to a microphone, VAD may also be employed in conjunction with an accelerometer (such as present in AirPods®) or a vibration detection device attached to the throat area of a user. Thus, a VAD device may be employed as a part of an intelligent or smart audio detection device or system, which can be embedded within any single device or a combination of ear listening devices 110, 120 or microphone devices 720 discussed herein in various embodiments.

[00109] In embodiments, VAD is used to trigger one or more processes, as will be discussed in more detail, performed by the processing device 750. For example, VAD can be applied in speech-controlled applications and devices like smartphones (among other smart devices that are employed homes, offices, and vehicles), which can be operated by using speech commands. Further, some of the main uses of VAD are in speaker diarization, speech coding, and speech recognition. VAD can facilitate speech processing and be used to deactivate some processes dining non-speech section of an audio session or during a speech section of the audio session when in a group conversation environment, as will be discussed. [00110] Oftentimes, the intelligibility of group conversations in noisy environments such as a restaurant, networking meeting, or the like is poor. In embodiments, the listening system 700 is employed to improve the intelligibility by aggregating signals from the mobile and wearable devices, referred to herein as the microphone devices 720 and listening devices 110, 120, of the participants (e.g., Users 1-3). In disclosed embodiments, the listening system 700 uses a microphone device 720 placed near each talker to capture a low-noise speech signal. Instead of muting inactive microphones, which can be distracting and lead to the loss of some speech at transitions from muting, the processing device 750 can employ adaptive crosstalk cancellation filters to remove the speech of other users, including delayed auditory feedback of the listener’s own speech. Next, the processing device 750 can employ adaptive spatialization filters that process the low-noise signals to generate binaural outputs that match the spatial and spectral cues at the ears of each listener. These adaptive spatialization filters were discussed in detail with reference to FIGs. 1-6D. [00111] Conventional listening devices, such as hearing aids, work poorly in noisy environments because their microphones have the same SNR as the unaided ears. However, a network of several microphone-equipped devices spread around the group could achieve greater spatial diversity, providing better noise reduction performance than any single device. Herein is proposed a group conversation enhancement system according to various embodiments that aggregates signals from the mobile and wearable devices of conversation participants. Wireless sensor networks and distributed microphone arrays have been proposed for spatial sound acquisition. For example, mobile phones near talkers can help fix microphone arrays to transcribe a meeting. A distributed beamforming algorithm for nonmoving hearing aid networks can be employed. Real-world human listening enhancement systems pose additional challenges. These challenges include, for example, that the system is to operate in real time with imperceptible delay, generally several milliseconds, the system is to preserve the spatial cues that humans use to localize and separate sounds, such as interaural time and level differences, and the system is to contend with continuous motion of both sound sources and microphones.

[00112] In embodiments, modem listening devices are paired with a wireless remote microphone (RM) accessory that transmits low-noise speech directly from a talker to the ears of a listener and low-latency wireless standards may soon allow smartphones to act as convenient RMs. Well-placed RMs can greatly improve intelligibility of a single distant talker in noise, but current systems are unsuitable for group conversations because they support only one talker at a time and do not preserve interaural cues. Some researchers have proposed applying spatialization filters to RM signals based on the estimated direction of arrival. As discussed with reference to FIGs. 1-6D, earpiece microphones are used as reference signals for an adaptive filter, eliminating the need for explicit source localization. This approach may also be employed in binaural beamforming systems, either using earpieces alone or in combination with external microphones.

[00113] Starting with FIG. 7, this disclosure extends the adaptive spatialization techniques of FIGs. 1-6D to address the challenges of close group conversations. Because the devices are closely spaced, there may exist significant crosstalk between microphones, which can cause distortion of spatial cues and delayed auditory feedback of the listener’s own speech, which can be disturbing and impede speech production. A common solution to crosstalk and own-speech echo is to disable all but one microphone at a time. However, frequent muting and unmuting of microphones can be distracting in a fast-paced group conversation and, if there is delay in the voice activity detection (VAD), can cause listeners to miss the first few syllables from a talker that was previously silent. Instead, the listening system 700 may be configured with crosstalk cancellation filters to suppress echoes, e.g. crosstalk. The listening system 700 provides a more natural listening experience in group conversations that may include frequent interruptions and double-talk.

[00114] A further challenge in group conversations is that users move constantly, causing acoustic channel parameters to change during and between utterances. In embodiments, therefore, the processing device 750 continuously updates the adaptive filters while in use. In this work, stationary mobile devices are used as the remote signal sources because their acoustic channel parameters are more stable than those of wearable microphones, allowing the adaptive filters to converge more quickly as users move. Meanwhile, earpieces and other wearable devices that move with the users are helpful for VAD and as references for tracking interaural cues.

[00115] Consider a group of N ≥ 2 talkers and N remote microphones, as shown in Fig.

1, numbered such that RM n is placed near talker n for n = 1, . . . , N. Let s n [t] be the discrete-time speech signal from talker n as captured by RM n. Consider a short time interval during which the acoustic channels from talkers to microphones can be considered time- invariant. Let be the relative impulse response (RIR) describing the acoustic channel from talker n to RM m relative to RM n and let z rm [t] be the ambient noise at RM m. Then the mixture captured by RM m is given by

(18) where * denotes linear convolution. Note that because each s n is defined with respect to RM n, each is the unit impulse . If each RM is placed close to its corresponding talker, then the RIRs of the other microphones should be well modeled by causal filters executed by the processing device 750.

[00116] In addition to the remote microphones, in embodiments, each user wears a binaural listening device containing a left microphone (e.g., microphone 122) and a right microphone (e.g., microphone 112). Let be the vector of RIRs from talker n to tire left and right ears of listener m for m, n = 1, ... , N and let be the ambient noise at those earpiece microphones 112, 122. Then the mixture captured by the earpieces of listener m is given by

(19)

[00117] In at least some embodiments, the listening system 700 performs conversation enhancement by removing ambient noise and own-speech echoes while preserving the speech of other talkers with correct spatial cues. The removal of ambient noise, for example, may be performed in removing the own-speech echoes and ambient noise mixed with those own- speech echoes, providing a significant improvement in the clarity of the desired audio signal being received from other microphone devices 720 (of other talkers). The desired output for listener m is given by

(20)

[00118] This binaural output may be amplified, equalized, compressed, or otherwise processed before it is presented to the listener, which may include the spatialization filtering discussed herein. In some embodiments, the enhanced signals y m are mixed with the earpiece signals x e,m to better preserve situational awareness. Because the spatialized signals will be mixed with live signals — either electronically within the device or acoustically in the ear — the post-cancellation processing endeavors to generate an output with near-zero delay relative to the live signal at the corresponding ear.

[00119] FIG. 8 is a simplified block diagram of an example of crosstalk cancellation as between two remote microphones associated with two of the users illustrated in FIG. 7 according some embodiments. FIG. 10 is a simplified block diagram of an example crosstalk cancellation as between three remote microphones associated with three of the users illustrated in FIG. 7 according some embodiments. In various embodiments, one of the processing devices 750 (e.g., of user m) performs the processing as illustrated in FIG. 8 (for User1 and User2) and as illustrated in FIG. 10 (for all three users) from the perspective of User2. Specifically, the processed audio in both FIG. 8 and FIG. 10 is delivered to playback device(s) of User2 that owns the second microphone device 720B. (FIG. 8 and FIG. 10 will be discussed in more detail later.) In embodiments, this processing is partitioned into two main stages. The first main stage is crosstalk cancellation to improve separation and suppress echoes of the listener’s own speech. The second main stage is spatialization to preserve realistic spatial and acoustic cues, which was discussed more thoroughly with reference to FIGs. 1-6D.

[00120] In various embodiments, both crosstalk suppression and spatialization rely on accurate VAD to determine or identify which users are speaking. Wearable devices are attractive for VAD because they are physically attached to users. Earpieces can use hardware features such as bone-conduction microphones to perform reliable VAD even in strong noise. In experiments, two wearable VAD implementations were compared, including a more- reliable VAD using headset microphones and a less-reliable VAD using lapel microphones. Speech was detected using a multivariate Gaussian likelihood ratio test in the short-time Fourier transform domain. Second-order statistics were estimated using training data and time-frequency log-likelihood ratios were averaged from 0 to 1 kHz in half-second time windows. The resulting statistics were compared against a manually- tuned threshold. The headset and lapel VADs were 90% and 82% accurate, respectively, in a one-at-a-time conversation with moving talkers.

[00121] In a group conversation, the talkers are close together so that each microphone of each microphone device 720 captures speech from all users. Instead of muting the microphones of users who are not speaking, which could be distracting and cause listeners to miss parts of the conversation, the processing device 750 is configured to keep microphones on at all times, but uses adaptive cancellation filters to remove crosstalk. The processed microphone signals ŝ n [t] are given by

(21) for n = 1, . . . , N, where each is a finite-impulse-response filter. In a low-noise environment, each models the corresponding RIR . In embodiments, the filter is disabled when user n is speaking to prevent target signal cancellation; merely pausing adaptation was found to be ineffective, presumably due to motion. Note that the filter cancelling source m at microphone n remains active even when user m is quiet in order to avoid echoes in case of false negatives from the VAD. When user m is quiet, the filter will help to suppress noise from the direction of that user. [00122] Because human talkers move frequently, the crosstalk cancellation filters u n,m are updated continuously when the crosstalk cancellation filters are active. When user n is quiet, is a linear prediction error signal with as the reference signal. The filters are adapted to perform crosstalk cancellation optimization as

(22) where E denotes statistical expectation. In experiments, Equation (22) was iteratively solved using the normalized least-mean-squares (NLMS) algorithm with first-order prewhitening. [00123] It is instructive to compare the behavior of the cancellation system to that of a muting system with an imperfect VAD. Consider N = 2 users and zero ambient noise. When User1 is speaking and User2 is quiet, the cancellation filter converges to the Wiener solution so that User1 is perfectly cancelled and , just as in a muting system. Suppose that User2 interrupts and the VAD of User2 does not immediately detect the interruption. In the muting system, speech of User2 would be inaudible. In the listening system 700, the output immediately following the interruption can be expressed as:

(23)

(24)

(25)

[00124] The speech from User1 is still cancelled correctly and the speech from Talker2 is audible but distorted. The severity of the distortion depends on the crosstalk channels between microphones. With well-positioned directional microphones, the RIRs a 2,1 and a 1,2 should both have magnitude responses much smaller than one (“1”) so that the distortion has little effect on s 2 . In a system with strong crosstalk, such as a compact microphone array, the listening system 700 may cause distortion, in which case, a linearly constrained beamformer may be more appropriate.

[00125] With additional reference to FIG. 8, the control logic 757 (e.g., of the audio detector 755 of the first microphone device 720A) performs voice activity detection, including to detect no first audio signal from the first microphone device 720A and detect a crosstalk audio signal from a direction of the second microphone device 720B that matches the second electronic signal (x r , 2 ). The lack of audio signal can mean that User1 is quiet (the VAD does not detect speech), so there still may be noise detected by the first microphone device 720A. The term “matches” here refers to being substantially the same audio signal, except for some differences in associated noise and delay. Thus, in at least some embodiments, the first electronic signal (x r,1 ) includes a mixture that includes something closely resembling the crosstalk audio signal (e.g., x r,2 ) and any ambient noise that is detected.

[00126] In these embodiments, an ear playback device (e.g., receiver or speaker) such as the ear playback device 116 of the first ear listening device 110 or the ear playback device 126 of the second ear listening device 120 is associated (e.g., paired) with the second microphone device 720B. In these embodiments, the processing device 750 (e.g., of the second microphone device 720B) is communicatively coupled to the first and second microphone devices 720A and 720B, to the control logic 757, and to the ear playback device, e.g., via the network 715. In at least one embodiment, the processing device 750 receives the first electronic signal and the second electronic signal and performs crosstalk cancellation (e.g., the application of crosstalk filters 802) to remove the second electronic signal from the first electronic signal to generate a cleansed first electronic signal (ŝ1). Tn embodiments, to remove the second electronic signal, the processing device 750 applies an adaptive cancellation filter to the first electronic signal with respect to the second electronic signal, for example.

[00127] As was mentioned in the N-user embodiments, the processing device 750 disables the cancellation filter when user n is speaking to prevail target signal cancellation. Applying this concept to the specific two-user example of FIG. 8, recall that unlike in many multi-talker systems that oily activate one microphone at a time, the processing device 750 of the listening system 700 is configured to leave the microphones on even when their respective talkers are quiet (e.g., VAD detects no speech). When a talker is quiet, the microphone of the talker runs the crosstalk cancellation algorithm instead of shutting off. When the talker starts talking, the crosstalk cancellation is disabled.

[00128] Thus, by way of example in FIG. 8, according to at least some embodiments, when User2 is talking and User1 is quiet, the illustrated crosstalk cancellation is active. This allows User2 to talk without hearing annoying crosstalk (e.g., which sounds like an echo), but still allows User1 to interrupt al any lime without wailing for the microphone to reactivate. When User1 is talking and User2 is quiet, the illustrated crosstalk cancellation is disabled, as unnecessary. Thus, in this example, in response to the control logic 757 detecting the first audio signal x r , 1 indicative of speech from the first user (User1), the processing device 751 disables the adaptive cancellation filter according to an embodiment. Tn this situation, the crosstalk cancellation may be performed on behalf of User1 instead of User2 if User1 is also a listener in the listening system 700 (but this cancellation is not illustrated). When both users are talking at the same time, crosstalk cancellation is also off, so User2 may hear some unwanted own-voice echo (if loud enough to reach the microphone of User1), but this is unavoidable when disabling the crosstalk cancellation in this scenario. The spatialization filtering can, in these situations, provide additional contextual processing to improve the received audio, despite that User2 might hear some crosstalk.

[00129] In some embodiments, the processing device 750 further processes the cleansed first electronic signal ŝ1 to integrate the cleansed first electronic signal into an output signal to the ear playback device 116 or 126, e.g., to a receiver of the first ear listening device 110 and/or the second ear listening device 120, respectively. In embodiments, this further processing includes applying spatialization filters 804A (e.g., at least a first audio filter of a set of audio filters) to the cleansed first electronic signal si with a first error signal, which is based on an output of the first ear microphone 112 of the first ear listening device 110, to generate the output signal to the first ear playback device 116. In embodiments, this further processing optionally also includes applying spatialization filters 804B (e.g., at least a second audio filter of the set of audio filters) to the cleansed first electronic signal ŝ1 with a second error signal, which is based on an output of the second ear microphone 122 of the second ear listening device 120, to generate the output signal to the second ear playback device 116.

[00130] In various embodiments, with additional specificity, the spatialization filters process the low-noise source estimates to match the spatial and acoustic cues at the ears of each listener, including interaural time and level differences, spectral shaping, and early reflections. The binaural output mixture for listener m is given by

(26) where each is a casual finite-impulse-response filter. The filters for each listener m are updated to solve

(27) [00131] In conducted experiments, this cost function is minimized iteratively using the NLMS algorithm with first-order prewhitening. Unlike the crosstalk cancellation filters, the spatialization filters are always active, even when their respective users are not speaking. However, each w m,n is updated only while user n is speaking. If the filters were updated continuously, then the filters would amplify nearby noise sources during speech pauses. [00132] When multiple users are speaking simultaneously, the spatialization filter coefficients are updated jointly. They therefore act as a multiple-input, binaural-output (MIBO) filter that maps from input mixtures to output mixtures. It was shown with reference to FIG. 3B that an TV-input MIBO filter can preserve the spatial cues of up to N sources. MIBO filters do not require that the sources be separated and are unaffected by residual crosstalk, making them well suited for closely spaced talkers. However, they do rely on accurate VAD: false negatives would cause them to blend the cues of multiple active talkers, while false positives would cause them to amplify a nearby noise source in place of the missing talker. The crosstalk cancellation stage therefore helps to mitigate spatial-cue distortion with an unreliable VAD.

[00133] FIG. 9 is a flow chart of a method 900 of crosstalk cancellation as between two remote microphone associated with the first and second users of FIG. 7 according to at least one embodiment The method 900 may be performed by processing logic that may comprise hardware (e.g., circuitry, dedicated logic, programmable logic, microcode, etc.), software (such as instructions running on the processor), firmware or a combination thereof. In one embodiment, the processing device 750 of the listening system performs the method 900. Alternatively, other components of a computing device or cloud server may perform some or all of the operations of the method 900.

[00134] Although shown in a particular sequence or order, unless otherwise specified, the order of the processes can be modified. Thus, the illustrated embodiments should be understood only as examples, and the illustrated processes can be performed in a different order, and some processes can be performed in parallel. Additionally, one or more processes can be omitted in various embodiments. Thus, not all processes are required in every embodiment. Other process flows are possible.

[00135] At operation 910, the processing logic receives a first electronic signal from a first microphone device. [00136] At operation 920, the processing logic receives a second electronic signal from a second microphone device. For example, the second electronic signal includes a mixture that includes the first electronic signal due to crosstalk between the first and second microphone devices.

[00137] At operation 930, the processing logic removes the first electronic signal from the second electronic signal to generate a cleansed second electronic signal.

[00138] At operation 940, the processing logic processes the cleansed second electronic signal to integrate the cleansed second electronic signal into an output signal to the ear playback device.

[00139] With additional reference to FIG. 10, and as an extension to the embodiment of FIG. 8, the third microphone device 720C is now also in play, as User3 has joined the group conversation. In embodiments, the third microphone device is co-located with the first and second microphone devices 720A and 720B. In these embodiments, tire third microphone device 720C generates a third electronic signal corresponding to sound detected within the audio detection range and communicatively coupled to the processing device. In embodiments, the control logic 757 is further to detect a second crosstalk audio signal from a direction of the third microphone device 720C that matches the third electronic signal, for example.

[00140] Thus, in this embodiment, the first electronic signal x r,1 includes a mixture that includes the first crosstalk audio signal (from the first microphone device 720A) and and the second crosstalk audio signal (from the third microphone device 720B). In embodiments, the first electronic signal also includes some ambient noise. In at least one embodiment, the processing device 750 receives and removes, e.g., using crosstalk cancellation filters 1002A, both the second electronic signal x r,2 and the third electronic signal x r,3 from the first electronic signal x r,1 to generate the cleansed first electronic signal ŝ1. In embodiments, to remove the second electronic signal, the processing device 750 applies an adaptive cancellation filter to the first electronic signal with respect to the second electronic signal and to remove the second crosstalk signal, the processing device 750 applies the adaptive cancellation filter to the first electronic signal with respect to the third electronic signal. (As discussed previously, however, in response to detecting that User1 starts talking, the processing device 750 disables the adaptive cancellation filter in some embodiments.) In at least some embodiments, this processing includes applying spatialization filters 1004A (e.g., at least a first audio filter of a set of audio filters) to the cleansed first electronic signal §1 with a first error signal, which is based on an output of the first ear microphone 112 of the first ear listening device 110, to generate a first output signal to the first ear playback device 116, for example.

[00141] In at least some embodiments, because of the third user, additional crosstalk may be captured within the third electronic signal (x r,3 ) that substantially includes a that includes the first electronic signal x r,1 and the second electronic signal x r,2 . In these embodiments, the processing device 750 receives and removes, e.g., using crosstalk cancellation filters 1002B, both the first electronic signal x r,1 and the second electronic signal x r,2 from the third electronic signal x r,3 to generate a cleansed third electronic signal ŝ 3 . In embodiments, to remove the second electronic signal, tire processing device 750 applies an adaptive cancellation filter to the third electronic signal with respect to the second electronic signal and to remove the second crosstalk signal, the processing device 750 applies the adaptive cancellation filter to the third electronic signal with respect to the second electronic signal. (As discussed previously, however, in response to detecting that User3 starts talking, the processing device 750 disables the adaptive cancellation filter in some embodiments.) In at least some embodiments, this processing includes applying spatialization filters 1004B (e.g., at least a second audio filter of a set of audio filters) to the cleansed third electronic signal ŝ 3 with a second error signal, which is based on an output of the second ear microphone 122 of the second ear listening device 120, to generate a second output signal to the second ear playback device 126, for example.

[00142] EXPERIMENTS : The listening system 700 was evaluated with three live human subjects seated around a table in an acoustically treated laboratory (T 60 ≈ 150 ms). Each subject wore an omnidirectional lavalier microphone behind each ear to simulate behind-the- ear hearing aids. Another such microphone was affixed to the table in front of each subject to simulate a mobile phone. Each subject also wore lapel and headset microphones, which were used only for VAD. Noise was produced by a set of six loudspeakers playing clips from the VCTK speech corpus of the Centre for Speech Technology Voice Cloning Toolkit.

[00143] To simulate a group conversation, the subjects took turns reading from a script for 60 seconds. In one recording, the subjects looked straight ahead and tried not to move. In another, they turned to look at each other and gestured while speaking. A third recording with moderate motion was used for VAD training. To quantify the input and output SNR of the system, the noise was recorded separately and added to the live speech recordings. The noise was therefore recorded with a different motion pattern than tire live speech. Likewise, double- talk and triple-talk mixtures were simulated by combining separate recordings. The microphones were sampled synchronously at 48 kHz and processed at 16 kHz. The results shown here are for the left-ear output of a listening device of one user.

[00144] The SNR improvement of the proposed conversation enhancement system is illustrated in FIG. 11. Because the listening system 700 does not perform beamforming or other noise reduction processing, the SNR improvement depends strongly on the placement of the remote microphones. The smartphone-like tabletop microphones had higher input SNR and lower crosstalk than the earpiece and lapel microphones, especially at high frequencies. Using the MIBO spatialization filters of FIG. 3B without crosstalk cancellation improves the high-frequency SNR at the left ear by up to 10 dB. The crosstalk filter helps to further suppress noise when nearby users are not speaking, providing another 2-5 dB benefit to SNR. A conventional remote microphone system that mutes all but one microphone achieves the best average output SNR, but is too distracting to be practical. The plot of FIG. 11 shows performance using the headset-based VAD for nonmoving subjects. The results for other experimental conditions were similar and so are not reported. VAD accuracy and user motion appear to have little effect on ambient noise reduction.

[00145] FIG. 12A is a graph illustrating own-speech crosstalk suppression performance at a left earpiece of a listener using a head microphone adapted to perform voice activity detection (VAD) according to experimental embodiments. FIG. 12B is a graph illustrating own-speech crosstalk suppression performance at a left earpiece of a listener using a lapel microphone adapted to perform VAD according to experimental embodiments. Thus, FIGs. 12A-12B show the crosstalk reduction performance of the system for the listener’s own speech. The curves show the crosstalk level relative to the direct acoustic path to the earpiece. Because the users are sealed close together, the own-speech crosstalk in the baseline spatialization-only system is just 2-5 dB weaker than the direct path. The crosstalk cancellation filters were able to suppress own-speech echoes by up to 15 dB more than the baseline system, but their performance depends on talker motion and on VAD accuracy. The residual crosstalk levels for moving talkers (solid curves) are higher than those for stationary talkers (dashed curves), especially at high frequencies for which source positions may suddenly change by multiple acoustic wave-lengths. Echo suppression was also worse for the less-reliable lapel-based VAD (FIG. 12B) compared to the more-reliable headset-based VAD (FIG. 12A). The performance of the muting system depends entirely upon VAD performance. With the reliable VAD, the muting system removed virtually all echoes; with the unreliable VAD, the muting system performed littie better than the cancellation system at most frequencies in the motion experiment.

[00146] One can evaluate spatialization performance by comparing the interaural cues of the system output to the cues of the noise-free speech signals at the ears. FIG. 13A is a graph illustrating high-frequency interaural level differences of other talkers at ears of a listener where subjects take turns speaking while moving to face each other according to experimental embodiments. FIG. 13B is a graph illustrating high-frequency interaural level differences of other talkers at ears of a listener simulated with double-talk and triple-talk with subjects facing forward according to experimental embodiments. Thus, FIGs. 13A-13B illustrate the input and output interaural level differences (ILD) of speech from the two other talkers at the ears of the listener. The ILDs are averaged over 1.5 sec windows from 1-8 kHz and color-coded to show the active talker(s). When the talkers take turns (FIG. 13A), only one spatialization filter adapts at a time. The output cues closely match the input cues, even as the listener turns their head. When both other talkers speak simultaneously (FIG. 13B, 30- 42 s), two filters adapt jointly, preserving the spatial cues of both sources despite residual crosstalk. When the listener and talker(s) speak simultaneously (FIG. 13B, 42-60 s), crosstalk cancellation is disabled and the spatialization filters are unable to distinguish the listener’s own speech from that of the other talkers, so their spatial cues are blended. Thus, a user might have trouble localizing conversation partners while interrupting them.

[00147] FIG. 14 is a block diagram of an example computer system 1400 in which embodiments of the present disclosure can operate. The system 1400 may represent the mobile device 140 or another device or system to which is referred or which is capable of executing the embodiment as disclosed herein. The computer system 1400 may include an ordered listing of a set of instructions 1402 that may be executed to cause the computer system 1400 to perform any one or more of the methods or computer-based functions disclosed herein. The computer system 1400 may operate as a stand-alone device or may be connected to other computer systems or peripheral devices, e.g., by using a network 1410. [00148] In a networked deployment, the computer system 1400 may operate in the capacity of a server or as a client-user computer in a server-client user network environment, or as a peer computer system in a peer-to-peer (or distributed) network environment. The computer system 1400 may also be implemented as or incorporated into various devices, such as a personal computer or a mobile computing device capable of executing a set of instructions 1402 that specify actions to be taken by that machine, including and not limited to, accessing the internet or web through any form of browser. Further, each of the systems described may include any collection of sub-systems that individually or jointly execute a set, or multiple sets, of instructions to perform one or more computer functions.

[00149] The computer system 1400 may include a memory 1404 on a bus 1420 for communicating information. Code operable to cause the computer system to perform any of the acts or operations described herein may be stored in the memory 1404. The memory 1404 may be a random-access memory, read-only memory, programmable memory, hard disk drive, solid-state disk drive, or other type of volatile or non-volatile memory or storage device.

[00150] The computer system 1400 may include a processor 1408, such as a central processing unit (CPU) and/or a graphics processing unit (GPU) and may include additional logic such as the audio detector 755 discussed with reference to FIG. 7. The processor 1408 may include one or more general processors, digital signal processors, application specific integrated circuits, field programmable gate arrays, digital circuits, optical circuits, analog circuits, combinations thereof, or other now known or later-developed devices for analyzing and processing data. The processor 1408 may implement the set of instructions 1402 or other software program, such as manually-programmed or computer-generated code for implementing logical functions. The logical function or system element described may, among other functions, process and/or convert an analog data source such as an analog electrical, audio, or video signal, or a combination thereof, to a digital data source for audio- visual purposes or other digital processing purposes such as for compatibility for computer processing.

[00151] The computer system 1400 may also include a disk (or optical) drive unit 1415. The disk drive unit 1415 may include a non-transitory computer-readable storage medium 1440 in which one or more sets of instructions 1402, e.g., software, can be embedded or stored. Further, the instructions 1402 may perform one or more of the operations as described herein. The instructions 1402 may reside completely, or at least partially, within the memory 1404 and/or within the processor 1408 during execution by the computer system 1400. [00152] The memory 1404 and the processor 1408 also may include non-transitory computer-readable media as discussed above. A “computer-readable medium,” “computer- readable storage medium,” “machine readable medium,” “propagated-signal medium,” and/or “signal-bearing medium” may include any device that includes, stores, communicates, propagates, or transports software for use by or in connection with an instruction executable system, apparatus, or device. The machine-readable medium may selectively be, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium.

[00153] Additionally, the computer system 1400 may include an input device 1425, such as a keyboard or mouse, configured for a user to interact with any of the components of system 1400. It may further include a display 1430, such as a liquid crystal display (LCD), a cathode ray tube (CRT), or any other display suitable for conveying information. The display 1430 may act as an interface for the user to see the functioning of the processor 1408, or specifically as an interface with the software stored in the memory 1404 or the drive unit 1415.

[00154] The computer system 1400 may include a communication interface 1436 that enables communications via the communications network 1410. The communications network 1410 may include wired networks, wireless networks, or combinations thereof. The communication interface 1436 network may enable communications via a number of communication standards, such as 802.11, 802.17, 802.20, WiMax, cellular telephone standards, or other communication standards.

[00155] Accordingly, the method and system may be realized in hardware, software, or a combination of hardware and software. The method and system may be realized in a centralized fashion in at least one computer system or in a distributed fashion where different elements are spread across several interconnected computer systems. A computer system or other apparatus adapted for carrying out the methods described herein is suited to the present disclosure. A typical combination of hardware and software may be a general-purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein. Such a programmed computer may be considered a special-purpose computer.

[00156] The method and system may also be embedded in a computer program product, which includes all the features enabling the implementation of the operations described herein and which, when loaded in a computer system, is able to cany out these operations. Computer program in the present context means any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function, either directly or after either or both of the following: a) conversion to another language, code or notation; b) reproduction in a different material form.

[00157] The disclosure also relates to an apparatus for performing the operations herein. This apparatus can be specially constructed for the intended purposes, or it can include a general purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program can be stored in a computer readable storage medium, such as, but not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, SD-cards, solid-state drives, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.

[00158] The algorithms, operations, and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose systems can be used with programs in accordance with the teachings herein, or it can prove convenient to construct a more specialized apparatus to perform the method. The structure for a variety of these systems will appear as set forth in the description below. In addition, the disclosure is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages can be used to implement the teachings of the disclosure as described herein.

[00159] The disclosure can be provided as a computer program product, or software, that can include a machine-readable medium having stored thereon instructions, which can be used to program a computer system (or other electronic devices) to perform a process according to the disclosure. A machine-readable medium includes any mechanism for storing information in a form readable by a machine (e.g., a computer). In some embodiments, a machine-readable (e.g., computer-readable) medium includes a machine (e.g., a computer) readable storage medium such as a read only memory (“ROM”), random access memory (“RAM”), magnetic disk storage media, optical storage media, flash memory components, solid-state memory components, etc. [00160] The words “example” or “exemplary” are used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as “example’ or “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects or designs. Rather, use of the words “example” or “exemplary” is intended to present concepts in a concrete fashion. As used in this application, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or.” That is, unless specified otherwise, or clear from context, “X includes A or B” is intended to mean any of the natural inclusive permutations. That is, if X includes A; X includes B; or X includes both A and B, then “X includes A or B” is satisfied under any of the foregoing instances. In addition, the articles “a” and “an” as used in this application and the appended claims may generally be construed to mean “one or more” unless specified otherwise or clear from context to be directed to a singular form. Moreover, use of the term “an implementation” or “one implementation” or “an embodiment” or “one embodiment” or the tike throughout is not intended to mean the same implementation or implementation unless described as such. One or more implementations or embodiments described herein may be combined in a particular implementation or embodiment. The terms “first,” “second,” “third,” “fourth,” etc. as used herein are meant as labels to distinguish among different elements and may not necessarily have an ordinal meaning according to their numerical designation.

[00161] In the foregoing specification, embodiments of the disclosure have been described with reference to specific example embodiments thereof. It will be evident that various modifications can be made thereto without departing from the broader spirit and scope of embodiments of the disclosure as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense.