Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ACOUSTIC PROCESSING DEVICE FOR MIMO ACOUSTIC ECHO CANCELLATION
Document Type and Number:
WIPO Patent Application WO/2022/048736
Kind Code:
A1
Abstract:
An acoustic processing device (300) for performing MIMO acoustic echo cancellation is disclosed. The acoustic processing device (300) comprises a first signal reception unit (303) adapted to receive a plurality of loudspeaker signals and a second signal reception unit (305) adapted to receive a plurality of microphone signals. Moreover, the acoustic processing device (300) comprises a processing circuitry (301) adapted to enable echo reduction, the processing circuitry (301) being configured to determine for each microphone signal an estimated echo signal, wherein the estimated echo signal comprises an estimated direct echo signal and an estimated residual echo signal. The processing circuitry (301) is further configured to determine a respective echo reduced microphone signal based on the respective microphone signal and the estimated echo signal.

Inventors:
HAUBNER THOMAS (DE)
HALIMEH MODAR (DE)
KELLERMANN WALTER (DE)
TAGHIZADEH MOHAMMAD (DE)
Application Number:
PCT/EP2020/074427
Publication Date:
March 10, 2022
Filing Date:
September 02, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
UNIV FRIEDRICH ALEXANDER ER (DE)
International Classes:
H04M9/08; G10L21/02; H04R1/40; H04R3/00; H04R3/02; H04S7/00
Foreign References:
US6246760B12001-06-12
US20130230184A12013-09-05
EP2701145A12014-02-26
Other References:
G. ENZNER ET AL.: "Acoustic Echo Control", 2014, ACADEMIC PRESS LIBRARY IN SIGNAL PROCESSING
M. SONDHI ET AL.: "Sterephonic acoustic echo cancellation-an overview of the fundamental problem", IEEE SIGNAL PROCESSING LETTER, vol. 2, no. 8, August 1995 (1995-08-01), pages 148 - 151
D. R. MORGAN ET AL.: "IEEE Transactions on Speech and Audio Processing", 2001, article "A better understanding and an improved solution to the specific problem of stereophonic acoustic echo cancellation"
C. HOFMANNW. KELLERMANN: "IEEE International Conference on Acoustics", March 2016, SPEECH AND SIGNAL PROCESSING, article "Acoustic Echo Cancellation for Surround Sound using Perceptually Motivated Convergence Enhancement"
H. BUCHNER ET AL.: "Generalized multichannel Frequency-Domain Adaptive Filtering: efficient realization and application to hands-free speech communication", SIGNAL PROCESSING, 2005
M. SCHNEIDERW. KELLERMANN: "The Generalized Frequency-Domain Adaptive Filtering algorithm as an approximation of the block recursive least-squares algorithm", EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2016
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
CLAIMS

1 . An acoustic processing device (300), comprising: a first signal reception unit (303) adapted to receive a plurality of loudspeaker signals; a second signal reception unit (305) adapted to receive a plurality of microphone signals; and a processing circuitry (301) adapted to enable echo reduction, the processing circuitry (301) being configured to determine an estimated echo signal, wherein the estimated echo signal comprises an estimated direct echo signal and an estimated residual echo signal, wherein the processing circuitry (301) is further configured to determine a respective echo reduced microphone signal based on the respective microphone signal and the estimated echo signal.

2. The acoustic processing device (300) of claim 1 , wherein the processing circuitry (301) is configured to determine the respective echo reduced microphone signal as a difference between the respective microphone signal and the estimated echo signal.

3. The acoustic processing device (300) of claim 1 or 2, wherein the processing circuitry (301) is configured to determine the respective estimated direct echo signal on the basis of the plurality of loudspeaker signals and one or more pre-defined MIMO FIR filter templates, each MIMO FIR filter template having a plurality of pre-defined filter coefficients.

4. The acoustic processing device (300) of claim 3, wherein the one or more predefined MIMO FIR filter templates comprise a plurality of pre-defined MIMO FIR filter templates and wherein the processing circuitry (301) is configured to determine the respective estimated direct echo signal on the basis of the plurality of loudspeaker signals and a linear combination of the plurality of pre-defined MIMO FIR filter templates, each of the plurality of pre-defined MIMO FIR filter templates weighted by an adjustable weighting coefficient.

23

5. The acoustic processing device (300) of claim 4, wherein the processing circuitry (301) is configured to adjust the plurality of adjustable weighting coefficients on the basis of the plurality of microphone signals and the plurality of estimated echo signals.

6. The acoustic processing device (300) of claim 5, wherein the processing circuitry (301) is configured to adjust the plurality of adjustable weighting coefficients on the basis of the plurality of estimated direct echo signals and a plurality of direct path reference signals, wherein the processing circuitry (301) is configured to determine the plurality of direct path reference signals on the basis of the plurality of microphone signals, one or more selected loudspeaker signals of the plurality of loudspeaker signals and the adaptive MIMO FIR filter.

7. The acoustic processing device (300) of claim 6, wherein the processing circuitry (301) is further configured to adjust the plurality of adaptive filter coefficients of the adaptive MIMO FIR filter on the basis of the plurality of estimated echo signals and a plurality of residual reference signals, wherein the processing circuitry (301) is configured to determine each of the plurality of residual reference signals as a difference between the respective microphone signal and the estimated direct echo signal.

8. The acoustic processing device (300) of claims 5 or 6, wherein each microphone signal and each estimated echo signal comprises a plurality of samples divided into different blocks covering different time portions of the respective signal and wherein the processing circuitry (301) is configured to adjust the plurality of adjustable weighting coefficients on the basis of the plurality of microphone signals and the plurality of estimated echo signals using a block-recursive least squares algorithm.

9. The acoustic processing device (300) of any one of the preceding claims, wherein the processing circuitry (301) is configured to determine the respective estimated residual echo signal on the basis of one or more selected loudspeaker signals of the plurality of loudspeaker signals and an adaptive MIMO FIR filter, the adaptive MIMO FIR filter having a plurality of adaptive filter coefficients.

10. The acoustic processing device (300) of claim 9, wherein the processing circuitry (301) is configured to adjust the plurality of adaptive filter coefficients of the adaptive MIMO FIR filter on the basis of the plurality of respective microphone signals and the plurality of estimated echo signals.

11 . The acoustic processing device (300) of claim 9, wherein the processing circuitry (301) is configured to adjust the plurality of adaptive filter coefficients of the adaptive MIMO FIR filter on the basis of the plurality of estimated echo signals and a plurality of residual reference signals, wherein the processing circuitry (301) is configured to determine each of the plurality of residual reference signals as a difference between the respective microphone signal and the estimated direct echo signal.

12. The acoustic processing device (300) of any one of claims 9 to 11 , wherein the processing circuitry (301) is configured to determine for each loudspeaker signal a signal level measure and/or a correlation measure and to determine the one or more selected loudspeaker signals of the plurality of loudspeaker signals on the basis of the plurality of signal level measures and/or correlation measures.

13. An acoustic device (500), comprising: the acoustic processing device (300) according to any one of the preceding claims; a plurality of loudspeakers (501 a-h), wherein each loudspeaker (501 a-h) is configured to be driven by one of the plurality of loudspeaker signals; and/or a plurality of microphones (503a-g), wherein each microphone (503a-g) is configured to detect one of the plurality of microphone signals.

14. The acoustic device (500) of claim 13, wherein the acoustic device (500) further comprises a mixing unit (100) configured to generate the plurality of loudspeaker signals on the basis of an input signal, in particular a stereo input signal.

15. An acoustic processing method (400), comprising: receiving (401) a plurality of loudspeaker signals; receiving (403) a plurality of microphone signals; determining (405) an estimated echo signal, wherein the estimated echo signal comprises an estimated direct echo signal and an estimated residual echo signal; and determining (407) an echo reduced microphone signal based on the microphone signal and the estimated echo signal. 16. A computer program product comprising a computer-readable storage medium for storing program code which causes a computer or a processor to perform the method (400) of claim 15, when the program code is executed by the computer or the processor.

26

Description:
ACOUSTIC PROCESSING DEVICE FOR MIMO ACOUSTIC ECHO CANCELLATION

TECHNICAL FIELD

The present disclosure relates to acoustic sound processing in general. More specifically, the disclosure relates to an acoustic processing device for MIMO acoustic echo cancellation (AEC).

BACKGROUND

Acoustic Echo Cancellation (AEC) addresses the problem of suppressing undesired acoustic coupling between sound reproduction and sound acquisition in an acoustic device (see G. Enzner et al., “Acoustic Echo Control”, Academic Press Library in Signal Processing, 2014). AEC is often used, for instance, for full-duplex hands-free acoustic human-machine interfaces, such as smart speaker devices with multiple loudspeakers and multiple microphones. AEC with multiple loudspeakers and multiple microphones is referred to as MIMO AEC.

A typical MIMO AEC scenario is illustrated in Figure 1 , where P loudspeaker signals x(n) G IR P are rendered from a common stereo source signal g(n) G IR 2 by a time-varying (TV) mixing unit 100 to drive P loudspeakers 110. The acoustic pressure variations generated by the P loudspeakers 110 in accordance with the P loudspeaker signals x( ) travel via a direct path 141 and one or more echo paths 143 due to reflections at the walls or ceiling of a room 140 the system is located in to Q microphones 120. In addition to the pressure variations caused by the P loudspeakers 110 the Q microphones 120 detect a desired signal or target signal, for instance, a voice signal via a path 145. The Q microphone signals y(n) G IR 0 detected by the Q microphones 120 are subsequently processed by a linear time-varying (LTV) speech enhancement stage 130, e.g., a beamformer 130.

A block diagram illustrating a typical architecture of the time-varying (TV) mixing unit 100 (also referred to as "rendering system") is shown in Figure 2, which comprises two main stages. The first stage is a low-pass/high-pass decomposition of the stereo source signal g(n). While a subwoofer signal s LP (n) G IR is computed by low-pass filtering 107 the sum 105 of the stereo channels, tweeter signals are computed by a time-varying tweeter rendering unit 103 which uses as input the high-pass filtered 101 stereo signals. Due to the time-variance of the mixing unit 100, the tweeter signals usually exhibit a high crosscorrelation and often also significantly different power levels. These properties complicate the task of robust acoustic system identification which is desired for AEC, as described, for instance, in M. Sondhi et al., “Sterephonic acoustic echo cancellation-an overview of the fundamental problem”, IEEE Signal Processing Letter, vol. 2, no. 8, pp. 148-151 , Aug. 1995 and J. Benesty et al., “A better understanding and an improved solution to the specific problem of stereophonic acoustic echo cancellation”, IEEE Transactions on Speech and Audio Processing, 1998. As illustrated in figure 2, the time-varying tweeter rendering unit 103 can be decomposed into two blocks R 103a and R 2 103b with the intermediate signal s HP (n). While the first block 103a contains a rapidly time-varying nonlinear spectral processing, the second block 103b contains a slowly time-varying spatial processing.

Most conventional approaches deal with the problem of high cross-correlation described above by preprocessing the loudspeaker signals x(n) to promote uncorrelatedness between the tweeter signals. This can be achieved for example by nonlinear distortions (see, for instance, D. R. Morgan et al., “Investigation of several types of nonlinearities for use in stereo acoustic echo cancellation”, IEEE Transactions on Speech and Audio Processing, 2001.), decorrelation filters or phase modulation filter banks (see, for instance, J. Herre et al., “Acoustic Echo Cancellation for Surround Sound using Perceptually Motivated Convergence Enhancement”, IEEE International Conference on Acoustics, Speech and Signal Processing, 2007). However, inherent to all of these approaches is a degradation of the sound reproduction quality which is undesired for high- end sound reproduction systems.

A different approach aims at exploiting the high cross-correlation instead of removing it by directly using the input signal of the rendering system, i.e. the mixing unit 100 as reference for the AEC system, as suggested in C. Hofmann and W. Kellermann, “Sourcespecific system identification”, in IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, Mar. 2016. This facilitates computationally efficient and fast converging algorithms. However, the identified acoustic system would then depend on the rendering algorithm which results in problems for rapidly time-varying rendering algorithms which are commonly used in personal audio applications. SUMMARY

It is an object of the invention to provide an improved acoustic processing device for MIMO acoustic echo cancellation.

The foregoing and other objects are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to a first aspect an acoustic processing device for MIMO acoustic echo cancelation (AEC) is provided. The acoustic processing device comprises a first signal reception unit adapted to receive a plurality of loudspeaker signals and a second signal reception unit adapted to receive a plurality of microphone signals. Moreover, the acoustic processing device comprises a processing circuitry adapted to enable echo reduction, wherein the processing circuitry is configured to determine for each microphone signal an estimated echo signal, wherein the estimated echo signal comprises, i.e. is a combination of an estimated direct echo signal and an estimated residual echo signal. The processing circuitry is further configured to determine a respective echo reduced microphone signal based on the respective microphone signal and the estimated echo signal.

Advantageously, the acoustic processing device for MIMO AEC and its different implementation forms allow a robust system identification for acoustic scenarios challenged by highly cross-correlated loudspeaker signals, time-varying loudspeaker rendering systems and/or different power levels of excitation signals, Moreover, the acoustic processing device for MIMO AEC and its different implementation forms do not suffer from an impaired sound reproduction quality (intrinsic to conventional channel decorrelation approaches).

In a further possible implementation form of the first aspect, the processing circuitry is configured to determine for each microphone signal the respective echo reduced microphone signal as a difference between the respective microphone signal and the estimated echo signal. Advantageously, this allows the acoustic processing device to extract the desired target signal, namely the respective echo reduced microphone signal from the respective microphone signal. In a further possible implementation form of the first aspect, the processing circuitry is configured to determine for each microphone signal the respective estimated direct echo signal on the basis of the plurality of loudspeaker signals and one or more pre-defined, i.e. fixed MIMO FIR (finite impulse response) filter templates, wherein each MIMO FIR filter template has, i.e. is defined by a plurality of pre-defined filter coefficients. In an implementation form the one or more fixed MIMO FIR filter templates may comprise one or more room impulse responses (RIRs).

In a further possible implementation form of the first aspect, the one or more pre-defined, i.e. fixed MIMO FIR filter templates comprise a plurality of, i.e. at least two pre-defined MIMO FIR filter templates, wherein the processing circuitry is configured to determine for each microphone signal the respective estimated direct echo signal on the basis of the plurality of loudspeaker signals and a linear combination of the plurality of pre-defined MIMO FIR filter templates, wherein each of the plurality of pre-defined MIMO FIR filter templates is weighted by an adjustable weighting coefficient. Thus, advantageously, the processing circuitry of the acoustic processing device is configured to efficiently determine a respective estimated direct echo signal on the basis of a weighted linear combination of the predefined MIMO FIR filter templates.

In a further possible implementation form of the first aspect, the processing circuitry is configured to adjust the plurality of adjustable weighting coefficients of the weighted linear combination of the predefined MIMO FIR filter templates on the basis of the plurality of microphone signals and the plurality of estimated echo signals. In other words, in an implementation form the processing circuitry is configured to adjust, e.g. optimize the plurality of adjustable weighting coefficients using the microphone signals as reference signals.

In a further possible implementation form of the first aspect, the processing circuitry is configured to adjust the plurality of adjustable weighting coefficients on the basis of the plurality of estimated direct echo signals and a plurality of direct path reference signals, wherein the processing circuitry is configured to determine the plurality of direct path reference signals on the basis of the plurality of microphone signals, one or more selected loudspeaker signals of the plurality of loudspeaker signals and the adaptive MIMO FIR filter. In a further possible implementation form of the first aspect, the processing circuitry is further configured to adjust the plurality of adaptive filter coefficients of the adaptive MIMO FIR filter on the basis of the plurality of estimated echo signals and a plurality of residual reference signals, wherein the processing circuitry is configured to determine each of the plurality of residual reference signals as a difference between the respective microphone signal and the estimated direct echo signal. Thus, in an implementation form the processing circuitry is configured to first adjust, i.e. optimize the plurality of adjustable weighting coefficients of the weighted linear combination of the predefined MIMO FIR filter templates and, thereafter, adjust, i.e. optimize the plurality of adaptive filter coefficients of the adaptive MIMO FIR filter.

In a further possible implementation form of the first aspect, each microphone signal and each estimated echo signal comprises a plurality of samples divided into different blocks covering different time portions of the respective signal, wherein the processing circuitry is configured to adjust the plurality of adjustable weighting coefficients of the weighted linear combination of the predefined MIMO FIR filter templates on the basis of the plurality of microphone signals and the plurality of estimated echo signals using a block-recursive least squares algorithm.

In a further possible implementation form of the first aspect, the processing circuitry is configured to determine for each microphone signal the respective estimated residual echo signal on the basis of one or more selected loudspeaker signals of the plurality of loudspeaker signals and an adaptive MIMO FIR filter, the adaptive MIMO FIR filter having a plurality of adaptive filter coefficients.

In a further possible implementation form of the first aspect, the processing circuitry is configured to adjust the plurality of adaptive filter coefficients of the adaptive MIMO FIR filter on the basis of the plurality of respective microphone signals and the plurality of estimated echo signals. In other words, in an implementation form the processing circuitry is configured to adjust, e.g. optimize the plurality of adaptive filter coefficients of the adaptive MIMO FIR filter using the microphone signals as reference signals.

In a further possible implementation form of the first aspect, the processing circuitry is configured to adjust the plurality of adaptive filter coefficients of the adaptive MIMO FIR filter on the basis of the plurality of estimated echo signals and a plurality of residual reference signals, wherein the processing circuitry is configured to determine each of the plurality of residual reference signals as a difference between the respective microphone signal and the estimated direct echo signal. In other words, in an implementation form the processing circuitry is configured to adjust, e.g. optimize the plurality of adaptive filter coefficients of the adaptive MIMO FIR filter using residual reference signals.

In a further possible implementation form of the first aspect, the processing circuitry is configured to determine for each loudspeaker signal a signal level measure and/or a correlation measure and to determine the one or more selected loudspeaker signals of the plurality of loudspeaker signals on the basis of the plurality of signal level measures and/or correlation measures.

According to a second aspect an acoustic device, such as a smart speaker device is disclosed. The acoustic device comprises an acoustic processing device for MIMO AEC according to the first aspect. Moreover, the acoustic device comprises a plurality of loudspeakers, wherein each loudspeaker is configured to be driven by one of the plurality of loudspeaker signals. The acoustic device further comprises a plurality of microphones, wherein each microphone is configured to detect one of the plurality of microphone signals.

In a further possible implementation form of the second aspect, the acoustic device further comprises a mixing unit configured to generate the plurality of loudspeaker signals on the basis of an input signal, in particular a stereo input signal.

According to a third aspect an acoustic processing method for MIMO AEC is disclosed. The acoustic processing method comprises the steps of: receiving a plurality of loudspeaker signals; receiving a plurality of microphone signals; determining an estimated echo signal, wherein the estimated echo signal comprises an estimated direct echo signal and an estimated residual echo signal; and determining an echo reduced microphone signal based on the microphone signal and the estimated echo signal. The acoustic processing method according to the third aspect of the present disclosure can be performed by the acoustic processing device according to the first aspect of the present disclosure. Thus, further features of the method according to the third aspect of the present disclosure result directly from the functionality of the acoustic processing device according to the first aspect of the present disclosure as well as its different implementation forms described above and below.

According to a fourth aspect, a computer program product comprising a non-transitory computer-readable storage medium for storing program code which causes a computer or a processor to perform the acoustic processing method according to the third aspect, when the program code is executed by the computer or the processor, is provided.

Details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description, drawings, and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In the following, embodiments of the present disclosure are described in more detail with reference to the attached figures and drawings, in which:

Fig. 1 is a schematic diagram illustrating a smart speaker device with a plurality of loudspeakers and a plurality of microphones;

Fig. 2 is a schematic diagram illustrating a time-varying mixing unit for an acoustic device for MIMO AEC according to an embodiment;

Fig. 3a is a schematic diagram illustrating an acoustic processing device for MIMO AEC according to an embodiment;

Fig. 3b is schematic diagram illustrating further aspects of an acoustic processing device for MIMO AEC according to an embodiment;

Fig. 4 is a flow diagram illustrating an acoustic processing method for MIMO AEC according to an embodiment; Figs. 5a and 5b are schematic diagrams illustrating a top view and a bottom view of a multichannel smart speaker device according to an embodiment; and

Fig. 6 shows graphs illustrating the AEC performance of the MIMO smart speaker device of figures 6a and 6b.

In the following, identical reference signs refer to identical or at least functionally equivalent features.

DETAILED DESCRIPTION OF THE EMBODIMENTS

In the following description, reference is made to the accompanying figures, which form part of the disclosure, and which show, by way of illustration, specific aspects of embodiments of the present disclosure or specific aspects in which embodiments of the present disclosure may be used. It is understood that embodiments of the present disclosure may be used in other aspects and comprise structural or logical changes not depicted in the figures. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims.

For instance, it is to be understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if one or a plurality of specific method steps are described, a corresponding device may include one or a plurality of units, e.g. functional units, to perform the described one or plurality of method steps (e.g. one unit performing the one or plurality of steps, or a plurality of units each performing one or more of the plurality of steps), even if such one or more units are not explicitly described or illustrated in the figures. On the other hand, for example, if a specific apparatus is described based on one or a plurality of units, e.g. functional units, a corresponding method may include one step to perform the functionality of the one or plurality of units (e.g. one step performing the functionality of the one or plurality of units, or a plurality of steps each performing the functionality of one or more of the plurality of units), even if such one or plurality of steps are not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other, unless specifically noted otherwise. Figure 3a is a schematic diagram illustrating an acoustic processing device 300 for MIMO AEC according to an embodiment. As will be described in more detail below, the acoustic processing device comprises a first signal reception unit 303 adapted to receive a plurality of loudspeaker signals x(n) and a second signal reception unit 305 adapted to receive a plurality of microphone signals y(n). Moreover, the acoustic processing device 300 comprises a processing circuitry 301 adapted to enable echo reduction, in particular acoustic echo cancellation (AEC). The processing circuitry 301 of the acoustic processing device 300 may be implemented in hardware and/or software. The hardware may comprise digital circuitry, or both analog and digital circuitry. Digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or general-purpose processors. The acoustic processing device 300 may further comprise a non-transitory memory configured to store data and executable program code which, when executed by the processing circuitry 301 causes the acoustic processing device 300 to perform the functions, operations and methods described herein.

As will be described in more detail below in the context of figure 3b, the processing circuitry 301 of the acoustic processing device 300 is configured to perform echo reduction, in particular acoustic echo cancellation (AEC) by determining for each of the plurality of microphone signals y(n) an estimated echo signal y(n), wherein the estimated echo signal y(n) comprises, i.e. is a combination of an estimated direct echo signal y dir (n) and an estimated residual echo signal jresO 1 ). and by determining a respective echo reduced microphone signal e(n) based on the respective microphone signal of the plurality of microphone signals y(n) and the respective estimated echo signal y(n). As illustrated in figure 3b, the plurality of microphone signals y(n) depend on the plurality of loudspeaker signals x(n) and the LEM (Loudspeaker-Enclosure-Microphone) 320, including the acoustic propagation as well as the loudspeaker and microphone transfer functions.

In an embodiment, the processing circuitry 301 of the acoustic processing device 300 is configured to determine for each of the plurality of microphone signals y(n) the respective echo reduced microphone signal e(n) as a difference between the respective microphone signal y(n) and the respective estimated echo signal y(n), i.e. e(n) = y(n) - y(n). As illustrated in figure 3b, the processing circuitry 301 may be configured to implement in hardware and/or software a processing block 301 d for determining the difference between the respective microphone signal y(n) and the respective estimated echo signal y(n), i.e. e(n) = y(n) - y(n).

The acoustic processing device 300 may be a component of an acoustic device 500, such as a smart speaker device 500, illustrated in figures 5a and 5b. In addition to the acoustic processing device 300 the acoustic device 500 further comprises a plurality of loudspeakers 501 a-h, wherein each loudspeaker 501 a-h is configured to be driven by one of the plurality of loudspeaker signals x(n), and a plurality of microphones 503a-g, wherein each microphone 503a-g is configured to detect one of the plurality of microphone signals y(n) . In an embodiment, the acoustic device 500 may further comprise a mixing unit, such as the mixing unit 100 shown in figure 2, configured to generate the plurality of loudspeaker signals x(n) on the basis of an input signal, in particular a stereo input signal g(n).

As will be described in more detail below, in an embodiment, the processing circuitry 301 of the acoustic processing device 300 is configured for the direct path to determine the respective estimated direct echo signal y(n) on the basis of the plurality of loudspeaker signals x(n) and one or more pre-defined MIMO FIR filter templates, wherein each MIMO FIR filter template comprises a plurality of pre-defined filter coefficients. In an embodiment, the one or more pre-defined MIMO FIR filter templates may comprise one or more room impulse responses (RIRs). In an embodiment, the processing circuitry 301 of the acoustic processing device 300 is configured to determine the respective estimated direct echo signal y(n) on the basis of the plurality of loudspeaker signals x(n) and a linear combination of the plurality of pre-defined MIMO FIR filter templates, wherein each of the plurality of pre-defined MIMO FIR filter templates is weighted by an adjustable weighting coefficient, i.e. weight.

As will described in more detail below, in an embodiment, the processing circuitry 301 is configured to determine for the residual path the respective estimated residual echo signal on the basis of one or more selected loudspeaker signals of the plurality of loudspeaker signals and an adaptive MIMO FIR filter, wherein the adaptive MIMO FIR filter has a plurality of adaptive filter coefficients.

Embodiments of the acoustic processing device 300 disclosed herein allow tackling the problem of high-cross correlation by decomposing the Multiple-Input-Multiple-Output (MIMO) echo path into a direct path and a residual path. While the direct path may be modelled by the adaptive linear combination of a priori measured Room Impulse Response (RIR) templates, the residual path may be represented by the adaptive MIMO FIR filter. While the direct-path echo cancellation processing branch may employ the loudspeaker signals x(n) as input, the residual path processing branch may use the output signals s(n) of a signal selection processing block 301a (illustrated in figure 3b) as input. By means of using two different AEC models, one can deal with the problems of non-unique system identification due to highly correlated loudspeaker signals and a timevarying rendering algorithm.

As will be described in more detail below, in an embodiment, the linear direct-path AEC model H dir (n) 301 b consists of a linear combination of broadband MIMO Finite Impulse Response (FIR) filter templates of length L dir which are measured in advance in typical acoustic environments. This allows to transfer the high-dimensional AEC problem of estimating a set of FIRs to a low-dimensional optimization problem, i.e., the estimation of the adjustable weighting coefficients. Due to the broadband MIMO coupling of the filter templates, both the non-uniqueness problem and the non-uniform signal level problem are significantly reduced.

The remaining echoes which cannot be described by the direct-path model are modelled by the residual AEC system H res (n) 301a which consists of the adaptive MIMO filter, which may contain FIR filters of length L between each input and each output channel. To limit the effect of cross-correlated loudspeaker signals, in an embodiment, only a reduced number of less cross-correlated input signals s(n) with high signal level are used. This reduced number of signals is selected by the signal selection processing block 301a depending on the rendering algorithm implemented in the mixing unit 100. A variety of such rendering algorithms are known and, therefore, not described in greater detail herein.

In the following an embodiment of the acoustic processing device 300 for MIMO AEC will be described in more detail in the context of figure 3b, using the following notation:

• Q -. Number of microphone signals

• P Number of loudspeaker signals

• L: Residual and direct path FIR filter length

• L dir Number of non-zero template coefficients

• n: Sample index m: Block index

As illustrated in figure 3b, the processing circuitry 301 of the acoustic processing device 300 may be configured to implement in hardware and/or software a processing block 301 e for determining the estimated echo signal in the time domain as the sum of the estimated direct echo signal, i.e. the direct-path component y dir (n) and the estimated residual echo signal, i.e. the residual-path component y res (n). In an embodiment, the processing circuitry 301 is configured to implement in hardware and/or software a processing block 301 b for computing the estimated direct echo signal, i.e. the direct-path component as with denoting a MIMO transmission matrix modelling P ■ Q FIR filters of length L between each of the P loudspeaker signals (i.e. input signals) summarized in the vector and the Q estimated direct echo signals (i.e. output signals) captured in the vector y dir (n). As can be taken from figure 3b, the processing circuitry 301 of the acoustic processing device 300 may be further configured to implement in hardware and/or software a processing block 301c for computing the estimated residual echo path signal as y r es(n) = Hres(n)s(n) G IR Q (6) wherein s(n) IR 14 ' contains the W output signals of a signal selection block 301a implemented by the processing circuitry 301 of the acoustic processing device 300 in hardware and/or software. In an embodiment, the output signals s(n) of the signal selection block 301a contain one or more selected loudspeaker signals s(n) of the plurality of loudspeaker signals x(n). In an embodiment, the signal selection block 301a may be configured to determine for each loudspeaker signal x(n) a signal level measure and/or a correlation measure and to determine the one or more selected loudspeaker signals s(n) of the plurality of loudspeaker signals x(n) on the basis of the plurality of signal level measures and/or correlation measures. In a further embodiment, all loudspeaker signals x(n) may be the input of the processing block 301c.

As already described above, in an embodiment, the processing circuitry 301 of the acoustic processing device 300 is configured to implement the direct path AEC processing block 301 b on the basis of a weighted sum of K (with K> 1) broadband MIMO FIR filter templates which, in an embodiment, may be extracted from a set of a priori measured room impulse responses (RIRs). The one or more MIMO FIR filter templates H k can be considered as dictionary elements of a dictionary-based model for the direct path. As will be appreciated, because the templates H k model the direct path, usually only the first L dir < L taps of the templates are non-zero. By employing the vectorization operator h- dir (n = vec H T dir (n)) G IR fl (8) with R = P ■ L ■ Q , the MIMO direct path transmission matrix in Eq. (7) can be written as a matrix vector multiplication hdir(n) = H dir w(n) G IR fl (9) with the time-invariant dictionary matrix and the time-varying weighting vector (defining the plurality of adjustable weighting coefficients, i.e. weights)

The columns of the dictionary matrix H dir contain the vectorized templates H k . Thus, the estimated direct echo signal, i.e. the direct path component may be written as y<iir (n) = X T (n)H dir w(n) = D T (n)w(n) G IR* 2 (12) with the input signal matrix and ® denoting the Kronecker product and I Q IR <2X<2 being the identity matrix of dimension Q. Note that describes the linear convolution of the input signals with the filter templates. In an embodiment, the processing circuitry 301 of the acoustic processing device 300 may be configured to compute these quantities by processing blocks of input signals and employing an overlap-save structure. The Discrete Fourier Transform (DFT) of the templates, i.e., dictionary elements for the overlap-save processing may be computed by the processing circuitry 301 of the acoustic processing device 300 in advance and stored in a memory of the acoustic processing device 300. In an embodiment, the processing circuitry 301 of the acoustic processing device 300 is further configured to implement the acoustic residual path AEC processing block 301 c on the basis of the following MIMO transmission matrix which models the system from the input signal s(n) to the reference signal y res (n). As already described above, in an embodiment, the transmission matrix H res (r) defines an adaptive MIMO FIR filter with a plurality of adjustable filter coefficients. As already described above, the input signal, i.e. the one or more selected loudspeaker signals s(n) is provided by the signal selection processing block 301a and can be configured according to the specific kind of rendering algorithm implemented by the mixing unit 100. In an embodiment, the rendering algorithm implemented by the mixing unit 100 is the rendering algorithm illustrated in Figure 2. For such an embodiment, the processing circuitry 301 of the acoustic processing device 300 may use as input to the residual path AEC processing block 301c. Advantageously, this allows reducing the complexity in comparison to using the loudspeaker signals x(n) directly as the input to the residual path AEC processing block 301c. Furthermore, the signals s(n) defined in Eq. (16) exhibit usually a reduced cross-correlation in comparison to the loudspeaker signals x(n) which may lead to a faster convergence of the adaptive filter coefficients.

As Eq. 12 denotes a linear regression model with the template outputs D(n) as covariates, i.e., input of the estimator, and the estimated direct echo signal y dir (n) as response variable, i.e., desired signal, in an embodiment the processing circuitry 301 of the acoustic processing device 300 may be configured to determine or adjust, i.e. optimize the weighting coefficients, i.e. the weights w(n) on the basis of a block-recursive least squares algorithm. In such an embodiment, the processing circuitry 301 of the acoustic processing device 300 may be configured to use a block estimated direct echo signal in the following form which captures L samples in one block y dir (m) indexed by m. In an embodiment, the processing circuitry 301 of the acoustic processing device 300 may be configured to compute the block y dir (m) on the basis of an efficient overlap-save structure, as already described above. The respective linear regression problem is then to determine the weight vector in with the linear convolution estimates of the templates

D(m) = (D(mL), ... , D(mL - L + 1)) G IR Kx<?i . (19)

In an embodiment, the processing circuitry 301 of the acoustic processing device 300 is configured to iteratively estimate the weighting coefficients based on the error signal between the estimated direct echo signal Eq. (18) and the observed one, i.e. with y dir (m ) IR* 2 denoting the mL-th sample of the reference signal. In an embodiment, the processing circuitry 301 of the acoustic processing device 300 is configured to determine the weighting coefficients, i.e. weights by minimizing a block-recursive leastsquares cost function, such as the following block-recursive least-squares cost function

In an embodiment, the processing circuitry 301 of the acoustic processing device 300 is configured to determine Eq. (22) as follows with the sample auto-correlation matrix and the sample cross-correlation matrix

To avoid numerical instability during the inversion, the processing circuitry 301 of the acoustic processing device 300 may determine the weighting coefficients by a Tikhonov- regularized sample auto-correlation matrix with a regularization factor 6. Due to the full rank update of the sample correlation matrices, the processing circuitry 301 of the acoustic processing device 300 may refrain from using the matrix inversion lemma for directly updating the inverse. If the number of templates K is low in comparison to the filter length L (which may often be the case), the complexity of inverting the sample auto-correlation matrix R DD (m) is negligible in comparison to the computation of the linear convolution by overlap-save processing.

In an embodiment, the processing circuitry 301 of the acoustic processing device 300 is configured to determine the error signal of the residual path as e res (n) = y r es(n) ~ Hres(n)s(n) G W* 2 . (27)

In an embodiment, the processing circuitry 301 of the acoustic processing device 300 may be configured to optimize, i.e. minimize the error signal of the residual path (for determining the plurality of adaptive filter coefficients) using one of a plurality of known optimization algorithms. For instance, in an embodiment, due to the superior convergence properties and the efficient block-processing the processing circuitry 301 of the acoustic processing device 300 may be configured to optimize, i.e. minimize the error signal of the residual path on the basis of the GFDAF optimization algorithm disclosed in H. Buchner et al., “Generalized multichannel Frequency-Domain Adaptive Filtering: efficient realization and application to hands-free speech communication”, Signal Processing, 2005 and M. Schneider and W. Kellermann, “The Generalized Frequency-Domain Adaptive Filtering algorithm as an approximation of the block recursive least-squares algorithm”, EURASIP Journal on Advances in Signal Processing, 2016.

As described above, in an embodiment the processing circuitry 301 of the acoustic processing device 300 may be configured to adjust, i.e. optimize for the direct path the plurality of adjustable weighting coefficients, i.e. weights and for the residual path the plurality of adaptive filter coefficients on the basis of one or more of the optimization algorithms described above, such as a block recursive least-squares regression algorithm or a GFDAF algorithm. In the following further embodiments of the acoustic processing device 300 will be described, where the processing circuitry 301 is further configured to determine one or more reference signals for the optimization algorithms described above.

In an embodiment, the processing circuitry 301 is configured to adjust the plurality of adjustable weighting coefficients, i.e. the weights of the weighted linear combination of the predefined MIMO FIR filter templates using the microphone signals as reference signals, i.e. on the basis of the plurality of microphone signals y(n) and the plurality of estimated echo signals y(n). This choice for the reference signal, however, in some scenarios might lead to a biased solution of the model parameters, i.e., the weights w(m) and the plurality of adaptive filter coefficients of H res (m), as both direct and residual path aim at explaining the same data.

To address this issue, in an embodiment, the processing circuitry 301 of the acoustic processing device 300 is configured to employ as the direct-path reference signal the following signal with the block-selection matrix Note that the block-selection matrix P sets the first L dir samples of the residual path for each loudspeaker to zero. In other words, the processing circuitry 301 may be configured to employ the error of the windowed residual path as reference for the direct-path adaptation to reduce the effect of competing adaptive filters.

For the residual path the processing circuitry 301 may be configured to use the error between the respective microphone signal y(n) and the respective estimated direct echo signal, i.e. the direct-path estimate y dir (n) as reference signal

In this case no windowing may be necessary, as the templates may include only L dir nonzero samples. Using the direct-path error as reference is motivated again by the desire to model non-competing filters.

In the embodiments described above, only the sample-wise reference signals have been described. As will be appreciated, however, the corresponding block quantities are defined by summarizing L samples of the respective reference signals (as already described above).

In an embodiment, the processing circuitry 301 of the acoustic processing device 300 is configured to implement the following staged procedure for adjusting, i.e. optimizing the plurality of adjustable weighting coefficients, i.e. weights for the direct path and the plurality of adaptive filter coefficients for the residual path. In a first stage, the processing circuitry 301 of the acoustic processing device 300 is configured to determine the direct path reference signal defined by Eq. (28) above. As already described above, this operation may include a linear convolution, which can be efficiently implemented by an overlap-save structure in the DFT domain. In a second stage, the processing circuitry 301 of the acoustic processing device 300 is configured to update the direct path, i.e. adjust the plurality of adjustable weighting coefficients using Eq. (26) above. Thereafter, in a third stage, the processing circuitry 301 of the acoustic processing device 300 is configured to update the residual path, i.e. adjust the plurality of adaptive filter coefficients using as reference signal the error signal defined in Eq. (27) above. Thus, in an embodiment, the processing circuitry 301 is configured to first adjust, i.e. optimize the plurality of adjustable weighting coefficients of the weighted linear combination of the predefined MIMO FIR filter templates and, thereafter, adjust, i.e. optimize the plurality of adaptive filter coefficients of the adaptive MIMO FIR filter. In an embodiment, the processing circuitry 301 of the acoustic processing device 300 may be configured to perform this optimization procedure for each new block of observations.

Figure 4 is a flow diagram illustrating an acoustic processing method 400 for MIMO AEC according to an embodiment. The acoustic processing method 400 comprises the steps of: receiving 401 the plurality of loudspeaker signals x(n); receiving 403 the plurality of microphone signals y(n); determining 405 the estimated echo signal y(n), wherein the estimated echo signal y(n) comprises the estimated direct echo signal y dir (n) and the estimated residual echo signal TresC^)’ and determining 407 the echo reduced microphone signal e(n) based on the microphone signal y(n) and the estimated echo signal y(n).

In an embodiment, the acoustic processing method 400 may be performed by the acoustic processing device 300 described above. Thus, further features of the acoustic processing method 400 result directly from the functionality of the different embodiments of the acoustic processing device 300 described above.

As already described above, figures 5a and 5b are schematic diagrams illustrating a top view and a bottom view of an exemplary multichannel smart speaker device 500 according to an embodiment. The exemplary smart speaker device 500 shown in figures 5a and 5b comprises eight loudspeakers 501 a-h, including seven tweeters 501 a-g and one subwoofer 501 h, as well as seven microphones 503a-g. As can be taken from figures 5a and 5b, the seven tweeters 501 a-g are mounted equidistantly from the center of the smart speaker device 500, and each of the seven microphones 503a-g is mounted below one of the tweeters 501 a-g. The subwoofer 501 h is mounted at the center of the device 500. For testing the AEC performance of the exemplary smart speaker device 500 shown in figures 6a and 6b recordings were conducted in a typical office environment with a reverberation time of T 6 o = 650 ms and the device 500 was placed on a table. The loudspeaker signals were computed by a time-varying virtual stereo rendering software from a stereo pop music signal. The average Sound Pressure Level (SPL) during the recordings was measured with an external microphone at a distance of 10 cm and was approximately 77 dB. The AEC performance of the acoustic processing device 300 of the smart speaker device 500 in this scenario was estimated on the basis of the average ERLE (Echo Return Loss Enhancement) performance measure describing the echo attenuation as follows:

The expectation E[ ] is hereby estimated by recursive averaging over time. The AEC performance of the acoustic processing device 300 of the smart speaker device 500 in this scenario was compared against two GFDAF-based benchmark algorithms, as illustrated in figure 6. GFDAF (Residual) denotes the results only of the residual path computations of the acoustic processing device 300, i.e., without the direct path model. Additionally, the classical AEC approach, i.e., employing the loudspeaker signals directly as input to an adaptive filter, is abbreviated by GFDAF (Loudspeaker). The proposed direct path model consisted of two templates which model the direct acoustic propagation and early reflections from the surrounding surface. Both RIRs were measured in an anechoic environment. The filter adaptation was halted after 12 s to evaluate the performance of the identified system. As can be appreciated from figure 6, the AEC performance of the acoustic processing device 300 of the smart speaker device 500 is significantly better than both benchmark approaches with respect to convergence speed as well as steady-state performance.

The person skilled in the art will understand that the "blocks" ("units") of the various figures (method and apparatus) represent or describe functionalities of embodiments of the present disclosure (rather than necessarily individual "units" in hardware or software) and thus describe equally functions or features of apparatus embodiments as well as method embodiments (unit = step).

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described embodiment of an apparatus is merely exemplary. For example, the unit division is merely logical function division and may be another division in an actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, functional units in the embodiments of the invention may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.