Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ACOUSTIC PROCESSING DEVICE FOR MULTICHANNEL NONLINEAR ACOUSTIC ECHO CANCELLATION
Document Type and Number:
WIPO Patent Application WO/2022/048737
Kind Code:
A1
Abstract:
An acoustic processing device (110) for multichannel nonlinear acoustic echo cancellation is disclosed. The acoustic processing device (110) comprises a processing circuitry (120) configured to apply to each of a plurality of loudspeaker signals a respective pre- processing filter for filtering each loudspeaker signal in order to obtain a respective pre-processed loudspeaker signal, wherein each pre-processing filter is based on a linear combination of a plurality of pre-defined basis functions, wherein each of the pre-defined basis functions is weighted by an adjustable pre-processing weighting coefficient. The processing circuitry (120) is further adapted to enable echo reduction by determining for each of a plurality of microphone signals an estimated echo signal on the basis of the plurality of pre-processed loudspeaker signals and determining a respective echo reduced microphone signal based on the respective microphone signal and the estimated echo signal.

Inventors:
HALIMEH MODAR (DE)
HAUBNER THOMAS (DE)
KELLERMANN WALTER (DE)
TAGHIZADEH MOHAMMAD (DE)
Application Number:
PCT/EP2020/074428
Publication Date:
March 10, 2022
Filing Date:
September 02, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
UNIV FRIEDRICH ALEXANDER ER (DE)
International Classes:
G10L21/02; H04M9/08; H04R1/32; H04R1/40; H04R3/00
Foreign References:
US6246760B12001-06-12
US20130230184A12013-09-05
Other References:
G. ENZNER ET AL.: "Acoustic Echo Control", 2014, ACADEMIC PRESS LIBRARY IN SIGNAL PROCESSING
H. BUCHNER ET AL.: "Generalized multichannel Frequency-Domain Adaptive Filtering: efficient realization and application to hands-free speech communication", SIGNAL PROCESSING, 2005
M. SCHNEIDERW. KELLERMANN: "The Generalized Frequency-Domain Adaptive Filtering algorithm as an approximation of the block recursive least-squares algorithm", EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2016
M. SCHNEIDERW. KELLERMANN: "The generalized frequency-domain adaptive filtering algorithm as an approximation of the block recursive least-squares algorithm", EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, vol. 2016, no. 1, January 2016 (2016-01-01), XP021231669, DOI: 10.1186/s13634-015-0302-2
T. HAUBNERM. HALIMEHA. SCHMIDTW. KELLERMANN: "Technical Report", 2019, FRIEDRICH-ALEXANDER UNIVERSITY ERLANGEN-NURNBERG, article "Robust nonlinear MIMO AEC WP 3"
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
CLAIMS

1. An acoustic processing device (110), comprising: a first signal reception unit (130) adapted to receive a plurality of loudspeaker signals; a second signal reception unit (140) adapted to receive a plurality of microphone signals; and a processing circuitry (120) configured to apply to each loudspeaker signal a respective pre-processing filter (122a-g) for filtering each loudspeaker signal in order to obtain a respective pre-processed loudspeaker signal, wherein each pre-processing filter (122a-g) is based on a linear combination of a plurality of pre-defined basis functions, each of the pre-defined basis functions is weighted by an adjustable pre-processing weighting coefficient; wherein the processing circuitry (120) is adapted to enable echo reduction, the processing circuitry (120) being configured to determine an estimated echo signal on the basis of the plurality of pre-processed loudspeaker signals, wherein the processing circuitry (120) is further configured to determine a respective echo reduced microphone signal based on the respective microphone signal and the estimated echo signal.

2. The acoustic processing device (110) of claim 1 , wherein the plurality of predefined basis functions comprises a plurality of Legendre polynomials, Power filters, Fourier series, diagonal Volterra kernels or neural networks.

3. The acoustic processing device (110) of claim 1 or 2, wherein the processing circuitry (120) is configured to determine the p-th pre-processing filter (122a-g) fp on the basis of the following equation: wherein denotes the j-th adjustable pre-processing weighting coefficient of the p-th pre-processing filter (122a-g), g denotes the j-th pre-defined basis function, xp denotes 29 the p-th loudspeaker signal, n denotes a sample index and La denotes an adjustable parameter defining the number of the plurality of basis functions.

4. The acoustic processing device (110) of any one of the preceding claims, wherein for each pre-processing filter (122a-g) the processing circuitry (120) is configured to adjust the plurality of adjustable pre-processing weighting coefficients of the pre-processing filter (122a-g) on the basis of a respective selected subset of the plurality of microphone signals.

5. The acoustic processing device (110) of claim 4, wherein for each pre-processing filter (122a-g) the processing circuitry (120) is configured to adjust the plurality of adjustable pre-processing weighting coefficients of the pre-processing filter (122a-g) on the basis of the respective selected subset of the plurality of microphone signals, a corresponding subset of the plurality of estimated direct echo signals and a corresponding subset of the plurality of estimated residual echo signals.

6. The acoustic processing device (110) of claim 4 or 5, wherein for each preprocessing filter (122a-g) the processing circuitry (120) is configured to determine the selected subset of the plurality of microphone signals on the basis of a plurality of geometrical configurations between a plurality of microphones (103a-g) and a respective loudspeaker of a plurality of loudspeakers (101 a-h) associated with the respective preprocessing filter (122a-g).

7. The acoustic processing device (110) of any one of claims 4 to 6, wherein for each pre-processing filter (122a-g) the processing circuitry (120) is configured to adjust the plurality of adjustable pre-processing weighting coefficients of the pre-processing filter (122a-g) on the basis of an iterative gradient-based adjustment scheme.

8. The acoustic processing device (110) of claim 7, wherein for each pre-processing filter (122a-g) the iterative gradient-based adjustment scheme is based on a cost function depending on the respective selected subset of the plurality of microphone signals, the corresponding subset of the plurality of estimated direct echo signals and the corresponding subset of the plurality of estimated residual echo signals.

9. The acoustic processing device (110) of any one of the preceding claims, wherein the processing circuitry (120) is configured to determine the respective echo compensated

30 microphone signal as a difference between the respective microphone signal and the estimated echo signal.

10. The acoustic processing device (110) of any one of the preceding claims, wherein each estimated echo signal comprises an estimated direct echo signal and an estimated residual echo signal, wherein the processing circuitry (120) is configured to determine the respective estimated direct echo signal on the basis of the plurality of pre-processed loudspeaker signals and one or more pre-defined MIMO FIR filter templates, each MIMO FIR filter template having a plurality of pre-defined filter coefficients, and wherein the processing circuitry (120) is configured to determine the respective estimated residual echo signal on the basis of one or more selected loudspeaker signals of the plurality of loudspeaker signals and an adaptive MIMO FIR filter, the adaptive MIMO FIR filter having a plurality of adaptive filter coefficients.

11 . The acoustic processing device (110) of claim 10, wherein the processing circuitry (120) is configured to adjust the plurality of adaptive filter coefficients of the adaptive MIMO FIR filter on the basis of the plurality of microphone signals and the plurality of estimated echo signals.

12. The acoustic processing device (110) of claim 10, wherein the processing circuitry (120) is configured to adjust the plurality of adaptive filter coefficients of the adaptive MIMO FIR filter on the basis of the plurality of estimated echo signals and a plurality of residual reference signals, wherein the processing circuitry (120) is configured to determine each of the plurality of residual reference signals as a difference between the respective microphone signal and the estimated direct echo signal.

13. The acoustic processing device (110) of any one of claims 10 to 12, wherein the one or more pre-defined MIMO FIR filter templates comprise a plurality of pre-defined MIMO FIR filter templates and wherein the processing circuitry (120) is configured to determine the respective estimated direct echo signal on the basis of the plurality of pre- processed loudspeaker signals and a linear combination of the plurality of pre-defined MIMO FIR filter templates, each of the plurality of pre-defined MIMO FIR filter templates weighted by an adjustable template weighting coefficient.

14. The acoustic processing device (110) of claim 13, wherein the processing circuitry (120) is configured to adjust the plurality of adjustable template weighting coefficients on the basis of the plurality of microphone signals and the plurality of estimated echo signals.

15. The acoustic processing device (110) of claim 14, wherein the processing circuitry (120) is configured to adjust the plurality of adjustable template weighting coefficients on the basis of the plurality of estimated direct echo signals and a plurality of direct path reference signals, wherein the processing circuitry (120) is configured to determine the plurality of direct path reference signals on the basis of the plurality of microphone signals, the one or more selected loudspeaker signals of the plurality of loudspeaker signals and the adaptive MIMO FIR filter.

16. The acoustic processing device (110) of claim 15, wherein the processing circuitry (120) is further configured to adjust the plurality of adaptive filter coefficients of the adaptive MIMO FIR filter on the basis of the plurality of estimated echo signals and a plurality of residual reference signals, wherein the processing circuitry (120) is configured to determine each of the plurality of residual reference signals as a difference between the respective microphone signal and the estimated direct echo signal.

17. An acoustic device (100), comprising: an acoustic processing device (110) according to any one of the preceding claims; a plurality of loudspeakers (101 a-h), wherein each loudspeaker (101 a-h) is configured to be driven by one of the plurality of loudspeaker signals; and/or a plurality of microphones (103a-g), wherein each microphone (103a-g) is configured to detect one of the plurality of microphone signals.

18. The acoustic device (110) of claim 17, wherein the acoustic device (110) further comprises a mixing unit (150) configured to generate the plurality of loudspeaker signals on the basis of an input signal, in particular a stereo input signal.

19. An acoustic processing method (500), comprising: receiving (501) a plurality of loudspeaker signals and a plurality of microphone signals; applying (503) to each loudspeaker signal a respective pre-processing filter for obtaining a respective pre-processed loudspeaker signal, wherein each pre-processing filter is based on a linear combination of a plurality of pre-defined basis functions, each of the predefined basis functions weighted by an adjustable pre-processing weighting coefficient; determining (505) an estimated echo signal on the basis of the plurality of pre-processed loudspeaker signals; and determining (507) a respective echo reduced microphone signal based on the respective microphone signal and the estimated echo signal.

20. A computer program product comprising a computer-readable storage medium for storing program code which causes a computer or a processor to perform the method (500) of claim 19, when the program code is executed by the computer or the processor.

33

Description:
ACOUSTIC PROCESSING DEVICE FOR MULTICHANNEL NONLINEAR ACOUSTIC ECHO CANCELLATION

TECHNICAL FIELD

The present disclosure relates to acoustic sound processing in general. More specifically, the disclosure relates to an acoustic processing device for multichannel nonlinear acoustic echo cancellation (NLAEC).

BACKGROUND

Acoustic Echo Cancellation (AEC) addresses the problem of suppressing undesired acoustic coupling between sound reproduction and sound acquisition in an acoustic device (see G. Enzner et al., “Acoustic Echo Control”, Academic Press Library in Signal Processing, 2014). AEC is often used, for instance, for full-duplex hands-free acoustic human-machine interfaces, such as smart speaker devices with multiple loudspeakers and multiple microphones. AEC with multiple loudspeakers and multiple microphones is referred to as MIMO AEC.

In smart speaker devices as well as other acoustic devices the so-called problem of multichannel or MIMO Nonlinear Acoustic Echo Cancellation (NLAEC) can occur. An illustration of this problem is depicted in Figure 1 , where the microphones 101 a-g of the smart speaker device 100 record the voice of a user 10 in addition to the signals played back by the loudspeakers 103a-h of the smart speaker device itself. With, e.g., miniaturized loudspeakers 103a-h or nonlinear power amplifiers driving the loudspeakers 103a-h, each loudspeaker 103a-h may nonlinearly distort its input signal. The purpose of multichannel NLAEC is to model the nonlinear echo paths between loudspeakers and microphones in order to cancel the echo signals.

Challenges for multichannel NLAEC include the estimation of simultaneously active nonlinearities, where each loudspeaker 103a-h acts as an interferer when estimating the others. Additionally, since the estimation needs to be carried out for multiple microphones 101 a-g, this would typically result in computationally demanding algorithms, rendering such systems unsuitable for miniaturized and portable devices, such as the smart speaker device 100 shown in figure 1. SUMMARY

It is an object of the invention to provide an improved acoustic processing device for multichannel nonlinear acoustic echo cancellation.

The foregoing and other objects are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to a first aspect an acoustic processing device for multichannel nonlinear acoustic echo cancellation (NLAEC) is provided. The acoustic processing device comprises a first signal reception unit adapted to receive a plurality of loudspeaker signals and a second signal reception unit adapted to receive a plurality of microphone signals. Furthermore, the acoustic processing device comprises a processing circuitry configured to apply to each loudspeaker signal a respective pre-processing filter, in particular a memoryless pre-processing filter for filtering each loudspeaker signal in order to obtain a respective pre-processed loudspeaker signal, wherein each pre-processing filter is based on a linear combination of a plurality of pre-defined, i.e. fixed basis functions (i.e. basis filters), wherein each of the pre-defined basis functions is weighted by an adjustable preprocessing weighting coefficient. The processing circuitry is further adapted to enable echo reduction by determining for each microphone signal an estimated echo signal on the basis of the plurality of pre-processed loudspeaker signals and determining a respective echo reduced microphone signal based on the respective microphone signal and the estimated echo signal.

The acoustic processing device according to the first aspect provides a beneficial approach for canceling nonlinear distortions caused by nonlinear components in miniaturized devices, e.g., power amplifiers or loudspeakers. Unlike current state-of-the- art systems, nonlinear distortions in a MIMO setup may be canceled without impairing the quality of the near-end speaker’s speech. Therefore, the acoustic processing device according to the first aspect does not degrade the performance of subsequent speech or audio processing systems, e.g., speech recognition systems, and allows for higher quality of full-duplex communication by means of, for instance, a smart speaker device comprising the acoustic processing device according to the first aspect. The acoustic processing device according to the first aspect models the different nonlinear direct echo paths by a set of nonlinear preprocessors followed by a linear mixing system. Since the different microphone signals are estimated by the same nonlinear preprocessors, the proposed acoustic processing device is less redundant in comparison to an assignment of a set of entirely unique echo paths per microphone. Each of the preprocessors can be adapted using a selected number of microphones providing a scalable approach through the controllable trade-off between the number of microphones used to adapt each preprocessor and the computational cost associated with using more microphones. This effectively identifies the MIMO nonlinear system, which has all loudspeaker signals as inputs and all microphone signals as outputs, for instance, by means of gradients obtained for a group of coupled MISO systems, each with all loudspeaker signals as inputs and one microphone signal as output. More specifically, the adaptation of one nonlinear preprocessor using one microphone signal may be carried out with respect to a cost function that is based on an error signal minimized by a MISO system as an estimator. This error signal describes the difference between an observed microphone signal and an estimated microphone signal that is the output of a MISO system with all loudspeaker signals as inputs. The adaptation may be performed by minimizing the cost function for the MISO system with respect to the nonlinear preprocessor parameters. By minimizing the cost function for multiple MISO systems, with all loudspeaker signals as inputs, a given nonlinear preprocessor may be adapted using error signals from multiple microphone signals. The coupling between the different MISO systems is highlighted by recalling that the different MISO systems share the same loudspeakers and, thus, the same nonlinear preprocessors.

In a further possible implementation form of the first aspect, the plurality of pre-defined basis functions comprises one or more Legendre polynomials, Power filters, Fourier series, diagonal Volterra kernels or neural networks.

In a further possible implementation form of the first aspect, the processing circuitry is configured to determine the p-th pre-processing filter on the basis of the following equation: wherein denotes the j-th adjustable pre-processing weighting coefficient of the p-th pre-processing filter, g j denotes the j-th pre-defined basis function, x p denotes the p-th loudspeaker signal, n denotes a sample index and L a denotes an adjustable parameter defining the number of the plurality of basis functions for modelling the respective preprocessing filter.

In a further possible implementation form of the first aspect, for each pre-processing filter the processing circuitry is configured to adjust the plurality of adjustable pre-processing weighting coefficients of the pre-processing filter on the basis of a respective selected subset of the plurality of microphone signals. In an implementation form the sizes of the selected subsets of microphone signals may differ for different pre-processing filters.

In a further possible implementation form of the first aspect, for each pre-processing filter the processing circuitry is configured to adjust the plurality of adjustable pre-processing weighting coefficients of the pre-processing filter on the basis of the respective selected subset of the plurality of microphone signals, a corresponding subset of the plurality of estimated direct echo signals and a corresponding subset of the plurality of estimated residual echo signals.

In a further possible implementation form of the first aspect, for each pre-processing filter the processing circuitry is configured to determine the selected subset of the plurality of microphone signals on the basis of a plurality of geometrical configurations, in particular spatial distances between the plurality of microphones and the respective loudspeaker associated with the respective pre-processing filter. Thus, in an implementation form for a given pre-processing filter the subset of microphone signals may be selected based on the distances of the microphones to the corresponding loudspeaker associated with the respective pre-processing filter.

In a further possible implementation form of the first aspect, for each pre-processing filter the processing circuitry is configured to adjust the plurality of adjustable pre-processing weighting coefficients of the pre-processing filter on the basis of an iterative gradientbased adjustment scheme.

In a further possible implementation form of the first aspect, for each pre-processing filter the iterative gradient-based adjustment scheme is based on a cost function depending on the respective selected subset of the plurality of microphone signals, the corresponding subset of the plurality of estimated direct echo signals and the corresponding subset of the plurality of estimated residual echo signals.

In a further possible implementation form of the first aspect, the processing circuitry is configured to determine for each microphone signal the respective echo compensated, e.g. echo free microphone signal, i.e. the target or desired signal as a difference between the respective microphone signal and the estimated echo signal.

In a further possible implementation form of the first aspect, each estimated echo signal comprises an estimated direct echo signal and an estimated residual echo signal. The processing circuitry is configured to determine for each microphone signal the respective estimated direct echo signal on the basis of the plurality of pre-processed loudspeaker signals and one or more pre-defined, i.e. fixed MIMO FIR filter templates, wherein each MIMO FIR filter template has a plurality of pre-defined filter coefficients. Moreover, the processing circuitry is configured to determine for each microphone signal the respective estimated residual echo signal on the basis of one or more selected loudspeaker signals of the plurality of loudspeaker signals and an adaptive MIMO FIR filter, wherein the adaptive MIMO FIR filter has a plurality of adaptive filter coefficients.

In a further possible implementation form of the first aspect, the processing circuitry is configured to adjust the plurality of adaptive filter coefficients of the adaptive MIMO FIR filter on the basis of the plurality of microphone signals and the plurality of estimated echo signals. In other words, in an implementation form the processing circuitry is configured to adjust, e.g. optimize the plurality of adaptive filter coefficients of the adaptive MIMO FIR filter using the microphone signals as reference signals.

In a further possible implementation form of the first aspect, the processing circuitry is configured to adjust the plurality of adaptive filter coefficients of the adaptive MIMO FIR filter on the basis of the plurality of estimated echo signals and a plurality of residual reference signals, wherein the processing circuitry is configured to determine each of the plurality of residual reference signals as a difference between the respective microphone signal and the estimated direct echo signal.

In a further possible implementation form of the first aspect, the one or more pre-defined, i.e. fixed MIMO FIR filter templates comprise a plurality of, i.e. at least two pre-defined MIMO FIR filter templates, wherein the processing circuitry is configured to determine for each microphone signal the respective estimated direct echo signal on the basis of the plurality of pre-processed loudspeaker signals and a linear combination of the plurality of pre-defined MIMO FIR filter templates, wherein each of the plurality of pre-defined MIMO FIR filter templates is weighted by an adjustable template weighting coefficient.

In a further possible implementation form of the first aspect, the processing circuitry is configured to adjust the plurality of adjustable template weighting coefficients on the basis of the plurality of microphone signals and the plurality of estimated echo signals. In other words, in an implementation form the processing circuitry is configured to adjust, e.g. optimize the plurality of adjustable weighting coefficients using the microphone signals as reference signals.

In a further possible implementation form of the first aspect, the processing circuitry is configured to adjust the plurality of adjustable template weighting coefficients on the basis of the plurality of estimated direct echo signals and a plurality of direct path reference signals, wherein the processing circuitry is configured to determine the plurality of direct path reference signals on the basis of the plurality of microphone signals, the one or more selected loudspeaker signals of the plurality of loudspeaker signals and the adaptive MIMO FIR filter.

In a further possible implementation form of the first aspect, the processing circuitry is further configured to adjust the plurality of adaptive filter coefficients of the adaptive MIMO FIR filter on the basis of the plurality of estimated echo signals and a plurality of residual reference signals, wherein the processing circuitry is configured to determine each of the plurality of residual reference signals as a difference between the respective microphone signal and the estimated direct echo signal.

According to a second aspect an acoustic device, such as a smart speaker device is disclosed. The acoustic device comprises an acoustic processing device for multichannel nonlinear acoustic echo cancellation is provided according to the first aspect. Moreover, the acoustic device comprises a plurality of loudspeakers, wherein each loudspeaker is configured to be driven by one of the plurality of loudspeaker signals. The acoustic device further comprises a plurality of microphones, wherein each microphone is configured to detect one of the plurality of microphone signals. In a further possible implementation form of the second aspect, the acoustic device further comprises a mixing unit configured to generate the plurality of loudspeaker signals on the basis of an input signal, in particular a stereo input signal.

According to a third aspect an acoustic processing method for multichannel nonlinear acoustic echo cancellation is provided. The acoustic processing method comprises the steps of: receiving a plurality of loudspeaker signals and a plurality of microphone signals; applying to each loudspeaker signal a respective pre-processing filter, in particular a memoryless pre-processing filter for obtaining a respective pre-processed loudspeaker signal, wherein each pre-processing filter is based on a linear combination of a plurality of pre-defined basis functions, wherein each of the pre-defined basis functions is weighted by an adjustable pre-processing weighting coefficient; determining for each microphone signal an estimated echo signal on the basis of the plurality of pre-processed loudspeaker signals; and determining for each microphone signal a respective echo reduced microphone signal based on the respective microphone signal and the estimated echo signal.

The acoustic processing method according to the third aspect of the present disclosure can be performed by the acoustic processing device according to the first aspect of the present disclosure. Thus, further features of the method according to the third aspect of the present disclosure result directly from the functionality of the acoustic processing device according to the first aspect of the present disclosure as well as its different implementation forms and embodiments described above and below.

According to a fourth aspect, a computer program product comprising a non-transitory computer-readable storage medium for storing program code which causes a computer or a processor to perform the acoustic processing method according to the third aspect, when the program code is executed by the computer or the processor, is provided. Details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description, drawings, and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In the following, embodiments of the present disclosure are described in more detail with reference to the attached figures and drawings, in which:

Fig. 1 is a schematic diagram illustrating a smart speaker device with a plurality of loudspeakers and a plurality of microphones;

Fig. 2a is a schematic diagram illustrating an acoustic processing device for multichannel nonlinear acoustic echo cancellation according to an embodiment;

Fig. 2b is schematic diagram illustrating further aspects of an acoustic processing device for multichannel nonlinear acoustic echo cancellation according to an embodiment;

Fig. 2c is schematic diagram illustrating further aspects of an acoustic processing device for multichannel nonlinear acoustic echo cancellation according to an embodiment;

Figs. 3a and 3b are schematic diagrams illustrating a multichannel smart speaker device according to an embodiment;

Figs. 4a and 4b are schematic diagrams illustrating a multichannel smart speaker device according to an embodiment in different processing stages;

Fig. 5 is a flow diagram illustrating an acoustic processing method for multichannel nonlinear acoustic echo cancellation according to an embodiment;

Figs. 6a and 6b are schematic diagrams illustrating a top view and a bottom view of a multichannel smart speaker device according to an embodiment including an acoustic processing device according to an embodiment;

Fig. 7 shows graphs illustrating the nonlinear AEC performance of the multichannel smart speaker device of figures 6a and 6b; and Fig. 8 shows a table illustrating the nonlinear AEC performance of the multichannel smart speaker device of figures 6a and 6b.

In the following, identical reference signs refer to identical or at least functionally equivalent features.

DETAILED DESCRIPTION OF THE EMBODIMENTS

In the following description, reference is made to the accompanying figures, which form part of the disclosure, and which show, by way of illustration, specific aspects of embodiments of the present disclosure or specific aspects in which embodiments of the present disclosure may be used. It is understood that embodiments of the present disclosure may be used in other aspects and comprise structural or logical changes not depicted in the figures. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims.

For instance, it is to be understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if one or a plurality of specific method steps are described, a corresponding device may include one or a plurality of units, e.g. functional units, to perform the described one or plurality of method steps (e.g. one unit performing the one or plurality of steps, or a plurality of units each performing one or more of the plurality of steps), even if such one or more units are not explicitly described or illustrated in the figures. On the other hand, for example, if a specific apparatus is described based on one or a plurality of units, e.g. functional units, a corresponding method may include one step to perform the functionality of the one or plurality of units (e.g. one step performing the functionality of the one or plurality of units, or a plurality of steps each performing the functionality of one or more of the plurality of units), even if such one or plurality of steps are not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other, unless specifically noted otherwise.

Figure 2a is a schematic diagram illustrating an acoustic processing device 110 for multichannel, i.e. MIMO nonlinear AEC according to an embodiment. As will be described in more detail below, the acoustic processing device 110 comprises a first signal reception unit 130 adapted to receive a plurality of loudspeaker signals x(n) and a second signal reception unit 140 adapted to receive a plurality of microphone signals y(n). Moreover, the acoustic processing device 110 comprises a processing circuitry 120 adapted to enable nonlinear echo reduction, in particular nonlinear acoustic echo cancellation (NLAEC). The processing circuitry 120 of the acoustic processing device 110 may be implemented in hardware and/or software. The hardware may comprise digital circuitry, or both analog and digital circuitry. Digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or general-purpose processors. The acoustic processing device 110 may further comprise a non-transitory memory configured to store data and executable program code which, when executed by the processing circuitry 120 causes the acoustic processing device 110 to perform the functions, operations and methods described herein.

As will be described in more detail below in the context of figures 2b and 2c, the processing circuitry 120 of the acoustic processing device 110 is configured to perform nonlinear echo reduction, in particular nonlinear acoustic echo cancellation (NLAEC) by determining for each of the plurality of microphone signals y(n) an estimated echo signal y(n) and by determining a respective echo reduced microphone signal e(n) based on the respective microphone signal of the plurality of microphone signals y(n) and the respective estimated echo signal y(n). In an embodiment, the processing circuitry 120 of the acoustic processing device 110 is configured to determine for each of the plurality of microphone signals y(n) the respective echo reduced microphone signal e(n) as a difference between the respective microphone signal y(n) and the respective estimated echo signal As illustrated in figures 2b and 2c, in an embodiment, the estimated echo signal y(n) comprises, i.e. is a combination of an estimated direct echo signal and an estimated residual or complementary echo signal Moreover, the processing circuitry 120 may be configured to implement in hardware and/or software a processing block 128 for determining the difference between the respective microphone signal y(n) and the respective estimated echo signal e(n) = y(n) ~ y(n).

The acoustic processing device 110 may be a component of an acoustic device, such as the smart speaker device 100 shown in figure 1 , further comprising the plurality of loudspeakers 101a-h, wherein each loudspeaker 101a-h is configured to be driven by one of the plurality of loudspeaker signals x(n), and the plurality of microphones 103a-g, wherein each microphone 103a-g is configured to detect one of the plurality of microphone signals y(n). In an embodiment, the acoustic device 100 may further comprise a mixing unit, such as the mixing unit 150 shown in figure 2b, configured to generate the plurality of loudspeaker signals x(n) on the basis of an input signal, in particular a stereo input signal s(n).

As will be described in more detail below and as illustrated in figure 2c, in an embodiment, the processing circuitry 120 of the acoustic processing device 110 is configured to apply to each loudspeaker signal x(n) a respective pre-processing filter 122a-g for filtering each loudspeaker signal x(n) in order to obtain a respective pre-processed loudspeaker signal d(n), wherein each pre-processing filter 122a-g is based on a linear combination of a plurality of pre-defined basis functions, wherein each of the pre-defined basis functions is weighted by an adjustable pre-processing weighting coefficient.

In the following an embodiment of the acoustic processing device 110 for multichannel nonlinear AEC as a component of the acoustic device 100 according to an embodiment will be described in more detail, wherein the acoustic device 100 comprises Q microphones 103a-g and P loudspeakers 101a-h. As already described above, the processing circuitry 120 of the acoustic processing device 110 is configured to adjust the set of nonlinear preprocessors (or preprocessing filters) f 1 (·), ...f P (·) 122a-g shown in figure 2c to approximate the nonlinearities in the direct echoes recorded by the plurality of microphone signals y(n) via the acoustic path 160. In an embodiment, the processing circuitry 120 of the acoustic processing device 110 is configured to determine, in a first stage, the individual microphone signals y(n), and, in second stage, to adjust the memoryless preprocessors f 1 (·), ...f P (·) 122a-g.

As illustrated in figure 2c, the nonlinearly distorted output of the p-th preprocessor, i.e. preprocessing filter 122a-g may be described as wherein x p (n) is the p-th loudspeaker signal, f P (·) is the nonlinear function realized by the memoryless preprocessor 122a-g, and denotes a parameter vector that completely characterizes the P-th preprocessor 122a-g. In an embodiment, the processing circuitry 120 implements the nonlinear function realized by the memoryless preprocessor 122a-g as a weighted sum of L a pre-defined basis functions (or basis filters) in the following way: wherein gj (·) is the j-th basis function, denotes the adjustable pre-processing weighting coefficient of the j-th basis function and L a denotes an adjustable parameter defining the total number of basis functions (or basis filters) used for describing the nonlinear function realized by the memoryless preprocessor 122a-g. In an embodiment, the plurality of pre-defined basis functions may comprise, for instance, a plurality of Legendre polynomials, Power filters, Fourier series, diagonal Volterra kernels or neural networks. The plurality of pre-defined basis functions should be chosen to allow a sufficiently good approximation of any nonlinearity that might be encountered by simply adapting the weights. As will be appreciated, by using more basis functions (i.e. by increasing the value of the adjustable parameter L a ), the respective memoryless preprocessor 122a-g can better model any nonlinearities. However, increasing the number of basis functions too much may lead to an undesired increase of the computational costs. In an embodiment, the adjustable parameter L a defining the total number of basis functions used for approximating the nonlinearity may have a value in the range of 2 to 5, in particular 3. In an embodiment, the processing circuitry 120 may be configured to use different values of the adjustable parameter L a for different preprocessors 122a-g.

In an embodiment, the processing circuitry 120 of the acoustic processing device 110 may arrange the nonlinearly distorted signal in an L dir -samples long frame with rn denoting the frame index. For the following description of the nonlinear filtering is introduced, which stacks current and previous frames of the quantity defined by Eq. (3) as follows:

In an embodiment, the processing circuitry 120 of the acoustic processing device 110 is configured to obtain a frequency-domain representation of using the L block x L block Fourier transform matrix F Lblock wherein a block may have a block length of L block = 2L dir . For each of the Q x P direct acoustic echo paths the processing circuitry 120 of the acoustic processing device 110 may be configured to obtain a frequency-domain representation by with 0 Ldir denoting an L dir -length zero vector. Consequently, for each microphone signal the processing circuitry 120 of the acoustic processing device 110 may be configured to determine to determine the m-th estimated microphone signal as where © denotes the element-wise multiplication operation. From Eq. (7), it can be seen that each estimated microphone signal utilizes the outputs of all nonlinear preprocessors which, in turn, are obtained using all loudspeaker signals x(n). In other words, may be considered as the output of a MISO system which has all loudspeaker signals x(n) as inputs.

Finally, as illustrated in figure 3a, the processing circuitry 120 of the acoustic processing device 110 may be configured to determine the respective time-domain microphone signal estimate by wherein denotes the inverse Fourier transform matrix and W 01 denotes an L dir x L block windowing matrix that extracts the last L dir samples from the multiplied vector.

In order for the p-th preprocessor 122a-g to match the nonlinearity of the p-th loudspeaker 101a-h (and/or an amplifier thereof), the processing circuitry 120 of the acoustic processing device 110 may be configured to adjust, in particular optimize the parameter vector a p with respect to the direct echo path error signals for a subset M of the plurality of microphones 103a-h. As shown in figure 3b by way of example for the microphone 103c, for each microphone 103a-h from the subset, i.e. 9 ∈ M the direct echo path error signal may be defined as e dir , q (n) = y dir,q (n) - (9) where y dir,q denotes the reference direct echo path, obtained by y dir,q (n) = y q (n) (10) with y g denoting the q-th microphone signal and denoting the q-th microphone’s complementary echo path signal. In an embodiment, the processing circuitry 120 of the acoustic processing device 110 may be configured to implement the "Adaptive MIMO AEC" processing block 127 illustrated in figures 2b and 2c for determining the q-th microphone’s complementary echo path signal.

In an embodiment, the processing circuitry 120 of the acoustic processing device 110 may be configured to implement an iterative gradient-based adjustment, in particular optimization procedure to update the parameter vector using update terms calculated with respect to each microphone 103a-h in the subset q ∈ M, where μ NL — [μ NL, 1 , ...,μN L,La ] T denotes a step size vector with different step sizes per basis function. In an embodiment, for each microphone <7 103a-h, the update term is derived to minimize the squared error cost function

J(e dir , g (m)) = e dir,q (m) T e dir,q (m), (12) where e dir,q (m) = [ e dir , q (mL dir - L dir + 1), ..., e dir , q (mL dir )] T . it is important here to note that even when adapting o-ne nonlinear preprocessor 122a-g using one microphone signal, the adaptation is based on a cost function, such as the cost function defined in Eq. (12) that relies on an error signal of a MISO system, as will be appreciated from Eq. (9) and (7). Taking the gradient of Eq. (12) yields the NLMS(Normalized Least Mean Squares)-inspired update with I denoting an all-ones column vector, is a regularization term to avoid division by zero. denotes an update term for the j-th basis function of the P-th preprocessor 122a-g obtained using the error signal of the q-th microphone 103a-h as follows where and

Here, diag(·) operation extracts the diagonal elements of a matrix, while denotes a stacked vector of the j-th basis function of the P-th preprocessor 122a-g and

As will be appreciated from Eq. (14), in order to be able to adapt one nonlinear preprocessor fp(·) 122a-g using one microphone signal y dir, q (n) an error signal e dir, q (n) of a MISO system, which uses all loudspeaker signals x 1 (n),...,x p (n) as inputs and one microphone signal y dir,q (n) as output, is needed. Hence, the nonlinearities in the MIMO system are identified using gradients that are based on the error signals defined in Eq. (9) describing the difference between the observed and the approximated MISO systems in (7).

In an embodiment, the set M of microphones used to adapt as defined in Eq. 11 may be based on the microphones proximity to the loudspeaker modeled by the P-th preprocessor 122a-g. An exemplary selection of the microphones is shown in figures 4a and 4b, where in figure 4b the nearest three microphones (by way of example, the microphones 103b, 103c and 103d) are used by the processing circuitry 120, if the adapted preprocessor 122a-g models the tweeter loudspeaker 101c. When the adapted preprocessor 122a-g models the subwoofer loudspeaker 101 h, for the example shown in figures 4a and 4b, all microphones 103a-h are equally distant from the subwoofer loudspeaker 101 h, and therefore, in this case the selection is arbitrary, i.e. the processing circuitry 120 of the acoustic processing device 110 may select any subset M of the plurality of microphones 103a-g.

As already described above, in an embodiment, the processing circuitry 120 of the acoustic processing device 110 is configured to determine the estimated echo signal as the sum of the estimated direct echo signal associated with a direct path and the estimated residual or complementary echo signal associated with a residual or complementary path (see processing block 129 shown in figure 2c).

As will be described in more detail below, in an embodiment, the processing circuitry 120 of the acoustic processing device 110 is configured for the direct path to determine the respective estimated direct echo signal on the basis of the plurality of pre- processed loudspeaker signals and one or more pre-defined MIMO FIR filter templates, wherein each MIMO FIR filter template comprises a plurality of pre-defined filter coefficients. In an embodiment, the one or more pre-defined MIMO FIR filter templates may comprise one or more room impulse responses (RIRs). In an embodiment, the processing circuitry 120 of the acoustic processing device 110 is configured to determine the respective estimated direct echo signal on the basis of the plurality of pre-processed loudspeaker signals and a linear combination of the plurality of predefined MIMO FIR filter templates, wherein each of the plurality of pre-defined MIMO FIR filter templates is weighted by an adjustable weighting coefficient, i.e. weight. As will described in more detail below, in an embodiment, the processing circuitry 120 is configured to determine for the residual path the respective estimated residual or complementary echo signal on the basis of one or more selected loudspeaker signals of the plurality of loudspeaker signals and an adaptive MIMO FIR filter, wherein the adaptive MIMO FIR filter has a plurality of adaptive filter coefficients (see processing block 127 shown in figure 2c).

More specifically, as illustrated in figure 2c, the processing circuitry 120 of the acoustic processing device 110 may be configured to implement in hardware and/or software the processing block 129 for determining the estimated echo signal in the time domain as the sum of the estimated direct echo signal, i.e. the direct-path component y dir (n) and the estimated residual or complementary echo signal, i.e. the residual-path component ycom(p)- |n an embodiment, the processing circuitry 120 is configured to implement in hardware and/or software a processing block 123 for computing the estimated direct echo signal, i.e. the direct-path component as with denoting a MIMO transmission matrix modelling P · Q FIR filters of length L between each of the P pre-processed loudspeaker signals (i.e. input signals) and the Q estimated direct echo signals (i.e. output signals) captured in the vector

As already described above, in an embodiment, the processing circuitry 120 of the acoustic processing device 110 is configured to implement the direct path AEC processing block 121 on the basis of a weighted sum of K (with K> 1) broadband MIMO FIR filter templates which, in an embodiment, may be extracted from a set of a priori measured room impulse responses (RIRs). The one or more MIMO FIR filter templates H k can be considered as dictionary elements of a dictionary-based model for the direct path. As will be appreciated, because the templates H k model the direct path, usually only the first L dir < L taps of the templates are non-zero. By employing the vectorization operator with R = P · L · Q, the MIMO direct path transmission matrix in Eq. (22) can be written as a matrix vector multiplication with the time-invariant dictionary matrix and the time-varying weighting vector (defining the plurality of adjustable weighting coefficients, i.e. weights)

The columns of the dictionary matrix H dir contain the vectorized templates H k . Thus, the estimated direct echo signal, i.e. the direct path component may be written as with the input signal matrix wherein denotes the Kronecker product, I Q ∈ R QxQ denotes the identity matrix of dimension Q and

Note that

D T (n) = (d 1 (n) ... d K (n)) G ∈R QxK (29) describes the linear convolution of the input signals with the filter templates. In an embodiment, the processing circuitry 120 of the acoustic processing device 110 may be configured to compute these quantities by processing blocks of input signals and employing an overlap-save structure. The Discrete Fourier Transform (DFT) of the templates, i.e., dictionary elements for the overlap-save processing may be computed by the processing circuitry 120 of the acoustic processing device 110 in advance and stored in a memory of the acoustic processing device 110.

In an embodiment, the processing circuitry 120 of the acoustic processing device 110 is further configured to implement the acoustic residual, i.e. complementary path AEC processing block 127 on the basis of the following MIMO transmission matrix which models the system from the input signal s(n) to the reference signal y com tn). As already described above, in an embodiment, the transmission matrix defines an adaptive MIMO FIR filter with a plurality of adjustable filter coefficients. As already described above, the input signal, i.e. the one or more selected loudspeaker signals s(n) may be provided by a signal selection processing block and can be configured according to the specific kind of rendering algorithm implemented by a mixing unit, such as the mixing unit 150.

As Eq. 27 denotes a linear regression model with the template outputs D(n) as covariates, i.e., input of the estimator, and the estimated direct echo signal as response variable, i.e., desired signal, in an embodiment the processing circuitry 120 of the acoustic processing device 110 may be configured to determine or adjust, i.e. optimize the weighting coefficients, i.e. the weights of the linear combination of MIMO FIR filter templates on the basis of a block-recursive least squares algorithm. In such an embodiment, the processing circuitry 120 of the acoustic processing device 110 may be configured to use a block estimated direct echo signal in the following form which captures L samples in one block indexed by m. In an embodiment, the processing circuitry 120 of the acoustic processing device 110 may be configured to compute the block on the basis of an efficient overlap-save structure, as already described above. The respective linear regression problem is then to determine the weight vector in with the linear convolution estimates of the templates

In an embodiment, the processing circuitry 120 of the acoustic processing device 110 is configured to iteratively estimate the weighting coefficients of the linear combination of MIMO FIR filter templates based on the error signal between the estimated direct echo signal Eq. (18) and the observed one, i.e. with y dir (m ) R Q denoting the mL-th sample of the reference signal. In an embodiment, the processing circuitry 120 of the acoustic processing device 110 is configured to determine the weighting coefficients, i.e. weights of the linear combination of MIMO FIR filter templates by minimizing a block-recursive least-squares cost function, such as the following block-recursive least-squares cost function

In an embodiment, the processing circuitry 120 of the acoustic processing device 110 is configured to determine Eq. (36) as follows with the sample auto-correlation matrix and the sample cross-correlation matrix

To avoid numerical instability during the inversion, the processing circuitry 120 of the acoustic processing device 110 may determine the weighting coefficients of the linear combination of MIMO FIR filter templates by a Tikhonov-regularized sample autocorrelation matrix with a regularization factor 8. Due to the full rank update of the sample correlation matrices, the processing circuitry 120 of the acoustic processing device 110 may refrain from using the matrix inversion lemma for directly updating the inverse. If the number of templates K is low in comparison to the filter length L (which may often be the case), the complexity of inverting the sample auto-correlation matrix R DD (m) is negligible in comparison to the computation of the linear convolution by overlap-save processing.

In an embodiment, the processing circuitry 120 of the acoustic processing device 110 is configured to determine the error signal of the residual, i.e. complementary path as In an embodiment, the processing circuitry 120 of the acoustic processing device 110 may be configured to optimize, i.e. minimize the error signal of the residual, i.e. complementary path (for determining the plurality of adaptive filter coefficients) using one of a plurality of known optimization algorithms. For instance, in an embodiment, due to the superior convergence properties and the efficient block-processing the processing circuitry 120 of the acoustic processing device 110 may be configured to optimize, i.e. minimize the error signal of the residual path on the basis of the GFDAF optimization algorithm disclosed in H. Buchner et al., “Generalized multichannel Frequency-Domain Adaptive Filtering: efficient realization and application to hands-free speech communication”, Signal Processing, 2005 and M. Schneider and W. Kellermann, “The Generalized Frequency- Domain Adaptive Filtering algorithm as an approximation of the block recursive leastsquares algorithm”, EURASIP Journal on Advances in Signal Processing, 2016.

As described above, in an embodiment the processing circuitry 120 of the acoustic processing device 110 may be configured to adjust, i.e. optimize for the direct path the plurality of adjustable weighting coefficients, i.e. weights of the linear combination of MIMO FIR filter templates and for the residual path the plurality of adaptive filter coefficients on the basis of one or more of the optimization algorithms described above, such as a block recursive least-squares regression algorithm or the GFDAF algorithm. In the following further embodiments of the acoustic processing device 110 will be described, where the processing circuitry 110 is further configured to determine one or more reference signals for the optimization algorithms described above.

In an embodiment, the processing circuitry 110 is configured to adjust the plurality of adjustable weighting coefficients, i.e. the weights of the weighted linear combination of the predefined MIMO FIR filter templates using the microphone signals as reference signals, i.e. on the basis of the plurality of microphone signals y(n) and the plurality of estimated echo signals y(n). This choice for the reference signal, however, in some scenarios might lead to a biased solution of the model parameters, i.e., the weights w(m) and the plurality of adaptive filter coefficients of H res (m), as both direct and residual path aim at explaining the same data.

To address this issue, in an embodiment, the processing circuitry 120 of the acoustic processing device 110 is configured to employ as the direct-path reference signal the following signal with the block-selection matrix

Note that the block-selection matrix P sets the first L dir samples of the residual, i.e. complementary path for each loudspeaker to zero. In other words, the processing circuitry 120 may be configured to employ the error of the windowed residual path as reference for the direct-path adaptation to reduce the effect of competing adaptive filters.

For the residual, i.e. complementary path the processing circuitry 120 may be configured to use the error between the respective microphone signal y(n) and the respective estimated direct echo signal, i.e. the direct-path estimate y dir (n) as reference signal

In this case no windowing may be necessary, as the templates may include only L dir non- zero samples. Using the direct-path error as reference is motivated again by the desire to model non-competing filters.

In the embodiments described above, only the sample-wise reference signals have been described. As will be appreciated, however, the corresponding block quantities are defined by summarizing L samples of the respective reference signals (as already described above).

In an embodiment, the processing circuitry 120 of the acoustic processing device 110 is configured to implement the following staged procedure for adjusting, i.e. optimizing the plurality of adjustable weighting coefficients, i.e. weights of the linear combination of MIMO FIR filter templates for the direct path and the plurality of adaptive filter coefficients for the residual path. In a first stage, the processing circuitry 120 of the acoustic processing device 110 is configured to determine the direct path reference signal defined by Eq. (42) above. As already described above, this operation may include a linear convolution, which can be efficiently implemented by an overlap-save structure in the DFT domain. In a second stage, the processing circuitry 120 of the acoustic processing device 110 is configured to update the direct path, i.e. adjust the plurality of adjustable weighting coefficients using Eq. (40) above. Thereafter, in a third stage, the processing circuitry 120 of the acoustic processing device 110 is configured to update the residual path, i.e. adjust the plurality of adaptive filter coefficients using as reference signal the error signal defined in Eq. (41) above. Thus, in an embodiment, the processing circuitry 120 is configured to first adjust, i.e. optimize the plurality of adjustable weighting coefficients of the weighted linear combination of the predefined MIMO FIR filter templates and, thereafter, adjust, i.e. optimize the plurality of adaptive filter coefficients of the adaptive MIMO FIR filter. In an embodiment, the processing circuitry 120 of the acoustic processing device 110 may be configured to perform this optimization procedure for each new block of observations.

Figure 5 is a flow diagram illustrating an acoustic processing method 500 for multichannel nonlinear acoustic echo cancellation according to an embodiment. The acoustic processing method 500 comprises the steps of: receiving 501 a plurality of loudspeaker signals and a plurality of microphone signals; applying 503 to each loudspeaker signal a respective pre-processing filter 122a-g, in particular a memoryless pre-processing filter 122a-g for obtaining a respective pre-processed loudspeaker signal, wherein each preprocessing filter 122a-g is a linear combination of a plurality of pre-defined basis functions, wherein each of the pre-defined basis functions is weighted by an adjustable preprocessing weighting coefficient; determining 505 for each microphone signal an estimated echo signal on the basis of the plurality of pre-processed loudspeaker signals; and determining 507 for each microphone signal a respective echo reduced microphone signal based on the respective microphone signal and the estimated echo signal.

In an embodiment, the acoustic processing method 500 may be performed by the acoustic processing device 110 described above. Thus, further features of the acoustic processing method 500 result directly from the functionality of the different embodiments of the acoustic processing device 110 described above.

Figures 6a and 6b are schematic diagrams illustrating a top view and a bottom view of an exemplary multichannel smart speaker device 110 according to an embodiment. The exemplary smart speaker device 110 shown in figures 6a and 6b comprises eight loudspeakers 101 a-h, including seven tweeters 101 a-g and one subwoofer 101 h, as well as seven microphones 103a-g. As can be taken from figures 6a and 6b, the seven tweeters 101 a-g are mounted equidistantly from the center of the smart speaker device 110, and each of the seven microphones 103a-g is mounted below one of the tweeters 101 a-g. The subwoofer 101 h is mounted at the center of the device. For testing the AEC performance of the exemplary smart speaker device 110 shown in figures 6a and 6b the loudspeakers 101a-h have been chosen to exhibit only a small nonlinear distortion with an average total harmonic distortion of below 1%. In this case, the nonlinearities are strongly excited on rare occasions only when the far-end signal amplitude is extremely high. This results in a very challenging nonlinear system identification task since the nonlinearities are not consistently excited. This constitutes a test of robustness for the processing scheme implemented by the acoustic processing device 110 for ensuring that the overall performance is not degraded in the absence of nonlinearities.

For testing the AEC performance of the exemplary smart speaker device 110 shown in figures 6a and 6b three different source signals have been used, namely a female speech signal, a male speech signal, and a music signal of length 30s each. The source signals are emitted by the different loudspeakers 101a-h in a typical living room environment with a reverberation time of T 60 = 550 ms at a Sound Pressure Level (SPL) of 96 dB, measured at a 10 cm distance from the closest tweeter 101 a-g. For handling the nonlinearities of the seven tweeters 101 a-g seven preprocessors 122a-g are used in this example. Each of the preprocessors 122a-g uses the first three odd-order Legendre polynomials of the first kind as basis functions (i.e. in this example a value of 3 has been chosen for the adjustable parameter L a defining the number of the plurality of basis functions). The direct acoustic echo paths between the loudspeakers 101a-h and the microphones 103a-g were modeled by 60-samples-long direct acoustic echo paths measured in a low-reverberation chamber. For modelling the remaining echo paths, the "Adaptive MIMO AEC" processing block 127 of the processing circuitry 120 (shown in figures 2b and 2c) was implemented using adaptive Finite Impulse Response (FIR) filters with 1024 taps adapted by the Generalized Frequency Domain Adaptive Filtering (GFDAF) algorithm disclosed in M. Schneider and W. Kellermann "The generalized frequency-domain adaptive filtering algorithm as an approximation of the block recursive least-squares algorithm”, in EURASIP Journal on Advances in Signal Processing, vol. 2016, no. 1 , Jan. 2016 or T. Haubner, M. Halimeh, A. Schmidt and W. Kellermann "Robust nonlinear MIMO AEC WP 3", Technical Report, Friedrich-Alexander University Erlangen-Nurnberg, 2019.

To illustrate the benefits provided by embodiments disclosed herein the performance of the acoustic processing device 110 of the acoustic device 100 shown in figures 6a and 6b is compared with a purely linear variant of the processing architecture shown in figures 2b and 2c. Moreover, the scalability performance of embodiments disclosed herein is evaluated by varying the number of microphones 103a-g used for adapting each nonlinear preprocessor 122a-g. Finally, the Echo Return Loss Enhancement (ERLE) is used to measure the cancellation performance of embodiments disclosed herein. To measure the computational cost of each approach, the Real Time Factor (RTF) is used, as described in T. Haubner, M. Halimeh, A. Schmidt and W. Kellermann "Robust nonlinear MIMO AEC WP 3", Technical Report, Friedrich-Alexander University Erlangen-Nurnberg, 2019.

Figure 7 shows corresponding graphs illustrating the nonlinear AEC performance of the multichannel smart speaker device 110 of figures 6a and 6b. More specifically, figure 8 depicts the ERLE, averaged over microphone channels, in dB provided by the multichannel smart speaker device 110 of figures 6a and 6b with a varying number of microphones 103a-g (denoted GFDAF + DirPath + NL(# Mies)) in comparison with the purely linear variant of processing architecture illustrated in figures 2b and 2c (denoted GFDAF + DirPath) for the female speech signal as a far-end signal. As can be taken from figure 7, the modeling of the nonlinearities in the echo paths as implemented by the multichannel smart speaker device 110 of figures 6a and 6b does indeed provide a benefit when compared to a purely linear system, where a consistent gain is observed. Furthermore, the multichannel smart speaker device 110 of figures 6a and 6b is particularly beneficial at instances where the distortions are strongly excited, e.g., at t — 20s, where a gain of up to « 6dB is achieved. In addition, better cancellation is observed when increasing the number of microphone signals used for adapting each of the preprocessors 122a-g.

The numerical performance measures for the different far-end signals are summarized in the table shown in figure 8, namely, average ERLE, maximum ERLE difference compared to the purely linear variant, and the RTF. As can be taken from these results, while speech signals, both male and female, benefit from the modeling of the loudspeakers' nonlinearities, music does not. This may be due to its Gaussian nature, which results in a less peaky signal. Consequently, the music signal did not require very high peaky levels, as speech did, to achieve the targeted average SPL, and therefore did not excite the loudspeakers’ nonlinearities. Nevertheless, even for the music signal, the multichannel nonlinear AEC implemented by the multichannel smart speaker device 110 of figures 6a and 6b did not result in any performance degradation despite the absence of the nonlinearities. In addition, it can be seen that the multichannel nonlinear AEC implemented by the multichannel smart speaker device 110 of figures 6a and 6b does indeed benefit from using more of the microphones 103a-g to estimate the nonlinearities. On the other hand, using more microphones results in a higher computational burden as seen by the increasing RTFs, highlighting the controllable trade-off between performance and computational efficiency.

Embodiments of the acoustic processing device 110 and the acoustic device 100 disclosed herein implement an efficient model which recognizes that the different nonlinearities observed by multiple microphones 103a-g may originate from the same loudspeakers 101a-h. The efficiency is achieved by limiting the number of adaptive parameters that characterize the nonlinearities by sharing the same nonlinear preprocessors 122a-g across the different microphones 103a-g. By deriving gradientbased adjustment, i.e. optimization rules in the frequency domain, embodiments disclosed herein are computationally highly efficient. Furthermore, the frequency-domain adjustment is not achieved at the expense of transforming the problem into a linear high-dimensional one as it is the case for extending Hammerstein Group Models (HGMs) to model a MIMO system.

As a result of the adjustable selection of microphone signals used for adapting the nonlinear approximation of one or more selected loudspeakers 103a-h, embodiments of the acoustic processing device 110 and the acoustic device 100 disclosed herein allow incorporating prior knowledge on the loudspeaker/microphone spatial distribution for the multiple microphones 103a-g. Moreover, by controlling the number of microphones 103a-h used in the adaptation of the nonlinear preprocessors 122a-g, embodiments of the acoustic processing device 110 and the acoustic device 100 disclosed herein implement allow an efficient adaptation to match the available computational resources.

Embodiments of the acoustic processing device 110 and the acoustic device 100 disclosed herein provide amongst others the following advantages: the capability of canceling nonlinear distortions originating from multiple simultaneously active nonlinearities originating from e.g., loudspeakers or amplifiers of loudspeakers; efficient multichannel nonlinear distortion modeling by coupling several microphones of the plurality of microphones 103a-g; exploitation of knowledge of microphone and loudspeaker positions for estimating nonlinearities in the echo path; scalable approach as the number of microphone signals used for estimating the nonlinearities is adjustable. The person skilled in the art will understand that the "blocks" ("units") of the various figures (method and apparatus) represent or describe functionalities of embodiments of the present disclosure (rather than necessarily individual "units" in hardware or software) and thus describe equally functions or features of apparatus embodiments as well as method embodiments (unit = step).

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described embodiment of an apparatus is merely exemplary. For example, the unit division is merely logical function division and may be another division in an actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, functional units in the embodiments of the invention may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.