Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ADAPTIVE INTER-CHANNEL TIME DIFFERENCE ESTIMATION
Document Type and Number:
WIPO Patent Application WO/2024/056702
Kind Code:
A1
Abstract:
A method to estimate an inter-channel time difference (ITD) in an encoder using a discontinuous transmission (DTX) is disclosed. One example method includes receiving (1601) a time domain audio input comprising audio input signals and processing (1603) the audio input signals in frames to produce a mono mixdown signal and one or more stereo parameters. The method further includes encoding the mono mixdown signal on a frame-by-frame basis by: encoding (1605) of active content of the mono mixdown signal at a first bit rate until a pause period is detected in the audio input signals or mono mixdown signal; estimating (1606) ITD parameters during the encoding of active content based on a low-pass filtering of cross-spectra or averaging of the cross-spectra; switching (1607) the encoding from the active encoding content to inactive encoding to encode background noise at a second bit rate during the pause period; and estimating (1609) ITD parameters during the pause period based on a low-pass filtering of cross-spectra or averaging of the cross-spectra wherein the estimating is being configured to adapt to the audio input signals faster compared to when estimating the ITD parameters during the encoding of active content. The method further includes encoding (1611) the ITD estimated parameters and other stereo parameters periodically during the pause period.

Inventors:
JANSSON TOFTGÅRD TOMAS (SE)
SEHLSTEDT MARTIN (SE)
JANSSON FREDRIK (SE)
Application Number:
PCT/EP2023/075087
Publication Date:
March 21, 2024
Filing Date:
September 13, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ERICSSON TELEFON AB L M (SE)
International Classes:
G10L19/012; G10L19/008; G10L25/06; G10L25/18
Foreign References:
EP3511934B12021-04-21
Other References:
[NONE]: "ITU-T Wideband embedded extension for ITU-T G.711 pulse code modulation; Recommendation ITU-T G.711.1 (09/2012)", 13 September 2012 (2012-09-13), XP055407180, Retrieved from the Internet [retrieved on 20170915]
Attorney, Agent or Firm:
ERICSSON (SE)
Download PDF:
Claims:
CLAIMS What is claimed is: 1. A method to estimate an inter-channel time difference, ITD, in an encoder (1300, 1508A, 1508B) using a discontinuous transmission, DTX, the method comprising: receiving (1601) a time domain audio input comprising audio input signals; processing (1603) the audio input signals in frames to produce a mono mixdown signal and one or more stereo parameters; encoding the mono mixdown signal on a frame-by-frame basis by: encoding (1605) of active content of the mono mixdown signal at a first bit rate until a pause period is detected in the audio input signals or mono mixdown signal; estimating (1606) ITD parameters during the encoding of active content based on a low-pass filtering of cross-spectra of the audio input signals or averaging of the cross- spectra; switching (1607) the encoding from the active encoding content to inactive encoding to encode background noise at a second bit rate during the pause period; estimating (1609) ITD parameters during the pause period based on a low-pass filtering of the cross-spectra of the audio input signals or averaging of the cross-spectra wherein the estimating is being configured to adapt faster to the audio input signals compared to when estimating the ITD parameters during the encoding of active content; and encoding (1611) the estimated ITD parameters and other stereo parameters periodically during the pause period. 2. The method of claim 1 wherein the estimating is being configured to adapt to the audio input signals faster as compared to when estimating the ITD parameters during the encoding of active content comprises: speeding up a smoothing of a cross-spectra by increasing the low-pass filtering coefficient during a DTX hangover period and/or during a start of the pause period compared to prior to the DTX hangover period and/or the start of the pause period. 3. The method of claim 1 wherein the estimating is being configured to adapt faster to the audio input signals compared to when estimating the ITD parameters during the encoding of active content comprises: in a first encoding frame after active coding, replacing (1701) a state of a first cross spectra low-pass filter ^^^^^^_^^^^௧^ with a state of a second low-pass filter ^^^^^^_^^^^௧^ which filters the cross spectrum but is only updated during hangover and pause periods. 4. The method of claim 3, further comprising: starting (1703) an update of the second low-pass filter ^^^^^^_^^^^௧^ during a DTX hangover period. 5. The method of claims 2-4, further comprising speeding (1705) up the update of the state of the second low-pass filter ^^^^^^_^^^^௧^ responsive to the filtering being slow due to a low spectral flatness measure, ^^ ^^ ^^. 6. The method of any of claims 3-5, wherein ^^^^^^_^^^^௧^ is determined in accordance with ^^ ^^^ ^^^^_^^^^௧^^ ^^, ^^^ ൌ ൫1 െ ^^ ^^௩^^ ^^^ ൯ ⋅ ^^^^^^_^^^^௧^^ ^^, ^^ െ 1^ ^ ^^^^^^^௩^^ ^^^ ⋅ ^^^^^^^ ^^^ ^^ ^ ^^, ^^^ ൌ 1 െ ^^^^^ ^^ ⋅ ^^ ^ ^^, ^^ െ 1^ ^^^ ^^^^_^^^^௧^ ൫ ^ ൯ ^^^^_^^^^௧^ ^ ^^^^^ ⋅ ^^^^^^^ ^^^ where 7. The method of claim 6, wherein ^^^^^^^௩^^ ^^^ and ^^^^^ ^^^ are determined in accordance with ^^^^^௩^^ ^^ ൌ max൭ ^^ , min^ ^^ , ^^ ^ and ^^^^^ ^^^^^ ^^^ ൌ max൭ ^^ௗ^^^௨^௧, min^ ^^^^^, ^^ ^ ^^ where ^^^^^^^௩^^ rate parameters. 8. The method of claim 6, wherein ^^^^^^^௩^^ ^^^ ^^^ and ^^^^^ are determined in accordance with ax൭ ^^ , min^ ^^ , ^ ^^ ^ ^^^^^௩^^ ^^ ൌ m ^^^^^௩^^ ^ ^^ , ^^ ^ ^^ ^ and ^^ ൌ max ^^ ^^ ^ ^^^ ^^^ ^ ^ ^^^ ௗ^^^௨^௧, min^ ^^ , ^^ ^ ^^ where ^^^^^^^௩^^ parameters, ^^^^^^^௩^^ corresponds to the number of hangover frames and ^^^ is a variable. 9. The method of any of claims 1-8 wherein the estimating is being configured to adapt faster to the audio input signals compared to when estimating the ITD parameters during the encoding of active content comprises: adjusting (1801) a low-pass filter coefficient during the DTX hangover period and/or during the start of the pause period. 10. The method of claim 9 wherein adjusting the low pass filter coefficient comprises adjusting the low-pass filter coefficient in accordance with ^^^^^^_^^^^௧^^ ^^, ^^^ ൌ ^ 1 െ ^^^^ ⋅ ^^^^^^ೞ^^^^^^ ^^, ^^ െ 1^ ^ ^^^ ⋅ ^^^^^^^ ^^^, ^^ ^^ ^^^^௨^௧^^ ^^ ^^ ^^^^௨^௧^^ ൌ 0, ^^ ^^ ^^ ^^ ^^ ^^ ^^ℎ ^^ ^^ ^^ ^^ ^^ where ^^^ is the low-pass filter coefficient, ^^ ൌ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^, ^^ ൌ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^, ^^^^^^^ ^^^ is a cross spectrum, ^^^^^^_^^^^௧^^ ^^, ^^^ is a low-pass filtering of the cross-spectrum, CNG frame is an inactive coding frame, and Speech frame is an active encoding frame, and ^^ ^^ ^^ is a spectral flatness measure, ^^ is an upper threshold. 11. The method of any of claims 1, 2 or 5, wherein the estimating the ITD parameters further comprises speeding up smoothing of cross-spectra by the low-pass filtering during a start of the pause period comprises triggering (1901) the speed up of the filtering of the cross-spectra after active encoding of a number of consecutive active frames have been reached. 12. The method of any of claims 1-11, further comprising: executing (2001) a dedicated cross-correlation estimate that is only updated during the pause periods and/or during DTX hangover frames for the cross spectra and using the dedicated cross-correlation estimate for the ITD estimation in the pause period. 13. The method of any of claims 1-12, further comprising: resetting (2101) the cross-spectrum low-pass filter state at one of prior to any updates in a DTX hangover period and prior to any updates in the pause period. 14. The method of any of claims 1-13, further comprising: replacing (2201) a low-pass filter state at the start of a hangover period or at the start of the pause period. 15. The method of claim 14, wherein replacing the low-pass filtering at the start of the pause period comprises averaging (2301) the cross spectra ^^^^^^^ ^^^ over a number of ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ frames and replace the filter state ^^^^^^_^^^^௧^ with an average of the cross spectra ^^^^^^^ ^^^ over the number of ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ frames. 16. The method of any of claims 1-15 further comprising: transmitting (1613) the active content encoded, the background noise encoded, and the ITD parameters and other stereo parameters encoded towards a decoder (500). 17. An encoder (1300, 1508A, 1508B) adapted to perform operations comprising: receiving (1601) a time domain audio input comprising audio input signals; processing (1603) the audio input signals in frames to produce a mono mixdown signal and one or more stereo parameters; encoding the mono mixdown signal on a frame-by-frame basis by: encoding (1605) of active content of the mono mixdown signal at a first bit rate until a pause period is detected in the audio input signals or mono mixdown signal; estimating (1606) ITD parameters during the encoding of active content based on a low-pass filtering of cross-spectra of the audio input signals or averaging of the cross- spectra; switching (1607) the encoding from the active encoding content to inactive encoding to encode background noise at a second bit rate during the pause period; estimating (1609) ITD parameters during the pause period based on a low-pass filtering of the cross-spectra of the audio input signals or averaging of the cross-spectra wherein the estimating is being configured to adapt to the audio input signals faster compared to when estimating the ITD parameters during the encoding of active content; and encoding (1611) the estimated ITD parameters and other stereo parameters periodically during the pause period. 18. The encoder (1300, 1508A, 1508B) of claim 17 wherein the encoder (400, 1508A, 1508B) performs according to any of claims 2-15. 19. An encoder (1300, 1508A, 1508B) comprising: processing circuitry (1301); and memory (1303) coupled with the processing circuitry, wherein the memory includes instructions that when executed by the processing circuitry causes the encoder (400, 1508A, 1508B) to perform operations comprising: receiving (1601) a time domain audio input comprising audio input signals; processing (1603) the audio input signals in frames to produce a mono mixdown signal and one or more stereo parameters; encoding the mono mixdown signal on a frame-by-frame basis by: encoding (1605) of active content of the mono mixdown signal at a first bit rate until a pause period is detected in the audio input signals or mono mixdown signal; estimating (1606) ITD parameters during the encoding of active content based on a low-pass filtering of cross-spectra of the audio input signals or averaging of the cross- spectra; switching (1607) the encoding from the active encoding content to inactive encoding to encode background noise at a second bit rate during the pause period; estimating (1609) ITD parameters during the pause period based on a low-pass filtering of the cross-spectra of the audio input signals or averaging of the cross-spectra wherein the estimating is being configured to adapt to the audio input signals faster compared to when estimating the ITD parameters during the encoding of active content; and encoding (1611) the estimated ITD parameters and other stereo parameters periodically during the pause period. 20. The encoder (1300, 1508A, 1508B) of claim 19, wherein the memory includes further instructions that when executed by the processing circuitry causes the encoder (1300, 1508A, 1508B) to perform operations according to any of claims 2-15. 21. A computer program comprising program code to be executed by processing circuitry (1301) of an encoder (1300, 1508A, 1508B), whereby execution of the program code causes the encoder (400, 1508A, 1508B) to perform operations comprising: receiving (1601) a time domain audio input comprising audio input signals; processing (1603) the audio input signals in frames to produce a mono mixdown signal and one or more stereo parameters; encoding the mono mixdown signal on a frame-by-frame basis by: encoding (1605) of active content of the mono mixdown signal at a first bit rate until a pause period is detected in the audio input signals or mono mixdown signal; estimating (1606) ITD parameters during the encoding of active content based on a low-pass filtering of cross-spectra of the audio input signals or averaging of the cross- spectra; switching (1607) the encoding from the active encoding content to inactive encoding to encode background noise at a second bit rate during the pause period; estimating (1609) ITD parameters during the pause period based on a low-pass filtering of the cross-spectra of the audio input signals or averaging of the cross-spectra wherein the estimating is being configured to adapt to the audio input signals faster compared to when estimating the ITD parameters during the encoding of active content; and encoding (1611) the estimated ITD parameters and other stereo parameters periodically during the pause period. 22. The computer program of claim 21, comprising further program code whereby execution of the program code causes the encoder (1300, 1608A, 1608B) to perform operations according to any of claims 2-15. 23. A computer program product comprising a non-transitory computer readable storage medium having program code, to be executed by processing circuitry (1301) of an encoder (1300, 1508A, 1508B), whereby execution of the program code causes the encoder (1300, 1508A, 1508B) to perform operations comprising: receiving (1601) a time domain audio input comprising audio input signals; processing (1603) the audio input signals in frames to produce a mono mixdown signal and one or more stereo parameters; encoding the mono mixdown signal on a frame-by-frame basis by: encoding (1605) of active content of the mono mixdown signal at a first bit rate until a pause period is detected in the audio input signals or mono mixdown signal; estimating (1606) ITD parameters during the encoding of active content based on a low-pass filtering of cross-spectra of the audio input signals or averaging of the cross- spectra; switching (1607) the encoding from the active encoding content to inactive encoding to encode background noise at a second bit rate during the pause period; estimating (1609) ITD parameters during the pause period based on a low-pass filtering of the cross-spectra of the audio input signals or averaging of the cross-spectra wherein the estimating is being configured to adapt to the audio input signals faster compared to when estimating the ITD parameters during the encoding of active content; and encoding (1611) the estimated ITD parameters and other stereo parameters periodically during the pause period. 24. The computer program product of claim 23, wherein the non-transitory computer readable storage medium has further program code, to be executed by processing circuitry (1301) of an encoder (1300, 1508A, 1508B), whereby execution of the program code causes the encoder (1300, 1508A, 1508B) to perform operations according to any of claims 2-15.
Description:
ADAPTIVE INTER-CHANNEL TIME DIFFERENCE ESTIMATION CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This application claims the priority benefit of U.S. Provisional Patent Application Serial No.63/406,127, filed September 13, 2022, the disclosure of which is incorporated herein by reference in its entirety. TECHNICAL FIELD [0002] The present disclosure relates generally to communications, and more particularly to communication methods and related devices and nodes supporting encoding and decoding. BACKGROUND [0003] In communications networks, there may be a challenge to obtain good performance and capacity for a given communications protocol, its parameters, and the physical environment in which the communications network is deployed. [0004] For example, although the capacity in telecommunication networks is continuously increasing, it is still of interest to limit the required resource usage per user. In mobile telecommunication networks, less required resource usage per call means that the mobile telecommunication network can service a larger number of users in parallel. Lowering the resource usage also yields lower power consumption in both devices at the user-side (e.g., terminal devices) and devices at the network-side (e.g., network nodes). This translates to energy and cost saving for the network operator, while enabling prolonged battery life and increased talk-time for the terminal devices. [0005] One mechanism for reducing the required resource usage for speech communication applications in mobile telecommunication networks is to exploit natural pauses in the speech. For example, in most conversations only one party is active at a time, and thus pauses in speech occurring in one communication direction will typically occupy more than half of the signal. One way to utilize this property to decrease the required resource usage is to employ a Discontinuous Transmission (DTX) system, where the active signal encoding is discontinued during speech pauses. [0006] Typically, the encoding process is performed on the audio signal segments (e.g., referred to as frames) where input audio samples during a time interval, typically 10-20 milliseconds (ms), are buffered and used by an encoder to extract the parameters to be transmitted to a decoder. [0007] During speech pauses, it is common to transmit ‘silence insertion descriptor’ (SID) frames at a very low bit rate encoding of the background noise to allow for a Comfort Noise Generator (CNG) system at the receiving end to fill the above-mentioned pauses with a background noise that has similar characteristics as the original noise. Notably, the CNG makes the pauses sound more natural (e.g., as compared to having completely silent speech pauses) since the background noise is maintained and not switched on and off together with the speech sounds. Complete silence in the speech pauses is commonly perceived as an annoyance and often leads to the misconception that the call has been disconnected. [0008] A DTX system may rely on a Voice Activity Detector (VAD), which indicates to the transmitting device whether to use i) active signal encoding or ii) low rate background noise encoding. In this respect, the transmitting device might be configured to differentiate between other source types by using a (Generic) Sound Activity Detector (GSAD or SAD), which not only distinguishes speech noise from background noise but can also be configured to detect music or other signal types deemed to be relevant. A block diagram of a DTX system 100 is illustrated in Figure 1. [0009] In Figure 1, input audio is received by the VAD 102, the speech/audio coder 104, and the CNG coder 106. The VAD 102 indicates whether to transmit the "high" bitrate from speech/audio coder 104 or transmit the "low" bitrate from CNG coder 106. [0010] Communication services may be further enhanced by supporting stereo or multichannel audio transmissions. In these cases, the DTX/CNG system might also account for the spatial characteristics of the signal in order to provide a pleasant-sounding comfort noise. [0011] A common mechanism used to generate comfort noise is to transmit information about the energy and spectral shape of the background noise in the speech pauses. This can be accomplished using a significantly lower number of bits than the regular coding of speech segments. Normally, this information is sent less frequently than in the active segments as depicted in Figure 2 where the active segments are illustrated as active encoding (e.g., see active encoding signal 202) and the information about the energy and spectral shape of the background noise in the speech pauses are illustrated as CN encoding signaling 204. [0012] A common feature in DTX systems is to add a “hangover period” 301 to the VAD decision as illustrated in Figure 3. During this period, active encoding is still be used even though the VAD decision (see signal 302) is that there should not be active encoding (e.g., see active encoding signal 304). This is to avoid short segments of CNG in the middle of longer active segments, e.g., in breathing pauses in a speech utterance (e.g., see signal 306). Parameters used for CNG generation can be estimated during this period. [0013] At the receiving side, the comfort noise is generated by creating a pseudo random signal and then shaping the spectrum of the signal with a filter based on information received from the transmitting device. The signal generation and spectral shaping can be performed in the time domain or the frequency domain. [0014] For stereo operation, additional parameters are transmitted to the receiving side. In a typical stereo signal, the channel pair shows a high degree of similarity, or correlation. State-of- the-art stereo coding schemes exploit this correlation by employing parametric coding, where a single channel is encoded with high quality and complemented with a parametric description that enables reconstruction of the full stereo image. The process of reducing the channel pair into a single channel is called a down-mix. Similarly, the resulting channel may be referred to as the down-mix channel or mixdown channel. The down-mix procedure typically tries to maintain the energy by aligning inter-channel time differences (ITD) and inter-channel phase differences (IPD) before mixing the channels. To maintain the energy balance of the input signal, the inter- channel level difference (ILD) is also measured. The ITD, IPD and ILD are then encoded and may be used in a reversed up-mix procedure when reconstructing the stereo channel pair at a decoder. As discussed below, Figure 4 and Figure 5 depicts block diagrams of a parametric stereo encoder 400 and decoder 500. [0015] In Figure 4, time domain stereo input is received by a stereo processing and mixdown module 402. The stereo processing and mixdown module 402 processes the time domain stereo input signals and produces a mono mixdown signal and stereo parameters (e.g., ITD, IPD, and/or ILD). The mono mixdown signal is received by a mono speech/audio encoder 404, which processes the mono mixdown signal and produces an encoded mono signal. The encoded mono signal and the stereo parameters are transmitted towards a decoder such as the parametric stereo decoder 500 (depicted in Figure 5). [0016] In Figure 5, the encoded mono signal is received by a mono speech/audio decoder 502 which decodes the encoded mono signal and produces a mono mixdown signal. The mono mixdown signal and the stereo parameters are received by a stereo processing and upmix decoder 504, which processes the mono mixdown signal and stereo parameters and produces time domain stereo output. The time domain stereo output can be stored or sent to an audio player for playback. [0017] Figure 6 is an illustration of a practical example of the occurrence of ITD. As depicted in Figure 6, if a stereo signal is captured by two microphones 602-604, the distance (L1) from the source (e.g., speaker source 601) to the left microphone 602 may be different from the distance (L2) to the right microphone 604. The difference in distance will lead to a time delay between the channels, i.e., the ITD. If there are several audio sources, these sources may have different ITDs. The background noise (e.g., sources 600) will often be a sum of many sources and may not have one apparent ITD. [0018] The conventional parametric approach to estimate the ITD relies on the cross- correlation function (CCF) ^^ ௫௬ which is a measure of similarity between two waveforms ^^^ ^^^ and ^^^ ^^^, and is generally defined in the time domain as: ^^ ௫௬ ^ ^^^ ൌ ^^^ ^^^ ^^^ ^^^ ^^ ^ ^^^^, where ^^ is the time-lag parameter and ^^^⋅^ is the expectation operator. For a signal frame of length ^^, the cross-correlation is typically estimated as: ேି^ ^ ^^ ௫௬ ^ ^^^ ൌ ^ ^^^ ^^^ ^^^ ^^ ^ ^^^ . [0019] The Inter-channel Cross-correlation Coefficient (ICC) is conventionally obtained as the maximum of the CCF, which is normalized by the signal energies as follows: ^^ ^^ ^^ ൌ ^ ^^^^ഓ^ [0020] The time lag ^^ as the ITD between the channels ^^ and ^^. By assuming ^^^ ^^^ and ^^^ ^^^ are zero outside the signal frame, the cross- correlation function can equivalently be expressed as a function of the cross-spectrum of the frequency spectra ^^^ ^^^ and ^^^ ^^^ (with discrete frequency index ^^) as: ^^ ^^^^ ^ ^^^ ൌ ^^^ ^^^ ^^ ^ ^^^ where ^^^ ^^^ is the discrete domain signal ^^^ ^^^, i.e., ே ି^ ^ ^ ^ ^^ ^ ൌ ^ ^^ ^ ^^ ^ ^^ ି^ଶగ ே ^^ , ^^ ൌ 0, … , ^^ െ 1 and the ^^ ^^ ^^ ି^ ^⋅^ [0021] For the case when ^^^ ^^^ is purely a delayed version of ^^^ ^^^, the cross-correlation function is given by: ^ ^௫௬ ^ ^^ ^ ൌ ^^ ^^ ^^ ି^ ^ ^^ ^ ^^ ^ ^^ ∗^ ^^ ^ ^^ ି^ మഏ ಿ ^ఛబ ^ ൌ ^^௫௫ ^ ^^ ^ ∗ ^^ ^ ^^ െ ^^^ ^ , where ∗ denotes it is equal to one at ^^ ^ and zero otherwise. This means the cross-correlation function between ^^ and ^^ is the delta function spread by the convolution with the autocorrelation function for ^^^ ^^^. This will broaden the delta peak. For signal frames with several delay components, e.g., several speakers/talkers, there will be peaks at each delay present between the signals, and the cross correlation becomes ^ ^ ௫௬ ^ ^^^ ൌ ^^ ௫௫ ^ ^^^ ∗ ^ ^^^ ^^ െ ^^ ^ ^ . [0022] The delta other and make it difficult to identify the several delays within the signal frame. There are, however, generalized cross- correlation (GCC) functions that do not have this spreading. The GCC is generally defined as: ^^ ^^ ^ ^^^ ൌ ^^ ^^ ^^ ି^ ^ψ^k^ ^^^ ^^^ ^^ ^ ^^^^ where ^^^ ^^^ is a frequency weighting. Especially for spatial audio, the phase transform (PHAT) has been utilized due to its robustness for reverberation in low noise environments. The phase transform is basically the absolute value of each frequency coefficient, i.e., ^ ^^ ^^^ ൌ 1 ^ ^^ ^^^ ^^∗^ [0023] This frequency the cross-spectrum such that the power of each component becomes equal. With pure delay and uncorrelated noise in the signals ^^^ ^^^ and ^^^ ^^^, the phase transformed GCC (GCC-PHAT) becomes the Kronecker delta function ^^^ ^^ െ ^^ ^ ^, i.e., ^^ ^^^^^∗^ ^ ^ ^ ష^ మഏ ಿ ೖഓబ ^ ு^் ^ ^^^ ൌ ^^ ^^ ^^ ି^ ^ ି^మഏ ^ ^ ^^^ ^^ ^ ൌ ^^ ^^ ^^ ି^ ^ ^^ ಿ ఛబ ^ ൌ ^^^ ^^ െ ^^ ^. [0024] the common lengths of these segments are 10 or 20 ms. The coding parameters, like the ITD, are estimated at the encoding side on a per frame basis and are transmitted to the decoder. It is also common to not transmit a parameter if there is no clear gain in the encoding process with using the parameter. In the ITD case, this will be when the left and right signals are more or less uncorrelated. SUMMARY [0025] There currently exist certain challenge(s). The CNG that is generated during speech pauses when DTX is enabled is encoded at a very low bit rate. There is no other part of the CNG encoding that can counteract the effect of an incorrect ITD. In speech pauses, it is likely that the ITD will be different as compared to the speech segments. For example, Figure 7 is a signaling diagram illustrating ITD delay according to some embodiments. In such cases, the low-pass filtering of the cross spectrum afforded by the current solution will lead to a delay in the change from the “speech ITD”, i.e., signal portion 702, to the “background noise ITD”, i.e., signal portion 703. If this delay 704 in the active encoding signal 701 is long enough, e.g., 1 second or more, the listener will initially hear the background noise e.g., ITD signal portion 702 generated with the speech ITD and then hear a sudden change of the ITD to the correct one e.g., signal portion 703. This will be easily perceived as a significant change in the spatial characteristics of the background noise and may serve as an annoyance to the listener. [0026] If the smoothing of the cross-spectrum is based on the spectral flatness, the issue will be stronger for background noises that have a strong spectral tilt. The spectral flatness measure is typically used to indicate a tonal or periodic signal structure. However, some noise signals will also yield a low spectral flatness measure due to a strongly tilted spectrum. This is often the case for car noise, which typically has a strong low frequency component. If the smoothing of the cross-spectrum is based on the spectral flatness, the smoothing will be strong for such background noises. This may lead to a delayed shift in ITD as mentioned above. [0027] Certain aspects of the disclosure and their embodiments may provide solutions to these or other challenges. The various embodiments described herein are directed to speeding up the low-pass filtering of the cross correlation to allow a faster adaptation of the ITD in the beginning of each CNG segment. This may be achieved in several ways, including but not limited to, the modifying of the low-pass filter coefficient. [0028] In some embodiments, the disclosed subject matter includes a method to estimate an inter-channel time difference (ITD) in an encoder using a discontinuous transmission (DTX) is disclosed. One example method includes receiving (1601) a time domain audio input comprising audio input signals and processing (1603) the audio input signals in frames to produce a mono mixdown signal and one or more stereo parameters. The method further includes encoding the mono mixdown signal on a frame-by-frame basis by: encoding (1605) of active content of the mono mixdown signal at a first bit rate until a pause period is detected in the audio input signals or mono mixdown signal; estimating ITD parameters during the encoding of active content based on a low-pass filtering of the cross-spectra of the audio input signals or averaging of the cross- spectra; switching (1607) the encoding from the active encoding content to inactive encoding to encode background noise at a second bit rate during the pause period; and estimating (1609) ITD parameters during the pause period (or inactive encoding) based on a low-pass filtering of cross- spectra of the audio input signals or averaging of the cross-spectra wherein the estimating is being configured to adapt to the audio input signals faster as (e.g., speeding up ITD estimation) compared to when estimating the ITD parameters during the encoding of active content. The method further includes encoding (1611) the ITD estimated parameters and other stereo parameters periodically during the pause period. [0029] According to at least one embodiment of the disclosed subject matter, the method further includes speeding up a smoothing of a cross-spectra by increasing the low-pass filtering coefficient during a DTX hangover period and/or during a start of the pause period compared to prior to the DTX hangover period and/or the start of the pause period. [0030] According to at least one embodiment of the disclosed subject matter, the method further includes, in a first encoding frame after active coding, replacing a state of a first cross spectra low-pass filter ^^ ^^^^_^^^^௧^ with a state of a second low-pass filter ^^ ^^^^_^^^^௧^ which filters the cross spectrum but is only updated during hangover and pause periods. [0031] According to at least one embodiment of the disclosed subject matter, the method further includes starting an update of the second low-pass filter ^^ ^^^^_^^^^௧^ during a DTX hangover period. [0032] According to at least one embodiment of the disclosed subject matter, the method further includes speeding up the update of the state of the second low-pass filter ^^ ^^^^_^^^^௧^ responsive to the filtering being slow due to a low spectral flatness measure, ^^ ^^ ^^. [0033] According to at least one embodiment of the disclosed subject matter, the method further includes wherein ^^ ^^^^_^^^^௧^ is determined in accordance with ^ ^ ^^^^^௩^ ^ ^^^_^^^^௧^ ^ ^^, ^^^ ൌ 1 െ ^^ ^ ^ ^^ ൯ ⋅ ^^ ^^^^_^^^^௧^ ^ ^^, ^^ െ 1^ ^ ^^^^^^^௩^^ ^ ^^ ⋅ ^^ ^^^^ ^ ^^^ ^^ ^^^^_^^^^௧^ ^ ^^, ^^^ ൌ ൫1 െ ^^ ^^^ ^ ^^ ൯ ⋅ ^^ ^^^^_^^^^௧^ ^ ^^, ^^ െ 1^ ^ ^^ ^^^ ^ ^^ ⋅ ^^ ^^^^ ^ ^^^ where [0034] According to at least one embodiment of the disclosed subject matter, the method further includes wherein ^^ ^^^^^௩^^ ^ ^^ and ^^ ^^^ ^ ^^ are determined in accordance with ^ ^^^^௩^^ ^^ ൌ max ^^ , min ^ ^^^^^^^௩^^ , ^ ^ ^ and ^ ^^ ^^ ^^ ^^ ^ ^ ^^ ൌ max ^^ ௗ^^^௨^௧ , min ^ ^^^^^, ^ ^ ^ ^^ where ^^ ^^^^^௩^^ rate parameters. [0035] According to at least one embodiment of the disclosed subject matter, the method further includes wherein ^^ ^^^^^௩^^ ^^^ ^ ^^ and ^^ ^^^ are determined in accordance with ^^ ^^^^^௩^^ ൌ max ^^ , m ^^^^^௩^^ min ^ ^^ ^^^^^௩^^ ^ ^ ^^^^^^^௩^^ , ^^ ^ ^ ^^ ൭ ௗ^^^௨^௧ in^ ^^ , ^^ ^^ and ^ ^^ ^^^ ^ ^^ ൭ ^^ ௗ^^^௨^௧ , min^ ^^^^ ^^ ^^ ൌ max ^, ^ ^^ where ^^ ^^^^^௩^^ parameters, ^^ ^^^^^௩^^ corresponds to the number of hangover frames and ^^ ^ is a variable. [0036] According to at least one embodiment of the disclosed subject matter, the method further includes adjusting a low-pass filter coefficient during the DTX hangover period and/or during the start of the pause period. [0037] According to at least one embodiment of the disclosed subject matter, the method further includes wherein adjusting the low pass filter coefficient comprises adjusting the low- pass filter coefficient in accordance with ^ ^^^^^_^^^^௧^ ^ ^^, ^^ ^ ^ 1 െ ^^^ ^ ⋅ ^^^^^^ೞ^^^^^ ^ ^^, ^^ െ 1 ^ ^ ^^^ ⋅ ^^^^^^ ^ ^^ ^ , ^^ ^^ ^^^^௨^௧^^ ^^ ^^^^_^^^^௧^ ^ ^^, ^^^ ൌ ^ 1 െ ^^ ^^ ^^^ ⋅ ^^ ^^^^_^^^^௧^ ^ ^^, ^^ െ 1^ ^ ^^ ^^ ^^ ⋅ ^^ ^^^^ ^ ^^^, ^^ ^^ ^^ ^^௨^௧^^ ^ ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ ^ ^ ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ െ ^^ ^^ ^^^^௨^௧^^ ^ ൌ min^ ^^, ^^ ^^ ^^ ^ ^^ ∗ ൬ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^௨^௧^^ ൌ 0, ^^ ^^ ^^ ^^ ^^ ^^ ^^ℎ ^^ ^^ ^^ ^^ ^^ where ^^ ^ is the low-pass filter coefficient, ^^ ൌ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^, ^^ ൌ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^, ^^ ^^^^ ^ ^^^ is a cross spectrum, ^^ ^^^^_^^^^௧^ ^ ^^, ^^^ is a low-pass filtering of the cross-spectrum, CNG frame is an inactive coding frame, Speech frame is an active encoding frame, and ^^ ^^ ^^ is a spectral flatness measure, ^^ is an upper threshold. [0038] According to at least one embodiment of the disclosed subject matter, the method further includes speeding up smoothing of cross-spectra by the low-pass filtering during a start of the pause period comprises triggering the speed up of the filtering of the cross-spectra after active encoding of a number of consecutive active frames have been reached. [0039] According to at least one embodiment of the disclosed subject matter, the method further includes executing a dedicated cross-correlation estimate that is only updated during the pause periods and/or during DTX hangover frames for the cross spectra and using the dedicated cross-correlation estimate for the ITD estimation in the pause period. [0040] According to at least one embodiment of the disclosed subject matter, the method further includes resetting the cross-spectrum low-pass filter state at one of prior to any updates in a DTX hangover period and prior to any updates in the pause period. [0041] According to at least one embodiment of the disclosed subject matter, the method further includes replacing a low-pass filter state at the start of a hangover period or at the start of the pause period. [0042] According to at least one embodiment of the disclosed subject matter, the method further includes wherein replacing the low-pass filtering at the start of the pause period comprises averaging the cross spectra ^^ ^^^^ ^ ^^ ^ over a number of ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ frames and replace the filter state ^^ ^^^^_^^^^௧^ with an average of the cross spectra ^^ ^^^^ ^ ^^^ over the number of ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ frames. [0043] According to at least one embodiment of the disclosed subject matter, the method further includes transmitting the active content encoded, the background noise encoded, and the ITD parameters and other stereo parameters encoded towards a decoder. [0044] Certain embodiments may provide one or more of the following technical advantage(s). The various embodiments permit the comfort noise to sound more natural and avoid annoying effects associated with a sudden change in the spatial characteristics during CNG after changing from active coding. In particular, one avoids that the DTX starts with a segment of comfort noise colored by the active content and then, after some time, suddenly changes to a comfort noise that more closely resembles the original input noise. [0045] A faster adaptation of the comfort noise to the background noise may also improve the ITD estimation in speech onsets since the influence in the ITD estimation from the previous speech segment is decreased. BRIEF DESCRIPTION OF THE DRAWINGS [0046] The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate certain non-limiting embodiments of inventive concepts. In the drawings: [0047] Figure 1 is a block diagram of a discontinuous transmission (DTX) system; [0048] Figure 2 is a flow diagram illustrating comfort noise generator (CNG) parameter encoding and transmission; [0049] Figure 3 is a flow diagram illustrating a voice activity detector (VAD) (or DTX) hangover period; [0050] Figure 4 is a block diagram of a parametric stereo encoder according to some embodiments; [0051] Figure 5 is a block diagram of a parametric stereo decoder according to some embodiments; [0052] Figure 6 is an illustration of inter-channel time difference inter-channel time difference (ITD) according to some embodiments; [0053] Figure 7 is a flow diagram illustrating ITD delay according to some embodiments; [0054] Figure 8 is a flow diagram illustrating a filter speed up period according to some embodiments; [0055] Figure 9 is a flow diagram illustrating VAD speech/CNG toggling according to some embodiments; [0056] Figure 10 is a flow diagram illustrating no filter speed up after a short active period of filter speed up according to some embodiments; [0057] Figure 11 is a flow diagram illustrating replacing cross spectra filter state according to some embodiments; [0058] Figure 12 is a flow diagram illustrating potential reset of filter state for second cross spectra filter state, speed up of filter state update based on gradually decreasing lower threshold for filter state updates, copying of filter state to the first cross spectra filter state, and continued gradually decreasing lower threshold for filter state updates according to some embodiments; [0059] Figure 13 is a block diagram of an encoder in accordance with some embodiments; [0060] Figure 14 is a block diagram of a decoder in accordance with some embodiments; [0061] Figure 15 is a block diagram of a virtualization environment in accordance with some embodiments; and [0062] Figures 16-23 are flow charts illustrating operations of an encoder according to some embodiments. DETAILED DESCRIPTION [0063] Some of the embodiments contemplated herein will now be described more fully with reference to the accompanying drawings. Embodiments are provided by way of example to convey the scope of the subject matter to those skilled in the art, in which examples of embodiments of inventive concepts are shown. Inventive concepts may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of present inventive concepts to those skilled in the art. It should also be noted that these embodiments are not mutually exclusive. Components from one embodiment may be tacitly assumed to be present/used in another embodiment. [0064] Embodiments of the disclosed subject matter pertains to methods and techniques of implementing an adaptive ITD estimation. Notably, the speed up of the low-pass filtering of the cross correlation estimate in the beginning of each CNG segment permits a faster adaptation of the ITD estimate. Notably, utilizing low-pass filtering in this manner may be achieved in several ways, such as by modifying the low-pass filter coefficient. In some embodiments, the ITD estimation processes and/or techniques disclosed herein may be executed via an encoder element and/or its ITD estimation engine (IEE) as described below. [0065] For the ITD, it is desirable to have an ITD estimate that does not have a small random variation on a frame-by-frame basis. One way to stabilize the estimate is to apply a low- pass filter to the cross spectrum using a simple first order filter, such as: ^^ ^^^^_^^^^௧^ ^ ^^, ^^^ ൌ ^ 1 െ ^^ ^ ∗ ^^ ^^^^_^^^^௧^ ^ ^^, ^^ െ 1^ ^ ^^ ∗ ^^ ^^^^ ^ ^^^ wherein k = frequency bin and m = frame number. [0066] The filter coefficient α can be fixed, but it may also be adaptive. One example is to use a spectral flatness measure (sfm) calculated on the left or right input signal as the filter coefficient. ಿ ∑ಿషభ ୪୬൫ ^ ^ ^ ^సబ ௫ ^ ൯ ∏ேି^ ^ୀ ^^^ ^^^ ^^ ^^ ^^ ^^ ^ ^^^^^^_^^^^௧^ ^ ^^, 1 ^ ^ ^^ ^^ ^^ ∗ ^^^^^^^ ^^^ [0067] This measure will have the range 0.0 – 1.0 where a higher value would indicate a flatter spectrum. Using this coefficient may improve the robustness and accuracy of the ITD estimation. [0068] Within a stereo or multichannel audio encoding system, ITD parameters are generated based on channel pairs, where the ITD estimation is based on a low-pass filtering or averaging of a cross-spectrum, and the low-pass filtering of the cross-spectrum is controlled based on the DTX and Voice/Sound Activity Detector decisions. [0069] Various embodiments enable the ITD calculation to be adaptive and controlled by the DTX system. Transitions occur from active content to CNG when the coding goes from active content to background content, which may have significantly different spatial properties (e.g., the inter-channel time difference or coherence). For such changes occurring in the signal characteristics of the encoded spatial audio, it can be beneficial to make the adaptation to the change in content quicker. [0070] The reason for the time difference existing between the signal in the left and right channels is due to the positioning of the sound source in relation to the capture microphones. In a conversational speech scenario with one or several speakers and environmental noise in the background, this means that there may be a sudden change in ITD when the speakers stop talking, i.e., when the DTX system make the coding process switch over to CNG. [0071] In many cases, the background noise will provide fairly uncorrelated signals in the left and right channel. This means that there is no ITD detected and the encoder may not transmit an ITD parameter, i.e., basically assuming ITD to be zero. In the case where the background noise is dominated by a single source (e.g., a fan or some machinery), ITD present in the background noise may differ from the ITD of the speech. One can assume that in a reasonable scenario the speech level will be significantly higher than the background noise level and that the estimated ITD during speech will be based on the speech signal. [0072] It is not desirable to have an ITD estimation that varies between each frame but it should follow any change in the input signal, e.g., if the speaker is moving or if there are several speakers that take turns speaking. [0073] Low-pass filtering the cross spectra is one way to smooth the ITD estimation to avoid frequent changes of the estimated ITD. If there is a sudden change in the ITD, the smoothing will introduce a delay in the ITD estimation thereby allowing a period of time before the ITD estimate has adapted to the new ITD. There will be a tradeoff between having a stable ITD estimate and the speed with which the ITD estimation can follow a change. [0074] In the case where DTX is used, there is a decision made as to whether active encoding or CNG encoding is to be used for the current frame. It is likely that the ITD will differ between active encoding, and as such, the focus of the embodiments of the disclosed subject matter is to speed up the ITD estimate by an adaptive filtering and update of the cross spectra (or time domain cross correlation) estimate for the beginning of a CNG encoding segment. This may be achieved by several techniques as described below. [0075] Adjusting the low-pass filter coefficient [0076] In order to speed up the ITD estimation in a transition from active speech encoding to CNG encoding, the low-pass filter coefficient is adjusted during at the start of the CNG period. In the example below, the filter coefficient is changed during the ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ first frames. The processing depends both on the current frame, ^^, and the previous frame, ^^ െ 1. To clarify this, the notations are complemented with a frame index. ^^ ^^^^_^^^^௧^ ^ ^^, ^^^ ൌ ^ 1 െ ^^ ^ ^ ⋅ ^^ ^^^^_^^^^௧^ ^ ^^, ^^െ 1^ ^ ^^ ^ ⋅ ^^ ^^^^ ^ ^^^, ^^ ^^ ^^ ^^௨^௧^^ ^ ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ ^^^ ൌ ^ 1 െ ^^ ^^ ^^^ ⋅ ^^െ 1^ ^ ^^ ^^ ^^ ⋅ ^^ ^^^^ ^ ^^ ^^ ^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ [0077] Normally in CNG encoding, the encoded frames are not sent as frequently as for the speech encoding. This is illustrated in Figure 8, which depicts active encoding signal 801 and CNG encoding signal 802. Typically, CNG encoded frames are sent every 8 th frame (e.g., SID frames 811 and 812 in CNG encoding signal 802) with nothing transmitted for the 7 frames (e.g., ‘speed up interval’ 803 in Figure 8) in between the CNG frames. [0078] If the ITD estimation is run with the same time interval as for active coding and ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ is set to ‘8’, it means that only one ITD estimation will be sent to the decoder during the time interval under which the filter coefficients are changed and where one could expect the estimates to be more unstable. The upper threshold ^^ may, for example, be set to ‘0.8’ to ensure that the smoothing over frames is not too weak. However, ^^ ^ may also be set to allow a higher filter coefficient when ^^ ^^ ^^ is exceeding ^^, i.e., ^ ^ ൌ max ^^ ^^ ^^, min ^^, ^ ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ െ ^^ ^^ ^^^^௨^௧^^ ^ ^ ^ ^ ^^ ^^ ^ ^^ ∗ ൬ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^^ [0079] Other alternatives for changing the filter coefficient may be to set the coefficient to a constant high value (e.g., 0.8) during the ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ first CNG frames or to use another function that would increase the filter coefficient value over this limited time period. The number of frames during which the modified filter coefficient is used, i.e., ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^, can also be made adaptive, e.g., allowing a longer period if the ^^ ^^ ^^ values are low. [0080] In order to avoid triggering the speed up of the cross-spectrum filtering for short bursts of active encoding (see active encoding signal 1001 in Figure 10), a certain length of the active segment may be required to trigger the speed up (e.g., see speed up interval 1003). One example embodiment is to wait with the reset of the ^^ ^^ ^^ ^^௨^௧^^ until a certain number of consecutive active frames have been reached. Notably, ^^ ^^ ^^ ^^௨^௧^^ ൌ 0, ^^ ^^ ^^ ^^ ^^ ^^ ^^ℎ ^^௨^௧^^ ^ ^^ ^^ ^^ ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^ where ^^ ^^ ^^ ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^ may be ‘8’, for example. This procedure is also illustrated in Figure 10 where the short speech burst at the second occurrence 1005 of ACTIVE ENCODING 1001 before the CNG ENCODING 1002 is too short to reset the ^^ ^^ ^^ ^^௨^௧^^ and activate the speed up logic at the interval 1004, which is shown as a ‘no speed up here’ interval. The benefit of not applying a speed up in this case is that a more stable and long term ITD estimate is obtained. [0081] Instead of specifying a time interval ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ for which an adapted filter coefficient is applied, there could be a default coefficient, ^^ ௗ^^^௨^௧ , (e.g., being based on ^^ ^^ ^^) and an adaptive lower threshold, ^^ ௧^^^^^^^ௗ . Preferably, the filter coefficient is adapted based on how many frames the cross-correlation estimation has been active and/or how many updates of the estimate can be expected until the estimate is to be used, e.g., used to estimate the ITD (as described in more detail below). [0082] In this case, the filter coefficient may be determined as follows: ^^ ^^^^_^^^^௧^ ^ ^^, ^^^ ൌ ^ 1 െ ^^ ^^^ ^ ⋅ ^^ ^^^^_^^^^௧^ ^ ^^, ^^ െ 1^ ^ ^^ ^^^ ⋅ ^^ ^^^^ ^ ^^^, [0083] the adaptive lower threshold ^^ ௧^^^^^^^ௗ is allowed. However, as the smoothing coefficient decreases, the larger a long-term estimate can be obtained. To ensure the cross-correlation estimate is relevant in the transition between active and inactive coding (e.g., where the estimate should switch from tracking the speech to tracking the background spatial characteristics), further techniques for updating the cross-spectra may be utilized, as described in the following sections. [0084] Separate the cross spectra filtering between active and CNG encoding [0085] In some embodiments, improved tracking of the spatial characteristics may be obtained by executing a dedicated cross-correlation estimate that is only updated (e.g., low-pass filtered) during the CNG periods and to use this estimate for the ITD estimation in the CNG period. This filter could have a fixed or adaptive filter coefficient. This means that in the beginning of each CNG period, the filter starts with the state from the end of the last CNG period. In many cases the background noise has not changed significantly during an active segment. Even if the background noise has changed, this starting point will not necessarily be worse than starting from a filter state acquired during the active speech segment. [0086] Reset of cross spectra filtering [0087] In some embodiments, there may be benefits to resetting the filter state rather than reusing the state of the previous CNG period, especially if some of the active signal spatial characteristics has got into the filter state in the end of the CNG period where the VAD might not yet have triggered active coding. In other embodiments, it may be beneficial to reset the filter state after a longer segment of active coding (e.g., 20 frames) where it is more likely that the signal characteristics have changed as opposed to only after a few frames. [0088] Therefore, in some embodiments, performing such a reset may be conditioned by designating a certain number of active frames between the CNG periods, as it otherwise is more likely that the previous filter state is an appropriate starting point. In any case, it is important that the update of the cross-correlation estimate (e.g., low-pass filtering of cross-spectra) is not too slow. [0089] Replace the cross spectra low-pass filter state in the beginning of a CNG period [0090] In some embodiments, the disclosed subject matter pertains to replacing the state of the cross spectra low-pass filter with a state that better reflects the background noise at the start of the CNG period. As shown in Figure 11, one way to accomplish this is to take an average of the cross spectra ^^ ^^^^ ^ ^^^ over ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ frames and subsequently replace the filter state ^^ ^^^^_^^^^௧^ with the average. For example, Figure 11 depicts an active encoding signal 1101 and a CNG encoding signal 1102. Figure 11 further shows an averaging period 1104 that is calculated by the encoder and utilized to determine an average filter state value 1103. As shown in Figure 11, the average filter state value 1103 is used to replace a “regular filtering” portion of the cross spectra during one or more certain periods (e.g., period 1111) in the CNG encoding signal 1102. This means that for, e.g., the period 1111, the average filter state value 1103 will be used for ITD estimation. The frame after the replacement the filter is updated in the regular way. This technique may be represented mathematically as follows: ^^ ^^^^_^௩^_^௨^ ^ ^^, ^^^ ൌ ^^ ^^^^_^௩^_^௨^ ^ ^^, ^^ െ 1^ ^ ^^ ^^^^ ^ ^^^, ^^ ^^ ^^ ^^௨^௧^^ ^ ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ ^ ^ ^ ^ ^^ ^^ ^^ ^ ^^ ^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ segment will not be affected. This means that it will likely reflect the ITD from the preceding active encoding. One alternative to using an average as described above would be to replace the state of the cross spectrum low-pass filter with the state of a cross spectrum filter that is updated only during CNG periods. [0092] Updates during DTX hangover [0093] In the VAD used for the DTX system there are measures taken to avoid frequent toggling between active speech and CNG, e.g., the ‘hangover’ added in the end of a speech segment where active coding modes are selected although the VAD has indicated no activity (i.e., background noise). It will, however, be impossible to have a perfect detection. There will be short spurious bursts of active coding during certain types of background noise, as illustrated by the active encoding segments 905-907 in active encoding signal line 901 and corresponding signaling segments 902-904 in graph 900 of Figure 9. Further, a new speech segment may also start with some toggling of the VAD decision. [0094] In order to prepare for ITD estimation for the first SID frame (e.g., CN encoding), it is beneficial to initiate the update of the cross-correlation estimate during the hangover period before a CNG period (or potentially another active period) is entered. This is especially beneficial if there has been a reset of the cross spectra. [0095] In some embodiments, ITD estimation involves, in the first CN encoding (SID) frame after active coding, replacing the state of a first cross spectra low-pass filter ^^ ^^^^_^^^^௧^ with the state of a second low-pass filter ^^ ^^^^_^^^^௧^ , which also filters the cross spectrum but is only updated during hangover periods and CNG periods. Notably, the first cross spectra low-pass filter is used for the ITD estimation (e.g., for regular active frames, hangover active frames, and inactive frames). [0096] A first adaptive low-pass filter coefficient ^^ ^^^^^௩^^ ^ ^^ , which used for updating the second cross spectra low-pass filter during hangover periods, may be determined based on how many frames the cross-correlation estimation has been active and how many updates of the estimate can be expected until the estimate is to be used. Similarly, a second adaptive low-pass filter coefficient ^^ ^^^ ^ ^^ , which is used for updating the first cross spectra low-pass filter during CN encoding SID frame) periods, may also be determined based on how many frames the cross-correlation estimation has been active and how many updates of the estimate can be expected until the estimate is to be used. The update performed in the first SID frame extending from the hangover period to the CN encoding period may be accomplished using either ^^ ^^^^^௩^^ ^ ^^ or ^^ ^^^ ^ ^^ . [0097] In some embodiments, an accumulated expected number of frames, ^^ ^^୮ , (until the estimate is to be used) may, for example, be set to ‘9’ in the first hangover region if it is expected that nine (9) hangover frames will be added prior to entering the CN encoding stage. If a reset of the second cross spectra low-pass filter is conducted in accordance with the below equation, ^^ ^^୮ would be reset to the expected number of hangover frames for the following update period ^^ ^^^^^௩^^ ^ ௫^ . Otherwise, if a reset of second cross spectra low-pass filter is not conducted, ^^ ^^୮ will be increased by the ^^ ^^^^^௩^^ ^ ௫^ expected number of frames after the corresponding segment of active coding. [0098] Similarly, the accumulated expected number of frames ^^ ^^୮ may be increased by the expected number of frames for the following update period ^^ ^^^ ^ ௫^ , e.g., being ‘8’ if the SID frames are transmitted every 8 th frame, but ^^ ^^^ ^ ௫^ may also vary over time if there is a variable SID rate. [0099] It should also be noted that ^^ ^^^^^௩^^ ^^^ ^ ௫^ and ^^ ^௫^ may not always correspond to the actual number of frames for the update period but instead denote an expected length of those update periods. While a low number of the expected number of frames ^^ ^^୮ typically results in a faster update of the cross spectra low-pass filters, a larger number of expected number of frames ^^ ^^୮ should result in a slower update of the cross spectra low-pass filters, thereby giving a more stable estimate of the cross-correlation and the ITD. [0100] In some embodiments, another frame counter ^^ ௨^ௗ^௧^^ denotes how many frames have been previously used to update the background cross-correlation estimate. This counter should be reset to ‘0’ when the second cross spectra low-pass filter is reset, which may be done in the hangover period. The reset may only be conducted when a certain number of active frames (e.g., non-hangover frames), e.g., 20 frames, have passed in accordance with the section labeled "Reset of cross spectra filtering", i.e., ^^ ^ ^^^ ^ ^ ^ ^^ ^^ ^^ ^^^^௧ ^ ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^ ^^ where ^^ is a counter hangover) frames, and the ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^ ^^ may be ‘20.’ Further, ^^ ^^^^௧ is reset to ‘0’ when ^^ ^^^^_^^^^௧^ has been reset and during CN encoding. As the VAD is run for each channel individually, there may be hangover for only one of the audio channels of a stereo pair (e.g., only for the left channel of the stereo pair). To trigger an update or trigger a reset of the second cross spectra low-pass filter, hangover for both channels might be required. This means that the counter ^^ ^^^^௧ may still be increased by one when there is hangover for one of the channels but not for the other. However, other embodiments, an update or reset of the second cross spectra low-pass filter may be triggered as long as there is hangover for any of the channels. [0101] During the CN encoding period, a SID frame is transmitted and a new update period is entered, the expected number of frames is increased by the number of frames expected for the coming update period, i.e., ^^ ௫^ ൌ ^ ^௫^ ^ ^ ^^^ ^ ^ ^^ ^^ ^^ ^^ ^ ^^ ^௫^ where ^^ ^௫^ ^ ^^ ^^ ^^ ^^^ denotes the accumulated expected number of frames prior to the update, and ^^ ^^^ ^ ௫^ denotes the expected number of frames for the upcoming update period. If the hangover or CN encoding period is interrupted by active coding, the accumulated expected number of frames ^^ ^௫^ may be reset to ^^ ௨^ௗ^௧^^ . [0102] In some embodiments, the low-pass filter coefficient ^^ ^^^^^௩^^ ^ ^^ is determined as: ^ ^^^^௩^^ ^^ ^^ ^^ ^^^௩^^ ^ ^^ ൌ max൭ ^^ ௗ^^^௨^௧ , min^ ^^^^^^^௩^^, ^ ^^ [0103] ^ ൌ max ^^ , min ^ ^ ^^^ ^ ^^^ ^ ^ ^^ ௗ^^^௨^௧ , ^ ^^ where the upper rate parameters ^^ ^^^^^௩^^ and ^^ ^^^ , may be set to ‘8’, for example. Although the thresholds and rate parameters are may also differ from each other. [0104] In other embodiments, the rate is dependent on the current number of hangover frames within the hangover period according to: ^ ^ ^^^^௩^^ min ^ ^^ ^ ^^ , ^^^^^௩^^ ^ ^ ^^^^^௩^^ ^^ ^ ^ ^^ ൌ max൭ ^^ௗ^^^௨^௧, min^ ^^ ^^^^^௩^^ , ^ ^^ some embodiments, the number of hangover frames ^^ ^^^^^௩^^ may be determined from the number of frames where the VAD for both channels are in a hangover mode or determined as the average of the number of hangover frames within the hangover period of the channels. The default filter coefficient may be ^^ ௗ^^^௨^௧ ൌ ^^ ^^ ^^. This filter coefficient may typically be used to update the first cross spectra low-pass filter during active frames (i.e., including the hangover period): ^^ ^^^^_^^^^௧^ ^ ^^, ^^^ ൌ ^ 1 െ ^^ ^^ ^^^ ⋅ ^^ ^^^^_^^^^௧^ ^ ^^, ^^ െ 1^ ^ ^^ ^^ ^^ ⋅ ^^ ^^^^ ^ ^^^ while for the first CN encoding frame, the first cross spectra low-pass filter state may be set to the state of the second cross spectra low-pass filter: ^^^^^^_^^^^௧^ ^ ^^, ^^ ^ ൌ ^^^^^^_^^^^௧^ ^ ^^, ^^ െ 1 ^ . [0105] In some be represented as: ^ ^ ^^^^_^^^^௧^ ^ ^^, ^^^ ൌ 1 െ ^^^^^ ^ ^^ ൯ ⋅ ^^ ^^^^_^^^^௧^ ^ ^^, ^^ െ 1^ ^ ^^^^^ ^ ^^ ⋅ ^^ ^^^^ ^ ^^^. [0106] frames in the following update period when ^^ ^^^ ^ ^^ is determined for the SID frames, but updated only after the cross-correlation estimate is in the SID frames. In some embodiments, the first cross spectra low-pass filter ^^ ^^^^_^^^^௧^ ^ ^^, ^^^ may be used for estimating the ITD (e.g., for regular active frames, hangover frames, and inactive frames). [0107] During the hangover, when the first cross spectra low-pass filter is updated using the default filter coefficient ^^ ௗ^^^௨^௧ , the second cross spectra low-pass filter is adaptively updated based on ^^ ^^^^^௩^^ ^ ^^ as: ^^ ^^^^_^^^^௧^ ^ ^^, ^^^ ൌ ൫1 െ ^^ ^^^^^௩^^ ^ ^^ ൯ ⋅ ^^ ^^^^_^^^^௧^ ^ ^^, ^^ െ 1^ ^ ^^ ^^^^^௩^^ ^ ^^ ⋅ ^^ ^^^^ ^ ^^^. updated using another filter coefficient ^^ ^^^ ^^^௨^௧ as follows: ^ ^ ^ ^^, ^^ ^ ^^^ ^^^ ^ ^^^_^^^^௧^ ൫1 െ ^^ௗ^^^௨^௧൯ ⋅ ^^^^^^_^^^^௧^ ^ ^^, ^^ െ 1 ^ ^ ^^ௗ^^^௨^௧ ⋅ ^^^^^^ ^ ^^ ^ , for example. [0109] Figure 12 illustrates an example of one solution for ITD estimation, utilizing two cross-spectra filter states, a first filter state 1204 and a second filter state 1205. Figure 12 further illustrates a potential reset 1206 of the second filter state 1205 and an adaptive gradually decreasing filter update coefficient 1207 based on how many frames the cross-correlation estimation has been active and how many updates of the estimate can be expected until the estimate is to be used. An important effect here is that already under the hangover period (if present), here seen as the period where the VAD 1201 is not indicate active signal (being lowered) while there is still active encoding 1202, the second filter state 1205 may be reset 1206 and/or updated to capture recent signal characteristics for the ITD estimate by being copied 1209 to the first filter state 1204 at the start of the CN encoding period 1203. Also, as indicated for both filtering using the second filter state 1205 and the updated filtering using the first filter state 1204, an adaptive gradually decreasing lower threshold 1207 for the filtering coefficient may be used, based on how many frames the cross-correlation estimation has been active and how many updates of the estimate can be expected until the estimate is to be used. This allows the estimate to better adapt to the recent signal characteristics, while still obtaining a more stable ITD estimate at the point it is to be used. Since the first filter state is used to estimate the ITD during active encoding, it cannot be replaced by the second filter state until the active encoding stops and there is inactive encoding by the CN encoding mode. When the VAD 1201 once again indicates an active signal, and the active encoding 1202 is re-enabled, regular filtering is applied. [0110] Prior to describing operations from the perspective of the encoder , Figure 13 is a block diagram illustrating elements of the encoder 1300 configured to encode audio frames according to the various embodiments herein. Notably, encoder 1300 is capable of performing at least the same functionalities and/or capabilities of encoder 400 in Figure 4. As shown, encoder 1300 may include a network interface circuitry 1305 (also referred to as a network interface) configured to provide communications with other devices, entities, functions, and the like. The encoder 1300 may also include processing circuitry 1301 (also referred to as a processor and processor circuits) coupled to the network interface circuitry 1305, and a memory circuitry 1303 (also referred to as memory) coupled to the processing circuitry. The memory circuitry 1303 may include computer readable program code that when executed by the processing circuitry 1301 causes the processing circuit to perform operations according to embodiments disclosed herein (e.g., processes 1600-2300 as depicted in Figures 16-23). [0111] According to other embodiments, processing circuitry 1301 may be defined to include memory so that a separate memory circuit is not required. As discussed herein, operations of the encoder 1300 may be performed by processing circuitry 1301 and/or network interface 1305. For example, processing circuitry 1301 may control network interface 1305 to transmit communications to decoder 500 and/or to receive communications through network interface 1305 from one or more other network nodes/entities/servers such as other encoder nodes, depository servers, etc. Moreover, modules may be stored in memory 1303, and these modules may provide instructions so that when instructions of a module are executed by processing circuitry 1301, processing circuitry 1301 performs respective operations. In some embodiments, ITD estimation engine (IEE) 1320 is a software program and/or module that is stored in memory 1303 and is configured to perform the functionalities described herein. For example, ITD estimation engine 1320 may be utilized to perform the steps described in Figures 16-23 below when executed by processing circuitry 1301. In some embodiments, ITD estimation engine 1320 may also be configured to perform the stereo processing and mixdown and mono/speech audio encoder functions executed by modules 402 and 404 in Figure 4. [0112] Figure 14 is a block diagram illustrating elements of decoder 1400 configured to decode audio frames according to some embodiments of inventive concepts. As shown, decoder 1400 may include a network interface circuitry 1405 (also referred to as a network interface) configured to provide communications with other devices/entities/functions/etc. Notably, decoder 1400 is capable of performing at least the same functionalities and/or capabilities of decoder 500 in Figure 5. The decoder 1400 may also include a processing circuitry 1401 (also referred to as a processor or processor circuitry) coupled to the network interface circuit 1405, and a memory circuitry 1403 (also referred to as memory) coupled to the processing circuitry. The memory circuitry 1403 may include computer readable program code that when executed by the processing circuitry 1401 causes the processing circuit to perform operations according to embodiments disclosed herein. [0113] According to other embodiments, processing circuitry 1401 may be defined to include memory so that a separate memory circuit is not required. As discussed herein, operations of the decoder 1400 may be performed by processor 1401 and/or network interface 1405. For example, processing circuitry 1401 may control network interface circuitry 1405 to receive communications from encoder 1300. Moreover, modules may be stored in memory 1403, and these modules may provide instructions so that when instructions of a module are executed by processing circuitry 1401, processing circuitry 1401 performs respective operations. [0114] The encoder 1300 and decoder 1400 may be virtualized in some embodiments by distributing the encoder 1300 and/or decoder 1400 across various components. Figure 15 is a block diagram illustrating an example of a virtualization environment 1500 in which functions implemented by some embodiments may be virtualized. In the present context, virtualizing means creating virtual versions of apparatuses or devices which may include virtualizing hardware platforms, storage devices and networking resources. As used herein, virtualization can be applied to any device described herein, or components thereof, and relates to an implementation in which at least a portion of the functionality is implemented as one or more virtual components. Some or all of the functions described herein may be implemented as virtual components executed by one or more virtual machines (VMs) implemented in one or more virtual environments 1500 hosted by one or more of hardware nodes, such as a hardware computing device that operates as a network node, UE, core network node, or host. Further, in embodiments in which the virtual node does not require radio connectivity (e.g., a core network node or host), then the node may be entirely virtualized. [0115] Applications 1502 (which may alternatively be called software instances, virtual appliances, network functions, virtual nodes, virtual network functions, etc.) are run in the virtualization environment 1500 to implement some of the features, functions, and/or benefits of some of the embodiments disclosed herein. [0116] Hardware 1504 includes processing circuitry, memory that stores software and/or instructions executable by hardware processing circuitry, and/or other hardware devices as described herein, such as a network interface, input/output interface, and so forth. Software may be executed by the processing circuitry to instantiate one or more virtualization layers 1506 (also referred to as hypervisors or virtual machine monitors (VMMs)), provide VMs 1508A and 1608B (one or more of which may be generally referred to as VMs 1508), and/or perform any of the functions, features and/or benefits described in relation with some embodiments described herein. The virtualization layer 1506 may present a virtual operating platform that appears like networking hardware to the VMs 1508. [0117] The VMs 1508 comprise virtual processing, virtual memory, virtual networking or interface and virtual storage, and may be run by a corresponding virtualization layer 1506. Different embodiments of the instance of a virtual appliance 1502 may be implemented on one or more of VMs 1508, and the implementations may be made in different ways. Virtualization of the hardware is in some contexts referred to as network function virtualization (NFV). NFV may be used to consolidate many network equipment types onto industry standard high volume server hardware, physical switches, and physical storage, which can be located in data centers, and customer premise equipment. [0118] In the context of NFV, a VM 1508 may be a software implementation of a physical machine that runs programs as if they were executing on a physical, non-virtualized machine. Each of the VMs 1508, and that part of hardware 1504 that executes that VM, be it hardware dedicated to that VM and/or hardware shared by that VM with others of the VMs, forms separate virtual network elements. Still in the context of NFV, a virtual network function is responsible for handling specific network functions that run in one or more VMs 1508 on top of the hardware 1504 and corresponds to the application 1502. [0119] Hardware 1504 may be implemented in a standalone network node with generic or specific components. Hardware 1504 may implement some functions via virtualization. Alternatively, hardware 1504 may be part of a larger cluster of hardware (e.g., such as in a data center or CPE) where many hardware nodes work together and are managed via management and orchestration 1510, which, among others, oversees lifecycle management of applications 1502. In some embodiments, hardware 1504 is coupled to one or more radio units that each include one or more transmitters and one or more receivers that may be coupled to one or more antennas. Radio units may communicate directly with other hardware nodes via one or more appropriate network interfaces and may be used in combination with the virtual components to provide a virtual node with radio capabilities, such as a radio access node or a base station. In some embodiments, some signaling can be provided with the use of a control system 1512 which may alternatively be used for communication between hardware nodes and radio units. [0120] Operations of the encoder 1300 (implemented using the structure of the block diagram of Figures 4 and 13) will now be discussed with reference to the flow chart of Figure 16 according to some embodiments of inventive concepts. For example, modules may be stored in memory 1303 of Figure 13, and these modules may provide instructions so that when the instructions of a module are executed by respective communication device processing circuitry 1301, the encoder 1300 performs respective operations of the flow chart. [0121] Figure 16 illustrates operations that an encoder 1300 performs in various embodiments. Referring to Figure 16, in block 1601, the encoder 1300 receives a time domain audio input comprising audio input signals. The audio input signals could be speech, music, and combinations thereof. [0122] In block 1603, the encoder 1300 processes the audio input signals in frames to produce a mono mixdown signal and one or more stereo parameters. Various techniques can be used to produce the mono mixdown signal and one or more stereo parameters. For example, the encoder 1300 can perform the processing in the time domain or in the frequency domain. [0123] In blocks 1605-1611, the encoder 1300 encodes the mono mixdown signals (and the one or more stereo parameters). Specifically, in block 1605, the encoder 1300 encodes active content of the mono mixdown signal at a first bit rate until a pause period (e.g., an inactive period) is detected in the audio input signals or the mono mixdown signal. A VAD (e.g., VAD 102) can be used to detect the pause period as described above. [0124] In block 1606, the encoder 1300 is configured to estimate ITD parameters during the encoding of active content based on a low-pass filtering of cross-spectra of the audio input signals or averaging of the cross-spectra. [0125] In block 1607, the encoder 1300 switches the encoding from the active encoding content to inactive encoding to encode background noise at a second bit rate during the pause period. The second bit rate is typically less than the first bit rate as described above. [0126] In block 1609, the encoder 1300 adapts the ITD estimation to the audio input signals faster as compared to when estimating the ITD parameters during the encoding of active content. In some embodiments, adapting the ITD estimation faster comprises speeding up a smoothing of cross-spectra by increasing the low-pass filtering coefficient during a DTX hangover period and/or during a start of the pause period compared to prior to the DTX hangover period and/or the start of the pause period. [0127] In block 1611, the encoder 1300 may be configured to encode the ITD parameters and other stereo parameters periodically during the pause period. [0128] In optional block 1613, the encoder 1300 may be configured to transmit the encoded active content, the encoded background noise, and the encoded ITD parameters towards a decoder. [0129] Figure 17 illustrates an alternative and/or additional embodiment for estimating ITD parameters. In some embodiments as illustrated in Figure 17, in adapting the ITD estimation to the audio input signals faster as compared to when estimating the ITD parameters during the encoding of active content, the encoder 1300 in block 1701, in a first encoding frame after active coding, replaces a state of a first cross spectra low-pass filter ^^ ^^^^_^^^^௧^ with a state of a second low-pass filter ^^ ^^^^_^^^^௧^ which filters the cross spectrum but is only updated during hangover and pause periods. In other embodiments, block 1701 includes speeding up the smoothing of the cross spectra of the audio input signals. [0130] In block 1703, the encoder 1300 starts an update of the second low-pass filter ^^ ^^^^_^^^^௧^ during a DTX hangover period. In some of these embodiments, the encoder 1300 in block 1705 speeds up the update of the state of the second low-pass filter ^^ ^^^^_^^^^௧^ in response to the filtering being slow due to a low spectral flatness measure ( ^^ ^^ ^^). In some embodiments, the encoder 1300 is configured to determine ^^ ^^^^_^^^^௧^ as follows: ^^ ^^^^_^^^^௧^ ^ ^^, ^^^ ൌ ൫1 െ ^^ ^^^^^௩^^ ^ ^^ ൯ ⋅ ^^ ^^^^_^^^^௧^ ^ ^^, ^^ െ 1^ ^ ^^ ^^^^^௩^^ ^ ^^ ⋅ ^^ ^^^^ ^ ^^^ while ^^ ^^^^_^^^^௧^ is determined as follows: ^ ^ ^^^ ^^^ ^ ^^^_^^^^௧^ ^ ^^, ^^ ^ ൌ ൫1 െ ^^^^^ ൯ ⋅ ^^^^^^_^^^^௧^ ^ ^^, ^^ െ 1 ^ ^ ^^^^^ ⋅ ^^^^^^ ^ ^^ ^ where ^^ ^^^ and ^^ ^^^ are low pass coefficients. [0131] In some embodiments, the encoder 1300 determines ^^ ^^^^^௩^^ ^^^ ^ ^^ and ^^ ^^^ in accordance with: ^ ^^^^^௩^^ ^^^^^௩^^ ^ ^^ ௗ^^^௨^௧ , min ^ ^^^^^^ ^ ^^ ൌ max ^^ ^௩^^, ^ ^ ^ ^^ and ^^^^^ ^^^^^ ^ ^^ ൌ max൭ ^^ ௗ^^^௨^௧ , min^ ^^^^^, ^ ^^ where ^^ ^^^^^௩^^ are rate parameters. [0132] In other embodiments, the encoder 1300 determines ^^ ^^^^^௩^^ ^^^ ^ ^^ and ^^ ^^^ in accordance with: ^ ^^^^௩ ^ ^ ^^^^^௩^^ ^^^^^௩^^ min ^ ^^ ^ ^ ^^ ^ ^^^^^^௩^^ , ^^ ^ ^ ^^ max൭ ^^ௗ^^^௨^௧, min^ ^^ , ^^ ^ ^^ and ^ ^ ^ ^^^ ^^^ ^ ^^ ൌ max൭ ^^ ௗ^^^௨^௧ , min^ ^^^^^, ^ ^ ^ ^ ^^ where ^^ ^^^^^௩^^ rate parameters, ^^ ^^^^^௩^^ corresponds to the number of hangover frames, and ^^ ^ is a variable. [0133] Figure 18 illustrates an embodiment of speeding up smoothing of cross-spectra using low-pass filtering. Turning to Figure 18, in block 1801, the encoder 1300 may be configured to adjust a low-pass filter coefficient during the DTX hangover period and/or during the start of the pause period. [0134] In some embodiments, the encoder 1300 is configured to adjust the low-pass filter coefficient in accordance with: ^^ ^^^^_^^^^௧^ ^ ^^, ^^^^ ^^, ^^^ ൌ ^ 1െ ^^ 1 ^ ⋅ ^^ ^^^^_^^^^௧^ ^ ^^, ^^^^ ^^, ^^െ 1^ ^ ^^ 1 ⋅ ^^ ^^ ^^ ^^ ^^ ^ ^^^, ^^ ^^^^ೞ^^^^^ ^ ^^, ^^^ ൌ ^ 1 െ ^^ ^^ ^^^ ⋅ ^^ ^^^^ೞ^^^^^ ^ ^^, ^^ െ 1^ ^ ^^ ^^ ^^ ⋅ ^^ ^^^^ ^ ^^^, ^^ ^^ ^^ ^^௨^௧^^ ^ ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ where ^^ is ^^ ^^ ^^ ^^, ^^ ^ ^ ^^^ is a cross spectrum, ^^ ^^^^ೞ^^^^^ ^ ^^, ^^ ^ is a low-pass filtering of the cross-spectrum, CNG frame is an inactive frame is an active encoding frame, ^^ ^^ ^^ is a spectral flatness measure, and ^^ is an upper threshold. [0135] In some other embodiments as illustrated in block 1901 of Figure 19, the encoder 1300 speeds up smoothing of cross-spectra by the low-pass filtering during a start of the pause period by triggering the speed up of the filtering of the cross-spectra after active encoding of a number of consecutive active frames have been reached. [0136] In other embodiments, the speeding up can be aided by a dedicated cross-correlation estimate. Turning to Figure 20, in block 2001, the encoder 1300 executes a dedicated cross- correlation estimate that is only updated during the pause periods and/or during DTX hangover frames for the cross spectra and using the dedicated cross-correlation estimate for the ITD estimation in the pause period. [0137] In further embodiments as illustrated in block 2101 of Figure 21, the encoder 1300 speeds up smoothing of cross-spectra by the low-pass filtering by resetting the cross-spectrum low-pass filter state at one of prior to any updates in a DTX hangover period and prior to any updates in the pause period. [0138] In yet other embodiments as illustrated in block 2201 of Figure 22, the encoder 1300 speeds up smoothing of cross-spectra by the low-pass filtering by replacing a low-pass filter state at the start of a hangover period or at the start of the pause period. [0139] In still further embodiments as illustrated in block 2301 of Figure 23, the encoder 1300 replaces the low-pass filtering at the start of the pause period by averaging the cross spectra ^^ ^^^^ ^ ^^^ over a number of ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ frames and replace the filter state ^^ ^^^^_^^^^௧^ with an average of the cross spectra ^^ ^^^^ ^ ^^^ over the number of ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ frames. [0140] Although the computing devices described herein (e.g., encoders, decoders, UEs, network nodes) may include the illustrated combination of hardware components, other embodiments may comprise computing devices with different combinations of components. It is to be understood that these computing devices may comprise any suitable combination of hardware and/or software needed to perform the tasks, features, functions and methods disclosed herein. Determining, calculating, obtaining or similar operations described herein may be performed by processing circuitry, which may process information by, for example, converting the obtained information into other information, comparing the obtained information or converted information to information stored in the network node, and/or performing one or more operations based on the obtained information or converted information, and as a result of said processing making a determination. Moreover, while components are depicted as single boxes located within a larger box, or nested within multiple boxes, in practice, computing devices may comprise multiple different physical components that make up a single illustrated component, and functionality may be partitioned between separate components. For example, a communication interface may be configured to include any of the components described herein, and/or the functionality of the components may be partitioned between the processing circuitry and the communication interface. In another example, non-computationally intensive functions of any of such components may be implemented in software or firmware and computationally intensive functions may be implemented in hardware. [0141] In certain embodiments, some or all of the functionality described herein may be provided by processing circuitry executing instructions stored on in memory, which in certain embodiments may be a computer program product in the form of a non-transitory computer- readable storage medium. In alternative embodiments, some or all of the functionality may be provided by the processing circuitry without executing instructions stored on a separate or discrete device-readable storage medium, such as in a hard-wired manner. In any of those particular embodiments, whether executing instructions stored on a non-transitory computer- readable storage medium or not, the processing circuitry can be configured to perform the described functionality. The benefits provided by such functionality are not limited to the processing circuitry alone or to other components of the computing device but are enjoyed by the computing device as a whole, and/or by end users and a wireless network generally. EMBODIMENTS 1. A method to adjust an inter-channel time difference, ITD, in an encoder (400, 1608A, 1608B) using a discontinuous transmission, DTX, the method comprising: receiving (1701) a time domain audio input comprising audio input signals; processing (1703) the audio input signals in frames to produce a mono mixdown signal and one or more stereo parameters; encoding the mono mixdown signal on a frame-by-frame basis by: encoding (1705)of active content of the mono mixdown signal at a first bit rate until a pause period is detected in the audio input signals or mono mixdown signal; switching (1707) the encoding from the active encoding content to inactive encoding to encode background noise at a second bit rate during the pause period; estimating (1709) ITD parameters during the pause period based on a low-pass filtering of cross-spectra or averaging of the cross-spectra, wherein estimating the ITD parameters comprises speeding up smoothing of cross-spectra by the low-pass filtering during a DTX hangover period and/or during a start of the pause period; and encoding (1711) the ITD parameters estimated and other stereo parameters periodically during the pause period; and transmitting (1713) the active content encoded, the background noise encoded, and the ITD parameters and other stereo parameters encoded towards a decoder (500, 1508A, 1508B). 2. The method of Embodiment 1, wherein estimating the ITD parameters comprises: in a first encoding frame after active coding, replacing (1801) a state of a first cross spectra low-pass filter ^^ ^^^^_^^^^௧^ with a state of a second low-pass filter ^^ ^^^^_^^^^௧^ which filters the cross spectrum but is only updated during hangover and pause periods. 3. The method of Embodiment2, further comprising: starting (1803) an update of the second low-pass filter ^^ ^^^^_^^^^௧^ during a DTX hangover period. 4. The method of Embodiment 2, further comprising speeding (1805) up the update of the state of the second low-pass filter ^^ ^^^^_^^^^௧^ responsive to the filtering being slow due to a low spectral flatness measure, ^^ ^^ ^^. 4. The method of any of Embodiments 2-3, wherein ^^ ^^^^_^^^^௧^ is determined in accordance with ^ ^ ^^^^^ ^ ^^^_^^^^௧^ ^ ^^, ^^^ ൌ 1 െ ^^ ௩^^ ^ ^^ ൯ ⋅ ^^ ^^^^_^^^^௧^ ^ ^^, ^^ െ 1^ ^ ^^^^^^^௩^^ ^ ^^ ⋅ ^^ ^^^^ ^ ^^^ ^^ ^^^^_^^^^௧^ ^ ^^, ^^^ ൌ ൫1 െ ^^ ^^^ ^ ^^ ൯ ⋅ ^^ ^^^^_^^^^௧^ ^ ^^, ^^ െ 1^ ^ ^^ ^^^ ^ ^^ ⋅ ^^ ^^^^ ^ ^^^ where of Embodiment 4, wherein ^^ ^^^^^௩^^ ^^^ ^ ^^ and ^^ ^^^ are determined in accordance with ^ ^ ^^^௩^^ ൌ ma ^^^^^௩^^ ^^^^^^^௩^^ ^^ ^ ^^ x൭ ^^ ௗ^^^௨^௧ , min^ ^^ , ^ ^ ^ ^^ and ^ ^^ ^^^^^ ^^ ^ ^^ ൌ max൭ ^^ ௗ^^^௨^௧ , min^ ^^^^^, ^ ^ ^ ^^ where ^^ ^^^^^௩^^ rate parameters. 6. The method of Embodiment 4, wherein ^^ ^^^^^௩^^ ^^^ ^ ^^ and ^^ ^^^ are determined in accordance with ^ ^^^ ^ ^ ^^^^^௩^^ ^ ^^ ൌ max൭ ^^ௗ^^^௨^௧, min^ ^^ ^^^^^௩^^ , min ^ ^^^ ^ ^^^^^^^௩^^ , ^^ ^௩^^^ ^ ^ ^ ^^ and ^^^^^ ^^^^^ ^ ^^ ൌ max൭ ^^ ௗ^^^௨^௧ , min^ ^^^^^, ^ ^^ where ^^ ^^^^^௩^^ parameters, ^^ ^^^^^௩^^ corresponds to the number of hangover frames and ^^ ^ is a variable. 7. The method of any of Embodiments 1-6 wherein speeding (1901) up smoothing of cross- spectra by the low-pass filtering comprises: adjusting (1901) a low-pass filter coefficient during the DTX hangover period and/or during the start of the pause period. 8. The method of Embodiment 7 where adjusting the low pass filter coefficient comprises adjusting the low-pass filter coefficient in accordance with ^^ ^^^^_^^^^௧^ ^ ^^, ^^^ ൌ ^ 1 െ ^^ ^ ^ ⋅ ^^ ^^^^ೞ^^^^^ ^ ^^, ^^ െ 1^ ^ ^^ ^ ⋅ ^^ ^^^^ ^ ^^^, ^^ ^^ ^^ ^^௨^௧^^ ൌ ൌ ^^^ is a cross spectrum, ^^ ^^^^_^^^^௧^ ^ ^^, ^^^ is a low-pass filtering of the cross-spectrum, CNG frame is an inactive coding frame, and Speech frame is an active encoding frame, and ^^ ^^ ^^ is a spectral flatness measure, ^^ is an upper threshold. 9. The method of any of Embodiments 1-8, wherein speeding up smoothing of cross-spectra by the low-pass filtering during a start of the pause period comprises triggering (2001) the speed up of the filtering of the cross-spectra after active encoding of a number of consecutive active frames have been reached. 10. The method of any of Embodiments 1-9, further comprising: executing (2101) a dedicated cross-correlation estimate that is only updated during the pause periods and/or during DTX hangover frames for the cross spectra and using the dedicated cross-correlation estimate for the ITD estimation in the pause period. 11. The method of any of Embodiments 1-10, further comprising: resetting (2201) the cross-spectrum low-pass filter state at one of prior to any updates in a DTX hangover period and prior to any updates in the pause period. 12. The method of any of Embodiments 1-11, further comprising: replacing (2301) a low-pass filter state at the start of a hangover period or at the start of the pause period. 13. The method of Embodiment 12, wherein replacing the low-pass filtering at the start of the pause period comprises averaging (2401) the cross spectra ^^ ^^^^ ^ ^^ ^ over a number of ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ frames and replace the filter state ^^ ^^^^_^^^^௧^ with an average of the cross spectra ^^ ^^^^ ^ ^^^ over the number of ^^ ^^ ^^_ ^^ ^^ ^^_ ^^ ^^ ^^ frames. 14. A method to adjust at least one stereo parameter in a decoder (500, 1608A, 1608B), the method comprising: receiving (2501) and decoding an encoded mono downmix signal and at least one stereo parameter; determining (2503) the at least one stereo parameter based on an indicator indicating whether or not foreground and background signals are efficiently separated; and synthesizing (2505) stereo signals based on the at least one stereo parameter determined and the mono downmix signal on a frame-by-frame basis. 15. The method of Embodiment 14 wherein the at least one stereo parameter comprises an inter-channel time difference, ITD, and estimating the ITD comprises: responsive to the indicator indicating that foreground and background signals are efficiently separated, obtaining (2601) an ITD used for stereo upmix ^^ ^^ ^^ ^௬^ directly from a target ITD obtained from the encoder in accordance with ^^ ^^ ^^ ^௬^ ൌ ^^ ^^ ^^ ௧^^^^௧ 16. The method of Embodiment 15, wherein estimating the ITD further comprises: responsive to the indicator indicating that foreground and background signals are not efficiently separated, gradually fading an ITD used for stereo upmix ^^ ^^ ^^ ^௬^ from the previous ITD towards ^^ ^^ ^^ ௧^^^^௧ in accordance with ^^ ^^ ^^ ^௬^ ൌ ^^ ^^ ^^ ^^^௩ ^ ^^ ^^ ^^ ^௧^^ ^^ ^^ ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^ ^^ ௫^^ௗ^ where ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^_ ^^ ^^ ^^ ^^ ^^ ^^ ^^ is a frame counter increased by one for every frame during a pause period in the mono downmix signal, ^^ ௫^^ௗ^ corresponds to a total fade length, ^^ ^^ ^^ ^^^௩ keeps track of the latest ITD value of a gradual fade towards ^^ ^^ ^^ ௧^^^^௧ and ^^ ^^ ^^ ^௧^^ is set in the beginning of a pause period when the fade starts and updated whenever a new target ITD is received in accordance with: ^^ ^^ ^^ ூ்^^ೌ^^^^ିூ்^^^^ೡ ^ ௧^^ ^^^ೌ^^ି ^௧ௗ_௫^^ௗ^_^^௨^௧^^ . 18. An encoder (400, comprising: receiving (1701) a time domain audio input comprising audio input signals; processing (1703) the audio input signals in frames to produce a mono mixdown signal and one or more stereo parameters; encoding the mono mixdown signal on a frame-by-frame basis by: encoding (1705)of active content of the mono mixdown signal at a first bit rate until a pause period is detected in the audio input signals or mono mixdown signal; switching (1707) the encoding from the active encoding content to inactive encoding to encode background noise at a second bit rate during the pause period; estimating (1709) ITD parameters during the pause period based on a low-pass filtering of cross-spectra or averaging of the cross-spectra, wherein estimating the ITD parameters comprises speeding up smoothing of cross-spectra by the low-pass filtering during a DTX hangover period and/or during a start of the pause period; and encoding (1711) the ITD parameters estimated and other stereo parameters periodically during the pause period; and transmitting (1713) the active content encoded, the background noise encoded, and the ITD parameters and other stereo parameters encoded towards a decoder (500, 1508A, 1508B). 19. The encoder (400, 1608A, 1608B) of Embodiment 18 wherein the encoder (400, 1608A, 1608B) performs according to any of embodiments 2-13. 20. An encoder (400, 1608A, 1608B) comprising: processing circuitry (1401); and memory (1403) coupled with the processing circuitry, wherein the memory includes instructions that when executed by the processing circuitry causes the encoder (400, 1608A, 1608B) to perform operations comprising: receiving (1701) a time domain audio input comprising audio input signals; processing (1703) the audio input signals in frames to produce a mono mixdown signal and one or more stereo parameters; encoding the mono mixdown signal on a frame-by-frame basis by: encoding (1705)of active content of the mono mixdown signal at a first bit rate until a pause period is detected in the audio input signals or mono mixdown signal; switching (1707) the encoding from the active encoding content to inactive encoding to encode background noise at a second bit rate during the pause period; estimating (1709) ITD parameters during the pause period based on a low- pass filtering of cross-spectra or averaging of the cross-spectra, wherein estimating the ITD parameters comprises speeding up smoothing of cross-spectra by the low-pass filtering during a DTX hangover period and/or during a start of the pause period; and encoding (1711) the ITD parameters estimated and other stereo parameters periodically during the pause period; and transmitting (1713) the active content encoded, the background noise encoded, and the ITD parameters and other stereo parameters encoded towards a decoder (500, 1508A, 1508B). 21. The encoder (400, 1608A, 1608B) of Embodiment 20, wherein the memory includes further instructions that when executed by the processing circuitry causes the encoder (400, 1608A, 1608B) to perform operations according to any of Embodiments 2-13. 22. A computer program comprising program code to be executed by processing circuitry (803) of an encoder (400, 1608A, 1608B), whereby execution of the program code causes the encoder (400, 1608A, 1608B) to perform operations comprising: receiving (1701) a time domain audio input comprising audio input signals; processing (1703) the audio input signals in frames to produce a mono mixdown signal and one or more stereo parameters; encoding the mono mixdown signal on a frame-by-frame basis by: encoding (1705)of active content of the mono mixdown signal at a first bit rate until a pause period is detected in the audio input signals or mono mixdown signal; switching (1707) the encoding from the active encoding content to inactive encoding to encode background noise at a second bit rate during the pause period; estimating (1709) ITD parameters during the pause period based on a low-pass filtering of cross-spectra or averaging of the cross-spectra, wherein estimating the ITD parameters comprises speeding up smoothing of cross-spectra by the low-pass filtering during a DTX hangover period and/or during a start of the pause period; and encoding (1711) the ITD parameters estimated and other stereo parameters periodically during the pause period; and transmitting (1713) the active content encoded, the background noise encoded, and the ITD parameters and other stereo parameters encoded towards a decoder (500, 1508A, 1508B). 23. The computer program of Embodiment 22, comprising further program code whereby execution of the program code causes the encoder (400, 1608A, 1608B) to perform operations according to any of Embodiments 2-13. 24. A computer program product comprising a non-transitory computer readable storage medium having program code, to be executed by processing circuitry (1403) of an encoder (400,1608A, 1608B), whereby execution of the program code causes the encoder (400, 1608A, 1608B) to perform operations comprising: receiving (1701) a time domain audio input comprising audio input signals; processing (1703) the audio input signals in frames to produce a mono mixdown signal and one or more stereo parameters; encoding the mono mixdown signal on a frame-by-frame basis by: encoding (1705)of active content of the mono mixdown signal at a first bit rate until a pause period is detected in the audio input signals or mono mixdown signal; switching (1707) the encoding from the active encoding content to inactive encoding to encode background noise at a second bit rate during the pause period; estimating (1709) ITD parameters during the pause period based on a low-pass filtering of cross-spectra or averaging of the cross-spectra, wherein estimating the ITD parameters comprises speeding up smoothing of cross-spectra by the low-pass filtering during a DTX hangover period and/or during a start of the pause period; and encoding (1711) the ITD parameters estimated and other stereo parameters periodically during the pause period; and transmitting (1713) the active content encoded, the background noise encoded, and the ITD parameters and other stereo parameters encoded towards a decoder (500, 1508A, 1508B). 25. The computer program product of Embodiment 24, wherein the non-transitory computer readable storage medium has further program code, to be executed by processing circuitry (1403) of an encoder (400,1608A, 1608B), whereby execution of the program code causes the encoder (400, 1608A, 1608B) to perform operations according to any of Embodiments 2-13. 26. An decoder (500, 1608A, 1608B) adapted to perform operations comprising: receiving (2501) and decoding an encoded mono downmix signal and at least one stereo parameter; determining (2503) the at least one stereo parameter based on an indicator indicating whether or not foreground and background signals are efficiently separated; and synthesizing (2505) stereo signals based on the at least one stereo parameter determined and the mono downmix signal on a frame-by-frame basis. 27. The decoder (500, 1608A, 1608B) of Embodiment 26 wherein the decoder (500, 1608A, 1608B) performs according to any of embodiments 15-16. 28. An decoder (500, 1608A, 1608B) comprising: processing circuitry (1501); and memory (1503) coupled with the processing circuitry, wherein the memory includes instructions that when executed by the processing circuitry causes the decoder (500, 1608A, 1608B) to perform operations comprising: receiving (2501) and decoding an encoded mono downmix signal and at least one stereo parameter; determining (2503) the at least one stereo parameter based on an indicator indicating whether or not foreground and background signals are efficiently separated; and synthesizing (2505) stereo signals based on the at least one stereo parameter determined and the mono downmix signal on a frame-by-frame basis. 29. The decoder (500, 1608A, 1608B) of Embodiment 28, wherein the memory includes further instructions that when executed by the processing circuitry causes the decoder (500, 1608A, 1608B) to perform operations according to any of Embodiments 15-16. 30. A computer program comprising program code to be executed by processing circuitry (1503) of a decoder (500, 1608A, 1608B), whereby execution of the program code causes the decoder (500, 1608A, 1608B) to perform operations comprising: receiving (2501) and decoding an encoded mono downmix signal and at least one stereo parameter; determining (2503) the at least one stereo parameter based on an indicator indicating whether or not foreground and background signals are efficiently separated; and synthesizing (2505) stereo signals based on the at least one stereo parameter determined and the mono downmix signal on a frame-by-frame basis. 31. The computer program of Embodiment 30, comprising further program code whereby execution of the program code causes the decoder (500, 1608A, 1608B) to perform operations according to any of Embodiments15-16. 32. A computer program product comprising a non-transitory computer readable storage medium having program code, to be executed by processing circuitry (1503) of a decoder (500,1608A, 1608B), whereby execution of the program code causes the decoder (500, 1608A, 1608B) to perform operations comprising: receiving (2501) and decoding an encoded mono downmix signal and at least one stereo parameter; determining (2503) the at least one stereo parameter based on an indicator indicating whether or not foreground and background signals are efficiently separated; and synthesizing (2505) stereo signals based on the at least one stereo parameter determined and the mono downmix signal on a frame-by-frame basis. 33. The computer program product of Embodiment 32, wherein the non-transitory computer readable storage medium has further program code, to be executed by processing circuitry (1503) of the decoder (500,1608A, 1608B), whereby execution of the program code causes the decoder (500, 1608A, 1608B) to perform operations according to any of Embodiments 15-16.