Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
PARAMETRIC SPATIAL AUDIO ENCODING
Document Type and Number:
WIPO Patent Application WO/2024/083520
Kind Code:
A1
Abstract:
An apparatus for encoding a parametric spatial audio stream, the apparatus comprising means for performing: obtaining a parametric spatial audio stream comprising at least one audio signal and a first version of parameters associated with the at least one audio signal; obtaining an encoded first version of the parameters associated with the audio signal based on a determined encoding configuration; and in a first mode of operation, controlling the output of the encoded first version of the parameters as part of an encoded parametric spatial audio stream, and in a second mode of operation: determining at least one further version of the parameters based on the difference between a decoded version of the encoded first version of the parameters and the first version of parameters associated with the at least one audio signal; obtaining an encoded at least one further version of the parameters; and controlling the output of the encoded first version of the parameters and the at least one further encoded version of the parameters.

Inventors:
LAITINEN MIKKO-VILLE (FI)
LAAKSONEN LASSE JUHANI (FI)
PIHLAJAKUJA TAPANI (FI)
RÄMÖ ANSSI SAKARI (FI)
Application Number:
PCT/EP2023/077665
Publication Date:
April 25, 2024
Filing Date:
October 06, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA TECHNOLOGIES OY (FI)
International Classes:
G10L19/008; G10L19/18
Domestic Patent References:
WO2022008454A12022-01-13
WO2017005978A12017-01-12
Foreign References:
US20060190247A12006-08-24
GB201619573A2016-11-18
FI2017050778W2017-11-10
EP1919130A12008-05-07
EP1919131A12008-05-07
FI2019050675W2019-09-20
GB201811071A2018-07-05
GB201913274A2019-09-13
Attorney, Agent or Firm:
NOKIA EPO REPRESENTATIVES (FI)
Download PDF:
Claims:
CLAIMS: 1. An apparatus for encoding a parametric spatial audio stream, the apparatus comprising means for performing: obtaining a parametric spatial audio stream comprising at least one audio signal and a first version of parameters associated with the at least one audio signal; obtaining an encoded first version of the parameters associated with the audio signal based on a determined encoding configuration; and in a first mode of operation, controlling the output of the encoded first version of the parameters as part of an encoded parametric spatial audio stream, and in a second mode of operation: determining at least one further version of the parameters based on the difference between a decoded version of the encoded first version of the parameters and the first version of parameters associated with the at least one audio signal; obtaining an encoded at least one further version of the parameters; and controlling the output of the encoded first version of the parameters and the at least one further encoded version of the parameters. 2. The apparatus as claimed in claim 1, wherein the means for performing encoding configuration is determined based on at least one of: an allowed number of bits for encoding the first version of parameters of the parametric spatial audio stream; a bitrate for encoding the first version of parameters of the parametric spatial audio stream; and a bitrate for encoding the parametric spatial audio stream. 3. The apparatus as claimed in any of claims 1 or 2, wherein the parameters comprise: at least one directional value; and at least one energy ratio value, for at least one temporal-interval of each frequency band of a frame of the at least one audio signal. 4. The apparatus as claimed in any of claims 1 to 3, wherein the at least one further version of the parameters represents at least one of: a different mapping of frequency bands compared to the first version of the parameters; a different temporal-interval compared to the first version of the parameters; and a different quantization of values compared to the first version of the parameters. 5. The apparatus as claimed in claim 4, wherein the at least one further version of the parameters comprises parameters for a sub-set of the frequency bands of the first version of the parameters. 6. The apparatus as claimed in any of claims 4 or 5, wherein the at least one further version of the parameters comprises parameters for a sub-time-interval of the time-interval defined by the first version of the parameters. 7. The apparatus as claimed in any of claims 1 to 6, wherein the means is further for performing obtaining an encoded first version of the at least one audio signal based on a further encoding configuration; and in a first mode of operation, controlling the output of the encoded first version of the at least one audio signal as part of the encoded parametric spatial audio stream, and in a second mode of operation: determining at least one further version of the at least one audio signal based on the difference between a decoded version of the encoded first version of the at least one audio signal and the first version of the at least one audio signal; obtaining an encoded at least one further version of the at least one audio signal; and controlling the output of the encoded first version of the at least one audio signal and the at least one further encoded version of the at least one audio signal. 8. The apparatus as claimed in any of claims 1 to 7, wherein the means for performing determining at least one further version of the parameters based on the difference between the first version of the parameters and the parameters associated with the at least one audio signal is further for performing: determining a difference between the first version of the parameters and the parameters associated with the at least one audio signal; and mapping the difference within a defined range. 9. The apparatus as claimed in any of claims 1 to 8, wherein the means is further for performing: generating an indicator configured to indicate a time instant and the length of at least one further version of the parameter; and controlling the output of the indicator. 10. An apparatus for decoding a parametric spatial audio stream, the apparatus comprising means for performing: obtaining an encoded parametric spatial audio stream comprising at least one encoded audio signal, and an encoded first version of parameters associated with the encoded at least one audio signal; decoding a first version of the parameters associated with the audio signal from the encoded first version of parameters based on an encoding configuration for encoding the first version of parameters of the parametric spatial audio stream; and in a first mode of operation, controlling the output of the decoded first version of the parameters as part of a decoded parametric spatial audio stream, and in a second mode of operation: obtaining an encoded at least one further version of the parameters; decoding the encoded at least one further version of the parameters; and combining the decoded first version of the parameters and the decoded at least one further version of the parameters to generate a reconstructed version of the parameters; and controlling the output of the reconstructed version of the parameters. 11. The apparatus as claimed in claim 10, wherein the encoding configuration is determined based on at least one of: an allowed number of bits for encoding the first version of parameters of the parametric spatial audio stream; a bitrate for encoding the parametric spatial audio stream; and a bitrate for encoding the first version of parameters of the parametric spatial audio stream. 12 The apparatus as claimed in any of claims 10 or 11, wherein the parameters comprise: at least one directional value; and at least one energy ratio value, for at least one temporal-interval of each frequency band of a frame of the at least one audio signal. 13. The apparatus as claimed in any of claims 10 to 12, wherein the at least one further version of the parameters represents at least one of: a different mapping of frequency bands compared to the first version of the parameters; a different temporal-interval compared to the first version of the parameters; and a different quantization of values compared to the first version of the parameters. 14. The apparatus as claimed in claim 13, wherein the at least one further version of the parameters comprises parameters for a sub-set of the frequency bands of the first version of the parameters.

15. The apparatus as claimed in any of claims 13 or 14, wherein the at least one further version of the parameters comprises parameters for a sub-time-interval of the time-interval defined by the first version of the parameters. 16. The apparatus as claimed in any of claims 10 to 15, wherein the means for further for performing obtaining an encoded first version of the at least one audio signal; and in a first mode of operation: decoding a first version of the audio signal from the encoded first version of the at least one audio signal; and controlling the output of the decoded first version of the at least one audio signal as part of the encoded parametric spatial audio stream, and in a second mode of operation: obtaining an encoded at least one further version of the at least one audio signal; decoding the encoded at least one further version of the at least one audio signal; combining the decoded first version of the at least one audio signal and the decoded at least one further version of the at least one audio signal to generate a reconstructed version of the at least one audio signal; and controlling the output of the reconstructed version of the at least one audio signal. 17. The apparatus as claimed in any of claims 10 to 16, wherein the means for performing combining the decoded first version of the parameters and the decoded at least one further version of the parameters to generate a reconstructed version of the parameters is further for performing: mapping the decoded at least one further version of the parameters to a defined original range; and combining the mapped decoded at least one further version of the parameters and the decoded first version of the parameters.

18. The apparatus as claimed in any of claims 10 to 17, wherein the means for is further for: obtaining an indicator configured to indicate a time instant and the length of the encoded at least one further version of the parameter; and aligning the decoded at least one further version of the parameter to the decoded first version of the parameter based on the indicator. 19. A method for encoding a parametric spatial audio stream, the method comprising: obtaining a parametric spatial audio stream comprising at least one audio signal and a first version of parameters associated with the at least one audio signal; obtaining an encoded first version of the parameters associated with the audio signal based on a determined encoding configuration; and in a first mode of operation, controlling the output of the encoded first version of the parameters as part of an encoded parametric spatial audio stream, and in a second mode of operation: determining at least one further version of the parameters based on the difference between a decoded version of the encoded first version of the parameters and the first version of parameters associated with the at least one audio signal; obtaining an encoded at least one further version of the parameters; and controlling the output of the encoded first version of the parameters and the at least one further encoded version of the parameters.. 20. A method for decoding a parametric spatial audio stream, the method comprising: obtaining an encoded parametric spatial audio stream comprising at least one encoded audio signal, and an encoded first version of parameters associated with the encoded at least one audio signal; decoding a first version of the parameters associated with the audio signal from the encoded first version of parameters based on an encoding configuration for encoding the first version of parameters of the parametric spatial audio stream; and in a first mode of operation, controlling the output of the decoded first version of the parameters as part of a decoded parametric spatial audio stream, and in a second mode of operation: obtaining an encoded at least one further version of the parameters; decoding the encoded at least one further version of the parameters; and combining the decoded first version of the parameters and the decoded at least one further version of the parameters to generate a reconstructed version of the parameters; and controlling the output of the reconstructed version of the parameters..

Description:
PARAMETRIC SPATIAL AUDIO ENCODING Field The present application relates to apparatus and methods for spatial audio representation and encoding, but not exclusively for audio representation for an audio encoder. Background Immersive audio codecs are being implemented supporting a multitude of operating points ranging from a low bit rate operation to transparency. An example of such a codec is the Immersive Voice and Audio Services (IVAS) codec which is being designed to be suitable for use over a communications network such as a 3GPP 4G/5G network including use in such immersive services as for example immersive voice and audio for virtual reality (VR). This audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is furthermore expected to support channel-based audio and scene-based audio inputs including spatial information about the sound field and sound sources. The codec is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions. Metadata-assisted spatial audio (MASA) is one input format proposed for IVAS. It uses audio signal(s) together with corresponding spatial metadata. The spatial metadata comprises parameters which define the spatial aspects of the audio signals and which may contain for example, directions and direct-to-total energy ratios in frequency bands. The MASA stream can, for example, be obtained by capturing spatial audio with microphones of a suitable capture device. For example a mobile device comprising multiple microphones may be configured to capture microphone signals where the set of spatial metadata can be estimated based on the captured microphone signals. The MASA stream can be obtained also from other sources, such as specific spatial audio microphones (such as Ambisonics), studio mixes (for example, a 5.1 audio channel mix) or other content by means of a suitable format conversion. Summary According to a first aspect there is provided a method for encoding a parametric spatial audio stream, the method comprising: obtaining a parametric spatial audio stream comprising at least one audio signal and a first version of parameters associated with the at least one audio signal; obtaining an encoded first version of the parameters associated with the audio signal based on a determined encoding configuration; and in a first mode of operation, controlling the output of the encoded first version of the parameters as part of an encoded parametric spatial audio stream, and in a second mode of operation: determining at least one further version of the parameters based on the difference between a decoded version of the encoded first version of the parameters and the first version of parameters associated with the at least one audio signal; obtaining an encoded at least one further version of the parameters; and controlling the output of the encoded first version of the parameters and the at least one further encoded version of the parameters. The encoding configuration may be determined based on at least one of: an allowed number of bits for encoding the first version of parameters of the parametric spatial audio stream; a bitrate for encoding the first version of parameters of the parametric spatial audio stream; and a bitrate for encoding the parametric spatial audio stream. The parameters may comprise: at least one directional value; and at least one energy ratio value, for at least one temporal-interval of each frequency band of a frame of the at least one audio signal. The at least one further version of the parameters may represent at least one of: a different mapping of frequency bands compared to the first version of the parameters; a different temporal-interval compared to the first version of the parameters; and a different quantization of values compared to the first version of the parameters. The at least one further version of the parameters may comprise parameters for a sub-set of the frequency bands of the first version of the parameters. The at least one further version of the parameters may comprise parameters for a sub-time-interval of the time-interval defined by the first version of the parameters. The method may further comprise obtaining an encoded first version of the at least one audio signal based on a further encoding configuration; and in a first mode of operation, controlling the output of the encoded first version of the at least one audio signal as part of the encoded parametric spatial audio stream, and in a second mode of operation: determining at least one further version of the at least one audio signal based on the difference between a decoded version of the encoded first version of the at least one audio signal and the first version of the at least one audio signal; obtaining an encoded at least one further version of the at least one audio signal; and controlling the output of the encoded first version of the at least one audio signal and the at least one further encoded version of the at least one audio signal. The further encoding configuration may be determined based on at least one of: an allowed number of bits for encoding the first version of the at least one audio signal; a bitrate for encoding the parametric spatial audio stream; and a bitrate for encoding the first version of the at least one audio signal. Determining at least one further version of the at least one audio signal based on the difference between the encoded first version of the at least one audio signal and the at least one audio signal may comprise subtracting a substantially time-aligned at first version of the least one audio signal from the at least one audio to generate the at least one further version of the at least one audio signal. Determining at least one further version of the parameters based on the difference between the first version of the parameters and the parameters associated with the at least one audio signal may comprise: determining a difference between the first version of the parameters and the parameters associated with the at least one audio signal; and mapping the difference within a defined range. The method may further comprise: generating an indicator configured to indicate a time instant and the length of at least one further version of the parameter; and controlling the output of the indicator. According to a second aspect there is provided a method for decoding a parametric spatial audio stream, the method comprising: obtaining an encoded parametric spatial audio stream comprising at least one encoded audio signal, and an encoded first version of parameters associated with the encoded at least one audio signal; decoding a first version of the parameters associated with the audio signal from the encoded first version of parameters based on an encoding configuration for encoding the first version of parameters of the parametric spatial audio stream; and in a first mode of operation, controlling the output of the decoded first version of the parameters as part of a decoded parametric spatial audio stream, and in a second mode of operation: obtaining an encoded at least one further version of the parameters; decoding the encoded at least one further version of the parameters; and combining the decoded first version of the parameters and the decoded at least one further version of the parameters to generate a reconstructed version of the parameters; and controlling the output of the reconstructed version of the parameters. The encoding configuration may be determined based on at least one of: an allowed number of bits for encoding the first version of parameters of the parametric spatial audio stream; a bitrate for encoding the parametric spatial audio stream; and a bitrate for encoding the first version of parameters of the parametric spatial audio stream. The parameters may comprise: at least one directional value; and at least one energy ratio value, for at least one temporal-interval of each frequency band of a frame of the at least one audio signal. The at least one further version of the parameters may represent at least one of: a different mapping of frequency bands compared to the first version of the parameters; a different temporal-interval compared to the first version of the parameters; and a different quantization of values compared to the first version of the parameters. The at least one further version of the parameters may comprise parameters for a sub-set of the frequency bands of the first version of the parameters. The at least one further version of the parameters may comprise parameters for a sub-time-interval of the time-interval defined by the first version of the parameters. The method may further comprise obtaining an encoded first version of the at least one audio signal; and in a first mode of operation: decoding a first version of the audio signal from the encoded first version of the at least one audio signal; and controlling the output of the decoded first version of the at least one audio signal as part of the encoded parametric spatial audio stream, and in a second mode of operation: obtaining an encoded at least one further version of the at least one audio signal; decoding the encoded at least one further version of the at least one audio signal; combining the decoded first version of the at least one audio signal and the decoded at least one further version of the at least one audio signal to generate a reconstructed version of the at least one audio signal; and controlling the output of the reconstructed version of the at least one audio signal. Combining the decoded first version of the at least one audio signal and the decoded at least one further version of the at least one audio signal to generate a reconstructed version of the at least one audio signal may comprise combining a substantially time-aligned decoded first version of the least one audio signal with the decoded at least one further version of the least one audio signal to generate the reconstructed version of the at least one audio signal. Combining the decoded first version of the parameters and the decoded at least one further version of the parameters to generate a reconstructed version of the parameters may comprise: mapping the decoded at least one further version of the parameters to a defined original range; and combining the mapped decoded at least one further version of the parameters and the decoded first version of the parameters. The method may further comprise: obtaining an indicator configured to indicate a time instant and the length of the encoded at least one further version of the parameter; and aligning the decoded at least one further version of the parameter to the decoded first version of the parameter based on the indicator. According to a third aspect there is provided an apparatus for encoding a parametric spatial audio stream, the apparatus comprising means configured to: obtain a parametric spatial audio stream comprising at least one audio signal and a first version of parameters associated with the at least one audio signal; obtain an encoded first version of the parameters associated with the audio signal based on a determined encoding configuration; and in a first mode of operation, control the output of the encoded first version of the parameters as part of an encoded parametric spatial audio stream, and in a second mode of operation: determine at least one further version of the parameters based on the difference between a decoded version of the encoded first version of the parameters and the first version of parameters associated with the at least one audio signal; obtain an encoded at least one further version of the parameters; and control the output of the encoded first version of the parameters and the at least one further encoded version of the parameters. The encoding configuration may be determined based on at least one of: an allowed number of bits for encoding the first version of parameters of the parametric spatial audio stream; a bitrate for encoding the first version of parameters of the parametric spatial audio stream; and a bitrate for encoding the parametric spatial audio stream. The parameters may comprise: at least one directional value; and at least one energy ratio value, for at least one temporal-interval of each frequency band of a frame of the at least one audio signal. The at least one further version of the parameters may represent at least one of: a different mapping of frequency bands compared to the first version of the parameters; a different temporal-interval compared to the first version of the parameters; and a different quantization of values compared to the first version of the parameters. The at least one further version of the parameters may comprise parameters for a sub-set of the frequency bands of the first version of the parameters. The at least one further version of the parameters may comprise parameters for a sub-time-interval of the time-interval defined by the first version of the parameters. The means may further be configured to obtain an encoded first version of the at least one audio signal based on a further encoding configuration; and in a first mode of operation, control the output of the encoded first version of the at least one audio signal as part of the encoded parametric spatial audio stream, and in a second mode of operation: determine at least one further version of the at least one audio signal based on the difference between a decoded version of the encoded first version of the at least one audio signal and the first version of the at least one audio signal; obtain an encoded at least one further version of the at least one audio signal; and control the output of the encoded first version of the at least one audio signal and the at least one further encoded version of the at least one audio signal. The further encoding configuration may be determined based on at least one of: an allowed number of bits for encoding the first version of the at least one audio signal; a bitrate for encoding the parametric spatial audio stream; and a bitrate for encoding the first version of the at least one audio signal. The means configured to determine at least one further version of the at least one audio signal based on the difference between the encoded first version of the at least one audio signal and the at least one audio signal may be configured to subtract a substantially time-aligned at first version of the least one audio signal from the at least one audio to generate the at least one further version of the at least one audio signal. The means configured to determine at least one further version of the parameters based on the difference between the first version of the parameters and the parameters associated with the at least one audio signal may be configured to: determine a difference between the first version of the parameters and the parameters associated with the at least one audio signal; and map the difference within a defined range. The means may be further configured to: generate an indicator configured to indicate a time instant and the length of at least one further version of the parameter; and control the output of the indicator. According to a fourth aspect there is provided an apparatus for decoding a parametric spatial audio stream, the apparatus comprising means configured to: obtain an encoded parametric spatial audio stream comprising at least one encoded audio signal, and an encoded first version of parameters associated with the encoded at least one audio signal; decode a first version of the parameters associated with the audio signal from the encoded first version of parameters based on an encoding configuration for encoding the first version of parameters of the parametric spatial audio stream; and in a first mode of operation, control the output of the decoded first version of the parameters as part of a decoded parametric spatial audio stream, and in a second mode of operation: obtain an encoded at least one further version of the parameters; decode the encoded at least one further version of the parameters; and combine the decoded first version of the parameters and the decoded at least one further version of the parameters to generate a reconstructed version of the parameters; and control the output of the reconstructed version of the parameters. The encoding configuration may be determined based on at least one of: an allowed number of bits for encoding the first version of parameters of the parametric spatial audio stream; a bitrate for encoding the parametric spatial audio stream; and a bitrate for encoding the first version of parameters of the parametric spatial audio stream. The parameters may comprise: at least one directional value; and at least one energy ratio value, for at least one temporal-interval of each frequency band of a frame of the at least one audio signal. The at least one further version of the parameters may represent at least one of: a different mapping of frequency bands compared to the first version of the parameters; a different temporal-interval compared to the first version of the parameters; and a different quantization of values compared to the first version of the parameters. The at least one further version of the parameters may comprise parameters for a sub-set of the frequency bands of the first version of the parameters. The at least one further version of the parameters may comprise parameters for a sub-time-interval of the time-interval defined by the first version of the parameters. The means may further be configured to obtain an encoded first version of the at least one audio signal; and in a first mode of operation: decode a first version of the audio signal from the encoded first version of the at least one audio signal; and control the output of the decoded first version of the at least one audio signal as part of the encoded parametric spatial audio stream, and in a second mode of operation: obtain an encoded at least one further version of the at least one audio signal; decode the encoded at least one further version of the at least one audio signal; combine the decoded first version of the at least one audio signal and the decoded at least one further version of the at least one audio signal to generate a reconstructed version of the at least one audio signal; and control the output of the reconstructed version of the at least one audio signal. The means configured to combine the decoded first version of the at least one audio signal and the decoded at least one further version of the at least one audio signal to generate a reconstructed version of the at least one audio signal may be configured to combine a substantially time-aligned decoded first version of the least one audio signal with the decoded at least one further version of the least one audio signal to generate the reconstructed version of the at least one audio signal. The means configured to combine the decoded first version of the parameters and the decoded at least one further version of the parameters to generate a reconstructed version of the parameters may be configured to: map the decoded at least one further version of the parameters to a defined original range; and combine the mapped decoded at least one further version of the parameters and the decoded first version of the parameters. The means may be further be configured to: obtain an indicator configured to indicate a time instant and the length of the encoded at least one further version of the parameter; and align the decoded at least one further version of the parameter to the decoded first version of the parameter based on the indicator. According to a fifth aspect there is provided an apparatus for encoding a parametric spatial audio stream, the apparatus comprising at least one processor and at least one memory storing instructions that, when executed by the at least one processor, cause the system at least to perform: obtaining a parametric spatial audio stream comprising at least one audio signal and a first version of parameters associated with the at least one audio signal; obtaining an encoded first version of the parameters associated with the audio signal based on a determined encoding configuration; and in a first mode of operation, controlling the output of the encoded first version of the parameters as part of an encoded parametric spatial audio stream, and in a second mode of operation: determining at least one further version of the parameters based on the difference between a decoded version of the encoded first version of the parameters and the first version of parameters associated with the at least one audio signal; obtaining an encoded at least one further version of the parameters; and controlling the output of the encoded first version of the parameters and the at least one further encoded version of the parameters. The encoding configuration may be determined based on at least one of: an allowed number of bits for encoding the first version of parameters of the parametric spatial audio stream; a bitrate for encoding the first version of parameters of the parametric spatial audio stream; and a bitrate for encoding the parametric spatial audio stream. The parameters may comprise: at least one directional value; and at least one energy ratio value, for at least one temporal-interval of each frequency band of a frame of the at least one audio signal. The at least one further version of the parameters may represent at least one of: a different mapping of frequency bands compared to the first version of the parameters; a different temporal-interval compared to the first version of the parameters; and a different quantization of values compared to the first version of the parameters. The at least one further version of the parameters may comprise parameters for a sub-set of the frequency bands of the first version of the parameters. The at least one further version of the parameters may comprise parameters for a sub-time-interval of the time-interval defined by the first version of the parameters. The apparatus may further be caused to perform: obtaining an encoded first version of the at least one audio signal based on a further encoding configuration; and in a first mode of operation, controlling the output of the encoded first version of the at least one audio signal as part of the encoded parametric spatial audio stream, and in a second mode of operation: determining at least one further version of the at least one audio signal based on the difference between a decoded version of the encoded first version of the at least one audio signal and the first version of the at least one audio signal; obtaining an encoded at least one further version of the at least one audio signal; and controlling the output of the encoded first version of the at least one audio signal and the at least one further encoded version of the at least one audio signal. The further encoding configuration may be determined based on at least one of: an allowed number of bits for encoding the first version of the at least one audio signal; a bitrate for encoding the parametric spatial audio stream; and a bitrate for encoding the first version of the at least one audio signal. The apparatus caused to perform determining at least one further version of the at least one audio signal based on the difference between the encoded first version of the at least one audio signal and the at least one audio signal may be caused to perform subtracting a substantially time-aligned at first version of the least one audio signal from the at least one audio to generate the at least one further version of the at least one audio signal. The apparatus caused to perform determining at least one further version of the parameters based on the difference between the first version of the parameters and the parameters associated with the at least one audio signal may be caused to perform: determining a difference between the first version of the parameters and the parameters associated with the at least one audio signal; and mapping the difference within a defined range. The apparatus may be further caused to perform: generating an indicator configured to indicate a time instant and the length of at least one further version of the parameter; and controlling the output of the indicator. According to a sixth aspect there is provided an apparatus for decoding a parametric spatial audio stream, the apparatus comprising at least one processor and at least one memory storing instructions that, when executed by the at least one processor, cause the system at least to perform: obtaining an encoded parametric spatial audio stream comprising at least one encoded audio signal, and an encoded first version of parameters associated with the encoded at least one audio signal; decoding a first version of the parameters associated with the audio signal from the encoded first version of parameters based on an encoding configuration for encoding the first version of parameters of the parametric spatial audio stream; and in a first mode of operation, controlling the output of the decoded first version of the parameters as part of a decoded parametric spatial audio stream, and in a second mode of operation: obtaining an encoded at least one further version of the parameters; decoding the encoded at least one further version of the parameters; and combining the decoded first version of the parameters and the decoded at least one further version of the parameters to generate a reconstructed version of the parameters; and controlling the output of the reconstructed version of the parameters. The encoding configuration may be determined based on at least one of: an allowed number of bits for encoding the first version of parameters of the parametric spatial audio stream; a bitrate for encoding the parametric spatial audio stream; and a bitrate for encoding the first version of parameters of the parametric spatial audio stream. The parameters may comprise: at least one directional value; and at least one energy ratio value, for at least one temporal-interval of each frequency band of a frame of the at least one audio signal. The at least one further version of the parameters may represent at least one of: a different mapping of frequency bands compared to the first version of the parameters; a different temporal-interval compared to the first version of the parameters; and a different quantization of values compared to the first version of the parameters. The at least one further version of the parameters may comprise parameters for a sub-set of the frequency bands of the first version of the parameters. The at least one further version of the parameters may comprise parameters for a sub-time-interval of the time-interval defined by the first version of the parameters. The apparatus may be further caused to perform obtaining an encoded first version of the at least one audio signal; and in a first mode of operation: decoding a first version of the audio signal from the encoded first version of the at least one audio signal; and controlling the output of the decoded first version of the at least one audio signal as part of the encoded parametric spatial audio stream, and in a second mode of operation: obtaining an encoded at least one further version of the at least one audio signal; decoding the encoded at least one further version of the at least one audio signal; combining the decoded first version of the at least one audio signal and the decoded at least one further version of the at least one audio signal to generate a reconstructed version of the at least one audio signal; and controlling the output of the reconstructed version of the at least one audio signal. The apparatus caused to perform combining the decoded first version of the at least one audio signal and the decoded at least one further version of the at least one audio signal to generate a reconstructed version of the at least one audio signal may be caused to perform combining a substantially time-aligned decoded first version of the least one audio signal with the decoded at least one further version of the least one audio signal to generate the reconstructed version of the at least one audio signal. The apparatus caused to perform combining the decoded first version of the parameters and the decoded at least one further version of the parameters to generate a reconstructed version of the parameters may be caused to perform: mapping the decoded at least one further version of the parameters to a defined original range; and combining the mapped decoded at least one further version of the parameters and the decoded first version of the parameters. The apparatus may be further caused to perform: obtaining an indicator configured to indicate a time instant and the length of the encoded at least one further version of the parameter; and aligning the decoded at least one further version of the parameter to the decoded first version of the parameter based on the indicator. According to a seventh aspect there is provided a computer program comprising instructions [or a computer readable medium comprising instructions] for causing an apparatus to perform at least the following: obtaining a parametric spatial audio stream comprising at least one audio signal and a first version of parameters associated with the at least one audio signal; obtaining an encoded first version of the parameters associated with the audio signal based on a determined encoding configuration; and in a first mode of operation, controlling the output of the encoded first version of the parameters as part of an encoded parametric spatial audio stream, and in a second mode of operation: determining at least one further version of the parameters based on the difference between a decoded version of the encoded first version of the parameters and the first version of parameters associated with the at least one audio signal; obtaining an encoded at least one further version of the parameters; and controlling the output of the encoded first version of the parameters and the at least one further encoded version of the parameters. According to an eighth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising instructions] for causing an apparatus to perform at least the following: obtaining an encoded parametric spatial audio stream comprising at least one encoded audio signal, and an encoded first version of parameters associated with the encoded at least one audio signal; decoding a first version of the parameters associated with the audio signal from the encoded first version of parameters based on an encoding configuration for encoding the first version of parameters of the parametric spatial audio stream; and in a first mode of operation, controlling the output of the decoded first version of the parameters as part of a decoded parametric spatial audio stream, and in a second mode of operation: obtaining an encoded at least one further version of the parameters; decoding the encoded at least one further version of the parameters; and combining the decoded first version of the parameters and the decoded at least one further version of the parameters to generate a reconstructed version of the parameters; and controlling the output of the reconstructed version of the parameters. According to a ninth aspect there is provided an apparatus for encoding a parametric spatial audio stream, the apparatus comprising: obtaining circuitry configured to obtain a parametric spatial audio stream comprising at least one audio signal and a first version of parameters associated with the at least one audio signal; obtaining circuitry configured to obtain an encoded first version of the parameters associated with the audio signal based on a determined encoding configuration; and circuitry configured in a first mode of operation, to control the output of the encoded first version of the parameters as part of an encoded parametric spatial audio stream, and in a second mode of operation: to determine at least one further version of the parameters based on the difference between a decoded version of the encoded first version of the parameters and the first version of parameters associated with the at least one audio signal; to obtain an encoded at least one further version of the parameters; and to control the output of the encoded first version of the parameters and the at least one further encoded version of the parameters. According to a tenth aspect there is provided an apparatus for decoding a parametric spatial audio stream, the apparatus comprising: obtaining circuitry configured to obtain an encoded parametric spatial audio stream comprising at least one encoded audio signal, and an encoded first version of parameters associated with the encoded at least one audio signal; decoding circuitry configured to decode a first version of the parameters associated with the audio signal from the encoded first version of parameters based on an encoding configuration for encoding the first version of parameters of the parametric spatial audio stream; and circuitry configured in a first mode of operation, to control the output of the decoded first version of the parameters as part of a decoded parametric spatial audio stream, and in a second mode of operation: to obtain an encoded at least one further version of the parameters; to decode the encoded at least one further version of the parameters; and to combine the decoded first version of the parameters and the decoded at least one further version of the parameters to generate a reconstructed version of the parameters; and to control the output of the reconstructed version of the parameters. According to an eleventh aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus, for encoding a parametric spatial audio stream, to perform at least the following: obtaining a parametric spatial audio stream comprising at least one audio signal and a first version of parameters associated with the at least one audio signal; obtaining an encoded first version of the parameters associated with the audio signal based on a determined encoding configuration; and in a first mode of operation, controlling the output of the encoded first version of the parameters as part of an encoded parametric spatial audio stream, and in a second mode of operation: determining at least one further version of the parameters based on the difference between a decoded version of the encoded first version of the parameters and the first version of parameters associated with the at least one audio signal; obtaining an encoded at least one further version of the parameters; and controlling the output of the encoded first version of the parameters and the at least one further encoded version of the parameters. According to a twelfth aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus, for decoding a parametric spatial audio stream, to perform at least the following: obtaining an encoded parametric spatial audio stream comprising at least one encoded audio signal, and an encoded first version of parameters associated with the encoded at least one audio signal; decoding a first version of the parameters associated with the audio signal from the encoded first version of parameters based on an encoding configuration for encoding the first version of parameters of the parametric spatial audio stream; and in a first mode of operation, controlling the output of the decoded first version of the parameters as part of a decoded parametric spatial audio stream, and in a second mode of operation: obtaining an encoded at least one further version of the parameters; decoding the encoded at least one further version of the parameters; and combining the decoded first version of the parameters and the decoded at least one further version of the parameters to generate a reconstructed version of the parameters; and controlling the output of the reconstructed version of the parameters. According to a thirteenth aspect there is provided a computer readable medium comprising instructions for causing an apparatus, for encoding a parametric spatial audio stream, to perform at least the following: obtaining a parametric spatial audio stream comprising at least one audio signal and a first version of parameters associated with the at least one audio signal; obtaining an encoded first version of the parameters associated with the audio signal based on a determined encoding configuration; and in a first mode of operation, controlling the output of the encoded first version of the parameters as part of an encoded parametric spatial audio stream, and in a second mode of operation: determining at least one further version of the parameters based on the difference between a decoded version of the encoded first version of the parameters and the first version of parameters associated with the at least one audio signal; obtaining an encoded at least one further version of the parameters; and controlling the output of the encoded first version of the parameters and the at least one further encoded version of the parameters. According to a fourteenth aspect there is provided a computer readable medium comprising instructions for causing an apparatus, for decoding a parametric spatial audio stream, to perform at least the following: obtaining an encoded parametric spatial audio stream comprising at least one encoded audio signal, and an encoded first version of parameters associated with the encoded at least one audio signal; decoding a first version of the parameters associated with the audio signal from the encoded first version of parameters based on an encoding configuration for encoding the first version of parameters of the parametric spatial audio stream; and in a first mode of operation, controlling the output of the decoded first version of the parameters as part of a decoded parametric spatial audio stream, and in a second mode of operation: obtaining an encoded at least one further version of the parameters; decoding the encoded at least one further version of the parameters; and combining the decoded first version of the parameters and the decoded at least one further version of the parameters to generate a reconstructed version of the parameters; and controlling the output of the reconstructed version of the parameters. According to a fifteenth aspect there is provided an apparatus comprising: means for obtaining a parametric spatial audio stream comprising at least one audio signal and a first version of parameters associated with the at least one audio signal; means for obtaining an encoded first version of the parameters associated with the audio signal based on a determined encoding configuration; and means for in a first mode of operation, controlling the output of the encoded first version of the parameters as part of an encoded parametric spatial audio stream, and in a second mode of operation: determining at least one further version of the parameters based on the difference between a decoded version of the encoded first version of the parameters and the first version of parameters associated with the at least one audio signal; obtaining an encoded at least one further version of the parameters; and controlling the output of the encoded first version of the parameters and the at least one further encoded version of the parameters. According to a sixteenth aspect there is provided an apparatus comprising: means for obtaining an encoded parametric spatial audio stream comprising at least one encoded audio signal, and an encoded first version of parameters associated with the encoded at least one audio signal; means for decoding a first version of the parameters associated with the audio signal from the encoded first version of parameters based on an encoding configuration for encoding the first version of parameters of the parametric spatial audio stream; and means for in a first mode of operation, controlling the output of the decoded first version of the parameters as part of a decoded parametric spatial audio stream, and in a second mode of operation: obtaining an encoded at least one further version of the parameters; decoding the encoded at least one further version of the parameters; and combining the decoded first version of the parameters and the decoded at least one further version of the parameters to generate a reconstructed version of the parameters; and controlling the output of the reconstructed version of the parameters. An apparatus comprising means for performing the actions of the method as described above. An apparatus configured to perform the actions of the method as described above. A computer program comprising program instructions for causing a computer to perform the method as described above. A computer program product stored on a medium may cause an apparatus to perform the method as described herein. An electronic device may comprise apparatus as described herein. A chipset may comprise apparatus as described herein. Embodiments of the present application aim to address problems associated with the state of the art. Summary of the Figures For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which: Figure 1 shows schematically a system of apparatus suitable for implementing some embodiments; Figure 2 shows a flow diagram of the operation of the example system shown in Figure 1 according to some embodiments; Figure 3 shows schematically a residual determiner as shown in the system of apparatus as shown in Figure 1 according to some embodiments; Figure 4 shows a flow diagram of the operation of the example residual determiner shown in Figure 3 according to some embodiments; Figure 5 shows schematically a metadata residual determiner as shown in the system of apparatus as shown in Figure 3 according to some embodiments; Figure 6 shows a flow diagram of the operation of the example metadata residual determiner shown in Figure 5 according to some embodiments; Figure 7 shows schematically a visualization of time-frequency mapping into successive passes according to some embodiments; Figure 8 shows schematically an audio residual determiner as shown in the system of apparatus as shown in Figure 3 according to some embodiments; Figure 9 shows a flow diagram of the operation of the example audio residual determiner shown in Figure 8 according to some embodiments; Figure 10 shows schematically an audio/metadata reconstructor as shown in the system of apparatus as shown in Figure 1 according to some embodiments; Figure 11 shows a flow diagram of the operation of the example audio/metadata reconstructor shown in Figure 10 according to some embodiments; Figure 12 shows schematically a metadata reconstructor as shown in the system of apparatus as shown in Figure 10 according to some embodiments; Figure 13 shows a flow diagram of the operation of the example metadata reconstructor shown in Figure 12 according to some embodiments; Figure 14 shows schematically an audio reconstructor as shown in the system of apparatus as shown in Figure 10 according to some embodiments; Figure 15 shows a flow diagram of the operation of the example audio reconstructor shown in Figure 14 according to some embodiments; and Figure 16 shows an example device suitable for implementing the apparatus shown in previous figures. Embodiments of the Application The following describes in further detail suitable apparatus and possible mechanisms for the encoding of parametric spatial audio streams comprising transport audio signals and spatial metadata. As discussed above Metadata-Assisted Spatial Audio (MASA) is an example of a parametric spatial audio format and representation suitable as an input format for IVAS. It can be considered an audio representation consisting of ‘N channels + spatial metadata’ (e.g., N = 1 or 2). It is a scene-based audio format particularly suited for spatial audio capture on practical devices, such as smartphones. The idea is to describe the sound scene in terms of time- and frequency-varying sound source directions and, e.g., energy ratios. Sound energy that is not defined (described) by the directions, is described as diffuse (coming from all directions). As discussed above spatial metadata associated with the audio signals may comprise multiple parameters (such as multiple directions and associated with each direction (or directional value) a direct-to-total ratio, spread coherence, distance, etc.) per time-frequency tile. The spatial metadata may also comprise other parameters or may be associated with other parameters which are considered to be non-directional (such as surround coherence, diffuse-to-total energy ratio, remainder-to-total energy ratio) but when combined with the directional parameters are able to be used to define the characteristics of the audio scene. For example a reasonable design choice which is able to produce a good quality output is one where the spatial metadata comprises one or more directions for each time-frequency subframe (and associated with each direction direct-to- total ratios, spread coherence, distance values etc) are determined. As described above, parametric spatial metadata representation can use multiple concurrent spatial directions. With MASA, the proposed maximum number of concurrent directions is two. For each concurrent direction, there may be associated parameters such as: Direction index; Direct-to-total ratio; Spread coherence; and Distance. In some embodiments other parameters such as Diffuse- to-total energy ratio; Surround coherence; and Remainder-to-total energy ratio are defined. The parametric spatial metadata values are available for each time- frequency tile (the MASA format defines that there are 24 frequency bands and 4 temporal sub-frames in each frame). The frame size in IVAS is 20 ms. Furthermore currently MASA supports 1 or 2 directions for each time-frequency tile. Example metadata parameters can be: Format descriptor which defines the MASA format for IVAS; Channel audio format which defines a combined following fields stored in two bytes; Number of directions which defines a number of directions described by the spatial metadata (Each direction is associated with a set of direction dependent spatial metadata as described afterwards); Number of channels which defines a number of transport channels in the format; Source format which describes the original format from which MASA was created. Examples of the MASA format spatial metadata parameters which are dependent of number of directions can be: Direction index which defines a direction of arrival of the sound at a time- frequency parameter interval. (typically this is a spherical representation at about 1-degree accuracy); Direct-to-total energy ratio which defines an energy ratio for the direction index (i.e., time-frequency subframe); and Spread coherence which defines a spread of energy for the direction index (i.e., time-frequency subframe). Examples of MASA format spatial metadata parameters which are independent of number of directions can be: Diffuse-to-total energy ratio which defines an energy ratio of non-directional sound over surrounding directions; Surround coherence which defines a coherence of the non-directional sound over the surrounding directions; Remainder-to-total energy ratio which defines an energy ratio of the remainder (such as microphone noise) sound energy to fulfil requirement that sum of energy ratios is 1. Furthermore example spatial metadata frequency bands can be LF HF BW LF HF BW Band (Hz) (Hz) (Hz) Band (Hz) (Hz) (Hz) 1 0 400 400 13 4800 5200 400 2 400 800 400 14 5200 5600 400 3 800 1200 400 15 5600 6000 400 4 1200 1600 400 16 6000 6400 400 5 1600 2000 400 17 6400 6800 400 6 2000 2400 400 18 6800 7200 400 7 2400 2800 400 19 7200 7600 400 8 2800 3200 400 20 7600 8000 400 9 3200 3600 400 21 8000 10000 2000 10 3600 4000 400 22 10000 12000 2000 11 4000 4400 400 23 12000 16000 4000 12 4400 4800 400 24 16000 24000 8000 The MASA stream can be rendered to various outputs, such as multichannel loudspeaker signals (e.g., 5.1) or binaural signals. The IVAS codec is expected to operate at various bit rates ranging from very low bit rates (~13 kbps) to relatively high bit rates (> 500 kbps). Thus, it can be used in various network conditions, both in high-quality networks with reliable high bit rate and also in less reliable conditions offering significantly lower bit rates (found, e.g., in rural areas). High-quality spatial voice and audio are foreseen as important enablers for many new use cases. One area is industrial applications, where communications and various equipment surveillance and control use cases will utilize spatial audio. For example, one person may oversee multiple different largely autonomous equipment and receive various data including recording of the surrounding audio which may alert the expert to issues and allow further analysis. In dedicated environments network congestion may not be an issue, e.g., a factory may have a private network that has sufficient bandwidth. However, even this can become limiting, when, e.g., high-resolution video is needed. On the other hand, there are industrial cases where network capacity is a problem. For example, various machinery may be operated on-site at construction sites, mines, forests, etc. and dedicated network is not available, and operation may be at cell edge, etc. Thus, audio solutions that tackle difficult network scenarios are relevant also in this context. As discussed above, IVAS may be used also in network conditions, where high bit rates cannot be used. As a result, the audio quality is degraded and clear artefacts can be perceived (both spatial and audio signal compression artefacts). While these artefacts may be acceptable for casual usage such as (spatial) calls, they may not be acceptable for professional usage. Thus for example as mentioned above, the IVAS codec may be used for various industrial applications and other professional use cases, where high-quality audio is required. For example the audio signals may be used for monitoring equipment such as determining abnormalities in machine operations and the quality of the audio determines how accurately any abnormalities can be determined and diagnosed. This may include the audio signal itself (which should not have artefacts caused by low bit rates, use of speech-model based encoding tools, etc.) and spatial information (e.g., from which direction or part of a large equipment certain sounds are emitted). In such circumstances there is a need for high-quality audio with reduced or no artifacts, even though the available bit rate is low. However, the quality is impaired by both the effect of the audio signal encoding and the metadata encoding. Thus for example, the metadata encoding may employ methods that reduce the temporal and/or frequency resolution of the metadata, as well as the resolution of the parameter values themselves. These affect the spatial impression of the rendered audio. Furthermore the need for the high-quality audio transmission may not be continuous. In some circumstances it may be sufficient to have non-transparent spatial audio transmission, which provides acceptable quality for a significant time but at least for a short time it should be possible to obtain significantly higher audio quality. For this reason, there is interest for achieving an improved audio quality for an ‘audio clip’ even in a constrained transmission channel. Furthermore, it would be beneficial when high-quality audio transmission and playback is available in a single service or application (e.g., private industrial network service) rather than relying on downloading ‘audio clips’ for example from a server. This can result in situations which are faster responding and simpler and which lead to improved efficiency of the operation. As such the embodiments described herein aim to generate and transmit high-quality parametric spatial audio excerpts using a low-bit-rate transmission channel. The concept as discussed herein in further detail by the following embodiments and examples relates to encoding parametric spatial audio stream (i.e., audio signal(s) and spatial metadata) where a method is proposed that enables the transmission of high-quality spatial audio excerpts (or audio clips) in a low-bit-rate transmission channel. In some embodiments this is achieved by first encoding and transmitting a base version of the parametric spatial audio stream, then computing residual spatial metadata versions between the original metadata and the encoded base metadata, wherein the residual metadata versions contain different mappings of the frequency bands and/or temporal intervals and/or value quantization based on the metadata encoding applied at the used bit rate. The embodiments further propose encoding and transmitting the residual metadata version. Then the embodiments propose receiving the base and residual versions and reconstructing spatial metadata at the receiving end using the decoded base metadata and the decoded residual metadata versions where the reconstructed metadata has a higher frequency and/or temporal and/or value resolution than the base metadata. In some embodiments residual encoding is performed on the audio signal(s) of the spatial audio stream. This residual encoding can contain at least one residual encoding layer. Residual audio encoding can use the same codec for each layer or, alternatively, using at least one separate codec for at least one of the residual encoding layers. In one example, an IVAS encoder is used for each residual layer. The residual is created by subtracting the time-aligned decoded audio signal(s) from the input audio signal(s) at each layer. The embodiments described herein require the knowledge of the start time instant and the length of the residual pass. This information can either be signalled, or this information can be embedded within the spatial metadata. In such embodiments a standardized codec (such as IVAS) can be employed for the coding of the spatial audio stream. In other words the embodiments described herein can be implemented without the requirement to make significant codec modifications. Thus, the presented embodiments can be simply deployed in normal networks. With respect to Figure 1 a schematic view of a system suitable for employing example embodiments is shown. The input to the system shown in Figure 1 are microphone array signals 100. The microphone array signals can be from any suitable configuration or type of microphones. For example in some embodiments the microphone array signals are generated by a mobile device. The system furthermore can comprise a microphone array frontend 101. The microphone array frontend 101 is configured to receive the microphone array signals 100 and from these microphone array signals generate an audio signal(s) 102 and spatial metadata 104. In some embodiments the signal(s) 102 and spatial metadata 104 are in the format of a MASA stream. The methods employed by the microphone array frontend 101 can be any suitable method or methods. The microphone array frontend 101 in some embodiments is configured to implement an analysis processor functionality configured to generate or determine suitable (spatial) metadata 104 associated with the audio signals and implement a suitable transport signal generator functionality to generate the audio signals 102. The analysis processor functionality is thus configured to perform spatial analysis on the input audio signals yielding suitable spatial metadata 104 in frequency bands. For all of the aforementioned input types, there exists known methods to generate suitable spatial metadata, for example directions and direct- to-total energy ratios (or similar parameters such as diffuseness, i.e., ambient-to- total energy ratios) in frequency bands. These methods are not detailed herein, however, some examples may comprise the performing of a suitable time- frequency transform for the input signals, and then in frequency bands when the input is a mobile phone microphone array, estimating delay-values between microphone pairs that maximize the inter-microphone correlation, and formulating the corresponding direction value to that delay (as described in GB Patent Application Number 1619573.7 and PCT Patent Application Number PCT/FI2017/050778), and formulating a ratio parameter based on the correlation value. The direct-to-total energy ratio parameter for multi-channel captured microphone array signals can be estimated based on the normalized cross- correlation parameter ^^^ ^ (^, ^) between a microphone pair at band ^, the value of the cross-correlation parameter lies between -1 and 1. A direct-to-total energy ratio parameter ^(^, ^) can be determined by comparing the normalized cross- correlation parameter to a diffuse field normalized cross correlation parameter ^^^^ ^ (^, ^) as ^ ( ^, ^ ) The direct-to-total energy ratio is explained further in PCT publication WO2017/005978 which is incorporated herein by reference. The metadata can be of various forms and in some embodiments comprise spatial metadata and other metadata. A typical parameterization for the spatial metadata is one direction parameter in each frequency band characterized as an azimuth value ^ (^, ^) value and elevation value ^ (^, ^) and an associated direct- to-total energy ratio in each frequency band ^(^, ^), where ^ is the frequency band index and ^ is the temporal frame index. In some embodiments the parameters generated may differ from frequency band to frequency band. Thus, for example in band X all of the parameters are generated and transmitted, whereas in band Y only one of the parameters is generated and transmitted, and furthermore in band Z no parameters are generated or transmitted. A practical example of this may be that for some frequency bands such as the highest band some of the parameters are not required for perceptual reasons. In some embodiments when the audio input is a FOA signal or B-format microphone the analysis processor functionality can be configured to determine parameters such as an intensity vector, based on which the direction parameter is obtained, and to compare the intensity vector length to the overall sound field energy estimate to determine the ratio parameter. This method is known in the literature as Directional Audio Coding (DirAC). In some embodiments when the input is HOA signal, the analysis processor functionality may either take the FOA subset of the signals and use the method above, or divide the HOA signal into multiple sectors, in each of which the method above is utilized. This sector-based method is known in the literature as higher order DirAC (HO-DirAC). In this case, there is more than one simultaneous direction parameter per frequency band. As such the output of the analysis processor functionality is (spatial) metadata 104 determined in frequency bands. The (spatial) metadata 104 may involve directions and energy ratios in frequency bands but may also have any of the metadata types listed previously. The (spatial) metadata 104 can vary over time and over frequency. The microphone array frontend 101, as described above is further configured to implement transport signal generator functionality, in order to, generate suitable audio signals 102 (also known as transport audio signals). The transport signal generator functionality is configured to receive the microphone array signals 100 and generate the audio signals 102. The audio signals 102 may be a multi-channel, stereo, binaural or mono audio signal. The generation of audio signals 102 can be implemented using any suitable method such as summarised below. When the input is microphone array audio signals, the transport signal generator functionality may be selecting a left-right microphone pair, and applying suitable processing to the signal pair, such as automatic gain control, microphone noise removal, wind noise removal, and equalization. When the input is a FOA/HOA signal or B-format microphone, the audio signals 102 may be directional beam signals towards left and right directions, such as two opposing cardioid signals. In some embodiments the audio signals 102 are the input microphone array signals. For example, in some situations, where the analysis and synthesis occur at the same device at a single processing step, without intermediate encoding. The number of transport channels can also be any suitable number (rather than one or two channels as discussed in the examples). In some embodiments the system comprises a residual determiner 103. The residual determiner 103 is configured to receive the audio signals 102 and the spatial metadata 104. The residual determiner is configured to operate, in a “normal” mode (or first mode), by outputting or passing the audio signals and the spatial metadata to the encoder 105. In other words the residual determiner 103 is configured to output as a residual audio signals 106 the audio signals 102 and output as a residual spatial metadata 108 the spatial metadata 104. In such embodiments the encoder is configured to generate a continuous spatial audio output based on an encoding configuration, which in some embodiments can be defined by the bit rate in which the codec is operating, e.g., 48 kbps. In some embodiments the encoding configuration is defined by an allowed number of bits for encoding the audio signals and spatial metadata. However, the residual determiner 103 is configured to (in some embodiments based on a request for high-quality audio for an audio excerpt) determine residual audio signals 106 and residual spatial metadata 108 in a “high- quality” or “enhanced” mode (or second mode). The determination of the residuals in the “high-quality” or “enhanced” mode can in some embodiments be based on the bit rate used by the 105 (and can be obtained as a further input, where the bitstream output from the encoder is obtained). In some embodiments the system can comprise an encoder 105 configured to receive the residual audio signals 106 and residual spatial metadata 108 from the residual determiner 103. The encoder 105 can further be configured to obtain a bitrate input 110 which defines a bitrate limit or target within which the encoder attempts to encode the residual audio signals and the residual spatial metadata. In some embodiments the encoder 105 is an IVAS encoder. The encoder 105 is configured to apply, when operating in the normal (or first) mode encoding on the residual audio signals 106 (which are the audio signals 102) and the residual spatial metadata 108 (which are the spatial metadata 104) based on the desired bitrate 110. In other words the encoder operates as a conventional encoder. The encoded audio signals and the metadata (and the bitrate) can then be multiplexed to form a bitstream 112, which is output (and can be passed to the residual determiner 103). The operation of the system with respect to the residual determiner 103 and the effect on the output provided by the encoder 105 when the system is operating in the high-quality or second mode is described in further detail later on. It would be understood that in some embodiments the encoder 105 is configured to operate without knowledge of which mode it is operating in. In other words the encoder 105 is configured to perform the same encoding operations where the mode determination is then implemented by the residual determiner 103 controlling what the encoder 105 is encoding. The system in some embodiments further comprises a decoder 111 configured to obtain or receive the bitstream 112. The decoder can in some embodiments be an IVAS decoder). The decoder 111 is configured to apply demultiplexing and decoding on the bitstream 112 to generate residual audio signals 116, residual spatial metadata 118 and furthermore in some embodiments the bitrate 114. In some embodiments the decoder 111 is configured to operate “normally”, i.e., it does not require any specific modifications to it. In a similar manner to that as described above the decoder 111 need not specifically be designed or implemented in a way that the decoder 111 determines which mode the system is in and it is the operation of the audio/metadata reconstructor 113 which determines which mode of operation the system is in and performs reconstruction based on this determination. The generated residual audio signals 116, residual spatial metadata 118, and bitrate 114 can in some embodiments be forwarded to the audio/metadata reconstructor 113. In some embodiments the system further comprises an audio/metadata reconstructor 113, which is configured to receive or obtain the generated residual audio signals 116, residual spatial metadata 118, and bitrate 114 and from these determines audio signals 120 and spatial metadata 122. The details of the reconstruction are presented further below. The system furthermore comprises a renderer 115 configured to obtain the audio signals 120 and spatial metadata 122 and from these renders a spatial audio output 124. The spatial audio output 124 can be any suitable output format, for example binaural audio signals). The renderer 115 can be any suitable renderer, for example a renderer for a MASA stream. With respect to Figure 2 is shown an example flow diagram of the system shown in Figure 1 according to some embodiments. As shown by 201, the flow diagram shows obtaining microphone array signals. Then as shown by 203 from the microphone array signals, determining audio signals and spatial metadata. As shown by 205, the next operation is (in high quality mode) obtaining residual audio signals and residual spatial metadata and (in normal mode) passing the audio signals and spatial metadata. A next operation as shown by 207 is encoding based on bit rate (in high quality mode): the residual audio signals and residual spatial metadata and (in normal mode) encoding the audio signals and spatial metadata. As indicated above the encoding in the first (normal) and second (high-quality) modes of operation can be an effect of the operation of the system as a whole rather than a change in operation of the encoder by itself. The encoded signals can then be transmitted and/or stored as shown by 209. Furthermore the encoded signals can then be received and/or retrieved as shown by 211. The encoded signals are then decoded as shown by 213. The decoded signals are then reconstructed (where needed) based on the bit rate as shown by 215. Finally the output audio signals are rendered as shown by 217. With respect to Figure 3 is shown an example residual determiner 103 according to some embodiments. As described earlier the residual determiner 103 is configured to receive the audio signals 102 and the spatial metadata 104 and also the bitstream 112. In some embodiments the residual determiner 103 comprises a decoder 301. The decoder 301 is configured to demultiplex and decode the bitstream and generate a decoded audio signals 302, decoded spatial metadata 304 and bitrate 310. The residual determiner 103 in some embodiments further comprises an audio residual determiner 303. The audio residual determiner 303 is configured to receive the decoded audio signals 302, the audio signals 102 and bitrate 310 and configured to determine the residual audio signals 106. Furthermore the residual determiner 103 comprises a metadata residual determiner 305 configured to receive the spatial metadata 104, decoded spatial metadata 304 and bitrate 310 and from these generate the residual spatial metadata 108. Figure 4 shows a flow diagram summarising the operations of the residual determiner. As shown by 401, the audio signals, spatial metadata and bitstream are obtained. Then as shown by 403 the bitstream is decoded to obtain bit rate, decoded audio signals and decoded spatial metadata. The residual audio signals are then determined based on decoded audio signals, audio signals and bit rate as shown by 405. Furthermore the residual spatial metadata is determined based on decoded spatial metadata, Spatial metadata and bit rate as shown by 407. Then the residual audio signals and residual spatial metadata are output as shown by 409. With respect to Figure 5 is shown a schematic view of an example metadata residual determiner 305 according to some embodiments. In the examples described herein the metadata residual determiner 305 operates in the second mode or “high quality mode” by performing “passes” on the audio excerpt or audio clip. In other words it is configured to process the audio excerpt multiple times in order to obtain the quality improvement. For example when the audio excerpt or clip is X seconds long (e.g., 5 seconds long), resulting in N (sub)frames (e.g., 1000 (sub)frames with 5-ms long (sub)frames). Thus, if there is, e.g., 6 passes, the N (sub)frames are processed 6 times in order to obtain the final high-quality spatial metadata at the audio/metadata reconstructor as shown in Figure 1. Thus in some embodiments for the first pass, the metadata residual determiner 305 is configured to output the spatial metadata as it was received at the metadata residual determiner 305. The metadata residual determiner 305 can comprise a selector 507. The selector 507 is configured to receive as a first input the spatial metadata 104. The selector 507 is configured when operating in normal or first mode to select the first input, the spatial metadata 104 to be output as the residual spatial metadata 108. Furthermore as indicated above the selector 507 can be configured such that when operating in the “high-quality mode” or second mode that in a first pass it is configured to select the first input, the spatial metadata 104 to be output as the residual spatial metadata 108. The selector 507 furthermore comprises a second input and is configured such that when operating in the “high-quality mode” or second mode that in any succeeding pass it is configured to select the second input to be output as the residual spatial metadata 108. In other words operating in the second mode and for the first pass (or in the first or normal mode), the residual spatial metadata comprises the original spatial metadata which can be encoded by the encoder normally. This “normally” encoded spatial metadata furthermore serves as a baseline for the following passes. During the first pass, the resulting bitstream is fed back to the residual determiner 103, which decodes it, and the resulting decoded spatial metadata 304 is used for computing the residual spatial metadata for the succeeding pass. In the following examples, the computation is presented for a few example parameters, starting from the direction parameter. In this example embodiment, the direction is handled as azimuth and elevation angles (^(^, ^) and ^(^, ^)), where ^ is the frequency band index, and ^ the temporal (sub)frame index. In situations where the direction is handled as a spherical index it can be converted to azimuth and elevation angles. This can be implemented before the metadata residual determiner 305. Furthermore the direction can be converted back to a spherical index after the metadata residual determiner 305. In some embodiments the metadata residual determiner 305 comprises a difference computer 501 or determiner configured to receive the original direction ^ ^^^^ (^, ^), ^ ^^^^ (^, ^) and the decoded “baseline” direction ^ ^^^^ (^, ^), ^ ^^^^ (^, ^), which computes a difference between the two. For example by ^ ^^^^ ( ^, ^ ) = degmodulo(^ ^^^^ ( ^, ^ ) − ^ ^^^^ ( ^, ^ ) ) where degmodulo() stands for computing a modulus of the angle (i.e., mapping it in between -180 and 180 degrees). The difference values 500 can then be output from the difference computer 501. The metadata residual determiner 305 can furthermore comprise a value mapper 503. The value mapper 503 is configured to receive or obtain the difference values ^ ^^^^ ( ^, ^ ) , ^ ^^^^ ( ^, ^ ) and map these difference values to a suitable range so that the values can be optimally used in the audio/metadata reconstructor 113 to refine the accuracy of the baseline metadata. As the azimuth difference is already between -180 and 180, it may be directly used as the residual signal in some embodiments. The elevation difference is also between -180 and 180, so it is limited to between -90 and 90 degrees (it is assumed in this example embodiment that the elevation angle is in between those angles). Thus, the mapped difference values for the direction may, e.g., be obtained by ^ ^^^ ( ^, ^ ) = clamp ^ ^ ^^^^ ( ^, ^ ) , [ −90,90 ]^ Where clamp(⋅,⋅) function means restricting the value into the given range. The Mapped difference values 502 ^ ^^^ (^, ^), ^ ^^^ (^, ^) can be forwarded to a Time/frequency mapper. As discussed previously, typically the number of frequency bands and or subframes in a frame are reduced in the encoder 105 at the lower bit rates. For example methods presented in UKIPO patent applications 1919130.3 and 1919131.1 may be used by the Encoder 105. Regardless how the merging of the bands or subframes is done, the time/frequency mapper 505 is configured to modify the metadata to be encoded in a way that finer time and/or frequency resolution is obtained than the encoder 105 actually provides at a certain bit rate. For example where 5 frequency bands are used by the encoder 105, (in other words the original 24 frequency bands of the MASA metadata are merged to 5 coding bands). Thus, in the baseline metadata ^ ^^^^ (^, ^), ^ ^^^^ (^, ^) the values are identical inside each of these 5 coding bands. If the difference values would be directly forwarded to the encoder 105, the frequency resolution would never be increased from the 5 coding bands, no matter how many times the difference would be sent as a residual. Thus, the quality improvement would quickly saturate. The time/frequency mapper is therefore configured to map different frequency bands to the 5 coding bands in passes. For example in the first pass of the residual coding (i.e., the second pass altogether), the mapper is configured to set the first 5 frequency bands to the frequency bands corresponding to the coding bands. Then, in the second pass of the residual coding, the next 5 frequency bands to the frequency bands corresponding to the coding bands. In such a manner the metadata residual determiner is about to generate the individual residual data for individual frequency bands, which can then be reconstructed to original frequency resolution (i.e., 24 bands instead of 5 bands). An example of this is shown in Figure 7. In this example is shown the first pass 701 where baseline coding 711 is implemented on all 24 bands 700. The second pass 703 then generates and encodes the residual values for bands 1-5 713, the third pass 705 then generates and encodes the residual values for bands 6-10715, the fourth pass 707 then generates and encodes the residual values for bands 11-15717, the fifth pass 709 then generates and encodes the residual values for bands 16-20719 and the sixth pass 710 then generates and encodes the residual values for bands 21-24720. Thus, in our example embodiment, coding bands are described by ^, and each coding band contains frequency bands from ^ ^^^ (^) to ^ ^^^^ (^). Thus, the TF- mapped mapped difference values 504 can be generated by ^ ^^^^^ ( ^, ^ ) = ^ ^^^ ( ^, ^ ) , ^ ^^^ ( ^ ) ≤ ^ ≤ ^ ^^^^ (^) ^ ^^^^^ (^, ^) = ^ ^^^ (^, ^), ^ ^^^ (^) ≤ ^ ≤ ^ ^^^^ (^) As in this example there are 5 coding bands and 24 frequency bands, we need 5 residual processing passes to obtain individual residual signals for each frequency band (since 5 x 5 = 25, during the last pass the last frequency band can, e.g., be set to zero). The TF-mapped mapped difference values 504 ^ ^^^^^ ( ^, ^ ) , ^ ^^^^^ ( ^, ^ ) can be forwarded to the selector 507 as the second input, which during the residual coding passes forwards them to the output. This can be represented as which is the residual spatial metadata 108. Similar processing may be applied for other potential parameters. However, the value mapper is configured to perform the mapping based on the encoding performed by the encoder and thus may perform different mapping functions for different parameters. As an example difference and mapping for a direct-to-total energy ratio ^ ( ^, ^ ) is presented herein: The difference is computed first by Then, the values have to be mapped to suitable values. Then, as a first step, they can be limited to a suitable range, e.g., from -0.2 to 0.2 ^ ^ ^ ^ ^^ ( ^, ^ ) = clamp ^ ^ ^^^^ ( ^, ^ ) , [ −0.2,0.2 ]^ As a simple option, these values could be just uniformly mapped as normal energy ratio values, i.e., from 0 to 1. However, it also possible to utilize more accurate information from the energy ratio encoding performed by the encoder. For example the energy ratio quantization steps performed by the encoder may not be uniform, so it is possible to directly map the values ^ ^ ^ ^ ^^ (^, ^) to the known quantization steps. This can be performed, e.g., by using a look-up table where values of ^ ^ ^ ^ ^^ ( ^, ^ ) inside some range produce a certain mapped value ^ ^^^ ( ^, ^ ) . As a result, optimal accuracy may be obtained for the transmission of the residual. Moreover, in some embodiments, the values of the energy ratio may influence the coding of the other parameters. Thus for example more accuracy may be used for the coding of the direction the larger the value of the energy ratio (described, e.g., in PCT/FI2019/050675, GB1811071.8, and GB1913274.5). In those embodiments, the mapping may be performed in a way that, e.g., the smallest and the largest values of the energy ratio are avoided in order to not disturb the coding of the direction residual too much. The mapped energy ratio values ^ ^^^ ( ^, ^ ) are outputted as a part of the mapped difference values and processed by time/frequency mapper 505 and selector 507 similarly as presented above for the direction parameter. With respect to Figure 6 is shown a flow diagram showing the operations of the example metadata residual determiner 305 according to some embodiments. Thus as shown by 601 the spatial metadata, decoded spatial metadata and bit rate is obtained. Then difference values between the spatial metadata and decoded spatial metadata is computed or determined as shown by 603. Difference values are then mapped based on bit rate to generate mapped difference values as shown by 605. Then as shown by 607, time/frequency map the mapped difference values. Finally the selection of the time/frequency mapped mapped difference values or the original spatial metadata is performed as shown by 609. With respect to Figure 8 is shown the audio residual determiner 303 in further detail according to some embodiments. The audio residual determiner 303 is configured to receive the audio signals and the decoded audio signals 302. The audio residual determiner 303 in some embodiments comprises a time aligner and inverter 801 configured to obtain the audio signals 102 and the decoded audio signals 302, which time-aligns the decoded audio signals 302 with the audio signals 102 in case there is any mismatch. Additionally the time aligner and inverter 801 is configured to invert the decoded audio signals to generate inverted aligned decoded audio signals 802. The audio residual determiner 303 further comprises a combiner 803 configured to combine the inverted aligned decoded audio signals 802 and the audio signals to generate the residual audio signals. In other words, the decoded audio signals 302 are subtracted from the audio signals 102. It is generally important to maintain the correct alignment between these signals to minimize the energy of the residual audio signals. In this example the subtraction or difference operation is implemented by an inversion and a combination operation. However in some embodiments the signals are aligned and them subtracted from each other (or the difference between the aligned signals determined). In some embodiments as an alternative (or in addition), frequency- dependent phase correction may be applied to the signal in the time aligner and inverter 801. The encoding of the residual audio signals 106 can be implemented in any suitable manner. The encoder 105 can for example be the same system for each input instance. In some embodiments there can be various constraints implemented. For example a specific bit rate being applied and/or a specific encoding mode is used. In some embodiments a separate encoder implementation can be used for the residual audio signals after the first pass (or lowest layer). For example, the purpose of at least the final encoder instance can be to encode a lossless enhancement. This is typically a variable bit rate layer. With respect to Figure 9 is shown a flow diagram of the example audio residual determiner 303 shown in Figure 8 according to some embodiments. Thus as shown by 901, audio signals and decoded audio signals are obtained. Then as shown by 903, the decoded audio signals are aligned to the audio signals and aligned decoded audio signals inverted. Then the inverted aligned decoded audio signals are combined to the audio signals to generate the residual audio signals as shown by 905. Finally the residual audio signals are output as shown by 907. With respect to Figure 10 is shown a schematic view of an example audio/metadata reconstructor 113 according to some embodiments. The audio/metadata reconstructor 113 is configured to receive the decoded residual audio signals 116, the bit rate 114 and the decoded residual spatial metadata 118. In some embodiments the audio/metadata reconstructor 113 comprises a audio reconstructor 1001 which is configured to receive the bit rate 114 and the decoded residual audio signals 116. The audio reconstructor 1001 is configured to reconstruct the audio signals 120, which are output. Additionally the audio/metadata reconstructor 113 comprises a metadata reconstructor 1003 which is configured to receive the bit rate 114 and the decoded residual spatial metadata 118. The metadata reconstructor 1003 is configured to reconstruct the spatial metadata 122, which is also output. With respect to Figure 11 is shown a flow diagram of operations of the example audio/metadata reconstructor 113 according to some embodiments. For example in some embodiments as shown by 1101 the decoded residual audio signals, bit rate and decoded residual spatial metadata is obtained. Then as shown by 1103 audio signals are generated by reconstruction based on bit rate and decoded residual audio signals. Furthermore as shown by 1105 spatial metadata is generated by reconstruction based on bit rate and decoded residual spatial metadata. Then the audio signals and spatial metadata is output as shown by 1107. With respect to Figure 12 is shown a schematic view of an example metadata reconstructor 122. The metadata reconstructor 122 is configured to implement an inverse of the metadata residual determiner. Thus, for example, the metadata reconstructor 122 comprises an inverse selector which is configured to determine which kind of metadata it is. For the first pass of the second mode (or high quality mode) or the first mode (normal mode), it is the “baseline” version of the metadata (in other words the actual decoded spatial metadata at the defined bit rate e.g.48 kbps), which is output as the baseline spatial metadata 1204 and passed to the reconstructor 1207. For succeeding passes in the second mode, the metadata is the actual residual metadata. The inverse selector 1201 is therefore configured to output the metadata as TF-mapped mapped difference values 1202 to the inverse time/frequency mapper 1203. The metadata reconstructor 122 can furthermore comprise a inverse time/frequency mapper 1023 configured to receive the TF-mapped mapped difference values 1202 and map the spatial metadata back to the original frequency bands from the frequency bands of the coding. For example this can be by ^ ^^^ ( ^, ^ ) = ^ ^^^^^ ( ^ ^^^ ( ^ ) , ^ ) It should be noted that the mapping ^ ^^^ (^) is different for different passes (e.g., for the first residual pass the values are set to the first 5 frequency bands, and to the next 5 frequency bands in the next pass, etc.). Moreover, it should be noted that this mapping is dependent on the bitrate 114. In other words the mapping follows the frequency band merging performed by the encoder. These mapped difference values 1206 are forwarded to an inverse value mapper 1205. The metadata reconstructor 122 can furthermore comprise an inverse value mapper 1205 which is configured to receive the mapped difference values 1206, and computes the difference values from the mapped values. This is the inverse what was performed in the residual metadata determiner. For the direction, however, the values were simply set, so the inverse processing can be just setting the values. The metadata reconstructor 122 can furthermore comprise a reconstructor 1207 which is configured to receive the difference values 1208 and the baseline spatial metadata 1204 values. The reconstructed values are obtained based on these values. For example by subtracting the difference values from the baseline values, as given by ^ ^^^ (^, ^) = degmodulo(^ ^^^^ (^, ^) − ^ ^^^^ (^, ^)) ^ ^^^ ( ^, ^ ) = A similar inverse processing is applied for the other parameters, such as the direct-to-total energy ratio. The reconstructed parameters ^ ^^^ ( ^, ^ ) , ^ ^^^ ( ^, ^ ) , ^ ^^^ ( ^, ^ ) are output from the block as the spatial metadata 122, which can then be used by the renderer 115 to render spatial audio. With respect to Figure 13 is shown a flow diagram of operations of the example metadata reconstructor 1003 according to some embodiments. Thus as shown by 1301 decoded residual spatial metadata and bit rate is obtained. Then inverse selection is performed (based on mode and pass iteration) as shown by 1303. Then inverse time frequency mapping is performed to generate mapped difference values as shown by 1305. The mapped difference values are then inverse mapped to generate difference values as shown by 1307. Then the difference values/baseline spatial metadata are used to reconstruct the spatial metadata as shown by 1309. The reconstructed metadata is then output as shown by 1311. With respect to Figure 14 is shown a schematic view of the audio reconstructor 1001 according to some embodiments. In such embodiments the decoded residual audio signals 116 are obtained from the decoder. At the lowest layer, this corresponds with the initial decoded audio. The audio reconstructor 1001 in some embodiments comprises a time aligner 1401 which is configured to align the decoded residual audio signals. The aligned decoded residual audio signals are then passed to a combiner 1403. The audio reconstructor 1001 furthermore comprises a combiner 1403 configured to combine the aligned decoded residual audio signals (from the current pass or iteration or layer) with the previous (pass/iteration/layer) decoded residual audio signals to generate the audio signals 120. In some embodiments where the same encoder/decoder with same delay is used for all layers (and if the encoder/decoder does not cause any temporal nor phase shifts), the time aligner 1401 does nothing, since the signals are automatically aligned. With respect to Figure 15 is shown a flow diagram of operations of the example audio reconstructor 1001 according to some embodiments. As shown by 1501 decoded residual audio signals, previous layers audio signals are obtained. Then as shown by 1503 the decoded audio signals are aligned to the previous layer audio signals. Furthermore the aligned decoded residual audio signals with the previous layer audio signals are combined to generate the reconstructed audio signals as shown by 1505. Finally the reconstructed audio signals are output as shown by 1507. The example embodiments above did not consider how the Audio/metadata reconstructor knows which frames belong to the ‘base’ pass, and which to ‘residual’ passes. There are many ways how to signal this, and the optimal solution depends on the use case. Nevertheless, a few examples are discussed in the following. In the case the residual determiner and the audio/metadata reconstructor are inside the codec, the information can be transmitted using signalling bits embedded in the bitstream. In the simplest form, it is possible to write a running counter into the bitstream where ‘base’ frame corresponds to index 0 and ‘residual’ frames run from 1 to P-1, where P is the total number of passes required to code all ‘base’ and ‘residual’ frames for one input frame. This approach is relatively robust as increments need to be continuous and increasing until P-1. If a wrong index is encountered, then a residual is not constructed for the corresponding bands and ‘base’ metadata can be used. If ‘base’ metadata is lost, then existing frame loss recovery operation should be used. In general, this approach should use at most 5 bits per frame for MASA format and in the above example, it would be 3 bits. In a very restricted bitrate situation, a one-bit-signaling approach may be used. In this case, only a ‘base’ frame is differentiated from ‘residual’ frames and the number of received frames is counted at the reconstructor. For this to work, the order of received residual frames needs to be assumed to be correct and verification can be done only on the pattern of ‘base’ frame appearances. If one frame delay is allowed on output, then next assumed ‘base’ frame should be first received and verified to be ‘base’ frame before the previous reconstructed frame is output. Alternatively in some embodiments perfect one bit operation can be implemented by repurposing the bits reserved for transmitting descriptive metadata as it can be assumed that descriptive metadata does not differ between ‘base’ and ‘residual’ frames, and thus the bits can be used to send a frame counter. However, this approach requires modifying bitstream writing and reading such that the correct descriptive metadata is returned for ‘residual’ frames before further metadata decoding happens. Moreover, if no bits are used to transmit descriptive metadata (due to bitrate), then this approach cannot be used. In the case the residual determiner and the audio/metadata reconstructor are not inside the codec, as was presented above, embedding the signalling bits in the bitstream is typically not possible. In this case, other approaches are needed. For example in some embodiments, the signalling information is transmitted as side information. This side information can be, e.g., transmitted by means of the RTP (Real-time Transport Protocol) payload head extension mechanism. The used mechanism is applicable to RTP/AVP (the Audio/Visual Profile) and its extensions. The header can have either a one-byte or two-byte formulation. One-byte header form allows for data lengths between 1 and 16 bytes (with maximum rate of 6.4 kbps), while two-byte header form has data lengths between 0 and 255 bytes (with maximum rate of 102 kbps). The RTP packet with an RTP header extension will indicate whether it uses one-byte or two-byte header extensions. It is possible to mix use of one-byte and two-byte RTP header extensions in one RTP stream, however, each RTP packet shall use either one-byte or two-byte formulation. It can be noted that the identifier values (IDs) being used must also remain unique in each media section of the SDP (Session Description Protocol) or unique in the session if a session-level SDP declaration is preferred. Furthermore the use of the residual transmission can be determined by obtaining two pieces of information. The information can be provided, e.g., by using two separate parameters. A first parameter, (which can be called) ‘ResidualLayer’, determines whether a frame is a regular frame (i.e., not relating to residual transmission), a (new) base frame (that is part of residual transmission), or a ‘residual’ pass frame. A second parameter, (which can be called) ‘ResidualFrame’, determines which is the base layer frame on top of which a ‘residual’ pass is performed. This can be, e.g., a local timestamp, such as a frame counter looking in the past. For example, a decoder receives frames, where the corresponding RTP packets have a timestamp. This allows correct decoding order and identifying of late and/or dropped frames (packets). However, it is not sufficient to determine the order of frames in case of ‘residual’ passes. The system also needs to also know whether a specific frame data improves a previous frame (this is given, e.g., by ‘ResidualLayer’) instead of providing data for next time instance and which previous frame exactly it improves over (this is given by ‘ResidualFrame’). For example, the system can transmit these parameters in an embedded way as follows (see use of length L below): This embedded approach is possible, since ‘ResidualFrame’ is not always needed based on value of parameter ‘ResidualLayer’. 0 1 2 3 0 1 2 3 4 56 7 8 9 0 1 23 4 5 6 7 8 90 1 2 3 4 5 67 8 9 0 1 | ID | L=1 | ResidualLayer | ResidualFrame | 0 (pad) | In this example, 0xBE 0xDE is a fixed pattern for the one-byte format according to RTP specification, labelled in the specification as “defined by profile”. This is followed by length defining the size of the extension in 32-bit chunks (not accounting for the extension header). Finally, each extension element starts with a byte containing an ID and a length (L). The ID is defined by 4 bits, and it is the local identifier of the element in the range of 1-14. Local identifier value 15 is reserved. Length is the number, minus one, of data bytes of the header extension element. Thus, L=0 indicates one byte of data. Maximum amount of data is 16 bytes accordingly. If necessary, padding is used to fill the chunk. In our example, L=1 gives size of two bytes. Description for a two-byte header is omitted here for brevity. For example, the maximum rate accommodated by the one-byte header format is entirely sufficient for this use case. Nevertheless, it is possible to use a two-byte header if so desired. For example, SDP Offer/Answer negotiation is used. Example of signaling in SDP when using IVAS could be: a=extmap:1 http://3gpp.org/ivas/rtp_hdr_ext.htm#residual_transmit This example SDP line uses the unique URI with ID=1 to uniquely describe the header extension. In some embodiments the RTP header extensions can be considered individually. For example, for each parameter can be defined suitable implementations. For example, for ‘Layer’ there can be employed the example URI: http://3gpp.org/ivas/rtp_hdr_ext.htm#residual_layer 0 1 01 2 3 4 5 6 78 9 0 1 2 3 45 For example, the length shall be 0, which indicates the field length (1 byte). The value of ‘ResidualLayer’ can be any between 0 and 255. For example, value 0 can be defied as skippable information (it can indicate no residual transmission, or it can be used, e.g., to tell when residual transmission has ended), value 1 as ‘base layer’ pass indicating beginning of a residual transmission for a frame, and any other value as ‘residual’ pass of said frame. For example, this information can indicate whether a ‘residual’ pass is 2 nd , 3 rd , 4 th , etc. (There can be a practical limitation on how many ‘residual’ passes are supported, so a practical limit of significantly smaller than value 255 is in place in typical implementations.) For ‘ResidualFrame’, we can have: 0 1 0 1 2 3 4 56 7 8 9 0 1 23 4 5 For a ‘residual’ pass frame, the value of ‘ResidualFrame’ indicates a frame count to the reference ‘base’ layer. In some embodiments, the ‘base’ layer does not use ‘ResidualFrame’, and its value is ignored if provided (by error). In some embodiments, the ‘base’ layer can indicate the length of the ‘residual’ pass in frames by providing the corresponding frame count using the ‘ResidualFrame’ extension element. The SDP lines can be, e.g.: a=extmap:1 http://3gpp.org/ivas/rtp_hdr_ext.htm#residual_layer a=extmap:2 http://3gpp.org/ivas/rtp_hdr_ext.htm#residual_frame If at least one frame is lost during the ‘base’ and ‘residual’ passes, the system detects this based on the use of the extension elements (and a frame loss can also be indicated by the RTP depacketizer). The behaviour during a frame loss can be specific to the implementation. If each previous frame is required for correct decoding (i.e., additional layers do not increase error after frame loss) during residual transmission, the additional ‘residual’ passes are not used, and frame error concealment is applied. If the residual transmission is designed such that each additional ‘residual’ pass improves the accuracy, the lost frame is ignored. In some embodiments, this behaviour can be different between audio and spatial metadata. If residual transmission approach can be different per frame, this needs to be indicated by an additional extension element (or a different use of the parameter values provided above). In another embodiment, the signalling information is transmitted using the encoded MASA metadata. As was presented above, in one embodiment, there are 24 frequency bands, which are transmitted using 5 passes of 5 coding bands. In this example, one coding band in one pass is unused (see Fig. 4). For residual coding, this coding band is used to store a sync pattern that allows detecting the residual coding pattern. In addition, for the sync pattern to efficiently work, it should be part of first frame of the residual coding pattern. This can be achieved, e.g., by sending the last ‘residual’ frame (contains the sync band) as first frame of the pattern followed by the ‘base’ frame and rest of the ‘residual’ frames. There are various ways how sync patterns can be created. The following procedure is one example for this use case. Here ‘sync band’ refers to the unused band of the last ‘residual’ pass. At encoder side, forming the sync pattern: ^ Restrict that ‘residual’ frames may not use energy ratio index value 0 (this is already a desired property for ‘residual’ frames to avoid inaccurate encoding of residual directions) ^ Set energy ratio value on the ‘sync band’ such that it quantizes to index value 0 (in practice, set energy ratio value as 0) ^ Set spread coherence and surround coherence on the ‘sync band’ as value 0 ^ Set elevation on the ‘sync band’ as 0° ^ When ‘sync band’ is band index 4, obtain energy ratio index values for bands 0-3 (four bands in total ^ Calculate even parity bit for each of these four index values separately (index modulo 2) ^ Encode these parity bits as direction data of the ‘sync band’ by parity bit value 1 setting direction as +90° and parity bit value 0 setting direction as -90°. Pair parity bit values of bands 0-3 as direction values of subframes of the ‘sync band’ respectively ^ Encode metadata of the ‘residual’ frame with standardized procedure At decoder side, inspecting the sync patters: ^ Check for each frame if energy ratio index on band index 4 is 0 ^ If it is, assume a ‘sync band’ and test further, otherwise continue with normal operation ^ Obtain energy ratio indices for bands 0-3 and calculate even parity bits for each of them ^ Obtain direction data for the assumed ‘sync band’ check sign for each direction on subframes 0-3. Positive azimuth angle indicates bit value 1, negative azimuth angle indicates bit value 0. ^ Compare bit values calculated from the energy ratio indices and decoded from the direction values of the ‘sync band’ and if they match, then this band is a ‘sync band’ and residual coding operation can be assumed. Although this approach is somewhat robust against accidental detection, the data is still such that could rarely happen accidentally. Residual bands can be sacrificed for the sync information or even a full frame can be used to signal a start of residual coding mode or each residual coding pattern. With respect to Figure 16 an example electronic device which may be used as any of the apparatus parts of the system as described above. The device may be any suitable electronics device or apparatus. For example, in some embodiments the device 1600 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc. The device may for example be configured to implement the encoder/analyser part and/or the decoder part as shown in Figure 1 or any functional block as described above. In some embodiments the device 1600 comprises at least one processor or central processing unit 1607. The processor 1607 can be configured to execute various program codes such as the methods such as described herein. In some embodiments the device 1600 comprises at least one memory 1611. In some embodiments the at least one processor 1607 is coupled to the memory 1611. The memory 1611 can be any suitable storage means. In some embodiments the memory 1611 comprises a program code section for storing program codes implementable upon the processor 1607. Furthermore, in some embodiments the memory 1611 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1607 whenever needed via the memory-processor coupling. In some embodiments the device 1600 comprises a user interface 1605. The user interface 1605 can be coupled in some embodiments to the processor 1607. In some embodiments the processor 1607 can control the operation of the user interface 1605 and receive inputs from the user interface 1605. In some embodiments the user interface 1605 can enable a user to input commands to the device 1600, for example via a keypad. In some embodiments the user interface 1605 can enable the user to obtain information from the device 1600. For example the user interface 1605 may comprise a display configured to display information from the device 1600 to the user. The user interface 1605 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1600 and further displaying information to the user of the device 1600. In some embodiments the user interface 1605 may be the user interface for communicating. In some embodiments the device 1600 comprises an input/output port 1609. The input/output port 1609 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1607 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling. The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable radio access architecture based on long term evolution advanced (LTE Advanced, LTE-A) or new radio (NR) (or can be referred to as 5G), universal mobile telecommunications system (UMTS) radio access network (UTRAN or E-UTRAN), long term evolution (LTE, the same as E-UTRA), 2G networks (legacy network technology), wireless local area network (WLAN or Wi-Fi), worldwide interoperability for microwave access (WiMAX), Bluetooth®, personal communications services (PCS), ZigBee®, wideband code division multiple access (WCDMA), systems using ultra-wideband (UWB) technology, sensor networks, mobile ad-hoc networks (MANETs), cellular internet of things (IoT) RAN and Internet Protocol multimedia subsystems (IMS), any other suitable option and/or any combination thereof. The transceiver input/output port 1609 may be configured to receive the signals. In some embodiments the device 1600 may be employed as at least part of the synthesis device. The input/output port 1609 may be coupled to headphones (which may be a headtracked or a non-tracked headphones) or similar and loudspeakers. In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof. The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD. The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general-purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples. Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate. Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication. As used in this application, the term “circuitry” may refer to one or more or all of the following: (a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry) and (b) combinations of hardware circuits and software, such as (as applicable): (i) a combination of analog and/or digital hardware circuit(s) with software/firmware and (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation. This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device. The term “non-transitory,” as used herein, is a limitation of the medium itself (i.e., tangible, not a signal ) as opposed to a limitation on data storage persistency (e.g., RAM vs. ROM). As used herein, “at least one of the following: <a list of two or more elements>” and “at least one of <a list of two or more elements>” and similar wording, where the list of two or more elements are joined by “and” or “or”, mean at least any one of the elements, or at least any two or more of the elements, or at least all the elements. The foregoing description has provided by way of exemplary and non- limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.