Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
KNOWLEDGE DISTILLATION WITH DOMAIN MISMATCH FOR SPEECH RECOGNITION
Document Type and Number:
WIPO Patent Application WO/2024/086164
Kind Code:
A1
Abstract:
A method (300) includes receiving distillation data (220) including a plurality of out-of-domain training utterances (222). For each particular out-of-domain training utterance of the distillation data, the method includes generating a corresponding augmented out-of-domain training utterance (232), and generating, using a teacher ASR model (170) trained on training data corresponding to a target domain, a pseudo-label (240) corresponding to the corresponding augmented out-of-domain training utterance. The method also includes distilling a student ASR model (210) from the teacher ASR model by training the student ASR model using the corresponding augmented out-of-domain training utterances paired with the corresponding pseudo-labels generated by the teacher ASR model.

Inventors:
YANG TIEN-JU (US)
CHENG YOU-CHI (US)
KUMAR SHANKAR (US)
LICHTARGE JARED (US)
AMID EHSAN (US)
DING YUXIN (US)
MATHEWS RAJIV (US)
CHEN MINGQING (US)
Application Number:
PCT/US2023/035318
Publication Date:
April 25, 2024
Filing Date:
October 17, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
GOOGLE LLC (US)
International Classes:
G10L15/06
Other References:
TZINIS EFTHYMIOS ET AL: "RemixIT: Continual Self-Training of Speech Enhancement Models via Bootstrapped Remixing", IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, IEEE, US, vol. 16, no. 6, 20 August 2022 (2022-08-20), pages 1329 - 1341, XP011923766, ISSN: 1932-4553, [retrieved on 20220822], DOI: 10.1109/JSTSP.2022.3200911
DONGSEONG HWANG ET AL: "Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 11 October 2022 (2022-10-11), XP091341824
Attorney, Agent or Firm:
KRUEGER, Brett A. (US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. A computer-implemented method (300) executed by data processing hardware (510) that causes the data processing hardware (510) to perform operations comprising: receiving distillation data (220) comprising a plurality of out-of-domain training utterances (222); for each particular out-of-domain training utterance (222) of the distillation data (220): generating a corresponding augmented out-of-domain training utterance (232); and generating, using a teacher automated speech recognition (ASR) model (170) trained on training data corresponding to a target domain, a pseudo-label (240) corresponding to the corresponding augmented out-of-domain training utterance (232); and distilling a student ASR model (210) from the teacher ASR model (170) by training the student ASR model (210) using the corresponding augmented out-of-domain training utterances (232) paired with the corresponding pseudo-labels (240) generated by the teacher ASR model (170).

2. The computer-implemented method (300) of claim 1, wherein distilling the student ASR model (210) from the teacher ASR model (170) comprises training the student ASR model (210) to recognize utterances corresponding to the target domain.

3. The computer-implemented method (300) of claim 1 or 2, wherein: before receiving the distillation data (220), the teacher ASR model (170) is trained on the training data comprising a plurality of target domain training utterances; and after receiving the distillation data (220), the training data is unavailable to the teacher ASR model (170) and the student ASR model (210).

4. The computer-implemented method (300) of any of claims 1-3, wherein: the training data comprises a plurality of target domain training utterances; and one or more of the target domain training utterances of the plurality of target domain training utterances are not included in the distillation data (220).

5. The computer-implemented method (300) of any of claims 1-4, wherein augmenting the respective out-of-domain training utterance (222) comprises at least one of adding noise, adding reverberation, or manipulating timing.

6. The computer-implemented method (300) of any of claims 1-5, wherein: the student ASR model (210) executes in a cloud computing environment (70); and the teacher ASR model (170) executes on one or more user devices (10) in communication with the cloud computing environment (70).

7. The computer-implemented method (300) of claim 6, wherein the training data comprises a plurality of target domain training utterances, each target domain training utterance received at a respective one of the one or more user devices (10).

8. The computer-implemented method (300) of any of claims 1-7, wherein generating the pseudo-label (240) corresponding to the corresponding augmented out-of- domain training utterance (232) comprises: generating a probability distribution over possible speech recognition hypotheses, wherein the pseudo-label (240) comprises N-best speech recognition hypotheses having the highest probabilities.

9. The computer-implemented method (300) of any of claims 1-8, wherein distilling the student ASR model (210) from the teacher ASR model (170) comprises, for each particular augmented out-of-domain training utterance (232): generating, using the student ASR model (210), a transcription (212) corresponding to the particular augmented out-of-domain training utterance (232); generating a loss (252) based on the pseudo-label (240) corresponding to the particular augmented out-of-domain training utterance (232) and the transcription (212) corresponding to the particular augmented out-of-domain training utterance (232); and updating parameters of the student ASR model (210) using the loss (252).

10. The computer-implemented method (300) of claim 9, wherein the loss (252) comprises at least one of a cross-entropy loss, a Kullback-Leibler (KL) divergence loss, or an L2 loss.

11. A system (400) comprising: data processing hardware (410); and memory hardware (420) in communication with the data processing hardware (410), the memory hardware (420) storing instructions that, when executed on the data processing hardware (410), cause the data processing hardware (410) to perform operations comprising: receiving distillation data (220) comprising a plurality of out-of-domain training utterances (222); for each particular out-of-domain training utterance (222) of the distillation data (220): generating a corresponding augmented out-of-domain training utterance (232); and generating, using a teacher automated speech recognition (ASR) model (170) trained on training data corresponding to a target domain, a pseudo-label (240) corresponding to the corresponding augmented out-of-domain training utterance (232); and distilling a student ASR model (210) from the teacher ASR model (170) by training the student ASR model (210) using the corresponding augmented out-of- domain training utterances (232) paired with the corresponding pseudo-labels (240) generated by the teacher ASR model (170).

12. The system (400) of claim 11, wherein distilling the student ASR model (210) from the teacher ASR model (170) comprises training the student ASR model (210) to recognize utterances corresponding to the target domain.

13. The system (400) of claim 11 or 12, wherein: before receiving the distillation data (220), the teacher ASR model (170) is trained on the training data comprising a plurality of target domain training utterances; and after receiving the distillation data (220), the training data is unavailable to the teacher ASR model (170) and the student ASR model (210).

14. The system (400) of any of claims 11-13, wherein: the training data comprises a plurality of target domain training utterances; and one or more of the target domain training utterances of the plurality of target domain training utterances are not included in the distillation data (220).

15. The system (400) of any of claims 11-14, wherein augmenting the respective out- of-domain training utterance (222) comprises at least one of adding noise, adding reverberation, or manipulating timing.

16. The system (400) of any of claims 11-15, wherein: the student ASR model (210) executes in a cloud computing environment (70); and the teacher ASR model (170) executes on one or more user devices (10) in communication with the cloud computing environment (70).

17. The system (400) of claim 16, wherein the training data comprises a plurality of target domain training utterances, each target domain training utterance received at a respective one of the one or more user devices (10).

18. The system (400) of any of claims 11-17, wherein generating the pseudo-label (240) corresponding to the corresponding augmented out-of-domain training utterance (232) comprises: generating a probability distribution over possible speech recognition hypotheses, wherein the pseudo-label (240) comprises N-best speech recognition hypotheses having the highest probabilities.

19. The system (400) of any of claims 11-18, wherein distilling the student ASR model (210) from the teacher ASR model (170) comprises, for each particular augmented out-of-domain training utterance (232): generating, using the student ASR model (210), a transcription (212) corresponding to the particular augmented out-of-domain training utterance (232); generating a loss (252) based on the pseudo-label (240) corresponding to the particular augmented out-of-domain training utterance (232) and the transcription (212) corresponding to the particular augmented out-of-domain training utterance (232); and updating parameters of the student ASR model (210) using the loss (252).

20. The system (400) of claim 19, wherein the loss (252) comprises at least one of a cross-entropy loss, a Kullback-Leibler (KL) divergence loss, or an L2 loss.

Description:
Knowledge Distillation With Domain Mismatch For Speech Recognition

TECHNICAL FIELD

[0001] This disclosure relates to training a speech recognition model.

BACKGROUND

[0002] Speech recognition systems are increasingly used to transcribe speech to text in many daily applications. These speech recognition systems may be embedded on user devices such as smart home devices or smartphones, or used in cloud-related services.

SUMMARY

[0003] One aspect of the disclosure provides a computer-implemented method for training a speech recognition model. The method, when executed by data processing hardware causes the data processing hardware to perform operations. The operations include receiving distillation data including a plurality of out-of-domain training utterances. For each particular out-of-domain training utterance of the distillation data, the operations include generating a corresponding augmented out-of-domain training utterance; and generating, using a teacher automated speech recognition (ASR) model trained on training data corresponding to a target domain, a pseudo-label corresponding to the corresponding augmented out-of-domain training utterance. The operations also include distilling a student ASR model from the teacher ASR model by training the student ASR model using the corresponding augmented out-of-domain training utterances paired with the corresponding pseudo-labels generated by the teacher ASR model.

[0004] Implementations of the computer-implemented method or the system of the disclosure may include one or more of the following optional features. In some implementations, distilling the student ASR model from the teacher ASR model includes training the student ASR model to recognize utterances corresponding to the target domain. In some examples, before receiving the distillation data, the teacher ASR model is trained on the training data including a plurality of target domain training utterances; and, after receiving the distillation data, the training data is unavailable to the teacher ASR model and the student ASR model. In some implementations, the training data includes a plurality of target domain training utterances, and one or more of the target domain training utterances of the plurality of target domain training utterances are not included in the distillation data. In some examples, augmenting the respective out-of- domain training utterance includes at least one of adding noise, adding reverberation, or manipulating timing.

[0005] In some implementations, the student ASR model executes in a cloud computing environment, and the teacher ASR model executes on one or more user devices in communication with the cloud computing environment. The training data may include a plurality of target domain training utterances, each target domain training utterance received at a respective one of the one or more user devices. In some examples, generating the pseudo-label corresponding to the corresponding augmented out-of- domain training utterance includes generating a probability distribution over possible speech recognition hypotheses, wherein the pseudo-label includes N-best speech recognition hypotheses having the highest probabilities. In some implementations, distilling the student ASR model from the teacher ASR model includes, for each particular augmented out-of-domain training utterance: generating, using the student ASR model, a transcription corresponding to the particular augmented out-of-domain training utterance; generating a loss based on the pseudo-label corresponding to the particular augmented out-of-domain training utterance and the transcription corresponding to the particular augmented out-of-domain training utterance; and updating parameters of the student ASR model using the loss. The loss may be at least one of a cross-entropy loss, a Kullback-Leibler (KL) divergence loss, or an L2 loss.

[0006] Another aspect of the disclosure provides a system including data processing hardware and memory hardware in communication with data processing hardware. The memory hardware stores instructions that, when executed on the data processing, hardware causes the data processing hardware to perform operations. The operations include receiving distillation data including a plurality of out-of-domain training utterances. For each particular out-of-domain training utterance of the distillation data, the operations include generating a corresponding augmented out-of-domain training utterance; and generating, using a teacher automated speech recognition (ASR) model trained on training data corresponding to a target domain, a pseudo-label corresponding to the corresponding augmented out-of-domain training utterance. The operations also include distilling a student ASR model from the teacher ASR model by training the student ASR model using the corresponding augmented out-of-domain training utterances paired with the corresponding pseudo-labels generated by the teacher ASR model.

[0007] Implementations of the computer-implemented method or the system of the disclosure may include one or more of the following optional features. In some implementations, distilling the student ASR model from the teacher ASR model includes training the student ASR model to recognize utterances corresponding to the target domain. In some examples, before receiving the distillation data, the teacher ASR model is trained on the training data including a plurality of target domain training utterances; and, after receiving the distillation data, the training data is unavailable to the teacher ASR model and the student ASR model. In some implementations, the training data includes a plurality of target domain training utterances, and one or more of the target domain training utterances of the plurality of target domain training utterances are not included in the distillation data. In some examples, augmenting the respective out-of- domain training utterance includes at least one of adding noise, adding reverberation, or manipulating timing.

[0008] In some implementations, the student ASR model executes in a cloud computing environment, and the teacher ASR model executes on one or more user devices in communication with the cloud computing environment. The training data may include a plurality of target domain training utterances, each target domain training utterance received at a respective one of the one or more user devices. In some examples, generating the pseudo-label corresponding to the corresponding augmented out-of- domain training utterance includes generating a probability distribution over possible speech recognition hypotheses, wherein the pseudo-label includes N-best speech recognition hypotheses having the highest probabilities. In some implementations, distilling the student ASR model from the teacher ASR model includes, for each particular augmented out-of-domain training utterance: generating, using the student ASR model, a transcription corresponding to the particular augmented out-of-domain training utterance; generating a loss based on the pseudo-label corresponding to the particular augmented out-of-domain training utterance and the transcription corresponding to the particular augmented out-of-domain training utterance; and updating parameters of the student ASR model using the loss. The loss may be at least one of a cross-entropy loss, a Kullback-Leibler (KL) divergence loss, or an L2 loss.

[0009] The details of one or more implementations of the disclosure are set forth in the accompanying drawings and the description below. Other aspects, features, and advantages will be apparent from the description and drawings, and from the claims.

DESCRIPTION OF DRAWINGS

[0010] FIG. 1 is a schematic view of an example speech environment using a speech recognition model for transcribing utterances.

[0011] FIG. 2Ais a schematic view of an example training process for training a speech recognition model using knowledge distillation with domain mismatch.

[0012] FIG. 2B is a schematic view of another example training process for training a speech recognition model using knowledge distillation with domain mismatch.

[0013] FIG. 3 is a flow chart of an example arrangement of operations for a method of training speech recognition model using knowledge distillation with domain mismatch. [0014] FIG. 4 is a schematic view of an example computing device that may be used to implement the systems and methods described herein.

[0015] Like reference symbols in the various drawings indicate like elements.

DETAILED DESCRIPTION

[0016] Knowledge distillation may be used to transfer knowledge from one model (i.e., a teacher model) to another model (i.e., a student model). Conventional knowledge distillation assumes that data in a target domain that is used to train the teach model is available for training the student model. Alternatively, complex sophisticated methods are needed to generate suitable data for training the student model. However, in many instances, training data matching the target domain is not available for training the student model. For example, for automatic speech recognition (ASR) for long-tail languages (e.g., Sub-Saharan African languages) there may be an inadequate amount of training data available for training a server-side model. While federated training may be used to train the server-side model, typically on-device data used for training an on- device speech recognition models should not leave user devices for privacy. Thus, the server-side model (i.e., a student model) cannot be trained using conventional knowledge distillation based on on-device models (i.e., teacher models) as there isn’t available training data matching the target domain to which the on-device models were trained. Therefore, there is a need for methods and systems for performing knowledge distillation to train a student model when the domain of training data available to train the student model does not match the domain of training data used to train the teacher model. While examples disclosed herein relate to training a student ASR model using knowledge distillation with a domain mismatch, disclosed examples may be used to train other types of models using knowledge distillation with a domain mismatch. That is, when the training data for training a teacher model is from a domain that does not match (i.e., is different from) the domain of training data that is available to train a student model. [0017] Referring to FIG. 1, in some implementations, a speech environment 100 includes a user 104 using spoken utterances 106 to interact with a voice-enabled device 10 (also referred to as a user device 10). Here, a spoken utterance 106 corresponds to, for example, a dictation for transcription, or a query to solicit a response from the user device 10 or to have the user device 10 execute a task specified by the query. In this sense, the user 104 may have conversational-like interactions with the user device 10 to perform computing activities or find answers to questions. In the example shown, a system 102 includes an ASR model 170 of a speech recognition system 150 for generating a transcription 172 of the utterance 106. The transcription 172 may then processed by a digital assistant 20 to generate a response to the utterance 106 or execute a task specified by the utterance 106. In some implementations, the digital assistant 20 includes a natural language processing/understanding (NLP/NLU) module executing on the user device 10 or a remote computing system 70 for processing the transcription 172 to understand the utterance 106. The digital assistant 20 may provide a response as text in a user interface 22 on a display 16c of the user device 10 or as audio signals 108 output by a speaker 16b of the user device 10. In some examples, the digital assistant 20 generates text representing a response, and a text-to-speech (TTS) system converts the text to audio signals 108 as synthetic speech. In the example shown, the user 104 speaks an utterance 106 asking “Who taught Alexander the Great,” and the digital assistant 20 responds with audio data 108 representing a response of “Aristotle.”

[0018] The user device 10 may correspond to any computing device associated with a user 104 and capable of capturing audio data 162, and providing textual or audible outputs. Some examples of user devices 10 include, but are not limited to, mobile devices (e.g., mobile phones, tablets, laptops, etc.), computers, wearable devices (e.g., a smart watch, smart glasses, smart goggles, an AR headset, a VR headset, etc.), smart appliances, Internet of things (loT) devices, vehicle infotainment systems, smart displays, smart speakers, etc. The user device 10 includes data processing hardware 12 and memory hardware 14 in communication with the data processing hardware 12 and stores instructions, that when executed by the data processing hardware 12, cause the data processing hardware 12 to perform one or more operations. The user device 10 further includes one or more input/output devices 16, 16a-c, such as an audio capture device 16, 16a (e.g., a microphone) for capturing and converting spoken utterances 106 into electrical signals, an audio output device 16, 16b (e.g., a speaker) for communicating an audible audio signal (e.g., as output audio data from the user device 10), and the display 16, 16c for displaying the visual content. Of course, any number and/or type(s) of other input/output devices 16 may be used. The input/output devices 16 may reside on or be in communication with the user device 10.

[0019] The speech recognition system 150 executes on the user device 10 of the user 104 and/or on a remote computing system 70 (e.g., one or more remote servers of a distributed system executing in a cloud-computing environment) in communication with the user device 10 via a network 40. The speech recognition system 150 includes an input subsystem 160 configured to receive the utterances 106 spoken by the user 104 and captured by the audio capture device 16a, and convert each utterance 106 into a corresponding digital format associated with input acoustic frames 162 (also generally referred to as audio data 162) capable of being processed by the speech recognition system 150. Thereafter, the ASR model 170 of the speech recognition system 150 receives, as input, the audio data 162 corresponding to a current utterance 106, and generates/predicts, as output, a corresponding transcription 172 (e.g., recognition result/hypothesis) of the utterance 106.

[0020] The remote computing system 70 includes data processing hardware 72, and memory hardware 74 in communication with the data processing hardware 72. The memory hardware 74 stores instructions that, when executed by the data processing hardware 72, cause the data processing hardware 72 to perform one or more operations, such as those disclosed herein.

[0021] The ASR model 170 may have different types of neural network architectures. In some examples, the ASR model 170 includes a conformer-based encoder and a listen attend spell (LAS) decoder. In some implementations, the conformer-based encoder may include 600 million trainable parameters, and the LAS decoder may include 200 million trainable parameters. In other examples, the ASR model 170 includes an RNN-T model with an encoder-decoder architecture. Here, the decoder architecture of the RNN-T model includes a joint/prediction network. An audio encoder of the ASR model 170 may include a cascaded encoder architecture.

[0022] FIG. 2A depicts an example training process 200a for training a student ASR model 210 from the ASR model 170 (i.e., a teacher ASR model 170) using knowledge distillation with domain mismatch. That is, domain mismatch occurs when the training data used to train the teacher ASR model 170 was from a target domain that does not match (i.e., is different from) the domain of training data that is available to train the student ASR model 210. The training process 200a includes a knowledge distillation training process where the teacher ASR model 170 distills its knowledge to the student ASR model 210. Here, the teacher ASR model 170 distills its knowledge to the student ASR model 210 by training the student ASR model 210 with distillation data 220 that includes a plurality of out-of-domain training utterances 222, 222a-n. Notably, the out- of-domain training utterances 222 are from a domain that is different from the target domain of training data (i .e., target domain training utterances) that was used to train the teacher ASR model 170 prior to distilling knowledge to the student ASR model 210. Here, the student ASR model 210 learns to predict transcriptions 212, 212a-n matching pseudo-labels 240, 240a-n produced by the teacher ASR model 170 in order to train the student ASR model 210 to recognize utterances corresponding to the target domain. In a federated learning example, the student ASR model 210 executes in a cloud computing environment (e.g., a server-side model executing on the remote computing system 70), and the teacher ASR model 170 executes on the user device 10 in communication with the cloud computing environment. Here, the target domain training utterances used to train the teacher ASR model 170 are received at the user device 10 from the user 104 of the user device 10 and are not available for training the student ASR model 210. Hence, the target domain training utterances are only processed by the teacher ASR model 170 for the purpose of training the teacher ASR model 170, whereby neither the target domain training utterances nor sensitive information derived from target domain training utterances are sent/communicated to the cloud computing environment.

[0023] FIG. 2A depicts the student ASR model 210 being training using a single teacher ASR model 170 executing on the user device 10. However, the student ASR model 210 may be trained using one or more teacher ASR models 170 executing on one or more user devices 10, where the target domain training utterances used to train each teacher ASR model 170 are received at respective ones of the one or more user devices 10 from corresponding users. Notably, each teacher ASR model 170 may be associated with its own target domain training utterances, which are not made available for training other teacher ASR models 170 or the student ASR model 210. The architecture of the teacher ASR model 170 may be different from that of the student ASR model 210. Additionally or alternatively, the teacher ASR model 170 and the student ASR model 210 may have different sizes. Notably, it has been found that, for ASR for long-tail languages, the teacher ASR model 170 can be 32 times smaller than the student ASR model 210 and still achieve good knowledge distillation.

[0024] During the training process 200a, an augm enter 230 receives the out-of- domain training utterances 222 and, for each particular out-of-domain training utterance 222, generates a corresponding augmented out-of-domain training utterance 232, 232a-n. The augmenter 230 generates an augmented out-of-domain training utterance 232 by, for example, adding noise to, adding reverberation to, or manipulating the timing of, the corresponding out-of-domain training utterance 222. In some implementations, the augmenter 230 generates an augmented out-of-domain training utterance 232 using Gaussian noise injection (GNI). For example, for out-of-domain training utterances 222 represented by Log-Mel input features, the augmenter 230 may augment an out-of- domain training utterance 222 by randomly varying the values of one or more input features of the out-of-domain training utterance 222 by adding a corresponding random value drawn from a Gaussian distribution to the value of each of the one or more varied input features. Here, the augmenter 230 may not vary all of the input features. The mean and the variance of the Gaussian distribution may be computed from the corresponding frequency channel in the input features. Alternatively, one or more input features may be replaced by a corresponding random value drawn from a Gaussian distribution. Here, GNI may be tuned using a single hyper-parameter representing a probability ranging from zero to one hundred percent that indicates whether a particular input feature is varied/replaced such that GNI-based augmentation may be easily tuned. Notably, target domain training utterances used to train the teacher ASR model 170 are not available after receiving the distillation data 220, and the distillation data 220 does not need to include any such target domain training utterances. In the example shown, the augmenter 230 generates a single augmented training utterance 232 for each training utterance 222. However, the augmenter 230 may generate multiple augmented training utterances 232 for each training utterance 222.

[0025] For each particular augmented out-of-domain training utterance 232, the teacher ASR model 170 generates a pseudo-label 240, 240a-n corresponding to the particular augmented out-of-domain training utterance 232. In some implementations, the teacher ASR model 170 generates the corresponding pseudo-label 240 by generating a probability distribution over possible speech recognition hypotheses for the particular augmented out-of-domain training utterance 232, and the pseudo-label 240 is the speech recognition hypothesis having the highest probability. Notably, the teacher ASR model 170 may generate a pseudo-label 240 for the particular augmented out-of-domain training utterance 232 that differs from a pseudo-label generated by the teacher ASR model 170 for the corresponding out-of-domain training utterance 222. Thus, the augmenter 230 effectively and dynamically creates a large unlabeled dataset with almost zero cost to increase the abundance and the coverage of the distillation data 220, which provides more opportunities for the teacher ASR model 170 to distill knowledge in the target domain to the student ASR model 210.

[0026] For each particular augmented out-of-domain training utterance 232, the student ASR model 210 generates a corresponding predicted transcription 212 corresponding to the particular augmented out-of-domain training utterance 232. In some implementations, the student ASR model 210 generates the corresponding predicted transcription 212 by generating a probability distribution over possible speech recognition hypotheses for the particular augmented out-of-domain training utterance 232, and the corresponding predicted transcription 212 is the speech recognition hypothesis having the highest probability.

[0027] Thereafter, the training process 200a distills knowledge to the student ASR model 210 from the teacher ASR model 170 by training the student ASR model 210 using the corresponding augmented out-of-domain training utterances 232 paired with the corresponding pseudo-labels 240 generated by the teacher ASR model 170. In particular, for each particular augmented out-of-domain training utterance 232, a loss term module 250 receives the corresponding pseudo-label 240 generated by the teacher ASR model 170 and the corresponding transcription 212 predicted by the student ASR model 210. Thereafter, the loss term module 250 generates a corresponding loss 252, 252a-n based on the corresponding pseudo-label 240 and the corresponding predicted transcription 212, and updates parameters of the student ASR model 210 using the loss 252. In some implementations, the corresponding loss 252 is expressed as: where x is the corresponding augmented training utterance 232 for a particular training utterance x 222, and y sl is pseudo-label 240 generated by the teacher ASR model that is inferred from x. The functions P s and P t are the output logits from the student ASR model 210 and the teacher ASR model 170, respectively, and have the same shape. The loss function L dist may be any distillation loss, e.g., a cross-entropy loss, a Kullback- Leibler (KL) divergence loss, or an L2 loss.

[0028] Alternatively, the corresponding loss 252 may be expressed as: which utilizes, for each particular augmented out-of-domain training utterance x 232, multiple pseudo-labels 240 generated by the teacher ASR model 170 from the augmented out-of-domain training utterance 232 in a beam-search. In some implementations, the teacher ASR model 170 generates the corresponding pseudo-labels 240 by generating a probability distribution over possible speech recognition hypotheses for each particular augmented out-of-domain training utterance x 232 in a beam search, where the multiple pseudo-labels 240 represent the N-best speech recognition hypotheses having the highest probabilities. Similarly, the student ASR model 210 may generate corresponding predicted transcriptions 212 by generating a probability distribution over possible speech recognition hypotheses for each particular augmented out-of-domain training utterance x 232 in a beam search, where the multiple predicted transcriptions 212 represent the N- best speech recognition hypotheses having the highest probabilities. Here, the loss of EQN (1) for each particular hypothesis y k is multiplied by the posterior probability of the particular hypothesis y k . This weighting operation may be either implemented as a weighted sum or a random sampling. In general, the loss of EQN (2) is capable of providing better word coverage given a particular augmented out-of-domain training utterance 232 compared to the loss of EQN (1).

[0029] FIG. 2B depicts another example training process 200b for training a student ASR model 210 from the ASR model 170 (i.e., a teacher ASR model 170) using knowledge distillation with domain mismatch. Compared to training process 200a of FIG. 2A, the training process 200b of FIG. 2B includes an additional augmenter 260 for altering the augmented out-of-domain training utterances 232 to form student training utterances 262, 262a-n for input to the student ASR model 210. Here, the teacher ASR model 170 still processes the augmented out-of-domain training utterances 232 and, thus, the pseudo-labels 240 are unaffected by the augmenter 260. In some examples, the augmenter 260 adds noise (e.g., using GNI) or performs some other type of spectral augmentation. Thus, the student ASR model 210 learns from a noisy student learning framework so the student ASR model 210 learns that slight differences in input utterances may result in the same pseudo-labels 240 and, thus, provides an additional degree of robustness to the training process 200b. Here, the augmenter 260 may create more than one student training utterance 262 per augmented out-of-domain training utterance 232. [0030] FIG. 3 is a flowchart of an exemplary arrangement of operations for a computer-implemented method 300 of training an ASR model using knowledge distillation with domain mismatch. The operations may be performed by data processing hardware 410 (FIG. 4) (e.g., the data processing hardware 12 of the user device 10 or the data processing hardware 72 of the remote computing system 70) based on executing instructions stored on memory hardware 420 (e.g., the memory hardware 14 of the user device 10 or the memory hardware 74 of the remote computing system 70).

[0031] At operation 302, the method 300 includes receiving distillation data 220 including a plurality of out-of-domain training utterances 222. For each particular out-of- domain training utterance 222 of the distillation data 220, the method 300 includes, at operation 304, generating a corresponding augmented out-of-domain training utterance 232 and, at operation 306, generating, using the teacher ASR model 170 trained on training data corresponding to a target domain, a pseudo-label 240 corresponding to the corresponding augmented out-of-domain training utterance 232. At operation 308, the method 300 includes distilling the student ASR model 210 from the teacher ASR model 170 by training the student ASR model 210 using the corresponding augmented out-of- domain training utterances 232 paired with the corresponding pseudo-labels 240 generated by the teacher ASR model 170.

[0032] FIG. 4 is schematic view of an example computing device 400 that may be used to implement the systems and methods described in this document. The computing device 400 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed in this document.

[0033] The computing device 400 includes a processor 410 (i.e., data processing hardware) that can be used to implement the data processing hardware 12 and/or 72, memory 420 (i.e., memory hardware) that can be used to implement the memory hardware 14 and/or 74, a storage device 430 (i.e., memory hardware) that can be used to implement the memory hardware 14 and/or 74, a high-speed interface/controller 440 connecting to the memory 420 and high-speed expansion ports 450, and a low speed interface/controller 460 connecting to a low speed bus 470 and a storage device 430 that can be used to store the distillation data 220. Each of the components 410, 420, 430, 440, 450, and 460, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 410 can process instructions for execution within the computing device 400, including instructions stored in the memory 420 or on the storage device 430 to display graphical information for a graphical user interface (GUI) on an external input/output device, such as display 480 coupled to high speed interface 440. In other implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices 400 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).

[0034] The memory 420 stores information non-transitorily within the computing device 400. The memory 420 may be a computer-readable medium, a volatile memory unit(s), or non-volatile memory unit(s). The non-transitory memory 420 may be physical devices used to store programs (e.g., sequences of instructions) or data (e.g., program state information) on a temporary or permanent basis for use by the computing device 400. Examples of non-volatile memory include, but are not limited to, flash memory and read-only memory (ROM) / programmable read-only memory (PROM) / erasable programmable read-only memory (EPROM) / electronically erasable programmable readonly memory (EEPROM) (e.g., typically used for firmware, such as boot programs). Examples of volatile memory include, but are not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), phase change memory (PCM) as well as disks or tapes.

[0035] The storage device 430 is capable of providing mass storage for the computing device 400. In some implementations, the storage device 430 is a computer- readable medium. In various different implementations, the storage device 430 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. In additional implementations, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 420, the storage device 430, or memory on processor 410.

[0036] The high speed controller 440 manages bandwidth-intensive operations for the computing device 400, while the low speed controller 460 manages lower bandwidthintensive operations. Such allocation of duties is exemplary only. In some implementations, the high-speed controller 440 is coupled to the memory 420, the display 480 (e.g., through a graphics processor or accelerator), and to the high-speed expansion ports 450, which may accept various expansion cards (not shown). In some implementations, the low-speed controller 460 is coupled to the storage device 430 and a low-speed expansion port 490. The low-speed expansion port 490, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet), may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e g., through a network adapter.

[0037] The computing device 400 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 400a or multiple times in a group of such servers 400a, as a laptop computer 400b, or as part of a rack server system 400c. [0038] Various implementations of the systems and techniques described herein can be realized in digital electronic and/or optical circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

[0039] A software application (i.e., a software resource) may refer to computer software that causes a computing device to perform a task. In some examples, a software application may be referred to as an “application,” an “app,” or a “program.” Example applications include, but are not limited to, system diagnostic applications, system management applications, system maintenance applications, word processing applications, spreadsheet applications, messaging applications, media streaming applications, social networking applications, and gaming applications.

[0040] These computer programs (also known as programs, software, software applications, or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms “machine- readable medium” and “computer-readable medium” refer to any computer program product, non-transitory computer readable medium, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor.

[0041] The processes and logic flows described in this specification can be performed by one or more programmable processors, also referred to as data processing hardware, executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

[0042] To provide for interaction with a user, one or more aspects of the disclosure can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube), LCD (liquid crystal display) monitor, or touch screen for displaying information to the user and optionally a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser. [0043] Unless expressly stated to the contrary, the phrase “at least one of A, B, or C” is intended to refer to any combination or subset of A, B, C such as: (1) at least one A alone; (2) at least one B alone; (3) at least one C alone; (4) at least one A with at least one B; (5) at least one A with at least one C; (6) at least one B with at least C; and (7) at least one A with at least one B and at least one C. Moreover, unless expressly stated to the contrary, the phrase “at least one of A, B, and C” is intended to refer to any combination or subset of A, B, C such as: (1) at least one A alone; (2) at least one B alone; (3) at least one C alone; (4) at least one A with at least one B; (5) at least one A with at least one C; (6) at least one B with at least one C; and (7) at least one A with at least one B and at least one C. Furthermore, unless expressly stated to the contrary, “A or B” is intended to refer to any combination of A and B, such as: (1) A alone; (2) B alone; and (3) A and B. [0044] A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure. Accordingly, other implementations are within the scope of the following claims.