Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
AUGMENTING TRAINING OF SEQUENCE TRANSDUCTION MODELS USING TOKEN-LEVEL LOSSES
Document Type and Number:
WIPO Patent Application WO/2024/081031
Kind Code:
A1
Abstract:
A method (500) includes, for each training sample (410) of a plurality of training samples: processing, using a sequence transduction model (200), corresponding training input features (415) to obtain one or more output token sequence hypotheses (432) each including one or more predicted common tokens (204); and determining a token-level loss (462) based on, for each hypothesis: a number of special token insertions each associated with a corresponding predicted special token that appears in the hypothesis but does not appear in a corresponding sequence of ground-truth output tokens; and a number of special token deletions each associated with a corresponding ground-truth special token in the set of ground-truth special tokens that does not appear in hypothesis. The method also includes training the sequence transduction model to minimize additive error rate based on the token-level losses determined for the plurality of training samples.

Inventors:
ZHAO GUANLONG (US)
WANG QUAN (US)
SERRANO BELTRÁN LABRADOR (US)
LU HAN (US)
MORENO IGNACIO LOPEZ (US)
HUANG YILING (US)
Application Number:
PCT/US2022/079703
Publication Date:
April 18, 2024
Filing Date:
November 11, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
GOOGLE LLC (US)
International Classes:
G10L17/04; G06N3/0442; G06N3/045; G06N3/09; G10L15/06; G10L15/08; G10L15/16; G10L17/18
Other References:
WEI XIA ET AL: "Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 25 January 2022 (2022-01-25), XP091128195
HUANRU HENRY MAO ET AL: "Speech Recognition and Multi-Speaker Diarization of Long Conversations", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 16 May 2020 (2020-05-16), XP081674821
LAURENT EL SHAFEY ET AL: "Joint Speech Recognition and Speaker Diarization via Sequence Transduction", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 9 July 2019 (2019-07-09), XP081440969
TAE JIN PARK ET AL: "Multimodal Speaker Segmentation and Diarization using Lexical and Acoustic Cues via Sequence to Sequence Neural Networks", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 28 May 2018 (2018-05-28), XP080882969
Attorney, Agent or Firm:
KRUEGER, Brett A. (US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. A computer-implemented method (500) when executed on data processing hardware (610) causes the data processing hardware (610) to perform operations comprising: receiving a plurality of training samples (410) each comprising a corresponding sequence of training input features (415) paired with a corresponding sequence of ground-truth output tokens (420), the sequence of ground-truth output tokens (420) comprising a set of ground-truth common tokens and a set of ground-truth special tokens; for each training sample (410) in the plurality of training samples (410): processing, using a sequence transduction model (200), the corresponding sequence of training input features (415) to obtain one or more output token sequence hypotheses (432), each output token sequence hypothesis (432) comprising one or more predicted common tokens (204); and determining a per sample token-level loss (462) based on, for each corresponding output token sequence hypothesis (432) obtained for the training sample (410): a number of special token insertions each associated with a corresponding predicted special token (206) that appears in the corresponding output token sequence hypothesis (432) but does not appear in the corresponding sequence of ground-truth output tokens (420); and a number of special token deletions each associated with a corresponding ground-truth special token in the set of ground-truth special tokens that does not appear in corresponding output token sequence hypothesis (432); and training the sequence transduction model (200) to minimize additive error rate based on the per sample token-level losses (462) determined for the plurality of training samples (410).

2. The computer-implemented method (500) of claim 1, wherein: determining the per sample token-level loss (462) is further based on a total number of ground-truth output tokens (420) in the corresponding sequence of ground- truth output tokens (420) and a respective number of predicted common token errors relative to the set of ground-truth common tokens in the corresponding sequence of ground-truth tokens (420); and when determining the per sample token-level loss (462), the number of special token insertions and the number of special token deletions are each weighted higher than the respective number of predicted common token errors to force the sequence transduction model (200) to reduce special token insertion and deletion rates during training.

3. The computer-implemented method (500) of claim 1 or 2, wherein the operations further comprise, for each training sample (410) in the plurality of training samples (410): processing, using the sequence transduction model (200), the corresponding sequence of training input features (415) to predict probability distributions (222) over possible output tokens; and training the sequence transduction model (200) based on a negative log of the probability distributions (442) for the corresponding sequence of ground-truth output tokens (420) conditioned on the corresponding sequence of training input features (415).

4. The computer-implemented method (500) of claim 3, wherein: training the sequence transduction model (200) based on the negative log of the probability distributions (442) comprises initially training the sequence transduction model (200) based on the negative log of the probability distributions (442) to initialize the sequence transduction model (200); and training the sequence transduction model (200) to minimize additive error rate based on the per sample token-level losses (462) comprises fine tuning the initialized sequence transduction model (200) based on minimizing the additive error rate based on the per sample token-level losses (462).

5. The computer-implemented method (500) of any of claims 1-4, wherein: processing the corresponding sequence of training input features (415) to obtain one or more output token sequence hypotheses (432) comprises processing the corresponding sequence of training input features (415) to obtain an N-best list of output token sequence hypotheses (432), each corresponding output token sequence hypothesis (432) in the N-best list having a respective probability score assigned by the sequence transduction model (200); and determining the per sample token-level loss (462) is further based on the respective probability score of the corresponding output token sequence hypotheses (432).

6. The computer-implemented method (500) of any of claims 1—5, wherein: the sequence of training input features (415) comprises a sequence of input audio frames characterizing an utterance (120) that includes a particular key phrase; the set of ground-truth common tokens in each corresponding sequence of ground-truth output tokens (420) comprises a set of word or sub-word unit tokens that form a ground-truth transcription of the utterance (120) characterized by the sequence of input audio frames; the set of ground-truth special tokens in each corresponding sequence of ground- truth output tokens (420) comprises at least one ground-truth key word token indicating a respective location in the ground-truth transcription immediately after the particular key phrase appears in the ground-truth transcription; the one or more predicted common tokens (204) of each corresponding output token sequence hypothesis (432) of the one or more output token sequence hypotheses (432) comprises a sequence of predicted word or sub-word tokens that form a respective candidate transcription of the utterance; and each corresponding predicted special token that appears in the corresponding output token sequence hypothesis (432) but does not appear in the corresponding sequence of ground-truth output tokens (420) comprises a predicted key phrase token indicating a respective location in the respective candidate transcription immediately after the sequence transduction model (200) predicts that the particular key phrase is detected.

7. The computer-implemented method (500) of any of claims 1-6, wherein: the sequence of training input features (415 ) comprises a sequence of input audio frames characterizing multiple utterances spoken by at least two different speakers; the set of ground-truth common tokens in each corresponding sequence of ground-truth output tokens (420) comprises a set of word or sub-word unit tokens that form a ground-truth transcription of the multiple spoken utterances (120) characterized by the sequence of input audio frames; the set of ground-truth special tokens in each corresponding sequence of ground- truth output tokens (420) comprises a set of one or more ground-truth speaker change tokens each indicating a respective location where a speaker change occurs in the ground- truth transcription of the multiple utterances; the one or more predicted common tokens (204) of each corresponding output token sequence hypothesis (432) of the one or more output token sequence hypotheses (432) comprises a sequence of predicted word or sub-word unit tokens that form a respective candidate transcription of the multiple utterances; and each corresponding predicted special token that appears in the corresponding output token sequence hypothesis (432) but does not appear in the corresponding sequence of ground-truth output tokens (420) comprises a predicted speaker change token indicating a respective location in the respective candidate transcription where a respective speaker change event is detected by the sequence transduction model (200).

8. The computer-implemented method (500) of any of claims 1-7, wherein the operations further comprise, for each training sample (410) in plurality of training samples (410): determining a customized Levenshtein distance between each corresponding output token sequence hypothesis (432) of the one or more output token sequence hypotheses (432) obtained for the corresponding sequence of training input features (415) and the corresponding sequence of ground-truth output tokens (420); and based on the customized Levenshtein distance: identifying the number of special token insertions for each corresponding output token sequence hypothesis (432); and identifying the number of special token deletions for each corresponding output token sequence hypothesis (432).

9. The computer-implemented method (500) of claim 8, wherein the customized Levenshtein distance determined between each corresponding output token sequence hypothesis (432) and the corresponding sequence of ground-troth output tokens (420) prevents the sequence transduction model (200) from allowing substitutions between special tokens and common tokens during training of the sequence transduction model (200).

10. The computer-implemented method (500) of any of claims 1-9, wherein the sequence transduction model (200) comprises a recurrent neural network-transducer (RNN-T) model architecture.

11. The computer-implemented method (500) of any of claims 1-10, wherein the sequence transduction model (200) comprises at least one of a character recognition model, a speech recognition model, an endpointing model, a speaker turn detection model, or a machine translation model.

12. A sy stem (100) compri sing data processing hardware (610); memory hardware (620) in communication with the data processing hardware (610) and storing instructions, that when executed by the data processing hardware (610), cause the data processing hardware (610) to perform operations comprising: receiving a plurality of training samples (410) each comprising a corresponding sequence of training input features (415) paired with a corresponding sequence of ground-troth output tokens (420), the sequence of ground-truth output tokens (420) comprising a set of ground-truth common tokens and a set of ground-truth special tokens; for each training sample (410) in the plurality of training samples (410): processing, using a sequence transduction model (200), the corresponding sequence of training input features (415 ) to obtain one or more output token sequence hypotheses (432), each output token sequence hypothesis (432) comprising one or more predicted common tokens (204); and determining a per sample token-level loss (462) based on, for each corresponding output token sequence hypothesis (432) obtained for the training sample (410): a number of special token insertions each associated with a corresponding predicted special token that appears in the corresponding output token sequence hypothesis (432) but does not appear in the corresponding sequence of ground- truth output tokens (420), and a number of special token deletions each associated with a corresponding ground-truth special token in the set of ground-truth special tokens that does not appear in corresponding output token sequence hypothesis (432); and training the sequence transduction model (200) to minimize additive error rate based on the per sample token-level losses (462) determined for the plurality of training samples (410).

13. The system (100) of claim 12, wherein: determining the per sample token-level loss (462) is further based on a total number of ground-truth output tokens (420) in the corresponding sequence of ground- truth output tokens (420) and a respective number of predicted common token errors relative to the set of ground-truth common tokens in the corresponding sequence of ground-truth tokens (420); and when determining the per sample token-level loss (462), the number of special token insertions and the number of special token deletions are each weighted higher than the respective number of predicted common token errors to force the sequence transduction model (200) to reduce special token insertion and deletion rates during training.

14. The system (100) of claim 12 or 13, wherein the operations further comprise, for each training sample (410) in the plurality of training samples (410): processing, using the sequence transduction model (200), the corresponding sequence of training input features (415) to predict probability distributions (222) over possible output tokens; and training the sequence transduction model (200) based on a negative log of the probability distributions (442) for the corresponding sequence of ground-truth output tokens (420) conditioned on the corresponding sequence of training input features (415).

15. The system (100) of claim 14, wherein: training the sequence transduction model (200) based on the negative log of the probability distributions (442) comprises initially training the sequence transduction model (200) based on the negative log of the probability distributions (442) to initialize the sequence transduction model (200); and training the sequence transduction model (200) to minimize additive error rate based on the per sample token-level losses (462) comprises fine tuning the initialized sequence transduction model (200) based on minimizing the additive error rate based on the per sample token-level losses (462).

16. The system (100) of any of claims 12—15, wherein: processing the corresponding sequence of training input features (415) to obtain one or more output token sequence hypotheses (432) comprises processing the corresponding sequence of training input features (415) to obtain an N-best list of output token sequence hypotheses (432), each corresponding output token sequence hypothesis (432) in the N-best list having a respective probability score assigned by the sequence transduction model (200); and determining the per sample token-level loss (462) is further based on the respective probability score of the corresponding output token sequence hypotheses (432).

17. The system (100) of any of claims 12 -16, wherein: the sequence of training input features (415) comprises a sequence of input audio frames characterizing an utterance (120) that includes a particular key phrase, the set of ground-truth common tokens in each corresponding sequence of ground-truth output tokens (420) comprises a set of word or sub-word unit tokens that form a ground-truth transcription of the utterance (120) characterized by the sequence of input audio frames; the set of ground-truth special tokens in each corresponding sequence of ground- truth output tokens (420) comprises at least one ground-truth keyword token indicating a respective location in the ground-truth transcription immediately after the particular key- phrase appears in the ground-truth transcription; the one or more predicted common tokens (204) of each corresponding output token sequence hypothesis (432) of the one or more output token sequence hypotheses (432) comprises a sequence of predicted word or sub-word tokens that form a respective candidate transcription of the utterance; and each corresponding predicted special token that appears in the corresponding output token sequence hypothesis (432) but does not appear in the corresponding sequence of ground-truth output tokens (420) comprises a predicted key phrase token indicating a respective location in the respective candidate transcription immediately after the sequence transduction model (200) predicts that the particular key phrase is detected.

18. The system (100) of any of claims 12—17, wherein: the sequence of training input features (415) comprises a sequence of input audio frames characterizing multiple utterances spoken by at least two different speakers; the set of ground-truth common tokens in each corresponding sequence of ground-truth output tokens (420) comprises a set of word or sub-word unit tokens that form a ground-truth transcription of the multiple spoken utterances (120) characterized by the sequence of input audio frames; the set of ground-truth special tokens in each corresponding sequence of ground- truth output tokens (420) comprises a set of one or more ground -truth speaker change tokens each indicating a respective location where a speaker change occurs in the ground- truth transcription of the multiple utterances; the one or more predicted common tokens (204) of each corresponding output token sequence hypothesis (432) of the one or more output token sequence hypotheses (432) comprises a sequence of predicted word or sub-word unit tokens that form a respective candidate transcription of the multiple utterances, and each corresponding predicted special token that appears in the corresponding output token sequence hypothesis (432) but does not appear in the corresponding sequence of ground-truth output tokens (420) comprises a predicted speaker change token indicating a respective location in the respective candidate transcription where a respective speaker change event is detected by the sequence transduction model (200).

19. The system (100) of any of claims 12-18, wherein the operations further comprise, for each training sample (410) in plurality of training samples (410): determining a customized Levenshtein distance between each corresponding output token sequence hypothesis (432) of the one or more output token sequence hypotheses (432) obtained for the corresponding sequence of training input features (415) and the corresponding sequence of ground-truth output tokens (420); and based on the customized Levenshtein distance: identifying the number of special token insertions for each corresponding output token sequence hypothesis (432); and identifying the number of special token deletions for each corresponding output token sequence hypothesis (432).

20. The system (100) of claim 19, wherein the customized Levenshtein distance determined between each corresponding output token sequence hypothesis (432) and the corresponding sequence of ground-truth output tokens (420) prevents the sequence transduction model (200) from allowing substitutions between special tokens and common tokens during training of the sequence transduction model (200). 21 . The system (100) of any of claims wherein the sequence transduction model (200) comprises a recurrent neural network-transducer (RNN-T) model architecture.

22. The system (100) of any of claims 12—21 , wherein the sequence transduction model (200) comprises at least one of a speech recognition model, a character recognition model, an endpointing model, a speaker turn detection model, or a machine translation model .

Description:
AUGMENTING TRAINING OF SEQUENCE TIUkNSDUCTION MODELS USING TOKEN-LEVEL LOSSES

TECHNICAL FIELD

[0001] This disclosure relates to the training of sequence transduction models.

BACKGROUND

[0002] Sequence transduction models are constructed and trained for transforming input sequences into output sequences. Example sequence transduction models include, but are not limited to, speech recognition models for transforming a sequence of input audio features into a transcription including a sequence of words or sub-word units, character recognition models for transforming a sequence of hand-writen characters into a sequence of word or sub-word text pieces, and machine translation models for transforming a first sequence of words in a first language into a second sequence of words in a second language.

SUMMARY

[0003] One aspect of the disclosure provides a computer-implemented method that, when executed on data processing hardware, causes the data processing hardware to perform operations for augmenting training of sequence transduction models using token- level losses. The operations include receiving a plurality of training samples each including a corresponding sequence of training input features paired with a corresponding sequence of ground-truth output tokens, the sequence of ground-truth output tokens including a set of ground-truth common tokens and a set of ground-truth special tokens. For each training sample in the plurality of training samples, the operations include processing, using a sequence transduction model, the corresponding sequence of training input features to obtain one or more output token sequence hypotheses, each output token sequence hypothesis including one or more predicted common tokens; and determining a per sample token-level loss based on, for each corresponding output token sequence hypothesis obtained for the training sample: a number of special token insertions each associated with a corresponding predicted special token that appears in the corresponding output token sequence hypothesis but does not appear in the corresponding sequence of ground-truth output tokens; and a number of special token deletions each associated with a corresponding ground-truth special token in the set of ground-truth special tokens that does not appear in corresponding output token sequence hypothesis. The operations further include training the sequence transduction model to minimize additive error rate based on the per sample token-level losses determined for the plurality of training samples.

[0004] Implementations of the disclosure may include one or more of the following optional features. In some implementations, the operations further include determining the per sample token-level loss is further based on a total number of ground-truth output tokens in the corresponding sequence of ground-truth output tokens and a respective number of predicted common token errors relative to the set of ground-truth common tokens in the corresponding sequence of ground-truth tokens; and when determining the per sample token-level loss, the number of special token insertions and the number of special token deletions are each weighted higher than the respective number of predicted common token errors to force the sequence transduction model to reduce special token insertion and deletion rates during training. In some examples, the operations further include, for each training sample in the plurality of training samples: processing, using the sequence transduction model, the corresponding sequence of training input features to predict probability distributions over possible output tokens; and training the sequence transduction model based on a negative log of the probability distributions for the corresponding sequence of ground-truth output tokens conditioned on the corresponding sequence of training input features. Here, training the sequence transduction model based on the negative log of the probability distributions may include initially training the sequence transduction model based on the negative log of the probability distributions to initialize the sequence transduction model; and training the sequence transduction model to minimize additive error rate based on the per sample token-level losses may include fine tuning the initialized sequence transduction model based on minimizing the additive error rate based on the per sample token-level losses. [0005] In some examples, processing the corresponding sequence of training input features to obtain one or more output token sequence hypotheses includes processing the corresponding sequence of training input features to obtain an N-best list of output token sequence hypotheses, each corresponding output token sequence hypothesis in the N-best list having a respective probability score assigned by the sequence transduction model; and determining the per sample token-level loss is further based on the respective probability score of the corresponding output token sequence hypotheses. In some implementations, the sequence of training input features includes a sequence of input audio frames characterizing an utterance that includes a particular key phrase; the set of ground-truth common tokens in each corresponding sequence of ground-truth output tokens includes a set of word or sub-word unit tokens that form a ground-truth transcription of the utterance characterized by the sequence of input audio frames; the set of ground-truth special tokens in each corresponding sequence of ground-truth output tokens includes at least one ground-truth keyword token indicating a respective location in the ground-truth transcription immediately after the particular key phrase appears in the ground-truth transcription; the one or more predicted common tokens of each corresponding output token sequence hypothesis of the one or more output token sequence hypotheses includes a sequence of predicted word or sub-word tokens that form a respective candidate transcription of the utterance; and each corresponding predicted special token that appears in the corresponding output token sequence hypothesis but does not appear in the corresponding sequence of ground-truth output tokens includes a predicted key phrase token indicating a respective location in the respective candidate transcription immediately after the sequence transduction model predicts that, the particular key phrase is detected.

[0006] In some implementations, the sequence of training input features includes a sequence of input audio frames characterizing multiple utterances spoken by at least two different speakers; the set of ground-truth common tokens in each corresponding sequence of ground-truth output tokens includes a set of word or sub-word unit tokens that form a ground-truth transcription of the multiple spoken utterances characterized by the sequence of input audio frames; the set of ground-truth special tokens in each corresponding sequence of ground-truth output tokens includes a set of one or more ground-truth speaker change tokens each indicating a respective location where a speaker change occurs in the ground-truth transcription of the multiple utterances, the one or more predicted common tokens of each corresponding output token sequence hypothesis of the one or more output token sequence hypotheses includes a sequence of predicted word or sub-word unit tokens that form a respective candidate transcription of the multiple utterances, and each corresponding predicted special token that, appears in the corresponding output token sequence hypothesis but does not appear in the corresponding sequence of ground-truth output tokens includes a predicted speaker change token indicating a respective location in the respective candidate transcription where a respective speaker change event is detected by the sequence transduction model.

[0007] In some examples, the operations further include, for each training sample in plurality of training samples: determining a customized Levenshtein distance between each corresponding output token sequence hypothesis of the one or more output token sequence hypotheses obtained for the corresponding sequence of training input features and the corresponding sequence of ground-truth output tokens; and based on the customized Levenshtein distance: identifying the number of special token insertions for each corresponding output token sequence hypothesis; and identifying the number of special token deletions for each corresponding output token sequence hypothesis. Here, the customized Levenshtein distance determined between each corresponding output token sequence hy pothesis and the corresponding sequence of ground-truth output tokens may prevent the sequence transduction model from allowing substitutions between special tokens and common tokens during training of the sequence transduction model. [0008] In some implementations, the sequence transduction model includes a recurrent neural network-transducer (RNN-T) model architecture. Additionally, the sequence transduction model may include at least one of a speech recognition model, a character recognition model, an endpointing model, a speaker turn detection model, or a machine translation model.

[0009] Another aspect of the disclosure provides a system that includes data processing hardware and memory hardware storing instructions that, when executed on the data processing hardware, cause the data processing hardware to perform operations. The operations include receiving a plurality of training samples each including a corresponding sequence of training input features paired with a corresponding sequence of ground-truth output tokens, the sequence of ground-truth output tokens including a set of ground-truth common tokens and a set of ground-truth special tokens. For each training sample in the plurality of training samples, the operations include processing, using a sequence transduction model, the corresponding sequence of training input features to obtain one or more output token sequence hypotheses, each output token sequence hypothesis including one or more predicted common tokens, and determining a per sample token-level loss based on, for each corresponding output token sequence hypothesis obtained for the training sample: a number of special token insertions each associated with a corresponding predicted special token that appears in the corresponding output token sequence hypothesis but does not appear in the corresponding sequence of ground-truth output tokens; and a number of special token deletions each associated with a corresponding ground-truth special token in the set of ground-truth special tokens that does not appear in corresponding output token sequence hypothesis. The operations further include training the sequence transduction model to minimize additive error rate based on the per sample token-level losses determined for the one or more output token sequence hypotheses obtained for each training sample in the plurality of training samples.

[0010] Implementations of the disclosure may include one or more of the following optional features. In some implementations, the operations further include determining the per sample token-level loss is further based on a total number of ground-truth output tokens in the corresponding sequence of ground-truth output tokens and a respective number of predicted common token errors relative to the set of ground-truth common tokens in the corresponding sequence of ground-truth tokens; and when determining the per sample token-level loss, the number of special token insertions and the number of special token deletions are each weighted higher than the respective number of predicted common token errors to force the sequence transduction model to reduce special token insertion and deletion rates during training. In some examples, the operations further include, for each training sample in the plurality of training samples: processing, using the sequence transduction model, the corresponding sequence of training input features to predict probability distributions over possible output tokens; and training the sequence transduction model based on a negative log of the probability distributions for the corresponding sequence of ground-truth output tokens conditioned on the corresponding sequence of training input features. Here, training the sequence transduction model based on the negative log of the probability distributions may include initially training the sequence transduction model based on the negative log of the probability distributions to initialize the sequence transduction model, and training the sequence transduction model to minimize additive error rate based on the per sample token-level losses may include fine tuning the initialized sequence transduction model based on minimizing the additive error rate based on the per sample token-level losses.

[0011] In some examples, processing the corresponding sequence of training input features to obtain one or more output token sequence hypotheses includes processing the corresponding sequence of training input features to obtain an N-best list of output token sequence hypotheses, each corresponding output token sequence hypothesis in the N-best list having a respective probability score assigned by the sequence transduction model; and determining the per sample token-level loss for each corresponding output token sequence hypothesis in in the N-best list is further based on the respective probability score of the corresponding output token sequence hypotheses. In some implementations, the sequence of training input features includes a sequence of input audio frames characterizing an utterance that includes a particular key phrase; the set of ground-truth common tokens in each corresponding sequence of ground-truth output tokens includes a set of word or sub-word unit tokens that form a ground-truth transcription of the utterance characterized by the sequence of input audio frames; the set of ground-truth special tokens in each corresponding sequence of ground-truth output tokens includes at least one ground-truth keyword token indicating a respective location in the ground-truth transcription immediately after the particular key phrase appears in the ground-truth transcription; the one or more predicted common tokens of each corresponding output token sequence hy pothesis of the one or more output token sequence hypotheses includes a sequence of predicted word or sub-word tokens that form a respective candidate transcription of the utterance; and each corresponding predicted special token that appears in the corresponding output token sequence hypothesis but does not appear in the corresponding sequence of ground-truth output tokens includes a predicted key phrase token indicating a respective location in the respective candidate transcription immediately after the sequence transduction model predicts that the particular key phrase is detected.

[0012] In some implementations, the sequence of training input features includes a sequence of input audio frames characterizing multiple utterances spoken by at least two different speakers; the set of ground-truth common tokens in each corresponding sequence of ground-truth output tokens includes a set of word or sub-word unit tokens that form a ground-truth transcription of the multiple spoken utterances characterized by the sequence of input audio frames; the set of ground-troth special tokens in each corresponding sequence of ground-truth output tokens includes a set of one or more ground-truth speaker change tokens each indicating a respective location where a speaker change occurs in the ground-truth transcription of the multiple utterances, the one or more predicted common tokens of each corresponding output token sequence hypothesis of the one or more output token sequence hypotheses includes a sequence of predicted word or sub-word unit tokens that form a respective candidate transcription of the multiple utterances; and each corresponding predicted special token that appears in the corresponding output token sequence hypothesis but does not appear in the corresponding sequence of ground-truth output tokens includes a predicted speaker change token indicating a respective location in the respective candidate transcription where a respective speaker change event is detected by the sequence transduction model.

[0013] In some examples, the operations further include, for each training sample in plurality of training samples: determining a customized Levenshtein distance between each corresponding output token sequence hypothesis of the one or more output token sequence hypotheses obtained for the corresponding sequence of training input features and the corresponding sequence of ground-troth output tokens; and based on the customized Levenshtein distance: identifying the number of special token insertions for each corresponding output token sequence hypothesis; and identifying the number of special token deletions for each corresponding output token sequence hypothesis. Here, the customized Levenshtein distance determined between each corresponding output token sequence hypothesis and the corresponding sequence of ground-truth output tokens may prevent the sequence transduction model from allowing substitutions between special tokens and common tokens during training of the sequence transduction model. [0014] In some implementations, the sequence transduction model includes a recurrent neural network-transducer (RNN-T) model architecture. Additionally, the sequence transduction model may include at least one of a speech recognition model, a character recognition model, an endpointing model, a speaker turn detection model, or a machine translation model.

[0015] The details of one or more implementations of the disclosure are set forth in the accompanying drawings and the description below. Other aspects, features, and advantages will be apparent from the description and drawings, and from the claims.

DESCRIPTION OF DRAWINGS

[0016] FIG. 1 is a schematic view an example system including a sequence transduction model for performing speech recognition.

[0017] FIG. 2 is a schematic view of an example recurrent neural network-transducer (RNN-T) model of the sequence transduction model of FIG. 1.

[0018] FIG. 3 is a schematic view of an example tied and reduced prediction network of the RNN-T model of FIG. 3.

[0019] FIG. 4 is a schematic view of an example two-stage process for augmenting training of a sequence transduction model with token-level losses.

[0020] FIG. 5 is a flowchart of an example arrangement of operations for a computer- implemented method for augmenting training of a sequence transduction model with token-level losses.

[0021] FIG. 6 is a schematic view of an example computing device that may be used to implement the systems and methods described herein.

[0022] Like reference symbols in the various drawings indicate like elements. DETAILED DESCRIPTION

[0023] In addition to transforming between input sequences and output sequences, sequence transduction models have also been constructed for detecting special input conditions and generating special outputs (e.g., special output tokens or other types of indications) when special input conditions are detected. That is, sequence transduction models may be constructed and trained to process a sequence of input data to generate a predicted sequence of outputs that, includes, in addition to other normal/common predicted outputs, special outputs (e.g., special output tokens or other types of indications) when the sequence transduction model detects corresponding special input conditions in the input data. Here, predicted common outputs are associated with inputs that are not associated with a special input condition. For example, an automatic speech recognition (ASR) model may be constructed and trained to detect special input conditions in a sequence of input audio features (i.e., captured input audio data) and to include or embed special output tokens in output transcriptions (i.e., an output sequence or word or sub-word units) that represent the detected special input conditions. Here, special input conditions include, but are not limited to, a speaker change (i.e., a change in who is speaking), a spoken key phrase (e.g., a spoken hotword, such as “Hey Assistant”, or a spoken warm word, such as “Volume Up”). In an example, the ASR model generates a predicted transcription “Hey Assistant <hw>, what time is it” from captured input audio data representing a spoken utterance of “Hey Assistant, what time is it.” Here, the transcription includes the special output token “<hw>” immediately following “Hey Assistant” to indicate that the ASR model identified “Hey Assistant” as a spoken hotword, and common tokens representing the other spoken words. Other example sequence transduction models include, but are not limited to, a character recognition model, and a machine translation model. Here, example special input conditions include, but are not limited to, a potentially offensive word, a special character, an equation, and a proper noun.

[0024] However, traditional methods of training sequence transduction models for detecting special input conditions have been problematic, at least because special input conditions in training data are sparse relative to common inputs. For example, the occurrence rate of speaker changes, key phrases, etc. in spoken utterance training samples may be orders of magnitudes less than non-special spoken words. Moreover, in some examples, there are significantly fewer possible unique special tokens (e.g., just “<st>” and “<hw”) than there are possible common tokens (e.g., all words in the English language). Traditionally, sequence transduction models have been trained using negative log likelihood losses computed for entire output sequences (i.e., including common outputs and infrequent special outputs), and the sparsity of special input conditions in training data causes the special outputs to be de-emphasized during training, which degrades detection accuracy for special input conditions.

[0025] Implementations herein are directed toward augmenting the training of sequence transduction models for detecting special input conditions using token-level losses. In particular implementations disclosed herein, a training process initially trains a sequence transduction model in a first training pass using training data that includes a set/plurality of training samples that each include a sequence of input data paired with a corresponding sequence of ground-truth outputs. Here, the ground-truth outputs include, in addition to common outputs, special ground-truth outputs for corresponding special input conditions in the input data. An example training sample includes captured input audio data representing a spoken utterance of “Hey Assistant, what is today’s weather” and a corresponding ground-truth transcription of “Hey Assistant <hw>, what is today’s weather.” Here, the ground-truth transcription includes the special output token “<hw>” immediately following “Hey Assistant” to indicate that the speech recognition model identified “Hey Assistant” as a spoken hotword/key phrase, and common tokens representing the other spoken words or sub-word units of the utterance. The initial training in the first training pass may be performed using a conventional loss function (e.g., minimizing a negative log likelihood) based on differences between predicted outputs of the sequence transduction model and corresponding ground-truth outputs, both of which include special outputs for special input conditions. Notably, during the first training pass, the losses are computed for entire predicted output sequences (i.e., including common outputs and infrequent special outputs) for each particular training sample. [0026] The training process then re-trains or fine tunes the training of sequence transduction model in a second training pass using the same training data, for example. For each training sample during the second training pass, the training process performs a beam search to identify the N-best predicted output sequences (i.e., output token sequence hypotheses) based on the input data for the training sample. The training process performs the re-training in the second training pass using loss function that is a weighted sum of a token-level loss function and a conventional loss function. The conventional loss function may be the same as that used during the initial training in the first training pass. In some implementations, the token-level loss function includes a minimum additive error rate, such as a minimum word error rate (MWER) or a word- level edit-based minimum Bayes risk (EMBR), and losses associated with special tokens are given greater loss values than the losses associated with regular tokens.

[0027] Referring to FIG. 1, an example system 100 includes a user device 110 for capturing input audio data 122 (i.e., a sequence of input audio features 122) representing one or more utterances 120, 120a -n spoken by one or more speakers (e.g., users or persons) 10, 10a-n and communicating with a cloud-computing environment 140 via a network 130. In some implementations, the user device 110 and/or the cloud-computing environment 140 executes a sequence transduction model 200 that is configured to receive input data/features and generate a sequence of predicted outputs (e.g., a sequence of output tokens). In the example shown, the sequence transduction model 200 includes an automated speech recognition (ASR) model 200 configured to generate one or more predicted transcriptions 202, 202a-n for the spoken utterances 120. Each transcription 202 includes a sequence of output tokens including common tokens 204, 204a-n corresponding to common spoken words of the spoken utterances 120. More specifically, the sequence of common tokens 204 include word or sub-word tokens that form the transcription 202 of the utterance 120. The ASR model 200 may include any transducer- based architecture including, but not limited to, a transformer-transducer (T-T), a recurrent neural network transducer ( RNN-T), and/or a conformer-transducer (C-T). The ASR model 200 is also constructed and trained to detect special input conditions in the spoken utterances 120 and to include or embed special tokens 206, 206a -n in a predicted sequence of output tokens (i.e., the transcriptions 202), the special tokens 206 representing the detected special input conditions. Example special input conditions include, but are not limited to, a speaker change (i.e., a change in who is speaking), a spoken key phrase (e.g., a spoken hotword, such as ‘"Hey Alexa”, or a spoken warm word, such as “Volume Up”). In an example, for a spoken utterance 120 of a single user 10 of “Hey Assistant, what time is it,” the ASR model 200 generates a predicted transcription 202 “Hey Assistant <hw>, what time is it.” Here, the transcription 202 includes a special token 206 “<hw>” to indicate that the ASR model 200 identified “Hey Alexa” as a spoken hotword, and common tokens 204 representing the other common words of the transcription 202 (e.g., “what”, “time”, “is”, and “it”). In another example, for a multi-speaker spoken utterance 120 including “word1 word2” spoken by a first user 10a, “word3 word4” spoken by a second user 10b, and “word5 word6” spoken by the first user 10a, the ASR model 200 generates a predicted transcription 202 “word 1 word2 <st> word3 word4 <st> word5 word6.” Here, the transcription 202 includes special tokens 206 “<st>” to indicate that the ASR model identified speaker changes, and common tokens 204 representing the recognized words in the transcription 202 (e.g., “word1”, “word2”, “word3”, “word4”, “word5”, and “word6”) that were spoken in the multi-speaker spoken utterance 120.For clarity of explanation, the system 100 and the sequence transduction model 200 are described with reference to the ASR model 200 for transcribing spoken utterances 120 and detecting spoken special input conditions, such as speaker changes, and special words (e.g., hotwords and warm words). Moreover, for clarity of explanation, disclosed example methods for augmented training of a sequence transduction model using token-level losses are described using the example of the ASR model 200. However, other type(s) of sequence transduction models for predicting other type(s) of output sequences from other type(s) of input sequences, and for detecting other type(s) of special input conditions may be used. For example, a sequence transduction model may include a character recognition model for transcribing a sequence of written characters into a text sequence, where special input conditions include, but are not limited to, special characters, equations, offensive words, and proper nouns. Another example sequence transduction model includes a machine translation model for translating an input sequence of words in a first language into an output sequence of words in a second language different from the first language, wherein special input conditions include, but are not limited to, proper nouns that don’t get translated. Moreover, persons of ordinary skill in the art will recognize that disclosed example methods of augmented training of a sequence transduction model using token-level losses may be used to augment the training of other types of sequence transduction models.

[0028] In the example shown, the ASR model 200 includes a recurrent neural network- transducer (RNN-T) model (see FIG. 2) and resides on the user device 110 and/or on the cloud-computing environment 140. The user device 110 and/or the cloud-computing environment 140 also includes an audio subsystem 150 configured to receive utterances 120 spoken by the user(s) 10 and captured by the audio capture device 116a, and convert the captured utterances 120 into a corresponding digital format associated with input audio data 122 (i.e., acoustic frames) capable of being processed by the ASR model 200. Thereafter, the ASR model 200 receives, as input, the audio data 122 corresponding to the utterance 120, and generates/predicts, as output, corresponding predicted transcriptions 202 (e.g., recognition result/hypothesis) of the utterances 120.

[0029] The user device 110 may correspond to any computing device associated with a user 10 and capable of capturing audio data. Some examples of user devices 110 include, but are not limited to, mobile devices (e.g., mobile phones, tablets, laptops, etc.), computers, wearable devices (e.g., smart watches), smart appliances, internet of things (loT) devices, vehicle infotainment systems, smart displays, smart speakers, etc. The user device 110 includes data processing hardware 112 and memory hardware 114 in communication with the data processing hardware 1 12 and storing instructions that, when executed by the data processing hardware 112, cause the data processing hardware 112 to perform one or more operations. The user device 110 further includes an audio system 116 with an audio capture device (e.g., microphone) 116, 116a for capturing and converting spoken utterances 120 into electrical signals and a speech output device (e.g., a speaker) 116, 116b for communicating an audible audio signal (e.g., as output audio data from the device 110). While the user device 110 implements a single audio capture device 1 16a in the example shown, the user device 110 may implement an array of audio capture devices 116a without departing from the scope of the present disclosure, whereby one or more capture devices 116a in the array may not physically reside on the user device 1 10, but be in communication with the audio system 116.

[0030] An output 160 may receive the transcription 202 output from the sequence transduction model 200 that includes a sequence of common tokens 204 associated with words or sub-word units (i.e., graphemes, phonemes, wordpieces) representing words recognized in a spoken utterance 120 and any special tokens 206 the model 200 is trained to detect and embed at a corresponding location in the transcription. The output 160 may include a program (e.g., digital assistant 50) or other component/ software that the special token 206 is configured to trigger to perform an operation. For instance, when the special token 206 includes the hotword <hw>, the output may include a natural language understanding/processing (NLU /NLP) module executing on the user device 110 and/or cloud-computing environment 140 that performs query interpretation on the common tokens 204 in the transcription 202 to identify a user command/query and then instructs a downstream component/application to perform an action/operation specified by the command.

[0031] The output 160 may also include a user interface generator executing on the user device 110 and/or the cloud-computing environment 140 that is configured to present a representation of the transcriptions 202 to the user 10 of the user device 110. In some examples, the special tokens 206 may indicate speaker change token indicating a respective location in the transcription 202 where a respective speaker change event is detected by the sequence transduction model 200. In these examples, the output 160 corresponding to the user interface generator may annotate the transcription 202 with speaker labels based on the speaker change tokens. The user interface generator may display the initial speech recognition results 202a in a streaming fashion and subsequently display a final speech recognition result 202b.

[0032] In the example shown, the user 104 may interact with a program or application 50 (e.g., a digital assistant application 50) of the user device 110 that uses the ASR model 200. For example, the digital assistant application 50 may display a digital assistant interface on a screen of the user device 1 10 to depict an interaction between the user 10 and the digital assistant application 50. For instance, the digital assistant application 50 may respond to a question posed by the user 10 using NLP/NLU to determine whether the written language prompts any action.

[0033] The cloud-computing environment 140 may be a distributed or virtualized system having scalable/elastic resources 142. The resources 142 include computing resources 144 (e.g., data processing hardware) and/or storage resources 146 (e.g., memory hardware) in communication with the computing resources 144 and storing instructions that, when executed by the computing resources 144, cause the computing resources 144 to perform one or more operations. Alternatively, a central or remote server having data processing hardware and memory hardware may be used to implement the operations of the cloud-computing environment 140.

[0034] FIG. 2 is a schematic view of an example recurrent neural network - transducer model 200 (i.e., RNN-T model 200) that is trained using long-form training utterances to improve, during inference, speech recognition for long-form utterances. The RNN-T model 200 provides a small computational footprint and utilizes less memory requirements than conventional A SR architectures, making the RNN-T model 200 suitable for performing speech recognition entirely on the user device 110 (e.g., no communication with a remote server is required).

[0035] As shown, the RNN-T model 200 includes an encoder network 210, a prediction network 300, a joint network 220, and a final softmax layer 230. The prediction and joint networks 300, 220 may collectively provide an RNN-T decoder. The encoder network 210, which is roughly analogous to an acoustic model (AM) in a traditional ASR system, includes a recurrent network of stacked Long Short-Term Memory ( LSTV ) layers. For instance, the encoder reads a sequence of d-dimensional feature vectors (e.g., acoustic frames 122 (FIG. l)) where and produces at each time step a higher-order feature representation 212. This higher- order feature representation 212 is denoted as

[0036] Similarly, the prediction network 300 is also an LSTM network, which, like a language model (LM), processes the sequence of non-blank symbols 232 output by the final softmax layer 230 so far, into a dense or hidden representation Described in greater detail below, the representation 350 includes a single embedding vector. Notably, the sequence of non-blank symbols 232 received at the prediction network 300 capture linguistic dependencies between non-blank symbols 232 predicted during the previous time steps so far to assist the joint network 220 in predicting the probability of a next output symbol or blank symbol during the current time step. As described in greater detail below, to contribute to techniques for reducing the size of the prediction network 300 without sacrificing accuracy /performance of the RNN-T model 200, the prediction network 300 may receive a limited-history sequence of non-blank symbols 2.32 that is limited to the N previous non-blank symbols 232 output by the final softmax layer 230.

[0037] The joint network 300 combines the higher-order feature representation 212 produced by the encoder network 210 and the representation 350 (i.e., single embedding vector 350) produced by the prediction network 300. The joint network 220 predicts a distribution 222 over the next output symbol.

Stated differently, the joint network 220 generates, at each time step, a probability distribution 222 over possible speech recognition hypotheses. Here, the “possible speech recognition hypotheses” correspond to a set of output labels each representing a symbol/character in a specified natural language. For example, when the natural language is English, the set of output labels may include twenty-seven (27) symbols, e.g., one label for each of the 26-letters in the English alphabet and one label designating a space. Accordingly, the joint network 220 may output a set of values indicative of the likelihood of occurrence of each of a predetermined set of output labels. This set of values can be a vector and can indicate a probability distribution over the set of output labels. In some cases, the output labels are graphemes (e.g., individual characters, and potentially punctuation and other symbols), but the set of output labels is not so limited. For example, the set of output labels can include wordpieces and/or entire words, in addition to or instead of graphemes. The output distribution of the joint network 220 can include a posterior probability value for each of the different output labels. Thus, if there re 100 different output labels representing different graphemes or other symbols, the output 232 of the joint network 220 can include 100 different probability values, one for each output label. The probability distribution can then be used to select and assign scores to candidate orthographic elements (e.g., graphemes, wordpieces, and/or words) in a beam search process (e.g., by the softmax layer 230) for determining the transcription 202.

[0038] The final softmax layer 230 receives the probability distribution 232 for the final speech recognition result 220b and selects the output label/symbol with the highest probability to produce the transcription. The final softmax layer 230 may employ any technique to select the output label/symbol with the highest probability in the distribution 232. In this manner, the RNN-T model 200 does not make a conditional independence assumption, rather the prediction of each symbol 232 is conditioned not only on the acoustics but also on the sequence of labels 232 output so far. The RNN-T model 200 does assume an output symbol 232 is independent of future acoustic frames 122, which allows the RNN-T model to be employed in a streaming fashion.

[0039] In some examples, the encoder network 210 of the RNN-T model 200 includes eight 2,048-dimensional LSTM layers, each followed by a 740-dimensional projection layer. In other implementations, the encoder network 210 includes a plurality of multi-headed attention layers. For instance, the plurality of multi -headed attention layers may include a network of conformer or transformer layers. The prediction network 220 may have two 2, 048-dimensional LSTM layers, each of which is also followed by 740-dimensional projection layer as well as an embedding layer of 128 units. Finally, the joint network 220 may also have 740 hidden units. The softmax layer 230 may be composed of a unified word piece or grapheme set that is generated using all unique word pieces or graphemes in training data. When the output symbols/ l abels include wordpieces, the set of output symbols/l abels may include 4,096 different word pieces. When the output symbols/labels include graphemes, the set of output symbols/l abels may include less than 100 different graphemes. [0040] FIG. 3 is a schematic view of an example prediction network 300 for the RNN-T model 200. The prediction network 300 receives, as input, a sequence of non- blank symbols that is limited to the N previous non-blank symbols

232a-n output by the final softmax layer 230. In some examples, N is equal to two. In other examples, N is equal to five, however, the disclosure is non-limiting and N may equal any integer. The sequence of non-blank symbols 232a-n indicates an initial speech recognition result 202 (FIG. 1). In some implementations, the prediction network 300 includes a multi-headed attention mechanism 302 that shares a shared embedding matrix 304 across each head 302A-302H of the multi-headed attention mechanism. In one example, the multi-headed attention mechanism 302 includes four heads. However, any number of heads may be employed by the multi-headed attention mechanism 302. Notably, the multi-headed attention mechanism improves performance significantly with minimal increase to model size. As described in greater detail below, each head 302A-H includes its own row of position vectors 308, and rather than incurring an increase in model size by concatenating outputs 318A--H from all the heads, the outputs 318A-H are instead averaged by a head average module 322.

[0041] Referring to the first head 302A of the multi-headed attention mechanism 302, the head 302 A generates, using the shared embedding matrix 304, a corresponding embedding 306, 306a-n (e.g., ) for each non-blank symbol among the sequence of non-blank symbols received as input at the corresponding time step from the plurality of time steps. Notably, since the shared embedding matrix 304 is shared across all heads of the multi-headed attention mechanism 302, the other heads 302B--H all generate the same corresponding embeddings 306 for each non-blank symbol. The head 302A also assigns a respective position vector PV Aa-An 308, 308Aa-An (e.g. corresponding non-blank symbol in the sequence of non-blank symbols The respective position vector PV 308 assigned to each non-blank symbol indicates a position in the history of the sequence of non-blank symbols (e.g., the N previous non-blank symbols 232a-n output by the final softmax layer 230). For instance, the first position vector PV Aa is assigned to a most recent position in the history, while the last position vector is assigned to a last position in the history of the N previous non-blank symbols output by the final softmax layer 230. Notably, each of the embeddings 306 may include a same dimensionality (i.e., dimension size) as each of the position vectors PV 308.

[0042] While the corresponding embedding generated by shared embedding matrix 304 for each for each non-blank symbol among the sequence of non-blank symbols is the same at all of the heads 302A-H of the multi-headed attention mechanism 302, each head 302A-H defines a different set/row of position vectors 308. For instance, the first head 302A defines the row of position vectors the second head 302B defines a different row of position vectors and the head 302 H defines another different row of position vectors

[0043] For each non-blank symbol in the sequence of non-blank symbols 232a-n received, the first head 302 A also weights, via a weight layer 310, the corresponding embedding 306 proportional to a similarity between the corresponding embedding and the respective position vector PV 308 assigned thereto. In some examples, the similarity may include a cosine similarity (e.g., cosine distance). In the example shown, the weight layer 310 outputs a sequence of weighted embeddings 312, 312Aa-An each associated the corresponding embedding 306 weighted proportional to the respective position vector PV 308 assigned thereto. Stated differently, the weighted embeddings 312 output by the weight layer 310 for each embedding 306 may correspond to a dot product between the embedding 306 and the respective position vector PV 308. The weighted embeddings 312 may be interpreted as attending over the embeddings in proportion to how similar they are to the positioned associated with their respective position vectors PV 308. To increase computational speed, the prediction network 300 includes non-recurrent layers, and therefore, the sequence of weighted embeddings 312Aa--An are not concatenated, but instead, averaged by a weighted average module 316 to generate, as output from the first head 302 A, a weighted average 318 A of the weighted embeddings 312Aa- An represented by: In EQN (1), h represents the index of the heads 302, n represents position in context, and e represents the embedding dimension. Additionally, in EQN (1), H, N, and d e include the sizes of the corresponding dimensions. The position vector PV 308 does not have to be trainable and may include random values. Notably, even though the weighted embeddings 312 are averaged, the position vectors PV 308 can potentially save position history information, alleviating the need to provide recurrent connections at each layer of the prediction network 300.

[0044] The operations described above with respect to the first head 302A are similarly performed by each other head 302B-H of the multi-headed attention mechanism 302. Due to the different set of positioned vectors PV 308 defined by each head 302, the weight layer 310 outputs a sequence of weighted embeddings 312Ba-Bn, 312Ha-Hn at each other head 302B-H that is different than the sequence of weighted embeddings 312Aa-Aa at the first head 302A. Thereafter, the weighted average module 316 generates, as output from each other corresponding head 302B-H, a respective weighted average 318B-H of the corresponding weighted embeddings 312 of the sequence of non-blank symbols.

[0045] In the example shown, the prediction network 300 includes a head average module 322 that averages the weighted averages 318A-H output from the corresponding heads 302A--H. A projection layer 326 with SWISH may receive, as input, an output 324 from the head average module 322 that corresponds to the average of the weighted averages 318A-H, and generate, as output, a projected output 328. A final layer normalization 330 may normalize the projected output 328 to provide the single embedding vector 350 at the corresponding time step from the plurality of time steps. The prediction network 300 generates only a single embedding vector 350 at each of the plurality of time steps subsequent to an initial time step.

[0046] In some configurations, the prediction network 300 does not implement the multi-headed attention mechanism 302 and only performs the operations described above with respect to the first head 302A. In these configurations, the weighted average 318A of the weighted embeddings 312Aa-An is simply passed through the projection layer 326 and layer normalization 330 to provide the single embedding vector 350. [0047 ] In some implementations, to further reduce the size of the RNN-T decoder, i.e., the prediction network 300 and the joint network 220, parameter tying between the prediction network 300 and the joint network 220 is applied. Specifically, for a vocabulary size and an embedding dimension the shared embedding matrix 304 at the prediction network is E Meanwhile, a last hidden layer includes a dimension size at the joint network 220, feed-forward projection weights from the hidden layer to the output logits will be with an extra blank token in the vocabulary. Accordingly, the feed-forward layer corresponding to the last layer of the joint network 220 includes a weight matrix By having the prediction network 300 to tie the size of the embedding dimension de to the dimensionality dh of the last hidden layer of the joint network 220, the feed-forward projection weights of the joint network 220 and the shared embedding matrix 304 of the prediction network 300 can share their weights for all non-blank symbols via a simple transpose transformation.

Since the two matrices share all their values, the RNN-T decoder only needs to store the values once on memory, instead of storing two individual matrices. By setting the size of the embedding dimension d s equal to the size of the hidden layer dimension dh, the RNN- T decoder reduces a number of parameters equal to the product of the embedding dimension and the vocabulary size This weight tying corresponds to a regularization technique.

[0048] Referring back to FIG. 1, the system 100 includes a two-stage training process 400 that resides on the user device 1 10 and/or on the cloud-computing environment 140. In some examples, the sequence transduction model 200 resides on the user device 110, and the training process 400 resides on the cloud-computing environment 140. However, the sequence transduction model 200 and the training process 400 can both reside on the user device 1 10 and/or the cloud-computing environment 140. [0049] The two-stage training process 400 (see FIG. 4) trains the sequence transduction model 200 (e.g., the ASR model 200) on training data 405 that includes a plurality of training samples 410, 410a-n. Each training sample 410 includes a spoken training utterance 415 (i.e., a sequence of input audio features) and corresponding ground-truth outputs 420 (e.g., a sequence of ground-truth output tokens representing a transcription 420 of the utterance 415). Here, the ground-truth outputs 420 include, in addition to common tokens for spoken words, special tokens for corresponding special input conditions in the input data. That is, the ground-truth outputs 420 are transcriptions that have been augmented with special tokens. An example training sample 410 includes a spoken utterance 415 of “Hey Assistant, what is today’s weather” and a corresponding ground-truth transcription 420 of “Hey Assistant <hw>, what is today’s weather.” Here, the ground-truth transcription 420 includes the special token “<hw>” immediately after “Hey Assistant” to indicate that the speech recognition model 200 identified “Hey Assistant” as a spoken hotword, and common tokens representing the other spoken words of the utterance (e.g., “what”, “is”, “today’s”, and “weather”). Notably, in the example, the common tokens may include word or sub-word unit tokens that form a transcription of the utterance 415.

[0050] The training process 400 trains the sequence transduction model 200 (e.g., the ASR model 200) on a sequence transduction task (e.g., a speech recognition performance task). A first stage of the training process 400 initially trains the sequence transduction model 200 to reduce a conventional loss function (e.g., maximizing a negative log likelihood) that is based on differences between the ground-truth output tokens 420 and output tokens 204, 206 generated by the sequence transduction model 200 based on the input spoken training utterance 415. Here, the negative log likelihood function is determined based on all output tokens (i.e., both common and special tokens) of the ground-truth output tokens 420 and output tokens 204, 206 predicted/generated by the sequence transduction model 200 based on the input spoken training utterance 415. Notably, the sparsity of special input conditions in the training data 405 causes the training process 400 to de-emphasize special tokens during the first stage of the training process, which may degrade detection accuracy for special input conditions. [0051] To prevent de-emphasis of learning the special tokens due to their relative sparsity compared to the common tokens, a second stage of the training process 400 then re-trains or fine tunes the sequence transduction model 200 in a second training pass using the same training data 405. For each training sample 410 during the second training pass, the training process 400 performs a beam search to identify the N-best predicted output sequences (i.e., N-best output token sequence hypotheses) based on the input spoken training utterance 415. The second stage of the training process 400 performs the re-training using loss function that is a weighted sum of a token-level loss function and a conventional loss function. The conventional loss function may be the same as that used during the initial training in the first training pass (e.g., maximizing a negative log likelihood). In some implementations, the token-level loss function includes a minimum additive error rate, such as MWER or word-level EMBR, and differences associated with special tokens are given/assigned greater loss values than differences associated with common tokens. Thus, during the second training pass, the training process 400 prioritizes differences associated with special tokens over differences associated with common tokens. In this way, the training process 400 prioritizes learning associated with predicting special tokens, which, thus, compensates for the sparsity of special input conditions in the training data 405.

[0052] FIG. 4 is a schematic view of an example two-pass training process 400, 400 for augmenting the training a sequence transduction model 200 using token-level losses. The sequence transduction model 200 may include the RNN-T model 200 of FIG. 2 that includes the encoder 210 and a decoder 430, wherein the decoder 430 collectively includes the prediction network 300 and the joint network 220. The training process 400 may execute on the cloud-computing environment 140 (i.e., on the computing resources 144) and/or on the user device 110 (i.e., on the data processing hardware 112).

[0053] For each training sample 410 in the set 405 of trainings samples 410, the first stage of the training process 400 processes, using the RNN-T model 200, the corresponding spoken training utterance 415 to determine the probability 222 that the most likely speech recognition hypothesis for the training utterance 415 is correct (i.e., the likelihood 222 that y 232 equals the corresponding ground truth 420). Thereafter, a log likelihood loss function module 440 determines a negative log likelihood loss term 442 based on the probability 222. The log likelihood loss term 442 may be expressed as The first stage of the two-stage training process 400 applies updates 444 to the speech recognition model 200 based on the negative log likelihood loss term £## 442 for each training sample 410 to initialize the speech recognition model 200 in the first training pass.

[0054] For each training sample 410 in the set of trainings samples 410, during a re- training of the RNN-T model 200 in a second training pass, the second stage of the training process 400 processes, using the RNN-T model 200, the corresponding spoken training utterance 415 to determine the probability 222 that the most likely speech recognition hypothesis for the training utterance 415 is correct (described above), and to obtain one or more speech recognition hypotheses 432, 432a-n (i.e. , output token sequence hypotheses 432) for the training utterance 415. Thereafter, for each training sample 410 and each speech recognition hypothesis 432 output by the RNN-T model 200 for the corresponding training utterance 415, a sequence alignment module 450 determines a respective token-level cost for aligning the speech recognition hypotheses 432 and the corresponding ground-truth transcriptions 420 for the training utterance 415. In some implementations, a token-level cost for aligning the sequences is determined using a customized Levenshtein distance. Here, letting

® be the j th speech recognition hypothesis 432 of the N-best speech recognition hypotheses 432 for the training sample 410, where and M is the number of training samples 410;

® be the probability 222

® be the ground-truth transcription 420 for the training sample 410 (i.e., a sequence of ground-truth output tokens), an example customized Levenshtein distance between the sequences and can be determined using the following cost expressions: wherein A represents tokens of , B represents tokens of , and <st> represents a special token. Notably, EQN (2) does not permit substitutions between special tokens and common tokens. Notably, K is greater than one (e.g., slightly greater than one, such as 1.1) such that special token insertions and deletions have more influence compared to common word errors during re-training. Here, the customized Levenshtein distance only permits substitutions between common tokens, and special tokens can only be correct, be deleted or be inserted. The sequence alignment module 450 uses the costs of EQN (3)- (5) to determine an optimal alignment of and for each training sample 410 that minimizing the customized Levenshtein distances.

[0055] The second stage of the training process 400 continues with a token-level loss function module 460 that determines, for each speech recognition hypothesis 432, a token-level cost based on a minimum additive error rate, such as an MWER or a word- level EMBR. That is, the second stage of the training process 400 determines, for each speech recognition hypothesis 432, the number of special token insertions , the number of special token deletions and the number of common token changes needed to match each aligned speech recognition hypothesis Hy 432 with the ground- truth transcription 420. For each speech recognition hypothesis 432, the token- level loss function module 460 then determines a corresponding token-level loss 462, 462a-n. The token-level loss represents an additive error rate, such as an MWER or a word-level EMBR. In some examples, such as for speaker change detection, for example, the token-level loss 462 is: where the values of the parameters and control the relative contribution of each subcomponent, and is the total number of output tokens in the ground-truth sequence 420. In some impl ementations, the values of the parameters and are set to be much larger than the value of the parameter to force a reduction in special token insertion and deletion rates. In alternative examples, such as for hotword or warm word detections, for example, the token-level loss 462 is: where and

In some implementations, EQN (6) is used first for speaker change detection, and then the token-level loss 462 of EQN (7) is used for special keyword detection (e.g., hotword or warm word detection). In other implementations, the token-level loss 462 is determined using a combination of EQN (6) and EQN (7).

[0056] Thereafter, a loss combining module 470 of the second stage of the training process 400 determines, for the set of training data 405, an overall token-level loss which can be expressed as and combined losses as where the value of the parameter controls the relative contributions of the token-level loss and the negative log likelihood loss term 442 determined by the log likelihood loss function module 440 (i.e., see EQN (2)). The second stage of the two- stage training process 400 applies updates 474 to the speech recognition model 200 based on the combined loss 472 for each training sample 410 to re-train the speech recognition model 200 in the second training pass.

[0057] FIG. 5 is a flowchart of an exemplary arrangement of operations for a computer-implemented method 500 for augmenting training of sequence transduction models using token-level losses. At operation 502, the method 500 includes receiving a plurality of training samples 410 each including a corresponding sequence of training input features 415 paired with a corresponding sequence of ground-truth output tokens 420, the sequence of ground-truth output tokens 420 including a set of ground-truth common tokens and a set of ground-truth special tokens.

[0058] For each training sample 410 in the plurality of training samples 410, the method 500 includes at operation 504 processing, using a sequence transduction model 200, the corresponding sequence of training input features 415 to obtain one or more output token sequence hypotheses 432, each output token sequence hypothesis 432 including one or more predicted common tokens 204.

[0059] For each training sample 410 in the plurality of training samples 410, the method 500 also includes, at operation 506, determining a per sample token-level loss 462 based on, for each corresponding output token sequence hypothesis 432 obtained for the training sample 410,: a number of special token insertions each associated with a corresponding predicted special token that appears in the corresponding output token sequence hypothesis 432 but does not appear in the corresponding sequence of ground- truth output tokens 420; and a number of special token deletions each associated with a corresponding ground-truth special token in the set of ground-truth special tokens 420 that does not appear in corresponding output token sequence hypothesis 432.

[0060] At operation 508, the method 500 includes training the sequence transduction model 200 to minimize additive error rate based on the per sample token-level losses 462 determined for the plurality of training samples 410.

[0061] FIG. 6 is schematic view of an example computing device 600 that may be used to implement the systems and methods described in this document. The computing device 600 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed in this document. [0062] The computing device 600 includes a processor 610 (i.e., data processing hardware) that can be used to implement the data processing hardware 112 and/or 144, memory 620 (i.e., memory hardware) that can be used to implement the memory hardware 114 and/or 146, a storage device 630 (i.e., memory hardware) that can be used to implement the memory hardware 114 and/or 146, a high-speed interface/controller 640 connecting to the memory/ 620 and high-speed expansion ports 650, and a low speed interface/controller 660 connecting to a low speed bus 670 and a storage device 630.

Each of the components 610, 620, 630, 640, 650, and 660, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 610 can process instructions for execution within the computing device 600, including instructions stored in the memory 620 or on the storage device 630 to display graphical information for a graphical user interface (GUI) on an external input/output device, such as display 680 coupled to high speed interface 640. In other implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory . Also, multiple computing devices 600 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi -processor system).

[0063] The memory 620 stores information non-transitorily within the computing device 600. The memory 620 may be a computer-readable medium, a volatile memory unit(s), or non-volatile memory/ unit(s). The non-transitory memory 620 may be physical devices used to store programs (e.g., sequences of instructions) or data (e.g., program state information) on a temporary or permanent basis for use by the computing device 600. Examples of non-volatile memory include, but are not limited to, flash memory and read-only memory (ROM) / programmable read-only memory (PROM) / erasable programmable read-only memory (EPROM) / electronically erasable programmable read- only memory (EEPROM) (e.g., typically used for firmware, such as boot programs). Examples of volatile memory' include, but are not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), phase change memory (PCM) as well as disks or tapes.

[0064] The storage device 630 is capable of providing mass storage for the computing device 600. In some implementations, the storage device 630 is a computer-readable medium. In various different implementations, the storage device 630 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. In additional implementations, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 620, the storage device 630, or memory on processor 610. [0065] The high speed controller 640 manages bandwidth-intensive operations for the computing device 600, while the low speed controller 660 manages lower bandwidth- intensive operations. Such allocation of duties is exemplary only. In some implementations, the high-speed controller 640 is coupled to the memory 620, the display 680 (e.g., through a graphics processor or accelerator), and to the high-speed expansion ports 650, which may accept various expansion cards (not shown). In some implementations, the low-speed controller 660 is coupled to the storage device 630 and a low-speed expansion port 690. The low-speed expansion port 690, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet), may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.

[0066] The computing device 600 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 600a or multiple times in a group of such servers 600a, as a laptop computer 600b, or as part of a rack server system 600c. [0067] Various implementations of the systems and techniques described herein can be realized in digital electronic and/or optical circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

[0068] A software application (i.e., a software resource) may refer to computer software that causes a computing device to perform a task. In some examples, a software application may be referred to as an “application,” an “app,” or a “program.” Example applications include, but are not limited to, system diagnostic applications, system management applications, system maintenance applications, word processing applications, spreadsheet applications, messaging applications, media streaming applications, social networking applications, and gaming applications.

[0069] These computer programs (also known as programs, software, software applications, or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms “machine- readable medium” and “computer-readable medium” refer to any computer program product, non-transitory computer readable medium, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor.

[0070] The processes and logic flows described in this specification can be performed by one or more programmable processors, also referred to as data processing hardware, executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for- performing instructions and one or more memory devices for storing in struct ions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks, magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry .

[0071] To provide for interaction with a user, one or more aspects of the disclosure can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube), LCD (liquid crystal display) monitor, or touch screen for displaying information to the user and optionally a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s client device in response to requests received from the web browser. [0072] Unless expressly stated to the contrary, the phrase “at least one of A, B, or C” is intended to refer to any combination or subset of A, B, C such as: (1) at least one A alone; (2) at least one B alone; (3) at least one C alone, (4) at least one A with at least one B; (5) at least one A with at least one C; (6) at least one B with at least C; and (7) at least one A with at least one B and at least one C. Moreover, unless expressly stated to the contrary, the phrase “at least one of A, B, and C” is intended to refer to any combination or subset of A, B, C such as: (1) at least one A alone; (2) at least one B alone; (3) at least one C alone; (4) at least one A with at least one B; (5) at least one A with at least one C;

(6) at least one B with at least one C; and (7) at least one A with at least one B and at least one C.

[0073] A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure. Accordingly, other implementations are within the scope of the following claims.