Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
STREAMING LONG-FORM SPEECH RECOGNITION
Document Type and Number:
WIPO Patent Application WO/2024/082167
Kind Code:
A1
Abstract:
Systems and methods are provided for accessing a factorized neural transducer comprising a first set of layers for predicting blank tokens and a second set of layers for predicting vocabulary tokens. The first set of layers comprises a blank predictor, an encoder, and a joint network and the second set of layers comprising a vocabulary predictor which is a separate predictor from the blank predictor. A context encoder is added to the factorized neural transducer which encodes long-form transcription history for generating a long-form context embedding, such that the factorized neural transducer is further configured to perform long-form automatic speech recognition, at least in part, by using the long-form context embedding to augment a prediction of vocabulary tokens.

Inventors:
WU YU (US)
LI JINYU (US)
LIU SHUJIE (US)
GONG XUN (US)
Application Number:
PCT/CN2022/126111
Publication Date:
April 25, 2024
Filing Date:
October 19, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
MICROSOFT TECHNOLOGY LICENSING LLC (US)
WU YU (CN)
LI JINYU (US)
LIU SHUJIE (CN)
GONG XUN (CN)
International Classes:
G10L15/16
Foreign References:
US20220122586A12022-04-21
Other References:
CHEN XIE ET AL: "Factorized Neural Transducer for Efficient Language Model Adaptation", ICASSP 2022 - 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 23 May 2022 (2022-05-23), pages 8132 - 8136, XP034156754, DOI: 10.1109/ICASSP43922.2022.9746908
Attorney, Agent or Firm:
SHANGHAI PATENT & TRADEMARK LAW OFFICE, LLC (CN)
Download PDF: