Automatic Speech Recognition (ASR) is a field of computer science that focuses on developing systems capable of converting spoken language into written text. In recent years, with the advent of deep learning techniques, ASR systems have achieved remarkable improvements in accuracy and robustness, making them increasingly suitable for real-world applications. This thesis explores the main techniques and architectures employed in modern ASR systems, with particular attention to deep learning–based approaches. We first analyze the most relevant neural networks used in ASR, such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. The typical ASR pipeline is then described, from audio preprocessing to output text decoding. A specific focus is dedicated to the Transducer architecture, an end-to-end model that combines acoustic and language modeling within a single neural network. We also examine a variant of Transducer models, the Conformer Transducer, in which the Conformer architecture is adopted as the encoder. Finally, using a public italian dataset, experimental results are presented for an LSTM-Transducer model trained from scratch under different hyperparameter configurations, and for a pre-trained Conformer-Transducer model evaluated directly on the test set. The results show the importance of configurations in training deep learning models and highlight the key differences between LSTM-Transducer and Conformer-Transducer architectures.
Exploring Transducer-Based Architectures for Automatic Speech Recognition
FOSCHI, LEONARDO
2024/2025
Abstract
Automatic Speech Recognition (ASR) is a field of computer science that focuses on developing systems capable of converting spoken language into written text. In recent years, with the advent of deep learning techniques, ASR systems have achieved remarkable improvements in accuracy and robustness, making them increasingly suitable for real-world applications. This thesis explores the main techniques and architectures employed in modern ASR systems, with particular attention to deep learning–based approaches. We first analyze the most relevant neural networks used in ASR, such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. The typical ASR pipeline is then described, from audio preprocessing to output text decoding. A specific focus is dedicated to the Transducer architecture, an end-to-end model that combines acoustic and language modeling within a single neural network. We also examine a variant of Transducer models, the Conformer Transducer, in which the Conformer architecture is adopted as the encoder. Finally, using a public italian dataset, experimental results are presented for an LSTM-Transducer model trained from scratch under different hyperparameter configurations, and for a pre-trained Conformer-Transducer model evaluated directly on the test set. The results show the importance of configurations in training deep learning models and highlight the key differences between LSTM-Transducer and Conformer-Transducer architectures.| File | Dimensione | Formato | |
|---|---|---|---|
|
Foschi.Leonardo (1).pdf
accesso aperto
Dimensione
2.78 MB
Formato
Adobe PDF
|
2.78 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14251/4296