Automatic Speech Recognition (ASR) is a field of computer science that focuses on developing systems capable of converting spoken language into written text. In recent years, with the advent of deep learning techniques, ASR systems have achieved remarkable improvements in accuracy and robustness, making them increasingly suitable for real-world applications. This thesis explores the main techniques and architectures employed in modern ASR systems, with particular attention to deep learning–based approaches. We first analyze the most relevant neural networks used in ASR, such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. The typical ASR pipeline is then described, from audio preprocessing to output text decoding. A specific focus is dedicated to the Transducer architecture, an end-to-end model that combines acoustic and language modeling within a single neural network. We also examine a variant of Transducer models, the Conformer Transducer, in which the Conformer architecture is adopted as the encoder. Finally, using a public italian dataset, experimental results are presented for an LSTM-Transducer model trained from scratch under different hyperparameter configurations, and for a pre-trained Conformer-Transducer model evaluated directly on the test set. The results show the importance of configurations in training deep learning models and highlight the key differences between LSTM-Transducer and Conformer-Transducer architectures.

Exploring Transducer-Based Architectures for Automatic Speech Recognition

FOSCHI, LEONARDO
2024/2025

Abstract

Automatic Speech Recognition (ASR) is a field of computer science that focuses on developing systems capable of converting spoken language into written text. In recent years, with the advent of deep learning techniques, ASR systems have achieved remarkable improvements in accuracy and robustness, making them increasingly suitable for real-world applications. This thesis explores the main techniques and architectures employed in modern ASR systems, with particular attention to deep learning–based approaches. We first analyze the most relevant neural networks used in ASR, such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. The typical ASR pipeline is then described, from audio preprocessing to output text decoding. A specific focus is dedicated to the Transducer architecture, an end-to-end model that combines acoustic and language modeling within a single neural network. We also examine a variant of Transducer models, the Conformer Transducer, in which the Conformer architecture is adopted as the encoder. Finally, using a public italian dataset, experimental results are presented for an LSTM-Transducer model trained from scratch under different hyperparameter configurations, and for a pre-trained Conformer-Transducer model evaluated directly on the test set. The results show the importance of configurations in training deep learning models and highlight the key differences between LSTM-Transducer and Conformer-Transducer architectures.
2024
RNN-T
LSTM
Conformer
E2E ASR Training
Speech to text
File in questo prodotto:
File Dimensione Formato  
Foschi.Leonardo (1).pdf

accesso aperto

Dimensione 2.78 MB
Formato Adobe PDF
2.78 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/4296