Exploring Transducer-Based Architectures for Automatic Speech Recognition

Automatic Speech Recognition (ASR) is a field of computer science that focuses on developing systems capable of converting spoken language into written text. In recent years, with the advent of deep learning techniques, ASR systems have achieved remarkable improvements in accuracy and robustness, making them increasingly suitable for real-world applications. This thesis explores the main techniques and architectures employed in modern ASR systems, with particular attention to deep learning–based approaches. We first analyze the most relevant neural networks used in ASR, such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. The typical ASR pipeline is then described, from audio preprocessing to output text decoding. A specific focus is dedicated to the Transducer architecture, an end-to-end model that combines acoustic and language modeling within a single neural network. We also examine a variant of Transducer models, the Conformer Transducer, in which the Conformer architecture is adopted as the encoder. Finally, using a public italian dataset, experimental results are presented for an LSTM-Transducer model trained from scratch under different hyperparameter configurations, and for a pre-trained Conformer-Transducer model evaluated directly on the test set. The results show the importance of configurations in training deep learning models and highlight the key differences between LSTM-Transducer and Conformer-Transducer architectures.

Exploring Transducer-Based Architectures for Automatic Speech Recognition

FOSCHI, LEONARDO

2024/2025

Abstract

Automatic Speech Recognition (ASR) is a field of computer science that focuses on developing systems capable of converting spoken language into written text. In recent years, with the advent of deep learning techniques, ASR systems have achieved remarkable improvements in accuracy and robustness, making them increasingly suitable for real-world applications. This thesis explores the main techniques and architectures employed in modern ASR systems, with particular attention to deep learning–based approaches. We first analyze the most relevant neural networks used in ASR, such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. The typical ASR pipeline is then described, from audio preprocessing to output text decoding. A specific focus is dedicated to the Transducer architecture, an end-to-end model that combines acoustic and language modeling within a single neural network. We also examine a variant of Transducer models, the Conformer Transducer, in which the Conformer architecture is adopted as the encoder. Finally, using a public italian dataset, experimental results are presented for an LSTM-Transducer model trained from scratch under different hyperparameter configurations, and for a pre-trained Conformer-Transducer model evaluated directly on the test set. The results show the importance of configurations in training deep learning models and highlight the key differences between LSTM-Transducer and Conformer-Transducer architectures.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Dipartimento di Scienze Fisiche, Informatiche e Matematiche
			
	Corso di studio
	
				Matematica
			
	Anno Accademico
	
				2024
			
	Parola chiave
	
				RNN-T
LSTM
Conformer
E2E ASR Training
Speech to text
			
	Relatore
	
				FRANCHINI, GIORGIA
SCRIBANO, CARMELO
			
	Appare nelle tipologie:
	
				Lauree Magistrali

File in questo prodotto:

File	Dimensione	Formato
Foschi.Leonardo (1).pdf accesso aperto Dimensione 2.78 MB Formato Adobe PDF Visualizza/Apri	2.78 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/4296