Language Models and Graphs: A Dual Approach to Understanding Fusion Oncoproteins

Fusion oncoproteins play a crucial role in cancer biology, yet their structural and functional characterization remains challenging. While methods for the identification and annotation of gene fusions are well established, approaches for deeper analysis of their protein products are still under development. In this thesis, I investigate the potential of Large Language Models (LLMs) to support the study of fusion oncoproteins, with a focus on both their sequences and three-dimensional structures. To this end, datasets were assembled from FusionPDB, UniProtKB, and collections of non-oncogenic proteins. Importantly, because fusion proteins are not directly annotated in standard repositories, their sequences had to be reconstructed by extracting genomic information (including gene coordinates, exon boundaries, and breakpoint positions) and subsequently translating these into protein products. LLM-based embeddings of protein sequences were used for classification tasks: wild-type versus fusion proteins (mean bootstrap accuracy: 89.37%, 95% CI: 88.62–90.01%) and oncogenic versus non-oncogenic proteins (mean bootstrap accuracy: 89.36%, 95% CI: 88.66–90.03%). In addition, protein 3D structures were converted into graphs to explore structural features; remarkably, the average node degree achieved a perfect separation between oncogenic and non-oncogenic proteins (100%). These results suggest that LLMs and graph-based representations provide promising tools for the study of fusion oncoproteins, highlighting their potential in advancing the structural and functional characterization of these critical biomolecules.

Language Models and Graphs: A Dual Approach to Understanding Fusion Oncoproteins

MELOTTI, VIRGINIA

2024/2025

Abstract

Fusion oncoproteins play a crucial role in cancer biology, yet their structural and functional characterization remains challenging. While methods for the identification and annotation of gene fusions are well established, approaches for deeper analysis of their protein products are still under development. In this thesis, I investigate the potential of Large Language Models (LLMs) to support the study of fusion oncoproteins, with a focus on both their sequences and three-dimensional structures. To this end, datasets were assembled from FusionPDB, UniProtKB, and collections of non-oncogenic proteins. Importantly, because fusion proteins are not directly annotated in standard repositories, their sequences had to be reconstructed by extracting genomic information (including gene coordinates, exon boundaries, and breakpoint positions) and subsequently translating these into protein products. LLM-based embeddings of protein sequences were used for classification tasks: wild-type versus fusion proteins (mean bootstrap accuracy: 89.37%, 95% CI: 88.62–90.01%) and oncogenic versus non-oncogenic proteins (mean bootstrap accuracy: 89.36%, 95% CI: 88.66–90.03%). In addition, protein 3D structures were converted into graphs to explore structural features; remarkably, the average node degree achieved a perfect separation between oncogenic and non-oncogenic proteins (100%). These results suggest that LLMs and graph-based representations provide promising tools for the study of fusion oncoproteins, highlighting their potential in advancing the structural and functional characterization of these critical biomolecules.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria "Enzo Ferrari"
			
	Corso di studio
	
				Artificial intelligence engineering
			
	Anno Accademico
	
				2024
			
	Parola chiave
	
				Cancer
Gene fusions
LLM
Oncoproteins
3D structure graphs
			
	Relatore
	
				LOVINO, MARTA
			
	Appare nelle tipologie:
	
				Lauree Magistrali

File in questo prodotto:

File	Dimensione	Formato
Melotti.Virginia.pdf embargo fino al 15/10/2028 Dimensione 1.99 MB Formato Adobe PDF	1.99 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/3922