Fusion oncoproteins play a crucial role in cancer biology, yet their structural and functional characterization remains challenging. While methods for the identification and annotation of gene fusions are well established, approaches for deeper analysis of their protein products are still under development. In this thesis, I investigate the potential of Large Language Models (LLMs) to support the study of fusion oncoproteins, with a focus on both their sequences and three-dimensional structures. To this end, datasets were assembled from FusionPDB, UniProtKB, and collections of non-oncogenic proteins. Importantly, because fusion proteins are not directly annotated in standard repositories, their sequences had to be reconstructed by extracting genomic information (including gene coordinates, exon boundaries, and breakpoint positions) and subsequently translating these into protein products. LLM-based embeddings of protein sequences were used for classification tasks: wild-type versus fusion proteins (mean bootstrap accuracy: 89.37%, 95% CI: 88.62–90.01%) and oncogenic versus non-oncogenic proteins (mean bootstrap accuracy: 89.36%, 95% CI: 88.66–90.03%). In addition, protein 3D structures were converted into graphs to explore structural features; remarkably, the average node degree achieved a perfect separation between oncogenic and non-oncogenic proteins (100%). These results suggest that LLMs and graph-based representations provide promising tools for the study of fusion oncoproteins, highlighting their potential in advancing the structural and functional characterization of these critical biomolecules.

Language Models and Graphs: A Dual Approach to Understanding Fusion Oncoproteins

MELOTTI, VIRGINIA
2024/2025

Abstract

Fusion oncoproteins play a crucial role in cancer biology, yet their structural and functional characterization remains challenging. While methods for the identification and annotation of gene fusions are well established, approaches for deeper analysis of their protein products are still under development. In this thesis, I investigate the potential of Large Language Models (LLMs) to support the study of fusion oncoproteins, with a focus on both their sequences and three-dimensional structures. To this end, datasets were assembled from FusionPDB, UniProtKB, and collections of non-oncogenic proteins. Importantly, because fusion proteins are not directly annotated in standard repositories, their sequences had to be reconstructed by extracting genomic information (including gene coordinates, exon boundaries, and breakpoint positions) and subsequently translating these into protein products. LLM-based embeddings of protein sequences were used for classification tasks: wild-type versus fusion proteins (mean bootstrap accuracy: 89.37%, 95% CI: 88.62–90.01%) and oncogenic versus non-oncogenic proteins (mean bootstrap accuracy: 89.36%, 95% CI: 88.66–90.03%). In addition, protein 3D structures were converted into graphs to explore structural features; remarkably, the average node degree achieved a perfect separation between oncogenic and non-oncogenic proteins (100%). These results suggest that LLMs and graph-based representations provide promising tools for the study of fusion oncoproteins, highlighting their potential in advancing the structural and functional characterization of these critical biomolecules.
2024
Cancer
Gene fusions
LLM
Oncoproteins
3D structure graphs
File in questo prodotto:
File Dimensione Formato  
Melotti.Virginia.pdf

embargo fino al 15/10/2028

Dimensione 1.99 MB
Formato Adobe PDF
1.99 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/3922