Fusion oncoproteins play a crucial role in cancer biology, yet their structural and functional characterization remains challenging. While methods for the identification and annotation of gene fusions are well established, approaches for deeper analysis of their protein products are still under development. In this thesis, I investigate the potential of Large Language Models (LLMs) to support the study of fusion oncoproteins, with a focus on both their sequences and three-dimensional structures. To this end, datasets were assembled from FusionPDB, UniProtKB, and collections of non-oncogenic proteins. Importantly, because fusion proteins are not directly annotated in standard repositories, their sequences had to be reconstructed by extracting genomic information (including gene coordinates, exon boundaries, and breakpoint positions) and subsequently translating these into protein products. LLM-based embeddings of protein sequences were used for classification tasks: wild-type versus fusion proteins (mean bootstrap accuracy: 89.37%, 95% CI: 88.62–90.01%) and oncogenic versus non-oncogenic proteins (mean bootstrap accuracy: 89.36%, 95% CI: 88.66–90.03%). In addition, protein 3D structures were converted into graphs to explore structural features; remarkably, the average node degree achieved a perfect separation between oncogenic and non-oncogenic proteins (100%). These results suggest that LLMs and graph-based representations provide promising tools for the study of fusion oncoproteins, highlighting their potential in advancing the structural and functional characterization of these critical biomolecules.
Language Models and Graphs: A Dual Approach to Understanding Fusion Oncoproteins
MELOTTI, VIRGINIA
2024/2025
Abstract
Fusion oncoproteins play a crucial role in cancer biology, yet their structural and functional characterization remains challenging. While methods for the identification and annotation of gene fusions are well established, approaches for deeper analysis of their protein products are still under development. In this thesis, I investigate the potential of Large Language Models (LLMs) to support the study of fusion oncoproteins, with a focus on both their sequences and three-dimensional structures. To this end, datasets were assembled from FusionPDB, UniProtKB, and collections of non-oncogenic proteins. Importantly, because fusion proteins are not directly annotated in standard repositories, their sequences had to be reconstructed by extracting genomic information (including gene coordinates, exon boundaries, and breakpoint positions) and subsequently translating these into protein products. LLM-based embeddings of protein sequences were used for classification tasks: wild-type versus fusion proteins (mean bootstrap accuracy: 89.37%, 95% CI: 88.62–90.01%) and oncogenic versus non-oncogenic proteins (mean bootstrap accuracy: 89.36%, 95% CI: 88.66–90.03%). In addition, protein 3D structures were converted into graphs to explore structural features; remarkably, the average node degree achieved a perfect separation between oncogenic and non-oncogenic proteins (100%). These results suggest that LLMs and graph-based representations provide promising tools for the study of fusion oncoproteins, highlighting their potential in advancing the structural and functional characterization of these critical biomolecules.| File | Dimensione | Formato | |
|---|---|---|---|
|
Melotti.Virginia.pdf
embargo fino al 15/10/2028
Dimensione
1.99 MB
Formato
Adobe PDF
|
1.99 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14251/3922