Mutational signatures are a group of frequency patterns of single nucleotide variants that occur along the genome. Each group has been associated to a common etiology of cancer in specific tissues. These groups accounts for endogenous (internal cellular processes) or exogenous (exposure to radiation, smoking) causes.
In this notebook, the goal is to check whether a deep learning model may reproduce the same mutational signatures classes predicted by probabilistic models such as SigProfiler . These methods take a long time to execute and generate the results and the time may increase according to the number of samples and combinations of signatures in the experiment. The machine learning approach could accelerate the execution and also turn the process scalable. The ML approach also could allow the calibration of the model according to samples covariates, since some of the signatures are correlated to age, for example.
Mutational signatures prediction can be derived from two main types of features table: dinucleotide (AC>CA) or single nucleotide (A[C>A]A) bases, these tables are mutational matrices, in which the rows correspond to the variants found in the VCF or MAF files of the samples, and each column accounts for the individual/sample. Here, I am using the Single Base Substitution that generates 96 combinations of base variation (SBS). All the possible classes that these features may have according to the type of cancer and tissues are found in COSMIC, but usually there are up to 10 main signatures in an experiment.
In this notebook, the matrix file was ready to use given in the SigProfiler repository. But in case you need to generate from the variant calling methods, you may use VCF or MAF files. There are open vcf and maf files in TCGA for projects of diverse types of cancer. The code below is an example of matrix generation from maf files downloaded for the cases in the TCGA-OV project using the TCGA API .
# In case you need to install another genome build version:
from SigProfilerMatrixGenerator import install as genInstall
genInstall.install('GRCh38', rsync=False, bash=True)
# To generate the matrix
from SigProfilerMatrixGenerator.scripts import SigProfilerMatrixGeneratorFunc as matGen
project="TCGA-OV"
buildGen="GRCh38"
folderMaf="mafs_tcga-ov"
matrices = matGen.SigProfilerMatrixGeneratorFunc(project, buildGen, folderMaf, exome=False, bed_file=None, chrom_based=False, plot=False, tsb_stat=False, seqInfo=False)
In order to obtain the control labels to train our model, the SigProfiler tool that is part of the tools package recommended by COSMIC. Their tools package contains a version for Python and for R. Since the preprocessing of the matrix was made in Python, I used the Python version. But there are other mutational signatures extraction methods, mainly in R (MutationalPatterns, MutSignatures, Mix-MMM (Python) ). To create the labels with this method, I used the mutational matrix in the original format given as example in the SigProfiler repository. The method is very simple to use, it takes two command lines (it requires you to enter in the Python interpreter console mode, just type python3 in terminal):
from SigProfilerExtractor import sigpro as sig
sig.sigProfilerExtractor("matrix", "results_sample", "samples_matrix.tsv", reference_genome="GRCh37", minimum_signatures=1, maximum_signatures=10, nmf_replicates=100, cpu=-1)
These matrices in the mentioned configuration are used by the published profilers to obtain the most probable signatures for each sample, they may use a De Novo or the COSMIC computed frequencies for each SBS according to the specific genome build. Since in this machine learning experiment, we are interested in attribute a class for each sample, the first step to process the mutational matrix is transpose the original matrix, then each row turns into a sample having 96 numerical features as columns (See code below).
# Loading pandas library to deal with dataFrames and perform matrix operations
import pandas as pd
# Loading the mutational matrix
df=pd.read_csv('sample_matrix.tsv', sep='\t')
# Extracting the values as a 2D-array, removing the index and the column names
X=df.values
# Removing the variant names column since it will not be used
X=X[:, 1:]
# Transposing the data matrix (rows now are samples, and columns are the 96 SBS features)
X=X.T
# Loading pandas library to deal with dataFrames and perform matrix operations
import pandas as pd
# Loading the labels table
df=pd.read_csv('sample_cosmic_labels.tsv', sep='\t')
# Getting the signatures column
Y=df['best_signature']
# Getting the labels without repetition
cls=[]
for y in Y:
if(not y in cls):
cls.append(y)
# loading the label encoder from scikit-learn
from sklearn.preprocessing import LabelEncoder
# inititalizing LabelEncoder
encoder = LabelEncoder()
# Fitting the labels
encoder.fit(Y)
# Transforming each category into integer values
encoded_Y = encoder.transform(Y)
# Loading train_test_split
from sklearn.model_selection import train_test_split
# Slice original data into train/test parts, following the rule 2/3 for train and 1/3 for test
X_train, X_test, y_train, y_test = train_test_split(X, encoded_Y, test_size=0.33, random_state=42)
# Representing data as dictionary
sample_data={'x_train': X_train.tolist(), 'y_train': y_train.tolist(), 'x_test': X_test.tolist(), 'y_test': y_test.tolist(), 'classNames': cls }
# Loading json library
import json
# Saving the data into json
with open('sample_model/data.json', 'w') as f:
json.dump(sample_data, f)
I developed a javascript model (MutaSig) that handles the data loading, the model generation and compilation, the training and the renderization of the evaluation visualization plots. The full script is here. I will just put on the snippets calling them here to keep the text clean. The first step is loading the data from the json we created before, the snippet will feed the x_train, x_test, y_train and y_test attributes of the module object transforming the data into tensors, it will also fill the classNames that will be used later to illustrate the confusion matrix:
# initialization of mutasig object
mutasig={ classNames: [], x_train: {}, x_test: [], y_train: [], y_test: [], model: {}, epochLogs: [] }
# Loading the data
await mutasig.loadData('http://127.0.0.1/portfolio-data/mutation_signature/sample_model/data.json')
The sample data used in this experiment had a lot of dataset issues for a model, it contains only 21 samples, so 14 of them were split for training and 7 for testing. The second issue is that the classes were not balanced: two samples were classified as SBS13, seven samples had the class SBS3 and SBS40 was assigned to 12 samples. These two issues are sevee in any context of ML application. But even considering these conditions some rounds of execution and prediction returns an accuracy value of 80%. The experimentation with all the samples of real TCGA projects may help to better calibrate and enhance this model.
References:
Mix-MMM - https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-021-00988-7
MutationalPatterns - https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-018-0539-0
MutSignatures - https://www.nature.com/articles/s41598-020-75062-0