Gene-Level Modeling

Variants in coding and regulatory regions of genes are likely to be the most relevant for disease association. While many bioinformatics tools have been developed for predicting the effect of missense mutations, their accuracy is still far from ideal, reaching about 80% for the best-performing tools. We are attempting here to improve the performance of existing computational methods when applied to specific genes, including the ACMG genes. 

Incorrect protein folding and decreased stability are the major consequences of pathogenic missense mutations that lead to diseases. Our initial attempt is, therefore, to incorporate predictors of protein stability into machine learning algorithms and to assess their performance in combination with other major predictors of protein function. In a pilot study we are training our algorithm on the CFTR gene, where we are using a locus specific database (CFTR2) and other publicly available data (1000G, ClinVar, etc.) to collect known pathogenic and non-pathogenic variants. Our collaboration with The Cystic Fibrosis Center at Stanford also provides access to phenotypic-rich patient-level data that we are using to further improve the performance of our algorithm.