Dr. Alioune NgomSchool of Computer Science, University of Windsor
Detecting driver genes and network biomarkers of breast cancer subtypes
Co-Investigators and Collaborators:
EVIDENCE OF PROGRESSTask 1 – Data collection and preprocessing: We have obtained: the full METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) data comprising 10 subtypes and consisting of single-nucleotide polymorphism (SNP), copy-number variation (CNV) and copy-number aberration (CNA), and gene expression profiles obtained from 2000 primary breast tumors with long-term clinical outcome. An additional full gene expression data consisting of 5 breast cancer subtypes and 158 samples have also been downloaded. Human protein-protein interaction (PPI) network data and cellular pathway data have been collected from HPRD and KEGG databases and other repositories. All these data have been preprocessed in such a way to allow us to obtain the driver genes and the diagnostic biomarkers of the breast cancer subtype.
Task 2 – Data Integration:The preprocessed breast cancer data above has been integrated with the human PPI network in order to be able to find the driver genes and the diagnostic biomarkers of the breast cancer subtypes. We first devised a set of algorithms to identify, from the gene expression data and the variation data, the candidate driver genes, which are the genes which have significant alteration and significant differential expression in each subtype. The final step of the integration process was to map all the genes of the breast cancer data onto the human PPI network; this will allow us to determine the functional relationships between the genes during the biomarker identification phase.
Task 3 – Identification of the Diagnostic Network-Biomarkers (NBs) of Breast Cancer Subtypes:We have proposed two prediction methods which find the informative diagnostic biomarkers of the breast cancer subtypes. We have obtained at least 9 sets of highly predictive diagnostic biomarkers, with accuracies ranging from 90% to 99%. The two methods, however, yield different results on the same data as well as on different data; even though their results are excellent. We have identified possible solutions to this problem (of lack of reproducibility) and each is being implemented and tested. This Task 3 is the most important (and also the most time consuming and difficult) phase of this Seeds4Hope research, as it requires studying, implementing and testing different prediction models to be run on the large Metabric data set.