Semi-supervised learning assumptions for gene ontology terms prediction outperform traditional supervised methods

The B2SLab in collaboration with the Instituto Tecnológico Metropolitano (Medellín, Colombia) presented an analysis of the applicability of semi-supervised learning assumptions over the specific task of Gene Ontology (GO) terms prediction [PDF]. The work provides judgment elements that allow choosing the most suitable tools for specific GO terms. It was published on June 2016 in the Scimago journal of the Universidad de Antioquía.

The Gene Ontology project aims to provide a unified framework for the biological annotation of genes and proteins across all species. It is one of the most important resources in bioinformatics. In order to cover the whole universe of protein functions, the GO project constructs controlled and structured vocabularies known as ontologies, which are applied in the annotation of gene products in biological databases [1]. It comprises three ontologies: Molecular function (biochemical activities at the molecular level), cellular component (specific sub-cellular location where a gene product is active) and biological process (events at phenotypical level to which the protein contributes). Recent methods for predicting GO terms employ machine learning techniques trained over physical-chemical and statistical attributes for predicting functional labels that later can be subjected to experimental verification [2].

Predicting GO terms is an essential task for bioinformatics, but the number of available labelled proteins is in several cases insufficient for training reliable machine learning classifiers. Semi-supervised learning methods arise as a powerful solution that explodes the information contained in unlabelled data in order to improve the estimations of traditional supervised approaches. However, semi-supervised learning methods have to make strong assumptions about the nature of the training data and thus, the performance of the predictor is highly dependent on these assumptions.

In this paper, an analysis of the suitability of semi-supervised methods for the prediction of protein functions in Embryophyta plants was performed. A review of the state of the art of semi-supervised classifiers was presented, highlighting the different assumptions that each method does about the underlying distribution of the data. Two semi-supervised methods were chosen to perform the tests, each representing one of the main semi-supervised assumptions: cluster assumption and manifold assumption.

The results show that semi-supervised learning applied to the prediction of GO terms in Embryophyta organisms, significantly outperforms the supervised learning approach, at the same time outperforming the commonly used sequence alignment strategy in most cases. In general terms, the highest performance were reached when applying the cluster assumption. However, several GO terms that were not significantly improved with the cluster assumption, achieved higher performance with the manifold based semi-supervised method, demonstrating that a single assumption is not enough for improving the learning process by the exploitation of the additional unlabelled data. As future work, it is desirable to implement a unified strategy exploiting both assumptions at the same time, in order to achieve high performances in most applications. Also, classifiers devoted to hierarchical classification, such as decision trees, could be used to improve classification performance.

References:

[1] M. Harris et al., “The gene ontology (GO) database and informatics resource”, Nucleic Acids Res., vol. 32, pp. 258-261, 2004.

[2] J. Jaramillo, J. Gallardo, C. Castellanos and A. Perera, “Predictability of gene ontology slim-terms from primary structure information in Embryophyta plant proteins”, BMC Bioinformatics, vol. 14, no. 68, pp. 1-11, 2013.