Sortase enzymes are cysteine transpeptidases that embellish the surface of Gram-positive bacteria with various proteins thereby allowing these microorganisms to interact with their neighboring environment. It is known that several of their substrates can cause pathological implications, so researchers have focused on the development of sortase inhibitors. Currently, six different classes of sortases (A-F) are recognized. However, with the extensive application of bacterial genome sequencing projects, the number of potential sortases in the public databases has exploded, presenting considerable challenges in annotating these sequences. It is very laborious and time-consuming to characterize these sortase classes experimentally.
Therefore, the first machine-learning-based two-layer predictor called SortPred was developed, where the first layer predicts the sortase from the given sequence and the second layer predicts their class from the predicted sortase. SortPred is an effective tool for identifying bacterial sortases, which in turn may aid in designing sortase inhibitors and exploring their functions.
Positive dataset #
The keyword "sortase" to search against the NCBI’s protein database to construct the positive samples. All bacterial sequences with a length ranging from 100 to 500 were retained and excluded other sequences, even those containing non-standard amino acids (B|J|O|U|X|Z). To annotate sortase sequences, position-specific scoring matrix (PSSM) searches against pre-formatted conserved domain database (CDD), “little_endian” (Downloaded: November 2020) were carried out by using a standalone RPS-BLAST v2.10.0+ algorithm with an e-value threshold of 1e-5. For each input sequence, RPS-BLAST lists the conserved domain models that score above a certain cut-off and includes the PSSMID of the conserved domain, scores (e.g., e-value and bit score) and the actual alignment between the input sequence and the conserved domain. The output of the RPS-BLAST was further processed by running another command line utility “rpsbproc” available from the CDD website (https://ftp.ncbi.nih.gov/pub/mmdb/cdd/rpsbproc/). The rpsbproc utility converts the raw alignments into domain or site annotations on the input sequence and presents the annotation data as tab-delimited files. From the rpsbproc utility output, sequences assigned to one of the six sortase classes (Classes A, B, C, D, E and F) were selected. Using these sortase sequences, a redundancy reduced dataset was generated by applying CD-HIT v4.8.1 with the 40% sequence identity cut-off. Sequences annotated as sortases without being assigned to a particular class, as well as only a limited number of marine sortases (from proteobacteria) identified in the preceding steps, were also excluded from the positive dataset. Furthermore, redundancy reduction was applied to excluded sortase sequences as well, so that they could be used for additional validation later.
Negative dataset #
Constructed negative dataset as follows: (i) retrieved all the reviewed bacterial sequences having a length between 100-500 amino acids from the UniProt database and discarded the sequences that contained non-standard amino acids. (ii) RPS-BLAST and the rpsbproc utility (described above) were used to identify the potential sortase sequences and excluded them from the negative dataset. (iii) We further filtered the negative dataset by removing any sequence that showed a greater than 30% sequence identity to sequences from the positive dataset. In the same way as the positive dataset, we also generated a negative dataset with a CD-HIT cut-off of 40% sequence identity. A prediction model developed using a balanced dataset is generally more reliable and robust than a model developed using an imbalanced dataset. In an imbalanced dataset, the model is overfitted to favor the sample belonging to the large class. Therefore, we randomly selected negative samples that are equivalent in number to positive samples. The combined positive and negative datasets were divided into training and independent validation sets by using the createDataPartition function of the CARET (short for Classification And REgression Training) package available in R (https://www.r-project.org/). In layer 1, we used 1663 sortases and 1660 non-sortases were used to develop the model, followed by 412 sortases and 415 non-sortases for independent validation. For layer 2, classes A, B, C, D, E, and F each contains140, 462, 186, 242, 213, and 420 samples for multi-class training. Those classes corresponding to independent validation are 34, 115, 46, 59, 53, and 105.
Used Features #
- Amino Acid Composition (AAC)
- Composition (C), Transition (T), and Distribution (D) (CTD)
- Conjoint Triad (CTriad)
- Dipeptide Composition (DPC)
- Quasi-Sequence-Order (QSO)
- Malik, A.; Subramaniyam, S.; Kim, C.-B.; Manavalan, B. SortPred: The first machine learning based predictor to identify bacterial sortases and their classes using sequence-derived information. Computational and Structural Biotechnology Journal 2022, 20, 165-174, doi:https://doi.org/10.1016/j.csbj.2021.12.014.