Background: Lysine crotonylation (Kcr) is a crucial protein post-translational modification found in histone and non-histone proteins. It plays a pivotal role in regulating diverse biological processes in both animals and plants, including gene transcription and replication, cell metabolism and differentiation, as well as photosynthesis. Despite the significance of Kcr, detection of Kcr sites through biological experiments is often time-consuming, expensive, and only a fraction of crotonylated peptides can be identified. This reality highlights the need for efficient and rapid prediction of Kcr sites through computational methods. Currently, several machine learning models exist for predicting Kcr sites in humans, yet models tailored for plants are rare. Furthermore, no downloadable Kcr site predictors or datasets have been developed specifically for plants. To address this gap, it is imperative to integrate existing Kcr sites detected in plant experiments and establish a dedicated computational model for plants. Results: Most plant Kcr sites are located on non-histones. In this study, we collected non-histone Kcr sites from five plants, including wheat, tabacum, rice, peanut, and papaya. We then conducted a comprehensive analysis of the amino acid distribution surrounding these sites. To develop a predictive model for plant non-histone Kcr sites, we combined a convolutional neural network (CNN), a bidirectional long short-term memory network (BiLSTM), and attention mechanism to build a deep learning model called PlantNh-Kcr. On both five-fold cross-validation and independent tests, PlantNh-Kcr outperformed multiple conventional machine learning models and other deep learning models. Furthermore, we conducted an analysis of species-specific effect on the PlantNh-Kcr model and found that a general model trained using data from multiple species outperforms species-specific models. Conclusion: PlantNh-Kcr represents a valuable tool for predicting plant non-histone Kcr sites. We expect that this model will aid in addressing key challenges and tasks in the study of plant crotonylation sites.
Keywords: Crotonylation; Convolutional neural network; Bidirectional long short-term memory; Attention mechanism; Focal loss
Supplementary Information The online version contains supplementary material available at https://doi.org/10.1186/s13007-024-01157-8.
Post-translational modifications (PTMs) [[
In plants, global identification and functional analysis of lysine crotonylation have been conducted in species such as tabacum [[
The current experimental methods for detecting Kcr sites include high-performance liquid chromatography fractionation, stable isotope labelling of amino acids in cell culture, immunological affinity enrichment, and high-resolution liquid chromatography-tandem mass spectrometry [[
Early models for Kcr site prediction were limited by the small training datasets with fewer than 200 Kcr sites, and all sites were limited to histones. These models employed conventional machine learning methods such as support vector machines [[
With the advancement of mass spectrometry technology, the global Kcr sites in the proteome of several species have been detected that enabled the utilization of significantly larger training datasets. At the same time, the deep learning frameworks [[
Among the four models for predicting Kcr sites on non-histones, nhKcr, iKcr_CNN, and CapsNh-Kcr are limited to predicting human Kcr sites. In contrast, DeepKcrot predict Kcr sites in four species, including humans, rice, tabacum, and papaya. However, it is important to note that due to the current unavailability of DeepKcrot's web server, accessing its datasets and models for further research purposes can be challenging. In addition, recent experimental studies have detected Kcr sites in some other plants, emphasizing the need for a computational model that is specifically tailored for plants. To address this gap, it is essential to integrate existing Kcr site data detected from plants and establish a specialized computational model dedicated to plants.
In this study, we compiled a comprehensive dataset of non-histone Kcr sites from five plant species including rice, tabacum, papaya, peanut, and wheat, and built a reliable training and test dataset. Then we utilized the binary encoding (BE) as input features and employed a combination of a convolutional neural network (CNN) [[
To train and test the model, we carefully curated a training dataset and a test dataset. This process was meticulously designed and is depicted in Fig. 1. We first collected non-histone Kcr sites from five plant species. These sites numbered 5692 from wheat [[
Graph: Fig. 1The flowchart of dataset preparation
Table 1 The numbers of positive and negative samples for each species in the training and test datasets
Species Training set Test set Positive Negative Positive Negative Wheat 2484 7585 1066 3252 Tabacum 820 2524 352 1082 Rice 662 3486 284 1495 Peanut 2452 10,675 1051 4575 Papaya 2226 8200 955 3515 Total samples 8644 32,470 3708 13,919
To build a predictive model for non-histone Kcr sites, it is necessary to transform peptide samples into numerical vectors as input features for the model. PlantNh-Kcr employs binary encoding as its input features. We conducted a comparative analysis of PlantNh-Kcr's predictive performance against conventional machine learning models and other deep learning models. The conventional machine learning models utilize various input features, including amino acid composition (AAC), enhanced group amino acid composition (EGAAC), BE, AAindex encoding, and BLOSUM62 encoding. Other deep learning models employ features including BE, word embedding (WE) encoding, AAindex encoding, and BLOSUM62 encoding. The following provides a detailed description of these encoding methods:
AAC: In bioinformatics, AAC is a commonly used encoding method, which calculates the frequencies of each amino acid in a peptide. In this study, X is also considered as an amino acid. So the peptide is encoded as a 21-dimensional vector, where each dimension corresponds to the frequency of one of the 21 amino acids present in the peptide.
EGAAC: The EGAAC encoding divides amino acids into five groups based on their physicochemical properties, i.e. aliphatic group (GAVLMI), aromatic group (FYW), positively charged group (KRH), negatively charged group (DE), and no charge group (STCPNQ). A peptide is encoded as a five-dimensional vector, where each dimension represents the proportion of one of the five groups of amino acids within the peptide.
BE: BE is a common technique used to convert amino acid sequences into numerical representations suitable for model training. For this encoding method, each amino acid is encoded as a 21-dimensional binary vector. This vector has one component set to 1 to indicate the type of amino acid, while all other components are set to 0. Finally, a peptide of length 29, is encoded as a matrix or a vector of size 29
WE encoding: WE is a technique that has gained popularity in the field of natural language processing. It assigns words to vectors in a high-dimensional space, ensuring that semantically similar words are positioned close to each other. This technique has also been effectively applied to sequence encoding in bioinformatics [[
AAindex encoding: AAindex [[
BLOSUM62 encoding: BLOSUM62 (BLOck Substitution Matrix 62) [[
The structure of PlantNh-Kcr was determined as Fig. 2 after evaluating various encoding methods and model architectures. The model accepts a 29 × 21 matrix derived from binary encoding as input. This matrix feeds into two distinct layers. The first is a convolutional layer that is followed by two additional convolutional layers. The second layer is a BiLSTM layer that is succeeded by a MHSA layer. The outputs of the third convolutional layer and the MHSA layer are merged and flattened into a vector. The flatten layer is followed by a linear layer and an output layer. All the layers are described in detail below.
Graph: Fig. 2The architecture of the PlantNh-Kcr model
Input layer: The layer receives a 29
Convolutional layers: The first convolutional layer has 21 input channels and 32 output channels, with a kernel size of 5 and a stride of 1. The second convolutional layer has 32 input channels and 32 output channels, and the third one has 32 input channels and 29 output channels. Both the latter two layers have a kernel size of 5 and a stride of 2. The outputs of each layer are activated using the ReLU function [[
BiLSTM layer and MHSA layer: The input size of the BiLSTM layer is 21, and the output size is 128. The MHSA layer has an input size of 128 and eight attention heads. To prevent overfitting, dropout operations with ratios of 0.9 and 0.5 are applied after the BiLSTM layer and MHSA layer, respectively.
Flatten layer: The flatten layer flattens the concatenated outputs of the third convolutional layer and MHSA layer, resulting in a 3944-dimensional vector.
Linear layer: The input size of the linear layer is 3944 and the output size is 128. The output is activated using the ReLU function.
Output layer: The output layer has an input size of 128 and output size of 2. The two-dimensional output vector represents the probabilities of a sample being positive and negative, respectively.
In this study, the training dataset has significantly more negative samples than positive samples, which would lead to a bias towards the negative samples during model training. To address this issue, we employed focal loss [[
Graph
2
Graph
3
Graph
where
The PlantNh-Kcr model was constructed and trained in a Python 3.9 and Pytorch 1.13.1 environment. Focal loss [[
In bioinformatics, the evaluation of classification models often involves cross-validation and independent test to assess their generalization capabilities. Similar to previous studies [[
In bioinformatics, commonly used evaluation metrics for binary classification models include sensitivity (Sn), specificity (Sp), accuracy (ACC), F1-score, Matthews correlation coefficient (MCC), and area under the receiver operating characteristic (ROC) curve (AUC) [[
4
Graph
5
Graph
6
Graph
7
Graph
8
Graph
In the above equations, TP, FP, TN, and FN represent the numbers of true positives, false positives, true negatives, and false negatives, respectively. Sn indicates the ability of the model to identify positive samples, with higher values indicating more accurate predictions for positive samples. Sp reflects the ability of the model to identify negative samples, with higher values indicating more accurate predictions for negative samples. F1-score provides a comprehensive measure of the model's performance in identifying positive samples, through balancing the counts of true positives, false positives, and false negatives. A higher F1-score indicates better performance. MCC considers both Sn and Sp, and ranges from − 1 to 1. A higher MCC value indicates better performance of the model. The ROC curve offers a graphical representation of the relationship between the true positive rate (TPR) and false positive rate (FPR) at different thresholds. TPR corresponds to Sn, while FPR equals one minus Sp. AUC represents the probability of a model ranking positive samples above negative samples. AUC is regarded as the most important metric in the evaluation of many bioinformatics models. The closer the ROC curve approaches the upper left corner, the closer the AUC value approaches 1, indicating a better classification performance of the model. In this study, samples with a predicted probability of being positive greater than 0.5 are classified as positive samples. The evaluation metrics of Sn, Sp, ACC, F1-score, and MCC are computed based on the fixed threshold of 0.5. We primarily use the ROC curve and its corresponding AUC value to compare the performance of different models. The ROC curve effectively visualizes the trade-off between Sn and Sp across various thresholds. This allows to compare models by considering their respective sensitivities at the same specificities, thus providing a more comprehensive evaluation.
To ensure the robustness of our model's performance, we conducted rigorous tests. For five-fold cross-validation, we calculated the mean and standard deviation of the metric values obtained from each fold. For independent test, we conducted 10 independent tests with different random seeds, and calculated the mean and standard deviation of the evaluated metrics.
Kcr is a post-translational modification that plays a crucial role in various cellular processes. It has been observed that the evolution of Kcr sites exhibits conservation, which suggests that these sites have functional significance [[
Graph: Fig. 3Sequence logo of Kcr sites on non-histone proteins. A Sequence logo for plants; B Sequence logo for humans
To evaluate the performance of PlantNh-Kcr, we conducted a five-fold cross-validation and 10 independent tests. For cross-validation, the ROC curves for each fold were tightly clustered in the top left corner of the plot (Fig. 4A), indicating that the model has strong discriminatory power. The average AUC value are 0.891, which is significantly better than random prediction. The average values for Sn, Sp, ACC, F1-score and MCC are 0.821, 0.810, 0.812, 0.648 and 0.551 respectively (Table 2). For independent tests, PlantNh-Kcr also demonstrats strong performance, with the average AUC value of 0.899 (Fig. 4B) and the average values for Sn, Sp, ACC, F1-score, and MCC are 0.811, 0.833, 0.828, 0.665 and 0.572, respectively (Table 2).
Graph: Fig. 4ROC curves of the PlantNh-Kcr model on five-fold cross-validation and independent tests. A The ROC curves on five-fold cross-validation; B The ROC curve on independent tests
Table 2 Metric values of the PlantNh-Kcr model on five-fold cross-validation and independent tests
Evaluation methods Sn (%) Sp (%) ACC (%) F1˗score (%) MCC (%) AUC (%) Cross-validation 82.1 ± 2.36 81.0 ± 1.91 81.2 ± 1.04 64.8 ± 0.33 55.1 ± 0.54 89.1 ± 0.54 Independent test 81.1 ± 3.23 83.3 ± 2.09 82.8 ± 0.99 66.5 ± 0.50 57.2 ± 0.50 89.9 ± 0.19
To visualize the discriminatory power of PlantNh-Kcr, we utilized the training dataset to train a model and subsequently fed the samples in the test dataset to it. Then we used t-SNE [[
Graph: Fig. 5T-SNE visualization of test samples in PlantNh-Kcr layers. A The input layer; B The flatten layer; C The linear layer
To further demonstrate the superior performance and robustness of PlantNh-Kcr, we conducted a comparative analysis with several well-established conventional machine learning models and deep learning models. The detailed information about these models is provided in Additional file 1.
In this study, we utilized three conventional machine learning methods including RF [[
Table 3 Metric values of different models on five-fold cross-validation
Classifiers Encodings Sn (%) Sp (%) ACC (%) F1˗score (%) MCC (%) AUC (%) RF AAC 70.5 64.8 66.0 44.6 29.1 74.5 EGAAC 70.9 59.3 61.8 43.8 24.7 71.1 AAindex 74.1 60.9 63.7 46.2 28.6 75.2 BLOSUM62 70.1 66.8 67.5 47.6 30.6 75.5 AdaBoost BE 24.3 95.4 80.5 34.4 28.6 79.0 AAC 17.6 95.3 79.0 26.0 20.1 75.8 EGAAC 2.40 99.4 79.0 4.70 7.60 71.5 BLOSUM62 24.2 95.6 80.6 34.4 28.8 78.9 LightGBM BE 68.6 85.0 81.5 60.9 49.5 85.6 AAC 67.3 73.5 72.2 50.4 34.8 78.0 EGAAC 70.3 59.4 61.7 43.6 24.3 70.5 AAindex 65.1 88.0 83.2 61.9 51.3 86.6 LSTM WE 74.8 81.8 80.4 61.6 50.3 86.5 AAindex 76.6 83.1 81.8 63.8 53.5 88.0 BLOSUM62 72.3 84.7 82.1 63.0 52.3 87.6 BiLSTM WE 74.4 81.9 80.3 61.4 50.1 86.6 AAindex 77.6 81.7 80.9 63.0 52.4 88.1 BLOSUM62 81.1 78.4 79.0 62.0 51.3 87.7 CNN WE 80.3 81.6 81.4 64.4 54.4 88.6 AAindex 78.3 82.6 81.7 64.3 54.2 88.5 BLOSUM62 77.3 83.0 81.9 64.2 54.0 88.6 PlantNh-Kcr
Table 4 Metric values of different models on independent test
Classifiers Encodings Sn (%) Sp (%) ACC (%) F1˗score (%) MCC (%) AUC (%) RF AAC 70.5 65.1 66.2 46.8 29.4 74.6 EGAAC 70.5 60.9 62.9 44.4 25.6 71.1 AAindex 74.5 60.0 63.0 45.9 28.2 75.2 BLOSUM62 70.0 66.4 67.1 47.2 30.1 75.4 AdaBoost BE 23.5 95.4 80.3 33.8 27.8 78.9 AAC 18.0 95.6 79.3 26.8 23.1 76.2 EGAAC 2.60 99.4 79.0 5.00 7.90 71.5 BLOSUM62 23.1 95.3 80.1 32.9 26.9 79.0 LightGBM BE 71.9 84.0 81.4 62.0 50.8 86.6 AAC 69.0 72.2 71.5 50.5 34.9 78.4 EGAAC 72.4 59.0 61.9 44.4 25.7 71.1 AAindex 69.4 86.9 83.2 63.5 53.0 87.6 LSTM WE 75.2 83.4 81.7 63.3 52.7 87.5 AAindex 79.0 82.5 81.7 64.6 54.6 88.7 BLOSUM62 75.5 83.8 82.0 63.9 53.6 88.5 BiLSTM WE 77.5 81.0 80.3 62.3 51.5 87.4 AAindex 79.3 82.2 81.6 64.5 54.5 88.9 BLOSUM62 75.5 83.5 81.8 63.7 53.5 88.7 CNN WE 79.0 83.4 82.4 65.4 55.6 89.1 AAindex 82.1 81.2 81.4 65.0 55.3 89.1 BLOSUM62 79.2 83.1 82.3 65.3 55.6 89.4 PlantNh-Kcr
Graph: Fig. 6ROC curves of different models on independent tests
Three networks including LSTM network, BiLSTM network, and CNN were used to compare with PlantNh-Kcr. The inputs for these networks encompassed BE, WE, AAindex, and BLOSUM62 encodings. The specific metric values for each network are detailed in Tables 3, 4. Interestingly, all three networks perform best when using BE as input. On five-fold cross-validation, the maximum average AUC values achieved by the LSTM, BiLSTM, and CNN networks are 0.882, 0.880 and 0.888, respectively. Similarly, on independent tests, the maximum average AUC values of these networks are 0.890, 0.889, and 0.896, respectively. However, the performance of the three networks is still inferior to PlantNh-Kcr.
To further demonstrate the performance of our model, we conducted a comparative analysis with four other models: nhKcr, iKcr_CNN, CapsNh-Kcr, and DeepKcrot, all designed to predict Kcr sites on non-histones. nhKcr, iKcr_CNN and CapsNh-Kcr predict Kcr sites in human. The nhKcr model integrated BE, AAindex encoding and BLOSUM62 encoding as input features and employed a CNN architecture. The iKcr_CNN model employed a CNN architecture and utilized a focal loss function for optimization. CapsNh-Kcr employed a CNN-based capsule network strategy. DeepKcrot predicted Kcr sites in four species including human, rice, papaya and tabacum. It utilized CNN with WE encoding as input features.
For nhKcr, iKcr_CNN and CapsNh-Kcr, we downloaded their source codes. For DeepKcrot, we rewrote its code due to the unavailability of its web server. We applied focal loss to nhKcr and DeepKcrot because they didn't address the data imbalance issue in their original source codes. We then trained the four models using the training dataset and evaluated their performance on the test set. The prediction performance of the four models is shown in Fig. 7 and Table 5. The average AUC values are 0.876, 0.876, 0.890, and 0.892, respectively, which are lower than that of PlantNh-Kcr. This again underscores the superior performance of PlantNh-Kcr.
Graph: Fig. 7ROC curves of PlantNh-Kcr and the other four models
Table 5 Metric values of PlantNh-Kcr and the other four models
Models Sn (%) Sp (%) ACC (%) F1˗score (%) MCC (%) AUC (%) iKcr_CNN 77.2 82.0 81.0 63.1 52.5 87.6 DeepKcrot 82.8 77.9 78.9 62.3 51.9 87.6 nhKcr 87.6 77.3 79.4 64.3 55.1 89.0 CapsNh-Kcr 76.4 84.3 83.1 65.5 55.6 89.2 PlantNh-Kcr 81.1 83.3 82.8 66.5 57.2 89.9
To assess the effect of each component in the PlantNh-Kcr model on prediction performance, we conducted an ablation study. In this study, we removed the linear layer, CNN, MHSA, and BiLSTM + MHSA individually from the model and evaluated the prediction performance on independent tests. The results are summarized in Table 6. Removing the linear layer and CNN individually resulted in a decrease of 1.1% and 1.2% in AUC values, respectively. This suggests that these two components have a certain impact on the overall performance of the model. On the other hand, removing MHSA and BiLSTM + MHSA individually resulted in a decrease of 0.5% and 0.3% in AUC values, respectively, indicating that these components have a smaller impact on performance compared to the linear layer and CNN. Overall, our results demonstrate that each component in the PlantNh-Kcr model contributes to its prediction performance. Removing any module from the model will result in a decrease in performance, indicating that each module is essential for achieving optimal performance.
Table 6 Prediction performance of models in the ablation study
Model Sn (%) Sp (%) ACC (%) F1˗score (%) MCC (%) AUC (%) Removing linear layer 90.0 70.7 74.8 60.1 50.2 88.8 Removing CNN 78.0 82.6 81.6 64.1 53.9 88.7 Removing MHSA 79.2 84.5 83.4 66.7 57.3 89.4 Removing BiLSTM + MHSA 82.1 82.2 82.1 66.0 56.5 89.6 PlantNh-Kcr 81.1 83.3 82.8 66.5 57.2 89.9
In this study, we collected non-histone Kcr sites from different types of plants. Given the potential species-specific impact on these sites, it's necessary to assess the generalizability of our predictive model across diverse plant species. Therefore, we studied the performance of our model for each species on independent tests. Table 7 details the evaluation metrics for each species, which are further visualized in Fig. 8 as a bar chart.
Table 7 Performance of PlantNh-Kcr for different plants
Species Sn (%) Sp (%) ACC (%) F1-score (%) MCC (%) AUC (%) Wheat 73.4 ± 3.94 81.9 ± 2.19 79.8 ± 0.76 64.2 ± 0.80 51.3 ± 0.93 85.8 ± 0.32 Tabacum 79.0 ± 4.16 84.5 ± 1.98 83.1 ± 0.81 69.7 ± 1.22 59.1 ± 1.67 89.6 ± 0.46 Rice 87.2 ± 2.18 78.2 ± 2.32 79.6 ± 1.68 57.8 ± 1.65 51.3 ± 1.69 89.0 ± 0.49 Peanut 87.9 ± 2.54 84.0 ± 2.17 84.7 ± 1.31 68.3 ± 1.29 61.5 ± 1.21 93.0 ± 0.21 Papaya 81.1 ± 3.58 85.5 ± 1.94 84.6 ± 0.81 69.2 ± 0.57 60.4 ± 0.75 91.4 ± 0.30
Graph: Fig. 8Metric values on independent tests for different plants
The results indicate that the prediction performance of the model varies slightly across different species. Notably, peanuts and papaya exhibit particularly strong performance, with average AUC values of 0.930 and 0.914, respectively. The model also demonstrates good performance for tabacum and rice, with average AUC values of 0.896 and 0.890, respectively. However, wheat exhibits slightly lower performance compared to other species, with an average AUC value of 0.858. This may be attributed to species-specific characteristics.
To study the performance of species-specific models, we developed individual models for each plant using samples from the corresponding species in the training dataset. We then evaluated these models using samples from the corresponding species in the test set. The performance of these models on five metrics are shown in Table 8. Notably, the peanut-specific and papaya-specific models exhibit the best performance, with average AUC values of 0.920 and 0.902, respectively. In contrast, the species-specific models for rice, tabacum, and wheat exhibit relatively poorer performance. This can be attributed to the smaller training set size for rice and tabacum and potential species-specific characteristics affecting crotonylation patterns in wheat. When compared with the general model's performance in Table 7, the species-specific models underperform. This finding underscores the advantage of integrating data from diverse species to train a general predictive model for plant non-histone Kcr sites.
Table 8 Performance of species-specific models on independent tests
Species Sn (%) Sp (%) ACC (%) F1-score (%) MCC (%) AUC (%) Wheat 73.3 ± 5.12 77.6 ± 4.26 76.5 ± 1.95 60.7 ± 0.53 46.2 ± 0.86 83.1 ± 0.36 Tabacum 72.4 ± 2.28 77.1 ± 1.06 76.0 ± 0.43 59.7 ± 0.77 44.7 ± 1.06 82.7 ± 0.77 Rice 66.2 ± 5.53 83.2 ± 3.46 80.4 ± 2.07 52.8 ± 1.11 43.2 ± 1.31 83.6 ± 0.80 Peanut 83.7 ± 2.55 84.8 ± 1.58 84.6 ± 0.84 67.1 ± 0.65 59.6 ± 0.62 92.0 ± 0.17 Papaya 84.4 ± 2.90 80.5 ± 2.26 81.4 ± 1.20 66.0 ± 0.85 56.6 ± 0.93 90.2 ± 0.19
The PlantNh-Kcr exhibits superior performance. However, there are still some issues that need to be considered.
First, our model PlantNh-Kcr contains three convolutional layers, which can effectively capture local patterns in protein sequences. Careful consideration must be given to the kernel size and the step size, as well as the number of convolution kernels. Too few or too many convolution kernels can lead to information loss or overfitting, respectively, which can impact model performance. Furthermore, when utilizing the convolutional layer to process long protein sequences, there is a risk of losing global contextual information. This can be a limiting factor in the predictive accuracy of the model. To address this issue, stacking multiple convolutional layers and effectively integrating their outputs can compensate for the loss of global context. By doing so, the model can achieve a more comprehensive understanding of the protein sequences, ultimately leading to improved predictive performance.
Second, multiple encodings were described in the paper, such as BE, WE encoding, AAindex encoding, and BLOSUM62 encoding. The PlantNh-Kcr model only utilize BE as input features. We have attempted to integrate multiple encodings as input features of the model, but failed to improve the performance. This may be because these features have poor complementarity.
Third, there are far more negative samples than positive samples in our training set. This imbalance can significantly influence model training, biasing it towards the negative samples. To address this issue, three methods were employed: up-sampling the positive samples, down-sampling the negative samples, and utilizing the focal loss function. Among these methods, the focal loss function presented the best prediction performance, and improved the ability of the model to correctly predict positive samples. We believe that dataset imbalance remains a potential problem that needs to be addressed in bioinformatics.
In this study, we compiled a large dataset of non-histone Kcr sites from five different plant species. Using this dataset, we developed a deep learning model called PlantNh-Kcr to predict non-histone Kcr sites in plants. The model's architecture integrates CNN, LSTM, and attention mechanism, utilizing BE as its primary input features. Notably, the model exhibits satisfactory performance on both five-fold cross-validation and independent tests, outperforming several other models. In addition, there are minor variations in prediction performance across different plant species, a general predictive model demonstrates superior performance compared to species-specific models. We believe that the PlantNh-Kcr model offers a valuable contribution to addressing challenges and advancing the study of plant Kcr sites. We also believe that as more Kcr sites are experimentally determined and as deep learning techniques continue to develop, we will see the emergence of more high-performance models for predicting Kcr sites.
We are very grateful to the experimental scientists for their publicly available data of Kcr sites.
J.Y. devised the method, drafted and revised the manuscript. Y.R. analyzed the data and revised the manuscript. W.X. supervised the study and revised the manuscript.
This work was supported by the Start-up fund of Shanxi Normal University (83358).
The source code and datasets are publicly available at https://github.com/jiangyanming-individual/PlantNh-Kcr.
Not applicable.
Not applicable.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Graph: Additional file 1. Detailed information about the conventional machine learning models.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
By Yanming Jiang; Renxiang Yan and Xiaofeng Wang
Reported by Author; Author; Author