Fig. 6

The feature importance ranking of bacterial taxa in heifers with superior or inferior reproductive performance by the random forest (RF) and XGBoost (XGB) algorithms. a Treatment of training dataset and test dataset. Here, the real number datasets of AA-selected factors for 150d and 300d (Fig. S1) are prepared for RF and XGB. Since the AA-selected factors are slightly different between the 150d and 300d datasets, factors detected in at least one of these datasets are also included. The confusion matrix was shown using the 150d and 300d data as the training and test data sets, respectively. The tables (Prediction Tables I and II) show a confusion matrix via RF and XGB, and class. error in the table shows the error rate in each class of the test data set. Since the accuracy is not high, it can be shown below that an attempt is made to detect features using the whole data. The abbreviations are as follows: training dataset, a dataset for training via the MLs for a model; test dataset, a dataset for testing via the MLs to be predicted; 150d, 150 days of age after birth; 300d, 300 days of age after birth; Inf, an inferior group; Sup, a superior group; 150d_Inf, the data of inferior group of 150d; 150d_Sup, the data of superior group of 150d; 300d_Inf, the data of inferior group of 300d; 300d_Sup, the data of superior group of 300d; class. error, an error rate in each class in the table. The relative levels of the RF feature is shown in (b) 150d and (c) 300d, and the levels of the XGB feature can be found in (d) 150d and (e) 300d. The X-axis indicates the feature values (MeanDecreaseGini, a RF feature unit in (b) and (c); and Importance, a XGB feature unit in (d) and (e)). The Y-axis indicates the bacterial taxa screened by LDA and AA at 150 and 300 days of age. The bacterial taxa classified using LDA (Fig. 5) and AA (Fig. S1) are as follows: red bars, bacterial taxa classified into superior groups; green bars, bacterial taxa classified into inferior groups. The factors selected by the LDA have been underlined. The factors with dotted corrals in (d) and (e) show feature in case that the 150d and 300d datasets are used as the training and test dataset, respectively