============== Usage Tutorial ============== The following tutorial illustrates training, evaluation, inference and model introspection with phenotrex using the phenotrex command line interface. For further information on flags and parameters used in this tutorial, please consult the relevant CLI documentation available via ``phenotrex --help``. To illustrate phenotrex's capabilities, a small dataset of genome assemblies will be used. To download all required data, run: .. code-block:: console $ curl -OL http://fileshare.csb.univie.ac.at/phenotrex/tutorial_data.tar $ tar -xf tutorial_data.tar $ cd tutorial_data To then install phenotrex, including its capability for extracting features from FASTA files, in a new virtual environment: .. code-block:: console $ python3 -m venv phenotrex-env $ source ./phenotrex-env/bin/activate $ pip install phenotrex[fasta] Creation of Phenotrex Input Features ------------------------------------ Phenotrex operates on presence/absence patterns of `eggNOG`_ cluster IDs in the passed genome. If a DNA FASTA file is passed to phenotrex, `Prodigal`_ is first used to find protein sequences - this step is skipped if a protein FASTA file is passed instead. To then find eggNOG cluster IDs from protein sequences, `deepnog`_ is used. Input files to feature creation may thus be DNA or protein multi-FASTA files, which may optionally be gzipped. Feature creation is computationally demanding. For this reason, direct input of individual FASTA files to phenotrex is only implemented for prediction. For model training, evaluation (and for batch prediction), tabular files must be created from input FASTA files representing the genotype of all input files, encoded as eggNOG cluster IDs. This allows reuse of the created features for all training and evaluation purposes. To create a tabular genotype file suitable for use in phenotrex training: .. code-block:: console $ phenotrex compute-genotype \ --out T3SS.train_eval.genotype \ --threads 4 \ train_eval/genomes/*.fna.gz After some time, this will create a new tab-separated values (TSV) file ``T3SS.train_eval.genotype`` in the current directory, of the following shape: .. code-block:: #feature_type:eggNOG5-tax-2 GCA_003096415.1.fna.gz COG0656 COG0661 COG3161 COG0358 ... GCF_000006765.1.fna.gz COG0593 COG3255 COG1195 COG0187 ... GCF_000006905.1.fna.gz COG2202 COG0169 COG0237 COG0847 ... ... Here, all lines starting with ``#`` denote a metadata field read by phenotrex - for example the type of features, in this case eggNOG clusters for the NCBI taxon ID 2 (Bacteria) from eggNOG version 5. In each following line, the first field denotes the identifier of the input genome (the file name), followed by all features found in the genome, each separated by tabs. .. note:: Feature creation by phenotrex scales reasonably well with the number of threads supplied to the ``compute-genotype`` command (``--threads``). However, for large sets of genomes and when compute cluster resources are available to the user, it may be more expedient to compute subsets of genomes in parallel on different machines, and concatenate them afterwards. Training of Phenotrex Classifiers --------------------------------- To train a phenotrex classifier, two tabular input files are required: The ``genotype`` file (created from FASTA files in the last section), containing representations of the input genomes; and the ``phenotype`` file, containing true phenotypic trait values for each input genome on which to train and evaluate the model. For this tutorial, we provide a phenotype file containing information on Type 3 secretion system (T3SS) presence in each of the input genomes. The tabular phenotype file required for training and model evaluation has the following shape: .. code-block:: console Identifier T3SS GCF_000012905.2.fna.gz NO GCF_000195735.1.fna.gz NO GCF_000060345.1.fna.gz NO GCF_000959505.1.fna.gz YES GCF_000220235.1.fna.gz NO GCF_000190695.1.fna.gz NO GCF_000007605.1.fna.gz YES GCF_000195995.1.fna.gz YES GCF_000015365.1.fna.gz NO GCF_000173115.1.fna.gz NO GCF_000173095.1.fna.gz NO GCA_003096415.1.fna.gz NO ... The first column of the file contains identifiers (file names) mapping to those in the genotype file, and the second column contains true phenotypic trait values. During training, the model will store the header of column 2 as the name of the trait. Phenotrex implements model training using two different machine learning algorithms: `XGBoost`_ (XGB) and `Support Vector Machine`_ (SVM). For each algorithm, a number of hyperparameters are settable for training and evaluation. Please consult the output of ``phenotrex train xgb --help`` and ``phenotrex train svm --help``, as well as the relevant documentation of the underlying implementations. When no hyperparameters are selected, reasonable (but possibly suboptimal) defaults are used. To train an XGB classifier with the previously created genotype and the given phenotype file: .. code-block:: console $ phenotrex train xgb \ --genotype T3SS.train_eval.genotype \ --phenotype train_eval/T3SS.train_eval.phenotype \ --weights \ --out T3SS.pkl This will create a new model artifact ``T3SS.pkl`` in the current directory, and a tabular file ``T3SS.pkl.rank`` representing the relative impact of input features on prediction output as learned by the model. Performance Estimation of Phenotrex Classifiers ----------------------------------------------- The default way for phenotrex to estimate model performance (other than applying the trained model to a held back test set) is `nested cross-validation`_ (CV). This allows the estimation of predictive performance for a given set of training data and hyperparameters. To estimate performance of the model trained in the previous section, we perform a 10x/5x nested cross-validation like so: .. code-block:: console $ phenotrex cv xgb \ --genotype T3SS.train_eval.genotype \ --phenotype train_eval/T3SS.train_eval.phenotype \ --out T3SS.misclassifications.tsv \ --folds 5 \ --replicates 10 \ --threads 4 After training, predictive performance metrics averaged over outer CV folds will be printed to stderr, and a new tabular file ``T3SS.misclassifications.tsv`` will be created. This file contains the identifiers, phenotypic trait labels and fractions of misclassifications of the sample over outer CV folds. .. note:: The above command does not accept a trained model artifact. Since cross-validation is performed by training several models on subsets of the given data, a final model is not warranted here. In general, training of the final classifier with ``phenotrex train {xgb,svm}`` should be performed only when satisified with performance of the selected hyperparameters as given by cross-validation. Performance Estimation for Metagenomic Phenotrex Classifiers ------------------------------------------------------------ For phenotrex models intended to be applied to metagenome assembled genomes, it is useful to estimate the impact of missing and/or contaminating genomic features on the model output. In phenotrex, this is achieved by randomly resampling the features of validation genomes to simulate incompleteness and contamination (see `Feldbauer et al. 2015`_). For example, to estimate performance of a model on 80% complete and 5% contaminated genomic bins, nested cross-validation is performed where from each validation sample 20% of eggNOG cluster features are randomly removed. To simulate 5% contamination conservatively, a requisite number of eggNOG clusters are added to the genome drawn randomly only from genomes of the opposite label. This is performed at regular intervals of completeness and contamination, resulting in a JSON file detailing the estimated predictive performance at each step. By default, a grid of 5% increments of completeness and contamination is evaluated. To perform cross-validation under consideration of contamination and completeness (CCCV), perform: .. code-block:: console $ phenotrex cccv xgb \ --genotype T3SS.train_eval.genotype \ --phenotype train_eval/T3SS.train_eval.phenotype \ --out T3SS.cccv.json \ --folds 5 \ --replicates 10 \ --threads 4 \ --verb The above command results in a file ``T3SS.cccv.json`` being created, performance metrics at each step of the completeness/contamination grid. .. note:: The default binary classification performance metric used by phenotrex is Balanced Accuracy (bACC), which is the arithmetic mean of Sensitivity and Specificity of prediction: .. math:: bACC = \frac{1}{2} (\frac{TP}{TP + FN} + \frac{TN}{TN + FP}) This metric avoids inflating performance estimates on imbalanced datasets, and ranges from 0.5 (performance is indistinguishable from random) to 1.0 (perfect performance). Users are encouraged to determine the contamination and completeness levels of input metagenomic bins (e.g. using `CheckM`_), and critically examine the validity of predictions made by the classifier using the estimated performance at the closest point in the completeness/contamination grid. Predicting Phenotypes with Phenotrex ------------------------------------ For prediction, the threshold confidence of the classifier can be specified - all predictions with confidence below this threshold are then masked with 'N/A'. .. note:: The reported ``Confidence`` of the classifier is the model's internal confidence in its prediction given its input data. If the input genome is significantly incomplete or contaminated, this measure may be misleading, as the genome could be missing vital information required for correct classification by the model. For such cases, the external confidence measure for the given completeness/contamination level as computed by ``phenotrex cccv {xgb,svm}`` should be considered as well. Prediction of phenotypic traits with a pre-computed genotype file derived from genomes in the ``test/genomes`` directory (see section `Creation of Phenotrex Input Features`_): .. code-block:: console $ phenotrex predict \ --genotype test/T3SS.test.genotype \ --classifier T3SS.pkl \ --min_proba 0.6 \ --verb > T3SS.test_predictions.tsv The ``predict`` command outputs prediction results directly to stdout. When redirecting stdout to a file, this results in a 3-column TSV file of the following shape: .. code-block:: # Trait: T3SS Identifier Trait present Confidence GCF_000006645.1.fna.gz YES 0.8604 GCF_000006665.1.fna.gz YES 0.8675 GCF_000006825.1.fna.gz NO 0.6617 GCF_000007165.1.fna.gz YES 0.6771 GCF_000007205.1.fna.gz YES 0.8261 GCF_000007445.1.fna.gz YES 0.8183 ... Lines starting with ``#`` represent metadata, in this case the trait name saved in the used model artifact. Explanation of Phenotrex Predictions ------------------------------------ In addition to providing predicted trait labels and confidence measures, phenotrex can provide additional explanations of its decision process. This can help debug faulty hyperparameter configurations and help identify errors in the training data. Model explanation is done by gauging the importance of input features identified in genomes at training and prediction time. Feature Importance at Training Time ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The relative impact of features learned by phenotrex models is output at training time when the flag ``--weights`` is added to the ``phenotrex train {xgb,svm}`` command. The meaning of the importance differs depending on the selected ML algorithm: when using XGB, the measure represents the overall importance of that feature in the decision process of the model (irrespective of the final prediction), when using SVM, the measure correlates with the probability of calling YES (positive values) or NO (negative values) for the trait in question. Feature Importance at Prediction Time ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A second, and arguably more useful type of explanation can be computed at prediction time. For each predicted genome, a list of features is created which, either by presence or absence, contributed most to the prediction output for that genome. Feature importance is represented by `SHAP`_ (SHapley Additive exPlanations) values. The sum of SHAP values of all features considered by the model is directly related to the probability of calling YES for the trait and genome in question. .. note:: Feature explanation at prediction time is implemented by the `shap package`_, which efficiently computes the required explanations for XGB models with de facto zero overhead. For SVM models however, this calculation can be extremely costly. We thus suggest that for use cases where model explainability is important, XGB should be preferred over SVM. To create feature explanations at prediction time: .. code-block:: console $ phenotrex predict \ --genotype test/T3SS.test.genotype \ --classifier T3SS.pkl \ --min_proba 0.6 \ --out_explain_summary T3SS.expl_summary.tsv \ --out_explain_per_sample T3SS.expl_per_sample.tsv \ --n_max_explained_features 10 \ --verb > T3SS.test_predictions.tsv In addition to the original output file containing predictions, two additional files have been created: - ``T3SS.expl_per_sample.tsv`` This file contains for each predicted genome, the features which had the highest impact on the model output, as well as the sign of that impact. .. code-block:: console rank Sample Feature Feature Presence SHAP Value (class=YES) Feature Annotation ... 0 GCF_000006825.1.fna.gz COG4789 0.0 -0.46379 Type iii secretion 1 GCF_000006825.1.fna.gz COG1025 0.0 -0.19678 Belongs to the peptidase M16 family 2 GCF_000006825.1.fna.gz COG0814 1.0 0.16128 amino acid 3 GCF_000006825.1.fna.gz COG1330 1.0 0.15993 A helicase nuclease that prepares dsDNA breaks (DSB)... 4 GCF_000006825.1.fna.gz COG1459 1.0 0.14634 type II secretion system 5 GCF_000006825.1.fna.gz COG1450 0.0 -0.14371 Type ii and iii secretion system protein ... For example, for the genome ``GCF_000006825.1.fna.gz``, we see that the absence of ``COG4789`` is the single most impactful contribution to the prediction output towards the (correct) prediction ``NO``. We can immediately identify another secretory system component absent from the genome (``COG1450``) which contributes to this prediction output. However, as the used model was trained on a small toy dataset, the presence of ``COG0814`` with the somewhat unhelpful annotation "amino acid" and other features make significant contributions towards flipping the prediction to ``YES``, leading ultimately to a correct output but with a low confidence of 0.66. - ``T3SS.expl_summary.tsv`` This file contains the overall highest impact features, averaged over all SHAP contributions in all predicted genomes. For each feature, the average SHAP value change upon presence or absence of the feature is given, as well as the number of samples in which the feature was present and absent. .. code-block:: console Feature Mean SHAP If Present Mean SHAP If Absent N(present) N(absent) Feature Annotation COG4789 0.69559 -0.48636 29 162 Type iii secretion COG1025 0.26914 -0.17944 46 145 Belongs to the peptidase M16 family COG1330 0.10883 -0.12163 72 119 A helicase nuclease that prepares dsDNA breaks (DSB)... COG1929 0.22469 -0.08981 37 154 Belongs to the glycerate kinase type-1 family COG0833 0.20413 -0.08887 38 153 amino acid COG0814 0.13396 -0.07835 60 131 amino acid COG3835 0.18331 -0.05811 38 153 regulator COG1459 0.11474 -0.05503 73 118 type II secretion system COG1450 0.03356 -0.10312 107 84 Type ii and iii secretion system protein .. _eggNOG: http://eggnog5.embl.de/ .. _Prodigal: https://github.com/hyattpd/Prodigal .. _deepnog: https://github.com/univieCUBE/deepnog .. _XGBoost: https://xgboost.readthedocs.io/en/release_1.3.0/ .. _Support Vector Machine: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html .. _nested cross-validation: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html .. _Feldbauer et al. 2015: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-16-S14-S1 .. _CheckM: https://github.com/Ecogenomics/CheckM/ .. _SHAP: https://doi.org/10.1038/s42256-019-0138-9 .. _shap package: https://github.com/slundberg/shap