pica.ml package¶
Subpackages¶
Submodules¶
pica.ml.cccv module¶
-
class
pica.ml.cccv.
CompleContaCV
(pipeline: sklearn.pipeline.Pipeline, scoring_function: Callable = <function balanced_accuracy_score>, cv: int = 5, comple_steps: int = 20, conta_steps: int = 20, n_jobs: int = -1, n_replicates: int = 10, random_state: numpy.random.mtrand.RandomState = None, verb: bool = False, reduce_features: bool = False, n_features: int = 10000)[source]¶ Bases:
object
A class containing all custom completeness/contamination cross-validation functionality.
- Parameters
pipeline – target pipeline which describes the vectorization and estimator/classifier used
scoring_function – Sklearn-like scoring function of crossvalidation. Default: Balanced Accuracy.
cv – Number of folds in crossvalidation. Default: 5
comple_steps – number of steps between 0 and 1 (relative completeness) to be simulated
conta_steps – number of steps between 0 and 1 (relative contamination level) to be simulated
n_jobs – Number of parallel jobs. Default: -1 (All processors used)
n_replicates – Number of times the crossvalidation is repeated
reduce_features – toggles feature reduction using recursive feature elimination
n_features – minimal number of features to retain (if feature reduction is used)
random_state – An integer random seed or instance of np.random.RandomState
-
run
(records: List[pica.structure.records.TrainingRecord])[source]¶ Perform completeness/contamination cross-validation.
- Parameters
records – List[TrainingRecords] to perform compleconta-crossvalidation on.
- Returns
A dictionary with mean balanced accuracies for each combination: dict[comple][conta]=mba
pica.ml.feature_select module¶
-
pica.ml.feature_select.
compress_vocabulary
(records: List[pica.structure.records.TrainingRecord], pipeline: sklearn.pipeline.Pipeline)[source]¶ Method to group features, that store redundant information, to avoid overfitting and speed up process (in some cases). Might be replaced or complemented by a feature selection method in future versions.
Compressing vocabulary is optional, for the test dataset it took 30 seconds, while the time saved later on is not significant.
- Parameters
records – a list of TrainingRecord objects.
pipeline – the targeted pipeline where the vocabulary should be modified
- Returns
nothing, sets the vocabulary for CountVectorizer step
-
pica.ml.feature_select.
multiple_step_rfecv
(records: List[pica.structure.records.TrainingRecord], pipeline: sklearn.pipeline.Pipeline, n_features: int, step=(0.01, 0.01, 0.01), random_state: numpy.random.mtrand.RandomState = None)[source]¶ Function to apply multiple steps-sizes of RFECV in series, currently not used. Strategy might be problematic, no clear benefit. #TODO rethink or remove
- Parameters
records – Data used
pipeline – The base estimator used
n_features – Goal number of features
step – List of steps that should be applied
random_state – random state for deterministic results
- Returns
-
pica.ml.feature_select.
recursive_feature_elimination
(records: List[pica.structure.records.TrainingRecord], pipeline: sklearn.pipeline.Pipeline, step: float = 0.0025, n_features: int = None, random_state: numpy.random.mtrand.RandomState = None)[source]¶ Function to apply RFE to limit the vocabulary used by the CustomVectorizer, optional step.
- Parameters
records – list of TrainingRecords, entire training set.
pipeline – the pipeline which vocabulary should be modified
step – rate of features to eliminate at each step. the lower the number, the more steps
n_features – number of features to select (if None: half of the provided features)
random_state – random state for deterministic results
- Returns
number of features used
pica.ml.trex_classifier module¶
-
class
pica.ml.trex_classifier.
TrexClassifier
(random_state: int = None, verb: bool = False)[source]¶ Bases:
abc.ABC
Abstract base class of Trex classifier.
-
crossvalidate
(records: List[pica.structure.records.TrainingRecord], cv: int = 5, scoring: Union[str, Callable] = 'balanced_accuracy', n_jobs=-1, n_replicates: int = 10, groups: bool = False, reduce_features: bool = False, n_features: int = 10000, demote=False, **kwargs) → Tuple[float, float, numpy.ndarray][source]¶ Perform cv-fold crossvalidation or leave-one(-group)-out validation if groups == True
- Parameters
records – List[TrainingRecords] to perform crossvalidation on.
scoring – String identifying scoring function of crossvalidation, or Callable. If a callable is passed, it must take two parameters y_true and y_pred (iterables of true and predicted class labels, respectively) and return a (numeric) score.
cv – Number of folds in crossvalidation. Default: 5
n_jobs – Number of parallel jobs. Default: -1 (All processors used)
n_replicates – Number of replicates of the crossvalidation
groups – If True, use group information stored in records for splitting. Otherwise, stratify split according to labels in records. This also resets n_replicates to 1.
reduce_features – toggles feature reduction using recursive feature elimination
n_features – minimum number of features to retain when reducing features
demote – toggles logger that is used. if true, msg is written to debug else info
kwargs – Unused
- Returns
A list of mean score, score SD, and the percentage of misclassifications per sample
-
crossvalidate_cc
(records: List[pica.structure.records.TrainingRecord], cv: int = 5, comple_steps: int = 20, conta_steps: int = 20, n_jobs: int = -1, n_replicates: int = 10, reduce_features: bool = False, n_features: int = 10000)[source]¶ Instantiates a CompleContaCV object, and calls its run_cccv method with records. Returns its result.
- Parameters
records – List[TrainingRecord] on which completeness_contamination_CV is to be performed
cv – number of folds in StratifiedKFold split
comple_steps – number of equidistant completeness levels
conta_steps – number of equidistant contamination levels
n_jobs – number of parallel jobs (-1 for n_cpus)
n_replicates – Number of times the crossvalidation is repeated
reduce_features – toggles feature reduction using recursive feature elimination
n_features – selects the minimum number of features to retain (if feature reduction is used)
- Returns
A dictionary with mean balanced accuracies for each combination: dict[comple][conta]=mba
-
abstract
get_feature_weights
() → Dict[source]¶ Extract the weights for features from pipeline.
- Returns
sorted Dict of feature name: weight
-
parameter_search
(records: List[pica.structure.records.TrainingRecord], search_params: Dict[str, List] = None, cv: int = 5, scoring: str = 'balanced_accuracy', n_jobs: int = -1, n_iter: int = 10, return_optimized: bool = False)[source]¶ Perform stratified, randomized parameter search. If desired, return a new class instance with optimized training parameters.
- Parameters
records – List[TrainingRecords] to perform crossvalidation on.
search_params – A dictionary of iterables of possible model training parameters. If None, use default search parameters for the given classifier.
scoring – Scoring function of crossvalidation. Default: Balanced Accuracy.
cv – Number of folds in crossvalidation. Default: 5
n_jobs – Number of parallel jobs. Default: -1 (All processors used)
n_iter – Number of grid points to evaluate. Default: 10
return_optimized – Whether to return a ready-made classifier with the optimized params instead of a dictionary of params.
- Returns
A dictionary containing best found parameters or an optimized class instance.
-
predict
(X: List[pica.structure.records.GenotypeRecord]) → Tuple[List[str], numpy.ndarray][source]¶ Predict trait sign and probability of each class for each supplied GenotypeRecord.
- Parameters
X – A List of GenotypeRecord for each of which to predict the trait sign
- Returns
a Tuple of predictions and probabilities of each class for each GenotypeRecord in X.
-
scoring_function_mapping
= {'accuracy': <function accuracy_score>, 'balanced_accuracy': <function balanced_accuracy_score>, 'f1': <function f1_score>}¶
-
train
(records: List[pica.structure.records.TrainingRecord], reduce_features: bool = False, n_features: int = 10000, **kwargs)[source]¶ Fit CountVectorizer and train LinearSVC on a list of TrainingRecord.
- Parameters
records – a List[TrainingRecord] for fitting of CountVectorizer and training of LinearSVC.
reduce_features – toggles feature reduction using recursive feature elimination
n_features – minimum number of features to retain when reducing features
kwargs – additional named arguments are passed to the fit() method of Pipeline.
- Returns
Whether the Pipeline has been fitted on the records.
-
pica.ml.vectorizer module¶
-
class
pica.ml.vectorizer.
CustomVectorizer
(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)[source]¶ Bases:
sklearn.feature_extraction.text.CountVectorizer
modified from CountVectorizer to override the _validate_vocabulary function, which invoked an error because multiple indices of the dictionary contained the same feature index. However, this is we intend with the compress_vocabulary function. Other functions had to be adopted: get_feature_names (allow decompression), _count_vocab (reduce the matrix size)