pica.ml package¶

Subpackages¶

pica.ml.classifiers package

Submodules¶

pica.ml.cccv module¶

class pica.ml.cccv.CompleContaCV(pipeline: sklearn.pipeline.Pipeline, scoring_function: Callable = <function balanced_accuracy_score>, cv: int = 5, comple_steps: int = 20, conta_steps: int = 20, n_jobs: int = -1, n_replicates: int = 10, random_state: numpy.random.mtrand.RandomState = None, verb: bool = False, reduce_features: bool = False, n_features: int = 10000)[source]¶

Bases: object

A class containing all custom completeness/contamination cross-validation functionality.

Parameters

pipeline – target pipeline which describes the vectorization and estimator/classifier used
scoring_function – Sklearn-like scoring function of crossvalidation. Default: Balanced Accuracy.
cv – Number of folds in crossvalidation. Default: 5
comple_steps – number of steps between 0 and 1 (relative completeness) to be simulated
conta_steps – number of steps between 0 and 1 (relative contamination level) to be simulated
n_jobs – Number of parallel jobs. Default: -1 (All processors used)
n_replicates – Number of times the crossvalidation is repeated
reduce_features – toggles feature reduction using recursive feature elimination
n_features – minimal number of features to retain (if feature reduction is used)
random_state – An integer random seed or instance of np.random.RandomState

run(records: List[pica.structure.records.TrainingRecord])[source]¶

Perform completeness/contamination cross-validation.

Parameters: records – List[TrainingRecords] to perform compleconta-crossvalidation on.
Returns: A dictionary with mean balanced accuracies for each combination: dict[comple][conta]=mba

pica.ml.feature_select module¶

pica.ml.feature_select.compress_vocabulary(records: List[pica.structure.records.TrainingRecord], pipeline: sklearn.pipeline.Pipeline)[source]¶

Method to group features, that store redundant information, to avoid overfitting and speed up process (in some cases). Might be replaced or complemented by a feature selection method in future versions.

Compressing vocabulary is optional, for the test dataset it took 30 seconds, while the time saved later on is not significant.

Parameters

records – a list of TrainingRecord objects.
pipeline – the targeted pipeline where the vocabulary should be modified

Returns

nothing, sets the vocabulary for CountVectorizer step

pica.ml.feature_select.multiple_step_rfecv(records: List[pica.structure.records.TrainingRecord], pipeline: sklearn.pipeline.Pipeline, n_features: int, step=(0.01, 0.01, 0.01), random_state: numpy.random.mtrand.RandomState = None)[source]¶

Function to apply multiple steps-sizes of RFECV in series, currently not used. Strategy might be problematic, no clear benefit. #TODO rethink or remove

Parameters

records – Data used
pipeline – The base estimator used
n_features – Goal number of features
step – List of steps that should be applied
random_state – random state for deterministic results

Returns

pica.ml.feature_select.recursive_feature_elimination(records: List[pica.structure.records.TrainingRecord], pipeline: sklearn.pipeline.Pipeline, step: float = 0.0025, n_features: int = None, random_state: numpy.random.mtrand.RandomState = None)[source]¶

Function to apply RFE to limit the vocabulary used by the CustomVectorizer, optional step.

Parameters

records – list of TrainingRecords, entire training set.
pipeline – the pipeline which vocabulary should be modified
step – rate of features to eliminate at each step. the lower the number, the more steps
n_features – number of features to select (if None: half of the provided features)
random_state – random state for deterministic results

Returns

number of features used

pica.ml.trex_classifier module¶

class pica.ml.trex_classifier.TrexClassifier(random_state: int = None, verb: bool = False)[source]¶

Bases: abc.ABC

Abstract base class of Trex classifier.

crossvalidate(records: List[pica.structure.records.TrainingRecord], cv: int = 5, scoring: Union[str, Callable] = 'balanced_accuracy', n_jobs=-1, n_replicates: int = 10, groups: bool = False, reduce_features: bool = False, n_features: int = 10000, demote=False, **kwargs) → Tuple[float, float, numpy.ndarray][source]¶

Perform cv-fold crossvalidation or leave-one(-group)-out validation if groups == True

Parameters

records – List[TrainingRecords] to perform crossvalidation on.
scoring – String identifying scoring function of crossvalidation, or Callable. If a callable is passed, it must take two parameters y_true and y_pred (iterables of true and predicted class labels, respectively) and return a (numeric) score.
cv – Number of folds in crossvalidation. Default: 5
n_jobs – Number of parallel jobs. Default: -1 (All processors used)
n_replicates – Number of replicates of the crossvalidation
groups – If True, use group information stored in records for splitting. Otherwise, stratify split according to labels in records. This also resets n_replicates to 1.
reduce_features – toggles feature reduction using recursive feature elimination
n_features – minimum number of features to retain when reducing features
demote – toggles logger that is used. if true, msg is written to debug else info
kwargs – Unused

Returns

A list of mean score, score SD, and the percentage of misclassifications per sample

crossvalidate_cc(records: List[pica.structure.records.TrainingRecord], cv: int = 5, comple_steps: int = 20, conta_steps: int = 20, n_jobs: int = -1, n_replicates: int = 10, reduce_features: bool = False, n_features: int = 10000)[source]¶

Instantiates a CompleContaCV object, and calls its run_cccv method with records. Returns its result.

Parameters

records – List[TrainingRecord] on which completeness_contamination_CV is to be performed
cv – number of folds in StratifiedKFold split
comple_steps – number of equidistant completeness levels
conta_steps – number of equidistant contamination levels
n_jobs – number of parallel jobs (-1 for n_cpus)
n_replicates – Number of times the crossvalidation is repeated
reduce_features – toggles feature reduction using recursive feature elimination
n_features – selects the minimum number of features to retain (if feature reduction is used)

Returns

A dictionary with mean balanced accuracies for each combination: dict[comple][conta]=mba

abstract get_feature_weights() → Dict[source]¶

Extract the weights for features from pipeline.

Returns: sorted Dict of feature name: weight

classmethod get_instance(*args, **kwargs)[source]¶

parameter_search(records: List[pica.structure.records.TrainingRecord], search_params: Dict[str, List] = None, cv: int = 5, scoring: str = 'balanced_accuracy', n_jobs: int = -1, n_iter: int = 10, return_optimized: bool = False)[source]¶

Perform stratified, randomized parameter search. If desired, return a new class instance with optimized training parameters.

Parameters

records – List[TrainingRecords] to perform crossvalidation on.
search_params – A dictionary of iterables of possible model training parameters. If None, use default search parameters for the given classifier.
scoring – Scoring function of crossvalidation. Default: Balanced Accuracy.
cv – Number of folds in crossvalidation. Default: 5
n_jobs – Number of parallel jobs. Default: -1 (All processors used)
n_iter – Number of grid points to evaluate. Default: 10
return_optimized – Whether to return a ready-made classifier with the optimized params instead of a dictionary of params.

Returns

A dictionary containing best found parameters or an optimized class instance.

predict(X: List[pica.structure.records.GenotypeRecord]) → Tuple[List[str], numpy.ndarray][source]¶

Predict trait sign and probability of each class for each supplied GenotypeRecord.

Parameters: X – A List of GenotypeRecord for each of which to predict the trait sign
Returns: a Tuple of predictions and probabilities of each class for each GenotypeRecord in X.

scoring_function_mapping = {'accuracy': <function accuracy_score>, 'balanced_accuracy': <function balanced_accuracy_score>, 'f1': <function f1_score>}¶

train(records: List[pica.structure.records.TrainingRecord], reduce_features: bool = False, n_features: int = 10000, **kwargs)[source]¶

Fit CountVectorizer and train LinearSVC on a list of TrainingRecord.

Parameters

records – a List[TrainingRecord] for fitting of CountVectorizer and training of LinearSVC.
reduce_features – toggles feature reduction using recursive feature elimination
n_features – minimum number of features to retain when reducing features
kwargs – additional named arguments are passed to the fit() method of Pipeline.

Returns

Whether the Pipeline has been fitted on the records.

pica.ml.vectorizer module¶

class pica.ml.vectorizer.CustomVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)[source]¶

Bases: sklearn.feature_extraction.text.CountVectorizer

modified from CountVectorizer to override the _validate_vocabulary function, which invoked an error because multiple indices of the dictionary contained the same feature index. However, this is we intend with the compress_vocabulary function. Other functions had to be adopted: get_feature_names (allow decompression), _count_vocab (reduce the matrix size)

get_feature_names()[source]¶: Array mapping from feature integer indices to feature name

pica.ml package¶

Subpackages¶

Submodules¶

pica.ml.cccv module¶

pica.ml.feature_select module¶

pica.ml.trex_classifier module¶

pica.ml.vectorizer module¶

Module contents¶

Table of Contents

Previous topic

Next topic

This Page