pica.ml package

Submodules

pica.ml.cccv module

class pica.ml.cccv.CompleContaCV(pipeline: sklearn.pipeline.Pipeline, scoring_function: Callable = <function balanced_accuracy_score>, cv: int = 5, comple_steps: int = 20, conta_steps: int = 20, n_jobs: int = -1, n_replicates: int = 10, random_state: numpy.random.mtrand.RandomState = None, verb: bool = False, reduce_features: bool = False, n_features: int = 10000)[source]

Bases: object

A class containing all custom completeness/contamination cross-validation functionality.

Parameters
  • pipeline – target pipeline which describes the vectorization and estimator/classifier used

  • scoring_function – Sklearn-like scoring function of crossvalidation. Default: Balanced Accuracy.

  • cv – Number of folds in crossvalidation. Default: 5

  • comple_steps – number of steps between 0 and 1 (relative completeness) to be simulated

  • conta_steps – number of steps between 0 and 1 (relative contamination level) to be simulated

  • n_jobs – Number of parallel jobs. Default: -1 (All processors used)

  • n_replicates – Number of times the crossvalidation is repeated

  • reduce_features – toggles feature reduction using recursive feature elimination

  • n_features – minimal number of features to retain (if feature reduction is used)

  • random_state – An integer random seed or instance of np.random.RandomState

run(records: List[pica.structure.records.TrainingRecord])[source]

Perform completeness/contamination cross-validation.

Parameters

records – List[TrainingRecords] to perform compleconta-crossvalidation on.

Returns

A dictionary with mean balanced accuracies for each combination: dict[comple][conta]=mba

pica.ml.feature_select module

pica.ml.feature_select.compress_vocabulary(records: List[pica.structure.records.TrainingRecord], pipeline: sklearn.pipeline.Pipeline)[source]

Method to group features, that store redundant information, to avoid overfitting and speed up process (in some cases). Might be replaced or complemented by a feature selection method in future versions.

Compressing vocabulary is optional, for the test dataset it took 30 seconds, while the time saved later on is not significant.

Parameters
  • records – a list of TrainingRecord objects.

  • pipeline – the targeted pipeline where the vocabulary should be modified

Returns

nothing, sets the vocabulary for CountVectorizer step

pica.ml.feature_select.multiple_step_rfecv(records: List[pica.structure.records.TrainingRecord], pipeline: sklearn.pipeline.Pipeline, n_features: int, step=(0.01, 0.01, 0.01), random_state: numpy.random.mtrand.RandomState = None)[source]

Function to apply multiple steps-sizes of RFECV in series, currently not used. Strategy might be problematic, no clear benefit. #TODO rethink or remove

Parameters
  • records – Data used

  • pipeline – The base estimator used

  • n_features – Goal number of features

  • step – List of steps that should be applied

  • random_state – random state for deterministic results

Returns

pica.ml.feature_select.recursive_feature_elimination(records: List[pica.structure.records.TrainingRecord], pipeline: sklearn.pipeline.Pipeline, step: float = 0.0025, n_features: int = None, random_state: numpy.random.mtrand.RandomState = None)[source]

Function to apply RFE to limit the vocabulary used by the CustomVectorizer, optional step.

Parameters
  • records – list of TrainingRecords, entire training set.

  • pipeline – the pipeline which vocabulary should be modified

  • step – rate of features to eliminate at each step. the lower the number, the more steps

  • n_features – number of features to select (if None: half of the provided features)

  • random_state – random state for deterministic results

Returns

number of features used

pica.ml.trex_classifier module

class pica.ml.trex_classifier.TrexClassifier(random_state: int = None, verb: bool = False)[source]

Bases: abc.ABC

Abstract base class of Trex classifier.

crossvalidate(records: List[pica.structure.records.TrainingRecord], cv: int = 5, scoring: Union[str, Callable] = 'balanced_accuracy', n_jobs=-1, n_replicates: int = 10, groups: bool = False, reduce_features: bool = False, n_features: int = 10000, demote=False, **kwargs) → Tuple[float, float, numpy.ndarray][source]

Perform cv-fold crossvalidation or leave-one(-group)-out validation if groups == True

Parameters
  • records – List[TrainingRecords] to perform crossvalidation on.

  • scoring – String identifying scoring function of crossvalidation, or Callable. If a callable is passed, it must take two parameters y_true and y_pred (iterables of true and predicted class labels, respectively) and return a (numeric) score.

  • cv – Number of folds in crossvalidation. Default: 5

  • n_jobs – Number of parallel jobs. Default: -1 (All processors used)

  • n_replicates – Number of replicates of the crossvalidation

  • groups – If True, use group information stored in records for splitting. Otherwise, stratify split according to labels in records. This also resets n_replicates to 1.

  • reduce_features – toggles feature reduction using recursive feature elimination

  • n_features – minimum number of features to retain when reducing features

  • demote – toggles logger that is used. if true, msg is written to debug else info

  • kwargs – Unused

Returns

A list of mean score, score SD, and the percentage of misclassifications per sample

crossvalidate_cc(records: List[pica.structure.records.TrainingRecord], cv: int = 5, comple_steps: int = 20, conta_steps: int = 20, n_jobs: int = -1, n_replicates: int = 10, reduce_features: bool = False, n_features: int = 10000)[source]

Instantiates a CompleContaCV object, and calls its run_cccv method with records. Returns its result.

Parameters
  • records – List[TrainingRecord] on which completeness_contamination_CV is to be performed

  • cv – number of folds in StratifiedKFold split

  • comple_steps – number of equidistant completeness levels

  • conta_steps – number of equidistant contamination levels

  • n_jobs – number of parallel jobs (-1 for n_cpus)

  • n_replicates – Number of times the crossvalidation is repeated

  • reduce_features – toggles feature reduction using recursive feature elimination

  • n_features – selects the minimum number of features to retain (if feature reduction is used)

Returns

A dictionary with mean balanced accuracies for each combination: dict[comple][conta]=mba

abstract get_feature_weights() → Dict[source]

Extract the weights for features from pipeline.

Returns

sorted Dict of feature name: weight

classmethod get_instance(*args, **kwargs)[source]

Perform stratified, randomized parameter search. If desired, return a new class instance with optimized training parameters.

Parameters
  • records – List[TrainingRecords] to perform crossvalidation on.

  • search_params – A dictionary of iterables of possible model training parameters. If None, use default search parameters for the given classifier.

  • scoring – Scoring function of crossvalidation. Default: Balanced Accuracy.

  • cv – Number of folds in crossvalidation. Default: 5

  • n_jobs – Number of parallel jobs. Default: -1 (All processors used)

  • n_iter – Number of grid points to evaluate. Default: 10

  • return_optimized – Whether to return a ready-made classifier with the optimized params instead of a dictionary of params.

Returns

A dictionary containing best found parameters or an optimized class instance.

predict(X: List[pica.structure.records.GenotypeRecord]) → Tuple[List[str], numpy.ndarray][source]

Predict trait sign and probability of each class for each supplied GenotypeRecord.

Parameters

X – A List of GenotypeRecord for each of which to predict the trait sign

Returns

a Tuple of predictions and probabilities of each class for each GenotypeRecord in X.

scoring_function_mapping = {'accuracy': <function accuracy_score>, 'balanced_accuracy': <function balanced_accuracy_score>, 'f1': <function f1_score>}
train(records: List[pica.structure.records.TrainingRecord], reduce_features: bool = False, n_features: int = 10000, **kwargs)[source]

Fit CountVectorizer and train LinearSVC on a list of TrainingRecord.

Parameters
  • records – a List[TrainingRecord] for fitting of CountVectorizer and training of LinearSVC.

  • reduce_features – toggles feature reduction using recursive feature elimination

  • n_features – minimum number of features to retain when reducing features

  • kwargs – additional named arguments are passed to the fit() method of Pipeline.

Returns

Whether the Pipeline has been fitted on the records.

pica.ml.vectorizer module

class pica.ml.vectorizer.CustomVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)[source]

Bases: sklearn.feature_extraction.text.CountVectorizer

modified from CountVectorizer to override the _validate_vocabulary function, which invoked an error because multiple indices of the dictionary contained the same feature index. However, this is we intend with the compress_vocabulary function. Other functions had to be adopted: get_feature_names (allow decompression), _count_vocab (reduce the matrix size)

get_feature_names()[source]

Array mapping from feature integer indices to feature name

Module contents