pica.io package

Submodules

pica.io.io module

pica.io.io.collate_training_data(genotype_records: List[pica.structure.records.GenotypeRecord], phenotype_records: List[pica.structure.records.PhenotypeRecord], group_records: List[pica.structure.records.GroupRecord], universal_genotype: bool = False, verb: bool = False) → List[pica.structure.records.TrainingRecord][source]

Returns a list of TrainingRecord from two lists of GenotypeRecord and PhenotypeRecord. To be used for training and CV of TrexClassifier. Checks if 1:1 mapping of phenotypes and genotypes exists, and if all PhenotypeRecords pertain to same trait.

Parameters
  • genotype_records – List[GenotypeRecord]

  • phenotype_records – List[PhenotypeRecord]

  • group_records – List[GroupRecord] optional, if leave one group out is the split strategy

  • universal_genotype – Whether to use an universal genotype file.

  • verb – toggle verbosity.

Returns

List[TrainingRecord]

pica.io.io.load_genotype_file(input_file: str) → List[pica.structure.records.GenotypeRecord][source]

Loads a genotype .tsv file and returns a list of GenotypeRecord for each entry.

Parameters

input_file – The path to the input genotype file.

Returns

List[GenotypeRecord] of records in the genotype file

pica.io.io.load_groups_file(input_file: str, selected_rank: str = None) → List[pica.structure.records.GroupRecord][source]

Loads a .tsv file which contains group or taxid for each sample in the other training files. Group-Ids may be ncbi-taxon-ids or arbitrary group names. Taxon-Ids are only used if a standard rank is selected, otherwise user-specified group-ids are assumed. Automatically classifies the [TODO missing text?]

Parameters
  • input_file – path to the file that is processed

  • selected_rank – the standard rank that is selected (optional) if not set, the input file is assumed to contain groups, i.e., each unique entry of the ID will be a new group

Returns

a list of GroupRecords

pica.io.io.load_phenotype_file(input_file: str, sign_mapping: Dict[str, int] = None) → List[pica.structure.records.PhenotypeRecord][source]

Loads a phenotype .tsv file and returns a list of PhenotypeRecord for each entry.

Parameters
  • input_file – The path to the input phenotype file.

  • sign_mapping – an optional Dict to change mappings of trait sign. Default: {“YES”: 1, “NO”: 0}

Returns

List[PhenotypeRecord] of records in the phenotype file

pica.io.io.load_training_files(genotype_file: str, phenotype_file: str, groups_file: str = None, selected_rank: str = None, universal_genotype: bool = False, verb=False) → Tuple[List[pica.structure.records.TrainingRecord], List[pica.structure.records.GenotypeRecord], List[pica.structure.records.PhenotypeRecord], List[pica.structure.records.GroupRecord]][source]

Convenience function to load phenotype and genotype file together, and return a list of TrainingRecord.

Parameters
  • genotype_file – The path to the input genotype file.

  • phenotype_file – The path to the input phenotype file.

  • groups_file – The path to the input groups file.

  • selected_rank – The selected standard rank to use for taxonomic grouping

  • universal_genotype – Whether to use an universal genotype file.

  • verb – toggle verbosity.

Returns

Tuple[List[TrainingRecord], List[GenotypeRecord], List[PhenotypeRecord]]

pica.io.io.write_cccv_accuracy_file(output_file: str, cccv_results)[source]

Function to write the cccv accuracies in the exact format that phendb uses as input.

Parameters
  • output_file – file

  • cccv_results

Returns

nothing

pica.io.io.write_misclassifications_file(output_file: str, records: List[pica.structure.records.TrainingRecord], misclassifications, use_groups: bool = False)[source]

Function to write the misclassifications file.

Parameters
  • output_file – name of the outputfile

  • records – List of trainingRecord objects

  • misclassifications – List of percentages of misclassifications

  • use_groups – toggles average over groups and groups output

Returns

pica.io.io.write_weights_file(weights_file: str, weights: Dict)[source]

Function to write the weights to specified file in tab-separated fashion with header

Parameters
  • weights_file – The path to the file to which the output will be written

  • weights – sorted dictionary storing weights with feature names as indices

Returns

nothing

Module contents