Input/Output Files

Help on the usage and complete list of I/O arguments of each command can be obtained using the command line help

vtam COMMAND --help
i.e.
vtam filter --help

Here we detail the content of the I/O files

params

Input of most commands. YML file with numerical parameters. Can be omitted if all parameters are by default. Simple text file with a “parameter name: parameter value” format. One parameter per line e.g.

lfn_variant_cutoff: 0.001
lfn_sample_replicate_cutoff: 0.003
lfn_read_count_cutoff: 70
pcr_error_var_prop: 0.05

fastqinfo

Input of merge. TSV file with the following columns:

  • TagFwd: Sequence of the tag on the forward primer (5’=>3’)
  • PrimerFwd: Sequence of the forward primer (5’=>3’)
  • TagRev: Sequence of the tag on the reverse primer (5’=>3’)
  • PrimerRev: Sequence of the reverse primer (5’=>3’)
  • Marker: Name of the marker (e.g. MFZR)
  • Sample: Name of the sample
  • Replicate: ID of the replicate
  • Run: Name of the sequencing run
  • FastqFwd: Name of the forward fastq file
  • FastqRev: Name of the reverse fastq file

fastainfo

Output of merge, input of sortreads. TSV file with the following columns:

  • run: Name of the sequencing run
  • marker: Name of the marker (e.g. MFZR)
  • sample: Name of the sample
  • replicate: ID of the replicate
  • tagfwd: Sequence of the tag on the forward primer (5’=>3’)
  • primerfwd: Sequence of the forward primer (5’=>3’)
  • tagrev: Sequence of the tag on the revrese primer (5’=>3’)
  • primerrev: Sequence of the reverse primer (5’=>3’)
  • mergedfasta: name of the fasta file with merged sequences

sortedinfo

Output of sortreads, input of filter and optimize. TSV file with the following columns:

  • run: Name of the sequencing run
  • marker: Name of the marker (e.g. MFZR)
  • sample: Name of the sample
  • replicate: ID of the replicate
  • sortedfasta: name of the fasta file containing merged, demultiplexed, trimmed sequeces

db

I/O of filter, taxassign. Input of optimize, pool. Sqlite database containing variants, samples, replicates, read counts, information on filtering steps, taxonomic assignations.

asvtable

Output of filter or pool, input of taxassign. TSV file with the variants (in lines) that passed all filtering steps, samples (in columns), presence-absence (output of pool) or read counts (output of filter) in cells and additional columns:

  • run: Name of the sequencing run
  • marker: Name of the marker (e.g. MFZR)
  • variant: Variant ID
  • pooled_variants (only in output of pool): IDs of variants pooled since identical in their overlapping regions
  • sequence_length length of the variant
  • read_count: Total number of reads of the variants in the samples listed in the table
  • [one column per sample] : presence-absence (output of pool) or read counts (output of filter)
  • clusterid: ID of the centroïd of the cluster (0.97 clustering of all variants of the asv table)
  • clustersize: Number of variants in the cluster
  • chimera_borderline (only in output of filter): Potential chimeras (very similar to one of the parental sequence)
  • [keep_mockXX; One column per mock sample, if known_occurrences option is used]: 1 if variant is expected in the mock sample, 0 otherwise
  • pooled_sequences (only in output of pool): Sequences of pooled_variants
  • sequence: Sequence of the variant

known_occurrences

Input of filter and optimize. Output of make_known_occurrences. TSV file with expected occurrences (keep) and known false positives (delete).

  • marker: Name of the marker (e.g. MFZR)
  • run: Name of the sequencing run
  • sample: Name of the sample
  • mock: 1 if sample is a mock, 0 otherwise
  • variant: Varinat ID (can be empty)
  • action: keep (occurrences that should be kept after filtering) or delete (clear false positives)
  • sequence: Sequence of the variant
  • tax_name: optional, not used by optimize

mock_composition

Input of filter. TSV file with expected sequences (keep) in mock samples.

  • marker: Name of the marker (e.g. MFZR)
  • run: Name of the sequencing run
  • sample: Name of the sample
  • mock: 1 if sample is a mock, 0 otherwise
  • variant: Variant ID (can be empty)
  • action: keep (occurrences that should be kept after filtering) or delete (clear false positives) or tolerate (variant present in a mock sample but amplifies badly)
  • sequence: Sequence of the variant
  • tax_name: optional, not used by optimize

sample_types

Input of make_known_occurrences. TSV file.

  • run: Name of the sequencing run
  • sample: Name of the sample
  • sample_type: real/negative(negative control)/mock
  • habitat: habitat type (e.g. freshwater, marine), NA for negative contol samples. It is used to detect occurrences that do not correspond to the habitat type.

missing_occurrences

Output of make_known_occurrences. TSV file with keep occurrences that are missing from the input ASV table.

  • marker: Name of the marker (e.g. MFZR)
  • run: Name of the sequencing run
  • sample: Name of the sample
  • mock: 1 if sample is a mock, 0 otherwise
  • variant: Variant ID (can be empty)
  • action: keep (occurrences that should be kept after filtering) or delete (clear false positives)
  • sequence: Sequence of the variant
  • tax_name: optional, not used by optimize

optimize_lfn_sample_replicate.tsv

Output of optimize. TSV file with the following columns:

  • run: Name of the sequencing run
  • marker: Name of the marker (e.g. MFZR)
  • sample: Name of the sample
  • replicate: ID of the replicate
  • variant: Variant ID
  • N_ijk: Number of reads of variant i, in sample j and replicate k
  • N_jk: Number of reads in sample j and replicate k (all variants)
  • N_ijk/N_jk
  • round_down: Rounded value of N_ijk/N_jk
  • sequence: Variant sequence

optimize_lfn_read_count_and_lfn_variant.tsv OR optimize_lfn_read_count_and_lfn_variant_replicate.tsv

Output of optimize. TSV file with the following columns:

  • occurrence_nb_keep: Number of keep occurrence left after filtering with lfn_nijk_cutoff and lfn_variant_cutoff values
  • occurrence_nb_delete: Number of delete occurrence left after filtering with lfn_nijk_cutoff and lfn_variant_cutoff values
  • lfn_nijk_cutoff: lfn_read_count_cutoff
  • lfn_variant_cutoff or lfn_variant_replicate_cutoff
  • run: Name of the sequencing run
  • marker: Name of the marker (e.g. MFZR)

optimize_lfn_variant_specific.tsv OR optimize_lfn_variant_replicate_specific.tsv

Output of optimize. TSV file with the following columns:

  • run: Name of the sequencing run
  • marker: Name of the marker (e.g. MFZR)
  • variant: Variant ID
  • replicate: (if optimize_lfn_variant_replicate_specific.tsv) ID of the replicate
  • action: Type d’occurrece (delete/keep)
  • read_count_max: Max of N_ijk for a given i
  • N_i (optimize_lfn_variant_specific.tsv) : Number of reads of variant i
  • N_ik (optimize_lfn_variant_replicate_specific.tsv): Number of reads of variant i in replicate k
  • lfn_variant_cutoff: read_count_max/N_i or read_count_max/N_ik
  • sequence: Variant sequence

optimize_pcr_error.tsv

Output of optimize. TSV file with the following columns:

  • run: Name of the sequencing run
  • marker: Name of the marker (e.g. MFZR)
  • sample: Name of the sample
  • variant_expected: ID of a keep variant
  • N_ij_expected: Number of reads of the expected variant in the sample (all replicates)
  • variant_unexpected: ID of an unexpected variants with one mismatch to the keep variant
  • N_ij_unexpected: Number of reads of the unexpected variant in the sample (all replicates)
  • N_ij_unexpected_to_expected_ratio: N_ij_unexpected/N_ij_expected
  • sequence_expected: Sequence of the expected variant
  • sequence_unexpected: Sequence of the unexpected variant

output (taxassign)

Output of taxassign The input asvtable completed with the following columns:

  • ltg_tax_id: TaxID of the LTG (Lowest Taxonomic Group)
  • ltg_tax_name ltg_rank: Name of the LTG
  • identity: Percentage of identity used to determine the LTG
  • blast_db: Name of the taxonomic BLAST database files (without extensions)
  • phylum: Phylum of LTG
  • class: class of LTG
  • order: order of LTG
  • family: family of LTG
  • genus: genus of LTG
  • species: species of LTG

taxonomy

Output of taxonomy, input of taxassign. TSV file with information of all taxa in the reference (BLAST) database.

  • tax_id: Taxonomic identifier of the taxon
  • parent_tax_id: Taxonomic identifier of the direct parent of the taxon
  • rank: Taxonomic rank of the taxon (e.g. class, species, no rank)
  • name_txt: Name of the taxon
  • old_tax_id: TaxID of taxa merged to taxon (not valid any more)
  • taxlevel index (optional; 0 = root, 1 = superkingdom, 2 = kingdom, 3 = phylum, 4 = class, 5 = order, 6 = family, 7 = genus, 8 = species, x.5 for intermediate levels)

runmarker

Input of pool. TSV file with the list of all run-marker combinations to be pooled.

  • run: Name of the sequencing run
  • marker: Name of the marker (e.g. MFZR)