Skip to content

KaryoScope Preprocessing: Telomeric Read Filtering

Overview

This document describes the preprocessing steps that prepare long-read telomeric sequencing reads for KaryoScope clustering analysis. The primary preprocessing step is telomeric read filtering using Telogator2, which enriches for reads containing telomeric and subtelomeric sequences. While not strictly required, this filtering dramatically improves computational efficiency and signal-to-noise ratio for telomeric structural variant detection.


Table of Contents

1. Telomeric Read Filtering (Telogator2) - 1.1 Purpose - 1.2 Algorithm - 1.3 Output Files - 1.4 Read Type Annotations - 1.5 Integration with KaryoScope Workflow - 1.6 Filtering Impact on Dataset Size

2. Alternative: Direct Input Without Telogator

3. Quality Control Recommendations

4. Citation


1. Telomeric Read Filtering (Telogator2)

1.1 Purpose

Telogator2 identifies and extracts reads containing telomeric sequences from whole-genome long-read sequencing data. This enrichment step:

  • Reduces dataset size by 10-100x depending on sequencing depth
  • Focuses downstream analysis on structurally informative telomeric reads
  • Removes reads lacking telomeric content
  • Enables efficient telomere-focused structural variant detection

1.2 Algorithm

Telogator2 uses k-mer-based classification to identify telomeric reads:

Primary filtering criterion:

Reads are classified as telomeric if they contain ≥ MINIMUM_CANON_HITS occurrences of tandem canonical telomeric 12-mers.

Default parameters: - MINIMUM_CANON_HITS: 8 (minimum 12-mer hits) - MINIMUM_READ_LEN: 4000 bp (minimum read length) - Base k-mer size: 6 bp (doubled to 12 bp for tandem search)

Canonical telomeric 12-mers (tandem 6-mers):

  • Primary (all platforms): CCCTAACCCTAA (forward), TTAGGGTTAGGG (reverse complement)
  • Constructed by repeating the canonical 6-mer CCCTAA / TTAGGG twice
  • Additional (Nanopore only): Tandem repeats of CCTGG and TTTTAA (improve sensitivity for Nanopore error profiles)

Search algorithm:

The algorithm searches for exact matches to these 12-mer patterns (6-mer repeated twice) using simple string counting. Reads with ≥8 occurrences of the canonical 12-mer pattern are retained.

Example:

Read sequence:  ...TTAGGGTTAGGGTTAGGGTTAGGG...CCCTAACCCTAA...
12-mer matches:   [TTAGGGTTAGGG]
                        [TTAGGGTTAGGG]
                              [TTAGGGTTAGGG]
                                    [TTAGGGTTAGGG] [CCCTAACCCTAA]

Total 12-mer hits: 5 (below threshold of 8, read excluded)

Read sequence:  ...TTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG...
12-mer matches:   [TTAGGGTTAGGG]
                        [TTAGGGTTAGGG]
                              [TTAGGGTTAGGG]
                                    [TTAGGGTTAGGG]
                                          [TTAGGGTTAGGG]
                                                [TTAGGGTTAGGG]
                                                      [TTAGGGTTAGGG]
                                                            [TTAGGGTTAGGG]

Total 12-mer hits: 8 (meets threshold, read retained)

Note: The algorithm uses Python's .count() method to search for the 12-mer string CCCTAACCCTAA or its reverse complement TTAGGGTTAGGG, counting overlapping occurrences.

1.3 Output Files

Telogator2 generates multiple output files. The key file for KaryoScope input is:

tel_reads.fa.gz - Filtered telomeric reads in FASTA format - Contains all reads meeting the tandem canonical k-mer threshold - Ready for input to KaryoScope feature annotation pipeline - Typical size reduction: 10-100x smaller than input

Additional outputs (optional, used for telomere length analysis): - tlens_by_allele.tsv - Allele-specific telomere lengths - all_final_alleles.png - Plots of all telomeric alleles - violin_atl.png - Violin plots of allelic telomere lengths

1.4 Read Type Annotations

KaryoScope workflows support organizing reads by sequencing technology and library preparation using user-defined type annotations:

Common PacBio type annotations: - hifi_fiber: High-fidelity reads from Fiber-seq libraries (chromatin accessibility profiling) - hifi_notfiber: Standard PacBio HiFi reads

Common Oxford Nanopore type annotations: - ONT_UL: Ultra-long reads (typically >50 kb) - ONT_nonfrag: Non-fragmented standard-length reads - ONT_frag: Fragmented or sheared reads

Implementation:

Type annotations are specified in two locations:

  1. Workflow configuration (config.yaml):
  2. Users manually assign type labels when defining input files
  3. The {type} wildcard in workflow files (e.g., {sample}.{type}.{i}) refers to these user-defined labels
  4. Example:

    data:
      sample1:
        ONT_UL:
          1: /path/to/ultralong_reads.fastq.gz
        hifi_fiber:
          1: /path/to/fiberseq_reads.bam
    

  5. Sequencing approach annotation (readnames.txt files):

  6. For post-clustering annotation, users create {sample}.readnames.txt files
  7. Tab-separated format: read_name\tsequencing_approach
  8. Values: hifi, ont, or other user-defined approaches
  9. Location: {readnames_dir}/{sample}/telogator/{sample}.readnames.txt
  10. Used by KaryoScope_annotate_sequences.py to add sequencing approach metadata to clustering results

Note: Type classifications are independent of Telogator2 filtering and based entirely on user knowledge of library preparation and sequencing strategy.

1.5 Integration with KaryoScope Workflow

The telogator-filtered tel_reads.fa.gz file serves as input to the KaryoScope feature annotation pipeline:

Workflow structure:

tel_reads.fa.gz (from Telogator2)
KaryoScope feature annotation (get_features_reads script)
results/{sample}/telogator/{i}/features/{database}/
    {sample}.telogator.{i}.{database}.combined.featureIDs.clustered.bed.gz
Feature smoothing (smooth_features.py)
results/{sample}/telogator/{i}/KaryoScope/{database}/
    {sample}.telogator.{i}.{database}.{featureset}.smoothed.features.bed
Feature merging (KaryoScope_merge_beds.py)
Clustering analysis (KaryoScope_cluster_analysis.py)

Directory structure:

results/{sample}/telogator/{i}/
├── features/{database}/           # Raw feature annotations
│   └── {sample}.telogator.{i}.{database}.combined.featureIDs.clustered.bed.gz
└── KaryoScope/{database}/         # Processed BED files for clustering
    ├── {sample}.telogator.{i}.{database}.chromosome.smoothed.features.bed
    ├── {sample}.telogator.{i}.{database}.region.smoothed.features.bed
    ├── {sample}.telogator.{i}.{database}.repeat.smoothed.features.bed
    ├── {sample}.telogator.{i}.{database}.subtelomeric.smoothed.features.bed
    └── {sample}.telogator.{i}.{database}.{merged_name}.smoothed.merged.bed

Iteration number {i}:

The {i} parameter is an index for multiple sequencing runs on the same sample. For example: - telogator/1/: First telogator run or first input file - telogator/2/: Second telogator run or second input file (e.g., different sequencing batch)

This allows processing multiple datasets or parameter sets in parallel within the same workflow.

1.6 Filtering Impact on Dataset Size

Example: Core4 Dataset

Post-telogator filtering results: - BJ: 2,681 telomeric reads - HeLa: 3,474 telomeric reads - IMR90: 2,197 telomeric reads - U2OS: 4,005 telomeric reads - Total: 12,357 telomeric reads

Technology distribution: - ~50% PacBio HiFi, ~50% ONT ultra-long - Classification breakdown: hifi_fiber (25%), hifi_notfiber (28%), ONT_UL (47%)

Read counts vary by sample based on: - Biological telomere content (e.g., ALT-positive cells may have more telomeric reads) - Sequencing depth - Telomere length distribution


2. Alternative: Direct Input Without Telogator

KaryoScope clustering can be performed on any set of annotated long reads without telogator preprocessing. However, this approach requires significantly more computational resources (10-100x more reads to process).

To use non-filtered reads, provide the reads directly to the KaryoScope feature annotation pipeline (get_features_reads script) as described in the KaryoScope Comprehensive Guide.


3. Quality Control Recommendations

Before telogator filtering: - Ensure reads are basecalled with recent algorithms (Guppy 5+, Dorado SUP, or SMRTLink 13+) - Older Nanopore basecallers (Guppy <5) have high error rates in telomeric regions and may fail

After telogator filtering: - Verify expected telomeric read counts match sample type and sequencing depth - Check for platform biases in read count distributions - Inspect tlens_by_allele.tsv for reasonable telomere length distributions (if generated)

Warning signs: - Very low telomeric read counts (<100 reads) may indicate: - Insufficient sequencing depth - Sample quality issues - Incorrect telogator parameters - Very high telomeric read counts (>50,000 reads) may indicate: - Telomere-enriched sequencing - ALT-positive sample


4. Citation

If using Telogator2 for preprocessing, please cite:

Stephens, Z., & Kocher, J. P. (2024). Characterization of telomere variant repeats using long reads enables allele-specific telomere length estimation. BMC Bioinformatics, 25(1), 194.