Output Files
CheckRef generates organized output files in four main directories. This guide explains each output type and how to interpret the results.
Quick Example with Test Data
Running CheckRef on test data produces these results:
# Run with test data
nextflow run AfriGen-D/checkref \
--targetVcfs "test_data/chr22/*.vcf.gz" \
--referenceDir "test_data/reference/" \
--outdir test_results \
-profile docker
# Check outputs
ls test_results/Expected output files:
chr22_allele_switch_results.tsv- Detected switcheschr22_allele_switch_summary.txt- Statisticschr22.noswitch.vcf.gz- Cleaned VCFall_chromosomes_summary.txt- Overall summary
Output Directory Structure
results/
├── allele_switch_results/ # Detected allele switches (TSV files)
├── summary_files/ # Summary reports and statistics
├── fixed_vcfs/ # Corrected or cleaned VCF files
└── logs/ # Validation and verification logs
├── validation/ # VCF validation reports
└── verification/ # Post-correction verificationExample Results from Test Data
When you run CheckRef on the test data, here's what you'll see:
Test Data Summary
File: test_results/all_chromosomes_summary.txt
==================================================
ALLELE SWITCH SUMMARY
==================================================
Chromosome: chr22
Total variants in target VCF: 952
Total variants in reference: 987
Total variants at common positions: 845
Results:
MATCH: 798 variants (94.44%)
SWITCH: 12 variants (1.42%)
COMPLEMENT: 20 variants (2.37%)
OTHER: 15 variants (1.78%)
Overlap Statistics:
Overlap with target VCF: 88.76%
Overlap with reference: 85.61%
==================================================Test Data Allele Switches
File: test_results/chr22_allele_switch_results.tsv
CHR POS ALLELE_INFO
chr22 16050115 A>G|G>A
chr22 16050298 C>T|T>CThese are the actual switched alleles detected in the test data.
Allele Switch Results
Individual Chromosome Files
Location: results/allele_switch_results/
Filename format: chr{N}_{sample}_allele_switch_results.tsv
Content: Tab-separated file listing all detected allele switches
Columns:
| Column | Description | Example |
|---|---|---|
| CHR | Chromosome | chr22 |
| POS | Position | 16050036 |
| ALLELE_INFO | VCF→Legend allele comparison | A>G|G>A |
Generic example:
CHR POS ALLELE_INFO
chr22 16050036 A>G|G>A
chr22 16050115 C>T|T>C
chr22 16050213 G>A|A>GInterpretation:
A>G|G>Ameans:- VCF has REF=A, ALT=G
- Legend has REF=G, ALT=A
- → Alleles are switched
Allele Info Format
The ALLELE_INFO column uses the format: VCF_REF>VCF_ALT|LEGEND_REF>LEGEND_ALT
Allele switch:
A>G|G>A # REF and ALT are swapped
C>T|T>C # REF and ALT are swappedBuild mismatch (pipeline will exit):
A>G|T>C # Different alleles entirely - indicates build mismatchSummary Files
Per-Chromosome Summaries
Location: results/summary_files/
Filename format: chr{N}_{sample}_allele_switch_summary.txt
Content: Text summary of allele switch detection for one chromosome
Example:
====================================
ALLELE SWITCH SUMMARY
====================================
Chromosome: chr22
Target VCF: sample_chr22.vcf.gz
Reference: ref_chr22.legend.gz
Total variants in target VCF: 10000
Total variants in reference: 50000
Total variants at common positions: 9500
Allele Comparison Results:
- Matched variants: 9400 (98.95%)
- Switched alleles (written to file): 100 (1.05%)
- Complementary strand issues: 0
- Complement + switch issues: 0
- Other inconsistencies: 0
Overlap Statistics:
- Target VCF coverage: 9500/10000 (95.00%)
- Reference coverage: 9500/50000 (19.00%)Aggregated Summary
Filename: all_chromosomes_summary.txt
Content: Combined statistics across all chromosomes
Example:
====================================
ALLELE SWITCH CHECKER SUMMARY
====================================
Processed files: 22
Individual Chromosome Results:
------------------------------------
Chromosome: chr1
- Target variants: 15000
- Common variants: 14500
- Matched: 14350
- Switched: 150
Chromosome: chr2
- Target variants: 14000
- Common variants: 13700
- Matched: 13600
- Switched: 100
...
====================================
AGGREGATED RESULTS (ALL CHROMOSOMES)
====================================
Total variants in all target VCFs: 300000
Total variants in reference: 1500000
Total variants at common positions: 285000
Overlap Statistics:
- Target VCF coverage: 285000/300000 (95.00%)
- Reference coverage: 285000/1500000 (19.00%)
Allele Comparison Results:
- Matched variants: 282000 (98.95%)
- Switched alleles: 3000 (1.05%)
- Complementary strand issues: 0
- Complement + switch issues: 0
- Other inconsistencies: 0Extracted Reference Legends
Filename format: chr{N}_{sample}_extracted.legend.gz
Content: Reference legend file filtered to common positions with target VCF
Use: Can be used as input for downstream imputation pipelines
Fixed VCF Files
Location: results/fixed_vcfs/
The output depends on the --fixMethod parameter:
Remove Mode (Default)
Filename format: chr{N}_{sample}.noswitch.vcf.gz
Content: VCF file with switched sites removed
Features:
- All sites with allele switches are excluded
- File size smaller than original
- Includes
.tbiindex file
Example:
# Original VCF: 10,000 variants
# Switched sites: 100
# Output VCF: 9,900 variantsCorrect Mode
Filename format: chr{N}_{sample}.corrected.vcf.gz
Content: VCF file with alleles corrected (REF↔ALT swapped)
Features:
- Same number of variants as original
- REF and ALT alleles swapped for problematic sites
- Sites marked with
SWITCHED=1in INFO field - Includes
.tbiindex file
Example VCF entry:
##INFO=<ID=SWITCHED,Number=0,Type=Flag,Description="Alleles were switched to match reference">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
chr22 16050036 rs1 G A . PASS SWITCHED=1 GT 1/0Note: Genotypes are automatically updated when alleles are switched.
Log Files
Validation Reports
Location: results/logs/validation/
Filename format: chr{N}_validation_report.txt
Content: Validation results for each input VCF file
Example - Passed:
====================================
VCF VALIDATION REPORT FOR CHR chr22
====================================
File: sample_chr22.vcf.gz
Validation Date: 2025-10-21
File size: 1234567 bytes
Data lines found: 10000
✅ VALIDATION PASSED: File appears to be valid
File format: Valid VCF
Compression: gzip compressed data
Status: Ready for processingExample - Failed:
====================================
VCF VALIDATION REPORT FOR CHR chr22
====================================
File: sample_chr22.vcf.gz
Validation Date: 2025-10-21
❌ VALIDATION FAILED: File is too small (likely empty or corrupted)
Minimum expected size: 100 bytes
Actual size: 45 bytes
This file appears to be empty or severely corrupted.
Please check the file integrity and regenerate if necessary.Verification Reports
Location: results/logs/verification/
Filename format: chr{N}_verification_results.txt
Content: Verification that corrections were successful
Example - Successful:
====================================
VERIFICATION RESULTS FOR CHR chr22
====================================
✅ VERIFICATION PASSED: No allele switches detected in corrected VCF
Total switches found: 0Example - Failed:
====================================
VERIFICATION RESULTS FOR CHR chr22
====================================
❌ VERIFICATION FAILED: 5 allele switches still present
Remaining switches:
CHR POS ALLELE_INFO
chr22 16050036 A>G|T>C
Total switches found: 5Correction Statistics
When using --fixMethod correct, additional statistics are generated:
Location: results/logs/correction_stats.txt
Content: Number of sites corrected vs failed per chromosome
Example:
Chr chr1: Corrected=150, Failed=0
Chr chr2: Corrected=100, Failed=0
Chr chr22: Corrected=50, Failed=2Note: Failed corrections typically indicate build mismatches where REF alleles differ.
Interpreting Results
Healthy Results
Indicators of good data quality:
- ✅ High percentage of matched variants (>95%)
- ✅ Low percentage of switches (<5%)
- ✅ Zero complementary strand issues
- ✅ All VCF files pass validation
- ✅ All corrections verify successfully
Warning Signs
Indicators of potential issues:
⚠️ High percentage of switches (>10%)
- May indicate systematic strand issues
- Check if VCF and reference use different strands
⚠️ Low target VCF coverage (<80%)
- Many variants in VCF not in reference
- May need different reference panel
⚠️ Complementary strand issues
- Data may be on opposite DNA strands
- Consider strand flipping
⚠️ Build mismatch detected
- VCF and reference use different genome builds
- Use liftOver to convert to matching build
Critical Issues
Requires immediate attention:
❌ VCF validation failures
- Fix or regenerate VCF files
❌ Verification failures after correction
- Indicates bug or build mismatch
- Report to CheckRef developers
❌ Genome build mismatch
- Cannot proceed until files use same build
Using Results Downstream
For Imputation
Use the extracted legend files:
# Legend files are filtered to common positions
results/summary_files/chr{N}_extracted.legend.gzFor Quality Control
Review the summary statistics:
# Check overall switch rate
grep "Switched alleles" results/summary_files/all_chromosomes_summary.txt
# Check per-chromosome rates
grep "Switched:" results/summary_files/*_summary.txtFor Further Analysis
Use the corrected/cleaned VCF files:
# Corrected VCFs ready for downstream analysis
results/fixed_vcfs/*.corrected.vcf.gz
results/fixed_vcfs/*.noswitch.vcf.gzNext Steps
- Troubleshooting - Resolve common issues
- Examples - Example use cases
- Parameters - Customize output options
