Input Files
CheckRef requires two types of input files: target VCF files and reference legend files. This guide explains the requirements and formats for each.
Quick Start with Test Data
CheckRef includes test data you can use immediately. See test data examples below.
Get test data:
# Option 1: Clone repository (includes test data)
git clone https://github.com/AfriGen-D/checkref.git
# Option 2: Download test data only
mkdir -p test_data/chr22 test_data/reference
wget https://raw.githubusercontent.com/AfriGen-D/checkref/main/test_data/chr22/chr22_sample.vcf.gz -P test_data/chr22/
wget https://raw.githubusercontent.com/AfriGen-D/checkref/main/test_data/reference/chr22_sample.legend.gz -P test_data/reference/Target VCF Files
Format Requirements
- File format: VCF (Variant Call Format) version 4.0 or later
- Compression: gzipped (
.vcf.gz) recommended - Index: Not required (but
.tbiindex is preserved if present) - Chromosome naming: Must be consistent across all files
Supported Chromosome Naming
CheckRef automatically detects chromosomes from filenames using these patterns:
✅ Supported formats:
chr1,chr2, ...,chr22,chrX,chrY,chrMT1,2, ...,22,X,Y,MTsample_chr1.vcf.gzdata.chr22.vcf.gzchr1_filtered.vcf.gz
Example with Test Data
CheckRef's test data shows a real working example:
File: test_data/chr22/chr22_sample.vcf.gz
- ~1000 variants from chromosome 22
- Positions: 16050036 to 16053438
- Size: ~22KB compressed
Peek at the file:
zcat test_data/chr22/chr22_sample.vcf.gz | grep -v "^##" | head -5Output:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096
chr22 16050036 rs587697622 A G . PASS . GT 0|0
chr22 16050115 rs9605903 C T . PASS . GT 0|0
chr22 16050213 rs5747620 G A . PASS . GT 0|0VCF File Structure
Minimal required columns:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1
chr1 100000 rs123 A G . PASS . GT 0/1
chr1 150000 rs456 C T . PASS . GT 1/1Input Methods
Method 1: Glob pattern (recommended for multiple files):
--targetVcfs "/path/to/vcfs/*.vcf.gz"Method 2: Comma-separated list:
--targetVcfs "chr1.vcf.gz,chr2.vcf.gz,chr3.vcf.gz"Method 3: Single file:
--targetVcfs "chromosome_22.vcf.gz"Reference Legend Files
Format Requirements
- File format: Space or tab-delimited text file
- Compression: gzipped (
.legend.gz) recommended - Header: Required (column names)
- Chromosome naming: Must match VCF files
Example with Test Data
CheckRef's test data includes a matching legend file:
File: test_data/reference/chr22_sample.legend.gz
- Reference alleles for chr22 positions
- Covers test VCF positions
- Size: ~7KB compressed
Peek at the file:
zcat test_data/reference/chr22_sample.legend.gz | head -5Output:
id position a0 a1
22:16050036:A:G 16050036 A G
22:16050115:C:T 16050115 C T
22:16050213:G:A 16050213 G A
22:16050319:C:T 16050319 C TLegend File Structure
Standard format (1000 Genomes style):
id position a0 a1
1:100000:A:G 100000 A G
1:150000:C:T 150000 C TAlternative format (with CHROM column):
CHROM POS ID REF ALT
chr1 100000 1:100000:A:G A G
chr1 150000 1:150000:C:T C TSupported Column Names
CheckRef recognizes various column naming schemes:
| Data Type | Recognized Names |
|---|---|
| Chromosome | CHROM, CHR (optional) |
| Position | POS, POSITION |
| Reference | REF, A0 |
| Alternate | ALT, A1 |
| ID | ID, SNP |
Legend File Requirements
- ✅ Must have header line
- ✅ Must contain position, reference, and alternate alleles
- ✅ Chromosome can be in ID field or separate column
- ✅ Can be gzipped or uncompressed
File Organization
Recommended Directory Structure
project/
├── target_vcfs/
│ ├── sample_chr1.vcf.gz
│ ├── sample_chr2.vcf.gz
│ ├── sample_chr3.vcf.gz
│ └── ...
└── reference_panels/
├── ref_panel_chr1.legend.gz
├── ref_panel_chr2.legend.gz
├── ref_panel_chr3.legend.gz
└── ...Chromosome Matching
CheckRef automatically matches VCF and legend files by chromosome:
✅ Correct matching:
- VCF:
sample_chr1.vcf.gz↔ Legend:ref_chr1.legend.gz - VCF:
data_1.vcf.gz↔ Legend:panel_1.legend.gz - VCF:
chr22.vcf.gz↔ Legend:chr22.legend.gz
❌ Mismatched formats (won't pair):
- VCF:
sample_chr1.vcf.gz↔ Legend:panel_1.legend.gz(one uses 'chr' prefix, other doesn't)
Legend Pattern
By default, CheckRef looks for *.legend.gz files. Customize with:
--legendPattern "*.legend.txt.gz"
--legendPattern "reference_*.legend.gz"Data Quality Checks
CheckRef performs automatic validation:
VCF Validation
- File integrity: Checks gzip compression
- Format compliance: Validates VCF structure with bcftools
- Emptiness check: Ensures file contains variant data
- Size check: Detects suspiciously small files
Build Compatibility
CheckRef checks for genome build mismatches:
- ✅ Compatible: Both files use same reference positions
- ❌ Incompatible: Reference alleles at same position differ
- Example: Position 100000 is
Ain VCF butGin legend - Likely indicates hg19 vs hg38 mismatch
- Example: Position 100000 is
If a build mismatch is detected, the pipeline gracefully exits with a helpful message.
Example Input Files
Example VCF (target_chr22.vcf.gz)
##fileformat=VCFv4.2
##contig=<ID=chr22>
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1
chr22 16050036 rs587697622 A G . PASS AF=0.1 GT 0/1
chr22 16050115 rs9605903 C T . PASS AF=0.25 GT 1/1
chr22 16050213 rs5747620 G A . PASS AF=0.05 GT 0/0Example Legend (ref_chr22.legend.gz)
id position a0 a1
22:16050036:A:G 16050036 A G
22:16050115:C:T 16050115 C T
22:16050213:G:A 16050213 G A
22:16050400:T:C 16050400 T CPreparing Your Data
From PLINK Binary Files
Convert PLINK files to VCF:
plink --bfile mydata \
--recode vcf bgz \
--out mydata_chr22Splitting Multi-Chromosome VCF
Split a whole-genome VCF by chromosome:
for chr in {1..22} X Y; do
bcftools view -r chr${chr} \
whole_genome.vcf.gz \
-Oz -o chr${chr}.vcf.gz
bcftools index -t chr${chr}.vcf.gz
doneExtracting Legend from VCF
Create a legend file from a reference VCF:
bcftools query -f '%CHROM:%POS:%REF:%ALT\t%POS\t%REF\t%ALT\n' \
reference.vcf.gz | \
gzip > reference.legend.gz
# Add header
echo -e "id\tposition\ta0\ta1" | \
cat - reference.legend.gz | \
gzip > reference_with_header.legend.gzTroubleshooting Input Files
VCF validation failures
Check file integrity:
# Test gzip compression
gunzip -t myfile.vcf.gz
# Validate VCF format
bcftools view -h myfile.vcf.gz
# Count variants
bcftools view -H myfile.vcf.gz | wc -lChromosome not detected
Ensure chromosome is in filename:
# Good filenames
sample_chr22.vcf.gz
data.chr1.vcf.gz
22.vcf.gz
# Bad filenames (won't auto-detect)
sample.vcf.gz
mydata.vcf.gzLegend file format issues
Validate legend structure:
# Check header
zcat reference.legend.gz | head -1
# Check columns
zcat reference.legend.gz | head -5 | column -t
# Ensure space/tab delimited
zcat reference.legend.gz | head | cat -ANext Steps
- Running the Pipeline - Execute CheckRef
- Output Files - Understanding results
- Examples - Example datasets and commands
