This document provides brief examples of how jackalope can be used to generate sequencing data that can inform some common sampling decisions for HTS studies.

For an example reference assembly, I simply simulated one of size 10 Mb using the following code:

This resulted in the following ref_genome object:

#> < Set of 10 sequences >
#> # Total size: 10,000,000 bp
#>   name                          sequence                             length
#> seq0       TGTTCTCACGGAAAACGGGAAGTTT...TACGGTGCGGGGCTCGCTAGGCGGAT   1000000
#> seq1       AGAATACCTTGGACGACAGGCAACA...TTGGTACTACTGAACTCAAATCTCGA   1000000
#> seq2       TTTGAGGATCAGCGTTAGTGTCAAC...AGATATCGCCTAGCGTACCGCCCGGC   1000000
#> seq3       CTCTACAATCATACCGACTCATCCG...TTTTGGTTGAGCCCAATCTTTCAAGC   1000000
#> seq4       GAACGATTAGAACATGACCTTGCCG...AACTAACCCGCTTAGGCGCGATGATA   1000000
#> seq5       TGAACACTGTCTTTGTCCTTAACGC...CTTTCGGGATTTGCGAAACCAATCGA   1000000
#> seq6       GGGTGCCAACCCCCCTAATTACTAA...CCTAGCAGACGTCGTCTCTATTACAG   1000000
#> seq7       CGAATAGGGGTGCCTCTTACCTTCA...TCCTAGTCTCAGGACGGATAACCAGT   1000000
#> seq8       AGAAATATGCTTACGCATTAGCAGT...CGCCGAAAGACACCTCCCGTCAGTGA   1000000
#> seq9       GCAGTTACATTTACGCATCCTTACC...GGGGGCCTCGTGAGCTAGAGCTAGGT   1000000

For molecular-evolution information, I used the TN93 model, an insertion rate of 2e-5 for lengths from 1 to 10, and a deletion rate of 1e-5 for lengths from 1 to 40.

sub <- sub_TN93(pi_tcag = c(0.1, 0.2, 0.3, 0.4),
                alpha_1 = 0.0001, alpha_2 = 0.0002,
                beta = 0.00015)
ins <- indels(rate = 2e-5, max_length = 10)
del <- indels(rate = 1e-5, max_length = 40)

Assembling a genome

The example here produces FASTQ files from the known reference assembly that could test strategies for how to assemble a similar genome using HTS data.

The first strategy is to use only PacBio sequencing. The PacBio Sequel system produces up to 500,000 reads per Single Molecule, Real-Time (SMRT) cell, so you could run the following for two cells:

pacbio(ref, out_prefix = "pacbio", n_reads = 2 * 500e3)

An alternative, hybrid strategy uses 1 SMRT cell of PacBio sequencing and 1 lane (\(\sim 500\) million reads) of \(2 \times 100\)bp Illumina sequencing on the HiSeq 2500 system (the default Illumina system in jackalope):

pacbio(ref, out_prefix = "pacbio", n_reads = 500e3)
illumina(ref, out_prefix = "illumina", n_reads = 500e6, paired = TRUE,
         read_length = 100)

The last strategy combines 1 lane of \(2 \times 100\)bp Illumina HiSeq 2500 sequencing with 1 flow cell of \(2 \times 250\)bp mate-pair sequencing on an Illumina MiSeq v3. The mate-pair sequencing uses longer fragments (defaults are mean of 400 and standard deviation of 100) to better cover highly repetitive regions.

illumina(ref, out_prefix = "ill_pe", n_reads = 500e6, paired = TRUE,
         read_length = 100)
illumina(ref, out_prefix = "ill_mp", seq_sys = "MSv3",
         read_length = 250, n_reads = 50e6, matepair = TRUE, 
         frag_mean = 3000, frag_sd = 500)

Estimating divergence between populations

Here, I will demonstrate how to generate population-genomic data of a type that might be used to estimate the divergence between two populations. I first use the scrm package to conduct coalescent simulations that will generate segregating sites for 40 haploid variants from the reference genome. Twenty of the variants are from one population, twenty from another. The symmetrical migration rate is 100 individuals per generation. To generate many mutations, I set the population-scaled mutation rate (\(\theta = 4 N_0 \mu\)) to 100 for this example.

ssites <- scrm(paste("40", ref$n_seqs(), "-t 100 -I 2 20 20 100"))

Using the previously created objects for molecular evolution information (sub, ins, and del), I create variants from the reference genome:

vars <- create_variants(ref, vars_ssites(ssites), sub, ins, del)

This results in the following set of variants:

#>                            << Variants object >>
#> # Variants: 40
#> # Mutations: 76,851
#> 
#>                         << Reference genome info: >>
#> < Set of 10 sequences >
#> # Total size: 10,000,000 bp
#>   name                          sequence                             length
#> seq0       TGTTCTCACGGAAAACGGGAAGTTT...TACGGTGCGGGGCTCGCTAGGCGGAT   1000000
#> seq1       AGAATACCTTGGACGACAGGCAACA...TTGGTACTACTGAACTCAAATCTCGA   1000000
#> seq2       TTTGAGGATCAGCGTTAGTGTCAAC...AGATATCGCCTAGCGTACCGCCCGGC   1000000
#> seq3       CTCTACAATCATACCGACTCATCCG...TTTTGGTTGAGCCCAATCTTTCAAGC   1000000
#> seq4       GAACGATTAGAACATGACCTTGCCG...AACTAACCCGCTTAGGCGCGATGATA   1000000
#> seq5       TGAACACTGTCTTTGTCCTTAACGC...CTTTCGGGATTTGCGAAACCAATCGA   1000000
#> seq6       GGGTGCCAACCCCCCTAATTACTAA...CCTAGCAGACGTCGTCTCTATTACAG   1000000
#> seq7       CGAATAGGGGTGCCTCTTACCTTCA...TCCTAGTCTCAGGACGGATAACCAGT   1000000
#> seq8       AGAAATATGCTTACGCATTAGCAGT...CGCCGAAAGACACCTCCCGTCAGTGA   1000000
#> seq9       GCAGTTACATTTACGCATCCTTACC...GGGGGCCTCGTGAGCTAGAGCTAGGT   1000000

For a file of true divergences from the reference genome, the write_vcf function writes the variants object to a VCF file:

write_vcf(vars, "variants")

Lastly, I simulate 1 lane of \(2 \times 100\)bp Illumina HiSeq 2500 sequencing. In this case, individuals within a population are pooled, and the population sequences are derived from are identified by barcodes.

illumina(vars, out_prefix = "vars_illumina", n_reads = 500e6, paired = TRUE,
         read_length = 100, barcodes = c(rep("AACCGCGG", 20), 
                                         rep("GGTTATAA", 20)))

The below example instead has each individual variant’s reads output to separate FASTQ files:

illumina(vars, out_prefix = "vars_illumina", n_reads = 500e6, paired = TRUE,
         read_length = 100, sep_files = TRUE)

Constructing a phylogeny

From one phylogenetic tree

This section shows how jackalope can generate variants from a phylogeny, then simulate sequencing data from those variants to test phylogeny reconstruction methods. First, I generated a random coalescent tree of 10 species using the ape package.

Then I input that to the create_variants function alongside the molecular evolution information.

vars <- create_variants(ref, vars_phylo(tree), sub, ins, del)

This results in the following variants object:

#>                            << Variants object >>
#> # Variants: 10
#> # Mutations: 36,395
#> 
#>                         << Reference genome info: >>
#> < Set of 10 sequences >
#> # Total size: 10,000,000 bp
#>   name                          sequence                             length
#> seq0       TGTTCTCACGGAAAACGGGAAGTTT...TACGGTGCGGGGCTCGCTAGGCGGAT   1000000
#> seq1       AGAATACCTTGGACGACAGGCAACA...TTGGTACTACTGAACTCAAATCTCGA   1000000
#> seq2       TTTGAGGATCAGCGTTAGTGTCAAC...AGATATCGCCTAGCGTACCGCCCGGC   1000000
#> seq3       CTCTACAATCATACCGACTCATCCG...TTTTGGTTGAGCCCAATCTTTCAAGC   1000000
#> seq4       GAACGATTAGAACATGACCTTGCCG...AACTAACCCGCTTAGGCGCGATGATA   1000000
#> seq5       TGAACACTGTCTTTGTCCTTAACGC...CTTTCGGGATTTGCGAAACCAATCGA   1000000
#> seq6       GGGTGCCAACCCCCCTAATTACTAA...CCTAGCAGACGTCGTCTCTATTACAG   1000000
#> seq7       CGAATAGGGGTGCCTCTTACCTTCA...TCCTAGTCTCAGGACGGATAACCAGT   1000000
#> seq8       AGAAATATGCTTACGCATTAGCAGT...CGCCGAAAGACACCTCCCGTCAGTGA   1000000
#> seq9       GCAGTTACATTTACGCATCCTTACC...GGGGGCCTCGTGAGCTAGAGCTAGGT   1000000

Now I can generate data for 1 flow cell of \(2 \times 250\)bp sequencing on an Illumina MiSeq v3.

illumina(vars, out_prefix = "phylo_tree", seq_sys = "MSv3",
         read_length = 250, n_reads = 50e6)

From gene trees

Similar to the section above, the ultimate goal here is to test phylogeny reconstruction methods. The difference in this section is that instead of using a single, straightforward phylogeny, I use multiple gene trees per sequence. In the species used in these simulations, species 3 diverged from 1 and 2 at \(t = 0.5\), where \(t\) indicates time into the past and is in units of \(4 N_0\) generations. Species 1 and 2 diverged at \(t = 0.25\). I assume a recombination rate of \(1 / (4 N_0)\) recombination events per sequence per generation. Because each sequence is a different length and the length is required for including a recombination rate, I had to run scrm separately for each sequence. There are 4 diploid individuals sampled per species.

This results in the following variants object:

#>                            << Variants object >>
#> # Variants: 24
#> # Mutations: 38,130
#> 
#>                         << Reference genome info: >>
#> < Set of 10 sequences >
#> # Total size: 10,000,000 bp
#>   name                          sequence                             length
#> seq0       TGTTCTCACGGAAAACGGGAAGTTT...TACGGTGCGGGGCTCGCTAGGCGGAT   1000000
#> seq1       AGAATACCTTGGACGACAGGCAACA...TTGGTACTACTGAACTCAAATCTCGA   1000000
#> seq2       TTTGAGGATCAGCGTTAGTGTCAAC...AGATATCGCCTAGCGTACCGCCCGGC   1000000
#> seq3       CTCTACAATCATACCGACTCATCCG...TTTTGGTTGAGCCCAATCTTTCAAGC   1000000
#> seq4       GAACGATTAGAACATGACCTTGCCG...AACTAACCCGCTTAGGCGCGATGATA   1000000
#> seq5       TGAACACTGTCTTTGTCCTTAACGC...CTTTCGGGATTTGCGAAACCAATCGA   1000000
#> seq6       GGGTGCCAACCCCCCTAATTACTAA...CCTAGCAGACGTCGTCTCTATTACAG   1000000
#> seq7       CGAATAGGGGTGCCTCTTACCTTCA...TCCTAGTCTCAGGACGGATAACCAGT   1000000
#> seq8       AGAAATATGCTTACGCATTAGCAGT...CGCCGAAAGACACCTCCCGTCAGTGA   1000000
#> seq9       GCAGTTACATTTACGCATCCTTACC...GGGGGCCTCGTGAGCTAGAGCTAGGT   1000000

To store mutation information by diploid sample, the write_vcf function writes the variants object to a VCF file. It assigns haplotypes to diploid samples using a matrix for the sample_matrix argument:

write_vcf(vars, out_prefix = "var_gtrees",
          sample_matrix = matrix(1:vars$n_vars(), ncol = 2, byrow = TRUE))

Now I can generate data for 1 flow cell of \(2 \times 250\)bp sequencing on an Illumina MiSeq v3.