Create and write PacBio reads to FASTQ file(s).

From either a reference genome or set of variant haplotypes, create PacBio reads and write them to FASTQ output file(s). I encourage you to cite the reference below in addition to jackalope if you use this function.

pacbio(obj,
       out_prefix,
       n_reads,
       chi2_params_s = c(0.01214, -5.12, 675, 48303.0732881,
                         1.4691051212330266),
       chi2_params_n = c(0.00189237136, 2.53944970, 5500),
       max_passes = 40,
       sqrt_params = c(0.5, 0.2247),
       norm_params = c(0, 0.2),
       prob_thresh = 0.2,
       ins_prob = 0.11,
       del_prob = 0.04,
       sub_prob = 0.01,
       min_read_length = 50,
       lognorm_read_length = c(0.200110276521, -10075.4363813,
                               17922.611306),
       custom_read_lengths = NULL,
       prob_dup = 0.0,
       haplotype_probs = NULL,
       sep_files = FALSE,
       compress = FALSE,
       comp_method = "bgzip",
       n_threads = 1L,
       read_pool_size = 100L,
       show_progress = FALSE,
       overwrite = FALSE)

Arguments

obj: Sequencing object of class ref_genome or haplotypes.
out_prefix: Prefix for the output file(s), including entire path except for the file extension.
n_reads: Number of reads you want to create.
chi2_params_s: Vector containing the 5 parameters for the curve determining the scale parameter for the chi^2 distribution. Defaults to c(0.01214, -5.12, 675, 48303.0732881, 1.4691051212330266).
chi2_params_n: Vector containing the 3 parameters for the function determining the n parameter for the chi^2 distribution. Defaults to c(0.00189237136, 2.53944970, 5500).
max_passes: Maximal number of passes for one molecule. Defaults to 40.
sqrt_params: Vector containing the 2 parameters for the square root function for the quality increase. Defaults to c(0.5, 0.2247).
norm_params: Vector containing the 2 parameters for normal distributed noise added to quality increase square root function Defaults to c(0, 0.2).
prob_thresh: Upper bound for the modified total error probability. Defaults to 0.2.
ins_prob: Probability for insertions for reads with one pass. Defaults to 0.11.
del_prob: Probability for deletions for reads with one pass. Defaults to 0.04.
sub_prob: Probability for substitutions for reads with one pass. Defaults to 0.01.
min_read_length: Minium read length for lognormal distribution. Defaults to 50.
lognorm_read_length: Vector containing the 3 parameters for lognormal read length distribution. Defaults to c(0.200110276521, -10075.4363813, 17922.611306).
custom_read_lengths: Sample read lengths from a vector or column in a matrix; if a matrix, the second column specifies the sampling weights. If NULL, it samples read lengths from the lognormal distribution using parameters in lognorm_read_length. Defaults to NULL.
prob_dup: A single number indicating the probability of duplicates. Defaults to 0.0.
haplotype_probs: Relative probability of sampling each haplotype. This is ignored if sequencing a reference genome. NULL results in all having the same probability. Defaults to NULL.
sep_files: Logical indicating whether to make separate files for each haplotype. This argument is coerced to FALSE if the obj argument is not a haplotypes object. Defaults to FALSE.
compress: Logical specifying whether or not to compress output file, or an integer specifying the level of compression, from 1 to 9. If TRUE, a compression level of 6 is used. Defaults to FALSE.
comp_method: Character specifying which type of compression to use if any is desired. Options include "gzip" and "bgzip". This is ignored if compress is FALSE, and it throws an error if it's set to "gzip" when n_threads > 1 (since I don't have a method to do gzip compression in parallel). Defaults to "bgzip".
n_threads: The number of threads to use in processing. If compress is TRUE or > 0 (indicating compressed output), setting n_threads to 2 or more makes this function first create an uncompressed file/files using n_threads threads, then compress that/those file/files also using n_threads threads. There is no speed increase if you try to use multiple threads to create compressed output on the fly, so that option is not included. If you want to be conservative with disk space (by not having an uncompressed file present even temporarily), set n_threads to 1. Threads are NOT spread across chromosomes or haplotypes, so you don't need to think about these when choosing this argument's value. However, all threads write to the same file/files, so there are diminishing returns for providing many threads. This argument is ignored if the package was not compiled with OpenMP. Defaults to 1.
read_pool_size: The number of reads to store before writing to disk. Increasing this number should improve speed but take up more memory. Defaults to 100.
show_progress: Logical for whether to show a progress bar. Defaults to FALSE.
overwrite: Logical for whether to overwrite existing FASTQ file(s) of the same name, if they exist.

Value

Nothing is returned.

ID lines

The ID lines for FASTQ files are formatted as such:

@<genome name>-<chromosome name>-<starting position>-<strand>

where genome name is always REF for reference genomes (as opposed to haplotypes).

References

Stöcker, B. K., J. Köster, and S. Rahmann. 2016. SimLoRD: simulation of long read data. Bioinformatics 32:2704–2706.

Examples

# \donttest{
rg <- create_genome(10, 100e3, 100)
dir <- tempdir(TRUE)
pacbio(rg, paste0(dir, "/pacbio_reads"), n_reads = 100)
# }