From either a reference genome or set of variant haplotypes, create PacBio reads and write them to FASTQ output file(s). I encourage you to cite the reference below in addition to jackalope if you use this function.

pacbio(obj,
       out_prefix,
       n_reads,
       chi2_params_s = c(0.01214, -5.12, 675, 48303.0732881,
                         1.4691051212330266),
       chi2_params_n = c(0.00189237136, 2.53944970, 5500),
       max_passes = 40,
       sqrt_params = c(0.5, 0.2247),
       norm_params = c(0, 0.2),
       prob_thresh = 0.2,
       ins_prob = 0.11,
       del_prob = 0.04,
       sub_prob = 0.01,
       min_read_length = 50,
       lognorm_read_length = c(0.200110276521, -10075.4363813,
                               17922.611306),
       custom_read_lengths = NULL,
       prob_dup = 0.0,
       haplotype_probs = NULL,
       sep_files = FALSE,
       compress = FALSE,
       comp_method = "bgzip",
       n_threads = 1L,
       read_pool_size = 100L,
       show_progress = FALSE,
       overwrite = FALSE)

Arguments

obj

Sequencing object of class ref_genome or haplotypes.

out_prefix

Prefix for the output file(s), including entire path except for the file extension.

n_reads

Number of reads you want to create.

chi2_params_s

Vector containing the 5 parameters for the curve determining the scale parameter for the chi^2 distribution. Defaults to c(0.01214, -5.12, 675, 48303.0732881, 1.4691051212330266).

chi2_params_n

Vector containing the 3 parameters for the function determining the n parameter for the chi^2 distribution. Defaults to c(0.00189237136, 2.53944970, 5500).

max_passes

Maximal number of passes for one molecule. Defaults to 40.

sqrt_params

Vector containing the 2 parameters for the square root function for the quality increase. Defaults to c(0.5, 0.2247).

norm_params

Vector containing the 2 parameters for normal distributed noise added to quality increase square root function Defaults to c(0, 0.2).

prob_thresh

Upper bound for the modified total error probability. Defaults to 0.2.

ins_prob

Probability for insertions for reads with one pass. Defaults to 0.11.

del_prob

Probability for deletions for reads with one pass. Defaults to 0.04.

sub_prob

Probability for substitutions for reads with one pass. Defaults to 0.01.

min_read_length

Minium read length for lognormal distribution. Defaults to 50.

lognorm_read_length

Vector containing the 3 parameters for lognormal read length distribution. Defaults to c(0.200110276521, -10075.4363813, 17922.611306).

custom_read_lengths

Sample read lengths from a vector or column in a matrix; if a matrix, the second column specifies the sampling weights. If NULL, it samples read lengths from the lognormal distribution using parameters in lognorm_read_length. Defaults to NULL.

prob_dup

A single number indicating the probability of duplicates. Defaults to 0.0.

haplotype_probs

Relative probability of sampling each haplotype. This is ignored if sequencing a reference genome. NULL results in all having the same probability. Defaults to NULL.

sep_files

Logical indicating whether to make separate files for each haplotype. This argument is coerced to FALSE if the obj argument is not a haplotypes object. Defaults to FALSE.

compress

Logical specifying whether or not to compress output file, or an integer specifying the level of compression, from 1 to 9. If TRUE, a compression level of 6 is used. Defaults to FALSE.

comp_method

Character specifying which type of compression to use if any is desired. Options include "gzip" and "bgzip". This is ignored if compress is FALSE, and it throws an error if it's set to "gzip" when n_threads > 1 (since I don't have a method to do gzip compression in parallel). Defaults to "bgzip".

n_threads

The number of threads to use in processing. If compress is TRUE or > 0 (indicating compressed output), setting n_threads to 2 or more makes this function first create an uncompressed file/files using n_threads threads, then compress that/those file/files also using n_threads threads. There is no speed increase if you try to use multiple threads to create compressed output on the fly, so that option is not included. If you want to be conservative with disk space (by not having an uncompressed file present even temporarily), set n_threads to 1. Threads are NOT spread across chromosomes or haplotypes, so you don't need to think about these when choosing this argument's value. However, all threads write to the same file/files, so there are diminishing returns for providing many threads. This argument is ignored if the package was not compiled with OpenMP. Defaults to 1.

read_pool_size

The number of reads to store before writing to disk. Increasing this number should improve speed but take up more memory. Defaults to 100.

show_progress

Logical for whether to show a progress bar. Defaults to FALSE.

overwrite

Logical for whether to overwrite existing FASTQ file(s) of the same name, if they exist.

Value

Nothing is returned.

ID lines

The ID lines for FASTQ files are formatted as such:

@<genome name>-<chromosome name>-<starting position>-<strand>

where genome name is always REF for reference genomes (as opposed to haplotypes).

References

Stöcker, B. K., J. Köster, and S. Rahmann. 2016. SimLoRD: simulation of long read data. Bioinformatics 32:2704–2706.

Examples

# \donttest{
rg <- create_genome(10, 100e3, 100)
dir <- tempdir(TRUE)
pacbio(rg, paste0(dir, "/pacbio_reads"), n_reads = 100)
# }