From either a reference genome or set of variant haplotypes, create PacBio reads
and write them to FASTQ output file(s).
I encourage you to cite the reference below in addition to jackalope
if you use
this function.
pacbio(obj,
out_prefix,
n_reads,
chi2_params_s = c(0.01214, -5.12, 675, 48303.0732881,
1.4691051212330266),
chi2_params_n = c(0.00189237136, 2.53944970, 5500),
max_passes = 40,
sqrt_params = c(0.5, 0.2247),
norm_params = c(0, 0.2),
prob_thresh = 0.2,
ins_prob = 0.11,
del_prob = 0.04,
sub_prob = 0.01,
min_read_length = 50,
lognorm_read_length = c(0.200110276521, -10075.4363813,
17922.611306),
custom_read_lengths = NULL,
prob_dup = 0.0,
haplotype_probs = NULL,
sep_files = FALSE,
compress = FALSE,
comp_method = "bgzip",
n_threads = 1L,
read_pool_size = 100L,
show_progress = FALSE,
overwrite = FALSE)
Sequencing object of class ref_genome
or haplotypes
.
Prefix for the output file(s), including entire path except for the file extension.
Number of reads you want to create.
Vector containing the 5 parameters for the curve determining
the scale parameter for the chi^2 distribution.
Defaults to c(0.01214, -5.12, 675, 48303.0732881, 1.4691051212330266)
.
Vector containing the 3 parameters for the function
determining the n parameter for the chi^2 distribution.
Defaults to c(0.00189237136, 2.53944970, 5500)
.
Maximal number of passes for one molecule.
Defaults to 40
.
Vector containing the 2 parameters for the square root
function for the quality increase.
Defaults to c(0.5, 0.2247)
.
Vector containing the 2 parameters for normal distributed
noise added to quality increase square root function
Defaults to c(0, 0.2)
.
Upper bound for the modified total error probability.
Defaults to 0.2
.
Probability for insertions for reads with one pass.
Defaults to 0.11
.
Probability for deletions for reads with one pass.
Defaults to 0.04
.
Probability for substitutions for reads with one pass.
Defaults to 0.01
.
Minium read length for lognormal distribution.
Defaults to 50
.
Vector containing the 3 parameters for lognormal
read length distribution.
Defaults to c(0.200110276521, -10075.4363813, 17922.611306)
.
Sample read lengths from a vector or column in a
matrix; if a matrix, the second column specifies the sampling weights.
If NULL
, it samples read lengths from the lognormal distribution
using parameters in lognorm_read_length
.
Defaults to NULL
.
A single number indicating the probability of duplicates.
Defaults to 0.0
.
Relative probability of sampling each haplotype.
This is ignored if sequencing a reference genome.
NULL
results in all having the same probability.
Defaults to NULL
.
Logical indicating whether to make separate files for each haplotype.
This argument is coerced to FALSE
if the obj
argument is not
a haplotypes
object.
Defaults to FALSE
.
Logical specifying whether or not to compress output file, or
an integer specifying the level of compression, from 1 to 9.
If TRUE
, a compression level of 6
is used.
Defaults to FALSE
.
Character specifying which type of compression to use if any
is desired. Options include "gzip"
and "bgzip"
.
This is ignored if compress
is FALSE
, and it throws an error if
it's set to "gzip"
when n_threads > 1
(since I don't have a method to
do gzip compression in parallel).
Defaults to "bgzip"
.
The number of threads to use in processing.
If compress
is TRUE
or > 0
(indicating compressed output),
setting n_threads
to 2
or more makes this function first create an
uncompressed file/files using n_threads
threads, then compress that/those
file/files also using n_threads
threads.
There is no speed increase if you try to use multiple threads to create
compressed output on the fly, so that option is not included.
If you want to be conservative with disk space (by not having an uncompressed
file present even temporarily), set n_threads
to 1
.
Threads are NOT spread across chromosomes or haplotypes, so you don't need to
think about these when choosing this argument's value.
However, all threads write to the same file/files, so there are diminishing
returns for providing many threads.
This argument is ignored if the package was not compiled with OpenMP.
Defaults to 1
.
The number of reads to store before writing to disk.
Increasing this number should improve speed but take up more memory.
Defaults to 100
.
Logical for whether to show a progress bar.
Defaults to FALSE
.
Logical for whether to overwrite existing FASTQ file(s) of the same name, if they exist.
Nothing is returned.
The ID lines for FASTQ files are formatted as such:
@<genome name>-<chromosome name>-<starting position>-<strand>
where genome name
is always REF
for reference genomes (as opposed to haplotypes).
Stöcker, B. K., J. Köster, and S. Rahmann. 2016. SimLoRD: simulation of long read data. Bioinformatics 32:2704–2706.
# \donttest{
rg <- create_genome(10, 100e3, 100)
dir <- tempdir(TRUE)
pacbio(rg, paste0(dir, "/pacbio_reads"), n_reads = 100)
# }