From either a reference genome or set of variant haplotypes, create PacBio reads
and write them to FASTQ output file(s).
I encourage you to cite the reference below in addition to jackalope
if you use
this function.
pacbio(obj, out_prefix, n_reads, chi2_params_s = c(0.01214, -5.12, 675, 48303.0732881, 1.4691051212330266), chi2_params_n = c(0.00189237136, 2.53944970, 5500), max_passes = 40, sqrt_params = c(0.5, 0.2247), norm_params = c(0, 0.2), prob_thresh = 0.2, ins_prob = 0.11, del_prob = 0.04, sub_prob = 0.01, min_read_length = 50, lognorm_read_length = c(0.200110276521, -10075.4363813, 17922.611306), custom_read_lengths = NULL, prob_dup = 0.0, haplotype_probs = NULL, sep_files = FALSE, compress = FALSE, comp_method = "bgzip", n_threads = 1L, read_pool_size = 100L, show_progress = FALSE, overwrite = FALSE)
obj | Sequencing object of class |
---|---|
out_prefix | Prefix for the output file(s), including entire path except for the file extension. |
n_reads | Number of reads you want to create. |
chi2_params_s | Vector containing the 5 parameters for the curve determining
the scale parameter for the chi^2 distribution.
Defaults to |
chi2_params_n | Vector containing the 3 parameters for the function
determining the n parameter for the chi^2 distribution.
Defaults to |
max_passes | Maximal number of passes for one molecule.
Defaults to |
sqrt_params | Vector containing the 2 parameters for the square root
function for the quality increase.
Defaults to |
norm_params | Vector containing the 2 parameters for normal distributed
noise added to quality increase square root function
Defaults to |
prob_thresh | Upper bound for the modified total error probability.
Defaults to |
ins_prob | Probability for insertions for reads with one pass.
Defaults to |
del_prob | Probability for deletions for reads with one pass.
Defaults to |
sub_prob | Probability for substitutions for reads with one pass.
Defaults to |
min_read_length | Minium read length for lognormal distribution.
Defaults to |
lognorm_read_length | Vector containing the 3 parameters for lognormal
read length distribution.
Defaults to |
custom_read_lengths | Sample read lengths from a vector or column in a
matrix; if a matrix, the second column specifies the sampling weights.
If |
prob_dup | A single number indicating the probability of duplicates.
Defaults to |
haplotype_probs | Relative probability of sampling each haplotype.
This is ignored if sequencing a reference genome.
|
sep_files | Logical indicating whether to make separate files for each haplotype.
This argument is coerced to |
compress | Logical specifying whether or not to compress output file, or
an integer specifying the level of compression, from 1 to 9.
If |
comp_method | Character specifying which type of compression to use if any
is desired. Options include |
n_threads | The number of threads to use in processing.
If |
read_pool_size | The number of reads to store before writing to disk.
Increasing this number should improve speed but take up more memory.
Defaults to |
show_progress | Logical for whether to show a progress bar.
Defaults to |
overwrite | Logical for whether to overwrite existing FASTQ file(s) of the same name, if they exist. |
Nothing is returned.
The ID lines for FASTQ files are formatted as such:
@<genome name>-<chromosome name>-<starting position>-<strand>
where genome name
is always REF
for reference genomes (as opposed to haplotypes).
Stöcker, B. K., J. Köster, and S. Rahmann. 2016. SimLoRD: simulation of long read data. Bioinformatics 32:2704–2706.
# \donttest{ rg <- create_genome(10, 100e3, 100) dir <- tempdir(TRUE) pacbio(rg, paste0(dir, "/pacbio_reads"), n_reads = 100) # }