From either a reference genome or set of variant haplotypes, create Illumina reads
from error profiles and write them to FASTQ output file(s).
I encourage you to cite the reference below in addition to jackalope
if you use
this function.
illumina(obj,
out_prefix,
n_reads,
read_length,
paired,
frag_mean = 400,
frag_sd = 100,
matepair = FALSE,
seq_sys = NULL,
profile1 = NULL,
profile2 = NULL,
ins_prob1 = 0.00009,
del_prob1 = 0.00011,
ins_prob2 = 0.00015,
del_prob2 = 0.00023,
frag_len_min = NULL,
frag_len_max = NULL,
haplotype_probs = NULL,
barcodes = NULL,
prob_dup = 0.02,
sep_files = FALSE,
compress = FALSE,
comp_method = "bgzip",
n_threads = 1L,
read_pool_size = 1000L,
show_progress = FALSE,
overwrite = FALSE)
Sequencing object of class ref_genome
or haplotypes
.
Prefix for the output file(s), including entire path except for the file extension.
Number of reads you want to create.
Length of reads.
Logical for whether to use paired-end reads.
This argument is changed to TRUE
if matepair
is TRUE
.
Mean of the Gamma distribution that generates fragment sizes.
Defaults to 400
.
Standard deviation of the Gamma distribution that generates
fragment sizes.
Defaults to 100
.
Logical for whether to simulate mate-pair reads.
Defaults to FALSE
.
Full or abbreviated name of sequencing system to use.
See "Sequencing systems" section for options.
See "Sequencing profiles" section for more information on how this argument,
profile1
, and profile2
are used to specify profiles.
Defaults to NULL
.
Custom profile file for read 1.
See "Sequencing profiles" section for more information on how this argument,
profile2
, and seq_sys
are used to specify profiles.
Defaults to NULL
.
Custom profile file for read 2.
See "Sequencing profiles" section for more information on how this argument,
profile1
, and seq_sys
are used to specify profiles.
Defaults to NULL
.
Insertion probability for read 1. Defaults to 0.00009
.
Deletion probability for read 1. Defaults to 0.00011
.
Insertion probability for read 2. Defaults to 0.00015
.
Deletion probability for read 2. Defaults to 0.00023
.
Minimum fragment size. A NULL
value results in the read length.
Defaults to NULL
.
Maximum fragment size.
A NULL
value results in 2^32-1
, the maximum allowed value.
Defaults to NULL
Relative probability of sampling each haplotype.
This is ignored if sequencing a reference genome.
NULL
results in all having the same probability.
Defaults to NULL
.
Character vector of barcodes for each haplotype, or a single barcode
if sequencing a reference genome. NULL
results in no barcodes.
Defaults to NULL
.
A single number indicating the probability of duplicates.
Defaults to 0.02
.
Logical indicating whether to make separate files for each haplotype.
This argument is coerced to FALSE
if the obj
argument is not
a haplotypes
object.
Defaults to FALSE
.
Logical specifying whether or not to compress output file, or
an integer specifying the level of compression, from 1 to 9.
If TRUE
, a compression level of 6
is used.
Defaults to FALSE
.
Character specifying which type of compression to use if any
is desired. Options include "gzip"
and "bgzip"
.
This is ignored if compress
is FALSE
, and it throws an error if
it's set to "gzip"
when n_threads > 1
(since I don't have a method to
do gzip compression in parallel).
Defaults to "bgzip"
.
The number of threads to use in processing.
If compress
is TRUE
or > 0
(indicating compressed output),
setting n_threads
to 2
or more makes this function first create an
uncompressed file/files using n_threads
threads, then compress that/those
file/files also using n_threads
threads.
There is no speed increase if you try to use multiple threads to create
compressed output on the fly, so that option is not included.
If you want to be conservative with disk space (by not having an uncompressed
file present even temporarily), set n_threads
to 1
.
Threads are NOT spread across chromosomes or haplotypes, so you don't need to
think about these when choosing this argument's value.
However, all threads write to the same file/files, so there are diminishing
returns for providing many threads.
This argument is ignored if the package was not compiled with OpenMP.
Defaults to 1
.
The number of reads to store before writing to disk.
Increasing this number should improve speed but take up more memory.
Defaults to 1000
.
Logical for whether to show a progress bar.
Defaults to FALSE
.
Logical for whether to overwrite existing FASTQ file(s) of the same name, if they exist.
Nothing is returned.
This section outlines how to use the seq_sys
, profile1
,
and profile2
arguments.
If all arguments are NULL
(their defaults), a sequencing system is chosen
based on the read length.
If, however, one or more arguments has been provided, then how they're provided
should depend on whether you want single- or paired-end reads.
For single-end reads
profile2
should be NULL
.
Only seq_sys
or profile1
should be provided, not both.
For paired-end reads
If providing seq_sys
, don't provide either profile1
or profile2
.
If providing profile1
, you must also provide profile2
(they can be the
same if you want) and you cannot provide seq_sys
.
Sequencing system options are the following, where, for each system, "name" is the full name, "abbrev" is the abbreviated name, "max_len" indicates the maximum allowed read length, and "paired" indicates whether paired-end sequencing is allowed.
name | abbrev | max_len | paired |
Genome Analyzer I | GA1 | 44 | Yes |
Genome Analyzer II | GA2 | 75 | Yes |
HiSeq 1000 | HS10 | 100 | Yes |
HiSeq 2000 | HS20 | 100 | Yes |
HiSeq 2500 | HS25 | 150 | Yes |
HiSeqX v2.5 PCR free | HSXn | 150 | Yes |
HiSeqX v2.5 TruSeq | HSXt | 150 | Yes |
MiniSeq TruSeq | MinS | 50 | No |
MiSeq v1 | MSv1 | 250 | Yes |
MiSeq v3 | MSv3 | 250 | Yes |
NextSeq 500 v2 | NS50 | 75 | Yes |
The ID lines for FASTQ files are formatted as such:
@<genome name>-<chromosome name>-<starting position>-<strand>[/<read#>]
where the part in []
is only for paired-end Illumina reads, and where genome name
is always REF
for reference genomes (as opposed to haplotypes).
Huang, W., L. Li, J. R. Myers, and G. T. Marth. 2012. ART: a next-generation sequencing read simulator. Bioinformatics 28:593–594.
# \donttest{
rg <- create_genome(10, 100e3, 100)
dir <- tempdir(TRUE)
illumina(rg, paste0(dir, "/illumina_reads"), n_reads = 100,
read_length = 100, paired = FALSE)
# }