Create and write Illumina reads to FASTQ file(s).

From either a reference genome or set of variant haplotypes, create Illumina reads from error profiles and write them to FASTQ output file(s). I encourage you to cite the reference below in addition to jackalope if you use this function.

illumina(obj,
         out_prefix,
         n_reads,
         read_length,
         paired,
         frag_mean = 400,
         frag_sd = 100,
         matepair = FALSE,
         seq_sys = NULL,
         profile1 = NULL,
         profile2 = NULL,
         ins_prob1 = 0.00009,
         del_prob1 = 0.00011,
         ins_prob2 = 0.00015,
         del_prob2 = 0.00023,
         frag_len_min = NULL,
         frag_len_max = NULL,
         haplotype_probs = NULL,
         barcodes = NULL,
         prob_dup = 0.02,
         sep_files = FALSE,
         compress = FALSE,
         comp_method = "bgzip",
         n_threads = 1L,
         read_pool_size = 1000L,
         show_progress = FALSE,
         overwrite = FALSE)

Arguments

obj: Sequencing object of class ref_genome or haplotypes.
out_prefix: Prefix for the output file(s), including entire path except for the file extension.
n_reads: Number of reads you want to create.
read_length: Length of reads.
paired: Logical for whether to use paired-end reads. This argument is changed to TRUE if matepair is TRUE.
frag_mean: Mean of the Gamma distribution that generates fragment sizes. Defaults to 400.
frag_sd: Standard deviation of the Gamma distribution that generates fragment sizes. Defaults to 100.
matepair: Logical for whether to simulate mate-pair reads. Defaults to FALSE.
seq_sys: Full or abbreviated name of sequencing system to use. See "Sequencing systems" section for options. See "Sequencing profiles" section for more information on how this argument, profile1, and profile2 are used to specify profiles. Defaults to NULL.
profile1: Custom profile file for read 1. See "Sequencing profiles" section for more information on how this argument, profile2, and seq_sys are used to specify profiles. Defaults to NULL.
profile2: Custom profile file for read 2. See "Sequencing profiles" section for more information on how this argument, profile1, and seq_sys are used to specify profiles. Defaults to NULL.
ins_prob1: Insertion probability for read 1. Defaults to 0.00009.
del_prob1: Deletion probability for read 1. Defaults to 0.00011.
ins_prob2: Insertion probability for read 2. Defaults to 0.00015.
del_prob2: Deletion probability for read 2. Defaults to 0.00023.
frag_len_min: Minimum fragment size. A NULL value results in the read length. Defaults to NULL.
frag_len_max: Maximum fragment size. A NULL value results in 2^32-1, the maximum allowed value. Defaults to NULL
haplotype_probs: Relative probability of sampling each haplotype. This is ignored if sequencing a reference genome. NULL results in all having the same probability. Defaults to NULL.
barcodes: Character vector of barcodes for each haplotype, or a single barcode if sequencing a reference genome. NULL results in no barcodes. Defaults to NULL.
prob_dup: A single number indicating the probability of duplicates. Defaults to 0.02.
sep_files: Logical indicating whether to make separate files for each haplotype. This argument is coerced to FALSE if the obj argument is not a haplotypes object. Defaults to FALSE.
compress: Logical specifying whether or not to compress output file, or an integer specifying the level of compression, from 1 to 9. If TRUE, a compression level of 6 is used. Defaults to FALSE.
comp_method: Character specifying which type of compression to use if any is desired. Options include "gzip" and "bgzip". This is ignored if compress is FALSE, and it throws an error if it's set to "gzip" when n_threads > 1 (since I don't have a method to do gzip compression in parallel). Defaults to "bgzip".
n_threads: The number of threads to use in processing. If compress is TRUE or > 0 (indicating compressed output), setting n_threads to 2 or more makes this function first create an uncompressed file/files using n_threads threads, then compress that/those file/files also using n_threads threads. There is no speed increase if you try to use multiple threads to create compressed output on the fly, so that option is not included. If you want to be conservative with disk space (by not having an uncompressed file present even temporarily), set n_threads to 1. Threads are NOT spread across chromosomes or haplotypes, so you don't need to think about these when choosing this argument's value. However, all threads write to the same file/files, so there are diminishing returns for providing many threads. This argument is ignored if the package was not compiled with OpenMP. Defaults to 1.
read_pool_size: The number of reads to store before writing to disk. Increasing this number should improve speed but take up more memory. Defaults to 1000.
show_progress: Logical for whether to show a progress bar. Defaults to FALSE.
overwrite: Logical for whether to overwrite existing FASTQ file(s) of the same name, if they exist.

Value

Nothing is returned.

Sequencing profiles

This section outlines how to use the seq_sys, profile1, and profile2 arguments. If all arguments are NULL (their defaults), a sequencing system is chosen based on the read length. If, however, one or more arguments has been provided, then how they're provided should depend on whether you want single- or paired-end reads.

For single-end reads

profile2 should be NULL.
Only seq_sys or profile1 should be provided, not both.

For paired-end reads

If providing seq_sys, don't provide either profile1 or profile2.
If providing profile1, you must also provide profile2 (they can be the same if you want) and you cannot provide seq_sys.

Sequencing systems

Sequencing system options are the following, where, for each system, "name" is the full name, "abbrev" is the abbreviated name, "max_len" indicates the maximum allowed read length, and "paired" indicates whether paired-end sequencing is allowed.

name	abbrev	max_len	paired
Genome Analyzer I	GA1	44	Yes
Genome Analyzer II	GA2	75	Yes
HiSeq 1000	HS10	100	Yes
HiSeq 2000	HS20	100	Yes
HiSeq 2500	HS25	150	Yes
HiSeqX v2.5 PCR free	HSXn	150	Yes
HiSeqX v2.5 TruSeq	HSXt	150	Yes
MiniSeq TruSeq	MinS	50	No
MiSeq v1	MSv1	250	Yes
MiSeq v3	MSv3	250	Yes
NextSeq 500 v2	NS50	75	Yes

ID lines

The ID lines for FASTQ files are formatted as such:

@<genome name>-<chromosome name>-<starting position>-<strand>[/<read#>]

where the part in [] is only for paired-end Illumina reads, and where genome name is always REF for reference genomes (as opposed to haplotypes).

References

Huang, W., L. Li, J. R. Myers, and G. T. Marth. 2012. ART: a next-generation sequencing read simulator. Bioinformatics 28:593–594.

Examples

# \donttest{
rg <- create_genome(10, 100e3, 100)
dir <- tempdir(TRUE)
illumina(rg, paste0(dir, "/illumina_reads"), n_reads = 100,
         read_length = 100, paired = FALSE)
# }