From either a reference genome or set of haploid variants, create Illumina reads from error profiles and write them to FASTQ output file(s). I encourage you to cite the reference below in addition to jackalope if you use this function.

illumina(seq_object, out_prefix, n_reads, read_length, paired,
frag_mean = 400, frag_sd = 100, matepair = FALSE, seq_sys = NULL,
profile1 = NULL, profile2 = NULL, ins_prob1 = 9e-05,
del_prob1 = 0.00011, ins_prob2 = 0.00015, del_prob2 = 0.00023,
frag_len_min = NULL, frag_len_max = NULL, variant_probs = NULL,
barcodes = NULL, prob_dup = 0.02, sep_files = FALSE,
compress = FALSE, comp_method = "bgzip", n_threads = 1L,
read_pool_size = 1000L, show_progress = FALSE, overwrite = FALSE)

## Arguments

seq_object Sequencing object of class ref_genome or variants. Prefix for the output file(s), including entire path except for the file extension. Number of reads you want to create. Length of reads. Logical for whether to use paired-end reads. This argument is changed to TRUE if matepair is TRUE. Mean of the Gamma distribution that generates fragment sizes. Defaults to 400. Standard deviation of the Gamma distribution that generates fragment sizes. Defaults to 100. Logical for whether to simulate mate-pair reads. Defaults to FALSE. Full or abbreviated name of sequencing system to use. See "Sequencing systems" section for options. See "Sequencing profiles" section for more information on how this argument, profile1, and profile2 are used to specify profiles. Defaults to NULL. Custom profile file for read 1. See "Sequencing profiles" section for more information on how this argument, profile2, and seq_sys are used to specify profiles. Defaults to NULL. Custom profile file for read 2. See "Sequencing profiles" section for more information on how this argument, profile1, and seq_sys are used to specify profiles. Defaults to NULL. Insertion probability for read 1. Defaults to 0.00009. Deletion probability for read 1. Defaults to 0.00011. Insertion probability for read 2. Defaults to 0.00015. Deletion probability for read 2. Defaults to 0.00023. Minimum fragment size. A NULL value results in the read length. Defaults to NULL. Maximum fragment size. A NULL value results in 2^32-1, the maximum allowed value. Defaults to NULL Relative probability of sampling each variant. This is ignored if sequencing a reference genome. NULL results in all having the same probability. Defaults to NULL. Character vector of barcodes for each variant, or a single barcode if sequencing a reference genome. NULL results in no barcodes. Defaults to NULL. A single number indicating the probability of duplicates. Defaults to 0.02. Logical indicating whether to make separate files for each variant. This argument is coerced to FALSE if the seq_object argument is not a variants object. Defaults to FALSE. Logical specifying whether or not to compress output file, or an integer specifying the level of compression, from 1 to 9. If TRUE, a compression level of 6 is used. Defaults to FALSE. Character specifying which type of compression to use if any is desired. Options include "gzip" and "bgzip". This is ignored if compress is FALSE, and it throws an error if it's set to "gzip" when n_threads > 1 (since I don't have a method to do gzip compression in parallel). Defaults to "bgzip". The number of threads to use in processing. If compress is TRUE or > 0 (indicating compressed output), setting n_threads to 2 or more makes this function first create an uncompressed file/files using n_threads threads, then compress that/those file/files also using n_threads threads. There is no speed increase if you try to use multiple threads to create compressed output on the fly, so that option is not included. If you want to be conservative with disk space (by not having an uncompressed file present even temporarily), set n_threads to 1. This argument is ignored if the package was not compiled with OpenMP. Defaults to 1. The number of reads to store before writing to disk. Increasing this number should improve speed but take up more memory. Defaults to 1000. Logical for whether to show a progress bar. Defaults to FALSE. Logical for whether to overwrite existing FASTQ file(s) of the same name, if they exist.

## Value

Nothing is returned.

## Sequencing profiles

This section outlines how to use the seq_sys, profile1, and profile2 arguments. If all arguments are NULL (their defaults), a sequencing system is chosen based on the read length. If, however, one or more arguments has been provided, then how they're provided should depend on whether you want single- or paired-end reads.

• profile2 should be NULL.

• Only seq_sys or profile1 should be provided, not both.

• If providing seq_sys, don't provide either profile1 or profile2.

• If providing profile1, you must also provide profile2 (they can be the same if you want) and you cannot provide seq_sys.

## Sequencing systems

Sequencing system options are the following, where, for each system, "name" is the full name, "abbrev" is the abbreviated name, "max_len" indicates the maximum allowed read length, and "paired" indicates whether paired-end sequencing is allowed.

 name abbrev max_len paired Genome Analyzer I GA1 44 Yes Genome Analyzer II GA2 75 Yes HiSeq 1000 HS10 100 Yes HiSeq 2000 HS20 100 Yes HiSeq 2500 HS25 150 Yes HiSeqX v2.5 PCR free HSXn 150 Yes HiSeqX v2.5 TruSeq HSXt 150 Yes MiniSeq TruSeq MinS 50 No MiSeq v1 MSv1 250 Yes MiSeq v3 MSv3 250 Yes NextSeq 500 v2 NS50 75 Yes name abbrev

## ID lines

The ID lines for FASTQ files are formatted as such:

@<genome name>-<sequence name>-<starting position>-<strand>[/<read#>]

where the part in [] is only for paired-end Illumina reads, and where genome name is always REF for reference genomes (as opposed to variants).

## References

Huang, W., L. Li, J. R. Myers, and G. T. Marth. 2012. ART: a next-generation sequencing read simulator. Bioinformatics 28:593–594.

## Examples

rg <- create_genome(10, 100e3, 100)