From either a reference genome or set of variant haplotypes, create Illumina reads from error profiles and write them to FASTQ output file(s). I encourage you to cite the reference below in addition to jackalope if you use this function.

illumina(obj,
         out_prefix,
         n_reads,
         read_length,
         paired,
         frag_mean = 400,
         frag_sd = 100,
         matepair = FALSE,
         seq_sys = NULL,
         profile1 = NULL,
         profile2 = NULL,
         ins_prob1 = 0.00009,
         del_prob1 = 0.00011,
         ins_prob2 = 0.00015,
         del_prob2 = 0.00023,
         frag_len_min = NULL,
         frag_len_max = NULL,
         haplotype_probs = NULL,
         barcodes = NULL,
         prob_dup = 0.02,
         sep_files = FALSE,
         compress = FALSE,
         comp_method = "bgzip",
         n_threads = 1L,
         read_pool_size = 1000L,
         show_progress = FALSE,
         overwrite = FALSE)

Arguments

obj

Sequencing object of class ref_genome or haplotypes.

out_prefix

Prefix for the output file(s), including entire path except for the file extension.

n_reads

Number of reads you want to create.

read_length

Length of reads.

paired

Logical for whether to use paired-end reads. This argument is changed to TRUE if matepair is TRUE.

frag_mean

Mean of the Gamma distribution that generates fragment sizes. Defaults to 400.

frag_sd

Standard deviation of the Gamma distribution that generates fragment sizes. Defaults to 100.

matepair

Logical for whether to simulate mate-pair reads. Defaults to FALSE.

seq_sys

Full or abbreviated name of sequencing system to use. See "Sequencing systems" section for options. See "Sequencing profiles" section for more information on how this argument, profile1, and profile2 are used to specify profiles. Defaults to NULL.

profile1

Custom profile file for read 1. See "Sequencing profiles" section for more information on how this argument, profile2, and seq_sys are used to specify profiles. Defaults to NULL.

profile2

Custom profile file for read 2. See "Sequencing profiles" section for more information on how this argument, profile1, and seq_sys are used to specify profiles. Defaults to NULL.

ins_prob1

Insertion probability for read 1. Defaults to 0.00009.

del_prob1

Deletion probability for read 1. Defaults to 0.00011.

ins_prob2

Insertion probability for read 2. Defaults to 0.00015.

del_prob2

Deletion probability for read 2. Defaults to 0.00023.

frag_len_min

Minimum fragment size. A NULL value results in the read length. Defaults to NULL.

frag_len_max

Maximum fragment size. A NULL value results in 2^32-1, the maximum allowed value. Defaults to NULL

haplotype_probs

Relative probability of sampling each haplotype. This is ignored if sequencing a reference genome. NULL results in all having the same probability. Defaults to NULL.

barcodes

Character vector of barcodes for each haplotype, or a single barcode if sequencing a reference genome. NULL results in no barcodes. Defaults to NULL.

prob_dup

A single number indicating the probability of duplicates. Defaults to 0.02.

sep_files

Logical indicating whether to make separate files for each haplotype. This argument is coerced to FALSE if the obj argument is not a haplotypes object. Defaults to FALSE.

compress

Logical specifying whether or not to compress output file, or an integer specifying the level of compression, from 1 to 9. If TRUE, a compression level of 6 is used. Defaults to FALSE.

comp_method

Character specifying which type of compression to use if any is desired. Options include "gzip" and "bgzip". This is ignored if compress is FALSE, and it throws an error if it's set to "gzip" when n_threads > 1 (since I don't have a method to do gzip compression in parallel). Defaults to "bgzip".

n_threads

The number of threads to use in processing. If compress is TRUE or > 0 (indicating compressed output), setting n_threads to 2 or more makes this function first create an uncompressed file/files using n_threads threads, then compress that/those file/files also using n_threads threads. There is no speed increase if you try to use multiple threads to create compressed output on the fly, so that option is not included. If you want to be conservative with disk space (by not having an uncompressed file present even temporarily), set n_threads to 1. Threads are NOT spread across chromosomes or haplotypes, so you don't need to think about these when choosing this argument's value. However, all threads write to the same file/files, so there are diminishing returns for providing many threads. This argument is ignored if the package was not compiled with OpenMP. Defaults to 1.

read_pool_size

The number of reads to store before writing to disk. Increasing this number should improve speed but take up more memory. Defaults to 1000.

show_progress

Logical for whether to show a progress bar. Defaults to FALSE.

overwrite

Logical for whether to overwrite existing FASTQ file(s) of the same name, if they exist.

Value

Nothing is returned.

Sequencing profiles

This section outlines how to use the seq_sys, profile1, and profile2 arguments. If all arguments are NULL (their defaults), a sequencing system is chosen based on the read length. If, however, one or more arguments has been provided, then how they're provided should depend on whether you want single- or paired-end reads.

For single-end reads

  • profile2 should be NULL.

  • Only seq_sys or profile1 should be provided, not both.

For paired-end reads

  • If providing seq_sys, don't provide either profile1 or profile2.

  • If providing profile1, you must also provide profile2 (they can be the same if you want) and you cannot provide seq_sys.

Sequencing systems

Sequencing system options are the following, where, for each system, "name" is the full name, "abbrev" is the abbreviated name, "max_len" indicates the maximum allowed read length, and "paired" indicates whether paired-end sequencing is allowed.

nameabbrevmax_lenpaired
Genome Analyzer IGA144Yes
Genome Analyzer IIGA275Yes
HiSeq 1000HS10100Yes
HiSeq 2000HS20100Yes
HiSeq 2500HS25150Yes
HiSeqX v2.5 PCR freeHSXn150Yes
HiSeqX v2.5 TruSeqHSXt150Yes
MiniSeq TruSeqMinS50No
MiSeq v1MSv1250Yes
MiSeq v3MSv3250Yes
NextSeq 500 v2NS5075Yes

ID lines

The ID lines for FASTQ files are formatted as such:

@<genome name>-<chromosome name>-<starting position>-<strand>[/<read#>]

where the part in [] is only for paired-end Illumina reads, and where genome name is always REF for reference genomes (as opposed to haplotypes).

References

Huang, W., L. Li, J. R. Myers, and G. T. Marth. 2012. ART: a next-generation sequencing read simulator. Bioinformatics 28:593–594.

Examples

# \donttest{
rg <- create_genome(10, 100e3, 100)
dir <- tempdir(TRUE)
illumina(rg, paste0(dir, "/illumina_reads"), n_reads = 100,
         read_length = 100, paired = FALSE)
# }