From either a reference genome or set of variant haplotypes, create Illumina reads
from error profiles and write them to FASTQ output file(s).
I encourage you to cite the reference below in addition to jackalope
if you use
this function.
illumina(obj, out_prefix, n_reads, read_length, paired, frag_mean = 400, frag_sd = 100, matepair = FALSE, seq_sys = NULL, profile1 = NULL, profile2 = NULL, ins_prob1 = 0.00009, del_prob1 = 0.00011, ins_prob2 = 0.00015, del_prob2 = 0.00023, frag_len_min = NULL, frag_len_max = NULL, haplotype_probs = NULL, barcodes = NULL, prob_dup = 0.02, sep_files = FALSE, compress = FALSE, comp_method = "bgzip", n_threads = 1L, read_pool_size = 1000L, show_progress = FALSE, overwrite = FALSE)
obj | Sequencing object of class |
---|---|
out_prefix | Prefix for the output file(s), including entire path except for the file extension. |
n_reads | Number of reads you want to create. |
read_length | Length of reads. |
paired | Logical for whether to use paired-end reads.
This argument is changed to |
frag_mean | Mean of the Gamma distribution that generates fragment sizes.
Defaults to |
frag_sd | Standard deviation of the Gamma distribution that generates
fragment sizes.
Defaults to |
matepair | Logical for whether to simulate mate-pair reads.
Defaults to |
seq_sys | Full or abbreviated name of sequencing system to use.
See "Sequencing systems" section for options.
See "Sequencing profiles" section for more information on how this argument,
|
profile1 | Custom profile file for read 1.
See "Sequencing profiles" section for more information on how this argument,
|
profile2 | Custom profile file for read 2.
See "Sequencing profiles" section for more information on how this argument,
|
ins_prob1 | Insertion probability for read 1. Defaults to |
del_prob1 | Deletion probability for read 1. Defaults to |
ins_prob2 | Insertion probability for read 2. Defaults to |
del_prob2 | Deletion probability for read 2. Defaults to |
frag_len_min | Minimum fragment size. A |
frag_len_max | Maximum fragment size.
A |
haplotype_probs | Relative probability of sampling each haplotype.
This is ignored if sequencing a reference genome.
|
barcodes | Character vector of barcodes for each haplotype, or a single barcode
if sequencing a reference genome. |
prob_dup | A single number indicating the probability of duplicates.
Defaults to |
sep_files | Logical indicating whether to make separate files for each haplotype.
This argument is coerced to |
compress | Logical specifying whether or not to compress output file, or
an integer specifying the level of compression, from 1 to 9.
If |
comp_method | Character specifying which type of compression to use if any
is desired. Options include |
n_threads | The number of threads to use in processing.
If |
read_pool_size | The number of reads to store before writing to disk.
Increasing this number should improve speed but take up more memory.
Defaults to |
show_progress | Logical for whether to show a progress bar.
Defaults to |
overwrite | Logical for whether to overwrite existing FASTQ file(s) of the same name, if they exist. |
Nothing is returned.
This section outlines how to use the seq_sys
, profile1
,
and profile2
arguments.
If all arguments are NULL
(their defaults), a sequencing system is chosen
based on the read length.
If, however, one or more arguments has been provided, then how they're provided
should depend on whether you want single- or paired-end reads.
For single-end reads
profile2
should be NULL
.
Only seq_sys
or profile1
should be provided, not both.
For paired-end reads
If providing seq_sys
, don't provide either profile1
or profile2
.
If providing profile1
, you must also provide profile2
(they can be the
same if you want) and you cannot provide seq_sys
.
Sequencing system options are the following, where, for each system, "name" is the full name, "abbrev" is the abbreviated name, "max_len" indicates the maximum allowed read length, and "paired" indicates whether paired-end sequencing is allowed.
name | abbrev | max_len | paired |
Genome Analyzer I | GA1 | 44 | Yes |
Genome Analyzer II | GA2 | 75 | Yes |
HiSeq 1000 | HS10 | 100 | Yes |
HiSeq 2000 | HS20 | 100 | Yes |
HiSeq 2500 | HS25 | 150 | Yes |
HiSeqX v2.5 PCR free | HSXn | 150 | Yes |
HiSeqX v2.5 TruSeq | HSXt | 150 | Yes |
MiniSeq TruSeq | MinS | 50 | No |
MiSeq v1 | MSv1 | 250 | Yes |
MiSeq v3 | MSv3 | 250 | Yes |
NextSeq 500 v2 | NS50 | 75 | Yes |
The ID lines for FASTQ files are formatted as such:
@<genome name>-<chromosome name>-<starting position>-<strand>[/<read#>]
where the part in []
is only for paired-end Illumina reads, and where genome name
is always REF
for reference genomes (as opposed to haplotypes).
Huang, W., L. Li, J. R. Myers, and G. T. Marth. 2012. ART: a next-generation sequencing read simulator. Bioinformatics 28:593–594.
# \donttest{ rg <- create_genome(10, 100e3, 100) dir <- tempdir(TRUE) illumina(rg, paste0(dir, "/illumina_reads"), n_reads = 100, read_length = 100, paired = FALSE) # }