Home > Manual



drFAST is a read mapper that is designed to map short reads to reference genome with a special emphasis on the discovery of structural variation and segmental duplications. drFAST maps short reads with respect to user defined error threshold. This manual, describes how to choose the parameters and tune drFAST with respect to the library settings. drFAST is designed to find 'all'  mappings for a given set of reads, however it can return one "best" map location if the relevant parameter is invoked.


Table of Content


General

Please download the latest version from our download page and then unzip the downloaded file. Run 'make' to build drFAST.

Mapping: drFAST
  1.     transfer the genome from letters (A,C,T,G) to colors (0=BLUE, 1=GREEN, 2=YELLOW, 3= RED)
  2.     generates an index of the reference genome(s) and
  3.     maps the reads to reference genome.


Parallelism: The best way to optimize drFAST is to split the reads into chunks that fit into the memory of the cluster nodes. The number of reads is approximately ((M-600)/(4*L)) mil where M is the size of the memory for the cluster node(MB) and L is the read length. If you have more nodes, you can make the chunks smaller to use the nodes efficiently. For example, if the library length is 50bp and the memory of nodes is 2GIG, chunks should (2000-600)/(4*50)= 7mil reads.

To see the list of options, use "-h" or "--help".
To see the current version of drFAST, user "-v" or "--version".

Indexing

drFAST's indices can be generated in two modes (single, batch). In single mode, drFAST indexes a fasta file (which may contain one or more reference genomes) while in batch mode it indexes a set of fasta files.

By default drFAST uses the window size of 12 characters to generate its index. Please be advised that if you do not choose the window size carefully, you will lose sensitivity.

How to choose the right window size: For a given read length (l) and error threshold (e), the window size is floor(l/(e+1)). For example if the reads length is 36 and the maximum number of mismatches allowed is 2, the window size is 12. if your calculated window size is greater than default, you can use the default window size without losing the sensitivity. For example, for the read length of 64 and error threshold of 2, the windows size should be 21. You can use the default window size 12. However you cannot use 12 as window size for read length of 30 and error threshold of 2.

Single Mode:

To index a reference genome like "refgen.fasta" run the following command:
$>../drFAST --index refgen.fasta
Upon the completion of the indexing phase, you can find "refgen.fasta.index" in the same directory as "refgen.fasta". drFAST uses a window size of 12 (default) to make the index of the genome, this windows size can be modified with "--ws". There is a restriction on the maximum of the window size as the window size directly affects the memory usage.
$>./drfast --index refgen.fasta --ws 13

Batch Mode

In batch mode, drFAST gets a list of reference files and generates the index for each one of them. Similar to single mode, you can specify a different window size for indexing.
$>./drfast -b --index fasts.list --ws 13

Mapping

drFAST can map single-end reads and paired-end reads to a reference genome. drFAST can map in either single or batch mode. In single mode, it only maps to one index. In batch mode, it maps to a list of indices. drFAST supports both fasta and fastq formats.

Single-end Reads - Single Mode

To map single reads to a reference genome in single mode, run the following command. Use "--seq" to specify the input file. refgen.fa and refgen.fa.index should be in the same folder.
$>./drfast --search refgen.fa --seq reads.fastq
The reported locations will be saved into "output" by default. If you want to save it somewhere else, use "-o" to specify another file. drFAST can report the unmapped reads in fasta/fastq format.
$>./drfast --search refgen.fasta --seq reads.fastq -o my.map
By default, drFAST reports all the locations per read. The number of the mismatches allowed by drFAST is 2 by default. You can modify this number by using "-e".
$>./drfast --search refgen.fasta --seq reads.fastq -e 3

Single-end Reads - Batch Mode

In batch mode, drFAST uses a list of indices to find the mappings of the reads. "index.list" should contain the list of fasta files.
$>./drfast -b --search index.list --seq reads.fastq

Paired-end Reads

To map paired-end reads, use "--pe" option. The mapping can be done in single/batch mode. If the reads are in two different files, you have to use "--seq1/--seq2" to indicate the files. If the reads are interleaved, use "--seq" to indicated the file. The distance allowed between the paired-end reads should be specified with "--min" and "--max". "--min" and "--max" specify the minmum and maximum of the inferred size (the distance between outer edges of the mapping mates).
$>./drfast --search refgen.fasta --pe --seq reads.fastq --min 150 --max 250
$>./drfast -b --search index.list --pe --seq1 reads1.fastq --seq2 reads2.fastq --min 50 --max 75

Discordant Mapping

drFAST can report the discordant mapping for use of Variation Hunter. The --min and --max optiopns will define the minimum and maximum inferred size for concordant mapping. Pair-end reads are deal as mention in
ABSolid Document
$>./drfast --search refgen.fasta --pe --discordant-vh --seq reads.fastq --min 50 --max 75

Output Format

drFAST output format is in SAM format. For detail about the definition of the fields please refer to
SAM Manual.
We have not implemented "MQUAL" field yet.




v0.0.0.2 and earlier

Translation

In this step we will transfer the genome from letter to color space.
$>java Driver refgen.fasta
executing this command will make the color space genome in a file called refgen.fasta.cs

Indexing

drFAST's indices can be generated in two modes (single, batch). In single mode, drFAST indexes a fasta file (which may contain one or more reference genomes) while in batch mode it indexes a set of fasta files.

By default drFAST uses the window size of 12 characters to generate its index. Please be advised that if you do not choose the window size carefully, you will lose sensitivity.

How to choose the right window size: For a given read length (l) and error threshold (e), the window size is floor(l/(e+1)). For example if the reads length is 36 and the maximum number of mismatches allowed is 2, the window size is 12. if your calculated window size is greater than default, you can use the default window size without losing the sensitivity. For example, for the read length of 64 and error threshold of 2, the windows size should be 21. You can use the default window size 12. However you cannot use 12 as window size for read length of 30 and error threshold of 2.

Single Mode:

To index a reference genome like "refgen.fasta" run the following command:
$>../drFAST --index refgen.fasta
Upon the completion of the indexing phase, you can find "refgen.fasta.index" in the same directory as "refgen.fasta". drFAST uses a window size of 12 (default) to make the index of the genome, this windows size can be modified with "--ws". There is a restriction on the maximum of the window size as the window size directly affects the memory usage.
$>./drfast --index refgen.fasta --ws 13
NOTE: Although in the Translation step we made a color space genome but for indexing you will give the name of letter genome "refgen.fasta"

Batch Mode

In batch mode, drFAST gets a list of reference files and generates the index for each one of them. Similar to single mode, you can specify a different window size for indexing.
$>./drfast -b --index fasts.list --ws 13

Mapping

drFAST can map single-end reads and paired-end reads to a reference genome. drFAST can map in either single or batch mode. In single mode, it only maps to one index. In batch mode, it maps to a list of indices. drFAST supports both fasta and fastq formats.

Single-end Reads - Single Mode

To map single reads to a reference genome in single mode, run the following command. Use "--seq" to specify the input file. refgen.fa and refgen.fa.index should be in the same folder.
$>./drfast --search refgen.fa --seq reads.fastq
The reported locations will be saved into "output" by default. If you want to save it somewhere else, use "-o" to specify another file. drFAST can report the unmapped reads in fasta/fastq format.
$>./drfast --search refgen.fasta --seq reads.fastq -o my.map
By default, drFAST reports all the locations per read. The number of the mismatches allowed by drFAST is 2 by default. You can modify this number by using "-e".
$>./drfast --search refgen.fasta --seq reads.fastq -e 3

Single-end Reads - Batch Mode

In batch mode, drFAST uses a list of indices to find the mappings of the reads. "index.list" should contain the list of fasta files.
$>./drfast -b --search index.list --seq reads.fastq

Paired-end Reads

To map paired-end reads, use "--pe" option. The mapping can be done in single/batch mode. If the reads are in two different files, you have to use "--seq1/--seq2" to indicate the files. If the reads are interleaved, use "--seq" to indicated the file. The distance allowed between the paired-end reads should be specified with "--min" and "--max". "--min" and "--max" specify the minmum and maximum of the inferred size (the distance between outer edges of the mapping mates).
$>./drfast --search refgen.fasta --pe --seq reads.fastq --min 150 --max 250
$>./drfast -b --search index.list --pe --seq1 reads1.fastq --seq2 reads2.fastq --min 50 --max 75

Discordant Mapping

drFAST can report the discordant mapping for use of Variation Hunter. The --min and --max optiopns will define the minimum and maximum inferred size for concordant mapping. Pair-end reads are deal as mention in
ABSolid Document
$>./drfast --search refgen.fasta --pe --discordant-vh --seq reads.fastq --min 50 --max 75

Output Format

drFAST output format is in SAM format. For detail about the definition of the fields please refer to
SAM Manual.
We have not implemented "MQUAL" field yet.

Help

Downloads

Related Projects:

mrsFAST and mrFAST : Illumina read mappers

mrCaNaVaR : Read depth analysis method to characterize segmental duplications and predict absolute copy numbers.

VariationHunter : Structural variation calling algorithm using read pair mapping information including suboptimal alignments.

NovelSeq : Novel sequence insertion discovery framework.

SPLITREAD : Detection of structural variants and indels from genome and exome sequencing data.

SCALCE : Tool for compression of FASTQ files.

SourceForge
Valid XHTML 1.0 Transitional