Table of Contents
Genome size, coverage and volume of the data are dependent factors, which are equally contributed to complexity of denovo genome assembly. Normally, large quantity of sequencing nucleotides for new genome is most feasible (achieved by next generation sequencing technology) to small size research labs, either case maintains the cluster type computational facilities are not feasible. Recently, researchers were focused on this issue, how to minimize the sequences to small size table servers and improve the quality of the assembly. This filed of research is called “Data normalization for denovo assembly”.
NGS sequencing while at sample preparation, PCR amplification will over sampled part of genome molecules when compared to low covered regions. Due to this, while at sequencing step lead to irregular coverage. These might lead to mis-assembly and use more computational power, some times the assembly will be terminated.
To overcome this issue, the possibilites are subsampling and depth normalization or beased on the PHERD quality score (PHERD score >=30). These methods will lead to over fragmented assembly or more missing regions.
Based on this issues the data driven normalization methods were developed such as (BBNORM (http://sourceforge.net/projects/bbmap/), diginorm (http://ged.msu.edu/papers/2012-diginorm/) and Neatfreq (http://www.biomedcentral.com/1471-2105/15/357/ )), subsampling (reformat in BBtools). Those tools were included in denovo assembler like DISCOVAR denovo (http://www.broadinstitute.org/software/allpaths-lg/blog/?p=716 ), Trinity (http://trinityrnaseq.github.io/), omega (http://omega.omicsbio.org/).
Here I will explain about the BBNORM from the author notes
To normalize to 40x coverage with BBNorm, and discard reads with an apparent depth under 2x (which typically indicates the reads have errors):
bbnorm.sh in=reads.fq out=normalized.fq target=40 mindepth=2
Only for Illumina data
ecc.sh in=reads.fq out=corrected.fq
kmer frequency histograms
khist.sh in=reads.fq hist=histogram.txt