Table of Contents
Genome Characteristics Estimation #
Interpretation of blind statistical summaries from the high throughput DNA sequences is similar to walk in a random path of deep forest. While be a part of the genome projects from small (Bacteria) to large (higher plants – pine tree) genomes, we need to know the components of the given input sequence. Mostly, scientists were followed blind summaries at this aspect. Recently, few research groups light on this issue through their products. Only few points were caught to statistical estimation, still need to do more on this aspect.
K-Mer Frequency Graph: #
Massive amount of DNA fragments were sequenced by high throughput machines. Those were converted to K-Mer frequency graph by calculating the frequency of individual K-Mers with respect to individual sizes (Figure 1) and those were plotted as histogram (K-Mer size vs Frequency) for easy interpretation. Based on the current completed projects the graph regions were interpreted as shown in the figure1. Errors, repeat content and heterozygous regions. These inherited values were used for all the downstream analysis.
Figure 1: A. Kmer Graph interpretation and B) Colors of the fits: red is the fit of the complete statistical model of the histogram (erronous k-mers + genomic k-mers). When using the diploid model, green are only the heterozygous k-mers, green are only the homozygous k-mers.
Genome Size Estimation #
Initial step of the genome projects is to estimate the genome size. Based on the estimated genome size, further steps will be designed. i.e the required quantity of the raw sequences and which sequencing technology could be cost effective for the estimated genome project. Through panda genome, the scientists were proposed the genome size estimation from high throughput sequencing data and later they optimized method with float precision estimation to estimate the genome size . Another group developed the method for automated K-mer size selection (KMERGENIE)  for denovo assembly. Results were given as individual histograms (Figure 2) for each K-Mers and best kmer graph with estimated genome size (Figure 3) and complete results were organized into interactive webpages for easy interpretation .
Figure 2: Individual K-Mer histogram from 21 to 101
Another side to reduce the user inputs, assemblers such as Plantanus , SGA  and ALLPathLG  were automated the selection of best K-Mer by default and estimation the genome size. Particularly the SGA group developed the statistical method to estimate the genome characters from NGS data for all denovo genome assembly projects with easy interpretable graphs (Figure 4).
Figure 3: Best K-Mer selection graph with genome size estimation
Repeat Content and base errors estimation #
Followed to Kmer and genome size estimation other factors were estimated using the K-Mer curve such as repeat content and base position errors by SGA. Also from intensive study of GEC and platanus were suggest the long K-Mer will reduce the assembly problems due to repeat content and reduce the fragmented library. But another side it will increase the computational cost parallel to K-mer size and genome size. Those were estimated by SGA for the benchmark datasets of assemblathon2 (Fish, bird and snake) and other genomes. SGA also guide us to selection of assembler through de bruijn graph complexity estimation and NG50 of the contig assembly. So this will lead the scientist select the best assembly.
KmerGenie compilation #
reads_file is either a FASTA, FASTQ, FASTA.gz, FASTQ.gz file or a list of file names, one per line.
SGA Compilation and create the genome plots #
sga preprocess --pe-mode 1 reads_R1.fastq reads_R2.fastq > mygenome.fastq
sga index -a ropebwt --no-reverse -t 8 mygenome.fastq
sga preqc -t 8 mygenome.fastq > mygenome.preqc
sga-preqc-report.py mygenome.preqc sga/src/examples/*.preqc
GCE Compilation #
./gce -f kmer_depth_file >gce.table 2>gce.log ##basic standard discrete model.
./gce -f kmer_depth_file -c unique_depth -H 1 >gce.table 2>gce.log ##for heterozygous mode.
./gce -f Kmer_depth_file -c unqiue_depth -m 1 -D 8 >gce.table 2> gce.log ##for continuous model
Liu BS, Yujian; Yuan, Jianying; Hu, Xuesong; Zhang, Hao; Li, Nan; Li, Zhenyu; Chen, Yanxiang; Mu, Desheng; Fan, Wei: Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. eprint arXiv:13082012 2013.
Chikhi R, Medvedev P: Informed and automated k-mer size selection for genome assembly. Bioinformatics 2014, 30(1):31-37.
Kajitani R, Toshimoto K, Noguchi H, Toyoda A, Ogura Y, Okuno M, Yabana M, Harada M, Nagayasu E, Maruyama H et al: Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome research 2014, 24(8):1384-1395.
Simpson JT: Exploring genome characteristics and sequence quality without a reference. Bioinformatics 2014, 30(9):1228-1235.
Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S et al: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences 2011, 108(4):1513-1518.