SOM for Optimization and application of the survey-sequence/RH map approach

The theoretical coverage of a genome is 39%, 63% and 78% after sequencing to 0.5x, 1.0x and 1.5x respectively (Lander and Waterman, 1988). For an average sequence read of 800 bases, the probability of including a specific 100 base segment of the genome within a sequence read is 0.36, 0.59 and 0.74, after sequencing to 0.5x, 1.0x and 1.5x respectively. We assume that identification of a gene fragment requires alignment of a read to at least 100 bases of at least one exon per gene of the reference genome. For the collection of 23,269 human RefSeqs, 23,233 contain at least one exon of 100 bases or longer (mean, 7.4; median 6; http://genome.ucsc.edu/cgi-bin/hgTables). The mean probability of sequencing at least 100 bases from at least one exon for each of 23,269 homologous genes is 0.84, 0.94 and 0.97 after 0.5x, 1.0x and 1.5x coverage respectively. However, it is acknowledged that approximately 25% of reference human genes will lack 1:1 orthology with genes in the surveyed mammalian genome (and will therefore fail to provide mutual best blastn matches). Consequently, our estimate for the number of orthologues that could be detected after 0.5x, 1.0x, and 1.5x coverage is 14,700 (i.e. 23,269 x 0.75 x 0.84), 16,400 and 16,900 respectively. These values do not consider the homology between intronic regions of orthologous genes, which can often be used to further increase the number of identifiable gene fragments after survey-sequencing (Kirkness et al., 2003).

The 1.5x survey-sequence described more than 89,000 SNPs (and di-, tri-, and tetranucleotide polymorphisms) mapped near or within the coding sequences of 14,679 distinct human genes. Nine percent (1,299) of the gene-based marker sequences contained single tandem repeat sequence (STR). Of those, 25% demonstrate size polymorphisms when the 7.5x boxer and 1.5x poodel sequences are compared, making them an optimal resource for genetic linkage studies, and further demonstrating the utility of the survey sequence/dense RH map approach for providing genomic resources in species of interest.