Low coverage genome sequence and ovine 50k SNP chip creation
The consortium investigated various methods by which low sequence coverage of the ovine genome could be produced. In addition to various simulations using bovine and ovine sequence, it also included pilot studies using Sanger resequencing from existing sheep sequence and Roche 454 GS20 sequencing of previously sequenced sheep BACs to provide baseline information.
An important aspect of the process is to identify SNPs, their genomic location, estimate their minor allele frequency (MAF), and provide sufficient known surrounding unique sequence to design probes for their detection.
This was considered to be a challenge using existing technology with the following issues being identified:
- The short contigs would almost certainly need to be ordered and orientated against a reference genome such as the bovine.
- The sequencing needs to make use of the virtual sheep genome to provide a framework for genome assembly.
- The new sequencing technologies result in very short sequence lengths. This means sequencing needs to use a divide and conquer approach to assembly, and even then assembly though even modest lengths of repetitive sequence is a challenge.
- The best technology for SNP detection and estimation of MAF provides insufficient genomic sequence for probe design and genome positioning.
The International Sheep Genomics Consortium.s immediate objective has been to skim sequence the ovine genome so as to identify SNPs in order to produce a 50k SNP chip. Roche 454 FLX sequencing technology is a new technology based on pyrosequencing. Simulations based on the limited ovine genomic sequence available, and results from a pilot ovine resequencing projects identified the following strategy: Roche 454 FLX technology to produce a 3x whole ovine genome coverage, consisting of 0.5x shotgun sequence coverage from 6 ewes. Each animal represents a different breed, and the resulting sequence was initially assembled using the bovine genome as a framework which has then been reorganized using the virtual sheep genome to create a sheep genome sequence.
The 454 approach has been used to produce more than 9 Gb of ovine sequence and to provide assembled and ordered sequence for approximately 76% of the unique portion of the ovine genome. It also allowed the detection of more than 590,000 probable SNPs with defined genomic locations of which more than 270,000 were classified as "class A" SNPs (both alleles seen in two sheep). This is sufficient to select from for goal of a 50k ovine SNP chip comprising equally spaced SNPs. Based on available information a 50k ovine SNP chip would be a resource where the mean linkage disequilibrium (r2) between adjacent SNPs would be in excess of 0.25, which is suitable for whole genome association studies.
The 454 sequencing phase of the project was completed by AgResearch in New Zealand sequencing 3 sheep representing the Romney, Texel and Scottish Blackface breeds and Baylor HGSC in Houston Texas sequencing 3 sheep representing the Merino, Poll Dorset and Awassi breeds.
The project had two additional components added to identify more SNPs as well as to estimate their minor allele frequency more accurately and improve the genome assembly. The first extension was to include ~4 Gbp of reduced representational sequencing (RRS) with an Illumina Genome Analyser (GA) to identify numerous additional SNPs and estimate their minor allele frequency using a technique outlined by Smith et al. (2008). The second extension has been to improve assembly by creation of paired end reads of various insert sizes and sequencing lengths using a combination of next generation and Sanger sequencing.
Roche FLX skim sequencing method
- Six animals (females), each of different breeds were selected (Fig. 1)
- different breeds help identify SNPs with higher minor allele frequencies (MAF)
- females chosen to equalise representation of the X chromosome
- DNA isolated from white blood cells using standard Protease K digestion and salt ethanol precipitation
- Each animal sequenced to 0.5 x genome coverage (1.5 Gbp) via Roche 454 FLX
- Two 454 FLX libraries made per animal with each library titrated and the best used
- 454 reads repeat masked with an in-house repeats database consisting of repbase bovine repeats coupled with CAP3 assembled ovine FLX sequence segments found to have >1000 hits in the bovine genome
- Unique hits matched to location on bovine genome
- MEGABLAST used with options -D 3 -t 21 -W 11 -q -3 -r 2 -G 5 -E 2 -s 56 -N 2 -F "m D" -U T
- Unique is defined as being where only a single hit occurred with an e value of less than 1e-5, or multiple hits were present with the ratio e top hit/e second hit being less than 1e-20
- Retrieved raw reads matching bovine scaffold segments (typically < 2 Mbp) and assembled using Newbler
- Position orientated Newbler ovine contigs on to bovine scaffold
- Summarised as a virtual ovine sequence (MELD, see Fig 2)
- Reorder MELDed ovine segments using ovine BES information and VSG into ovine genome order
- align sequence reads to 454 MELD sequence (Fig 3)
- filter high quality SNPs based on:
- unique genomic match
- high quality sequence, no flanking SNPs
- not within or adjacent to a homopolymeric run
- at least 2 reads of minor allele preferably from different animals
- at least 50 bp of flanking sequence on both sides
- The consortium also used reduced representational sequencing with Illumina Genome Analyser
- 60 animals representing 11 breeds (primarily female) and ~5% of the genome sequenced to 20x depth/run
- 3 Illumina GA sequencing runs of 3.7 Gb in total with 33 base reads generated more than 76,000 high quality SNPs
- Illumina GA sequences positioned on Roche 454 MELDed sequence to provide genome location of SNP and sufficient flanking sequence for probe design
- A small number of runs of 454 GS20 and FLX paired end reads to quality control assembly are being were generated by Baylor HGSC
- Funding: Ovita (New Zealand), ISL Grant (Sydney University, Australia), Genesis Faraday (United Kingdom)
- Roche 454 FLX sequencing: AgResearch, University of Otago and Baylor HGSC
- Illumina GA reduced representational sequencing: CSIRO, Illumina, USDA
- Assembly: AgResearch, CSIRO
- SNP detection: AgResearch, CSIRO, (USDA)
|Sequence 3x coverage using 454 FLX||completed January 2008|
|Assemble and create MELD 454 sequence||completed February 2008|
|Illumina GA RRS sequencing||completed May 2008|
|BAC and RRS SNP detection||completed July 2008|
|pilot testing of 454 SNPs||completed July 2008|
|pilot testing of Solexa SNPs||completed August 2008|
|SNP Chip design||completed August 2008|
|SNP Chip synthesis and initial testing||completed December 2008|