1、RNA-seq Data Handling and Analysis Kevin Childs Statistical genetics/genomics journal club 中国测序论坛Overview Fasta/Fastq file formats NCBI SRA Data preparation Bowtie/Tophat/Cufflinks Velvet/Oases Trinity 中国测序论坛Fasta File Format#FASTA gi|1800214|gb|U56729.1|SBU56729 Sorghum bicolor phytochrome A CGCATC
2、CTTCCGCGCCGGGCATGGGCACCGCGTCGGCGCGCGCCCCTACCCAGTCGTCGACTTGATGCTG CTCACTCGCACTCGTCGCAGCGCCCCACGCCCCGCTATTTATGCGTACTTGCTTGCCGGGAGAGTCGCTG GAGGTGGGCGTCCTCCTCCCGCTCCAGAGCTCGCTGCTTCGCTCCACCCACCCTTAAGCAGGAGTGATAT CTGGTGGTTTTTCAAAAGAAGACAAAAATGTCTTCCTCGAGGCCTGCCCACTCTTCCAGTTCATCCAGTA GGACTCGCCAGAGCTCCCAGGC
3、AAGGATATTAGCACAAACAACCCTTGATGCTGAACTCAATGCAGAGTA TGAAGAATCTGGTGATTCCTTTGATTACTCCAAGTTGGTTGAAGCACAGCGGAGCACTCCATCTGAGCAG CAAGGGCGATCAGGAAAGGTCATAGCCTACTTGCAGCATATTCAAAGAGGAAAGCTAATCCAACCATTTG GTTGCTTGTTGGCCCTTGACGAGAAGAGCTTCAGGGTCATTGCATTCAGTGAGAATGCACCTGAAATGCT CACAACGGTCAGCCATGCTGTGCCAAACGTTGATGATC
4、CCCCAAAGCTAGGAATTGGTACCAATGTGCGC TCCCTTTTCACTGACCCTGGTGCTACAGCACTGCAGAAGGCACTAGGATTTGCTGATGTTTCTTTGCTGA ATCCTATCCTAGTTCAATGCAAGACCTCAGGCAAGCCATTCTATGCCATTGTTCATAGGGCAACTGGTTG TCTGGTGGTTGATTTTGAGCCTGTGAAGCCTACAGAATTTCCTGCCACTGCTGCTGGGGCTTTGCAGTCT 中国测序论坛Fastq File Format Read Name Sequence Quality Qua
5、lity scores are in ASCII characters representing coded Phred scores.ASCII codes start at ASCII 33 or ASCII 64.All SRA codes converted to ASCII 33 These scores provide a likelihood that the base was called incorrectly.10 1 in 10 chance the base call is incorrect 20 1 in 100 chance the base call is in
6、correct 30 1 in 1000 chance the base call is incorrect 中国测序论坛High Throughput Sequencing Platforms Illumina HiSeq 1000 and HiSeq 2000 Illumina Genome Analyzer IIx*Life Sciences/Roche 454 pyrosequencing ABI Solid Sequencing System*Pacific Biosciences*Ion Torrent Cambridge Nannopore(late 2012?)中国测序论坛Hi
7、gh Throughput Sequencing HiSeq 2000 Highly parallel sequencing by synthesis Single and paired-end reads between 50 bp and 100 bp 187 million single end or 374 million paired-end reads per lane High error rate in the 3 end 中国测序论坛NCBI SRA SRA toolkit fastq-dump/opt/sratoolkit/fastq-dump SRR373821.lite
8、.sra/opt/sratoolkit/fastq-dump-split-files SRR329070.lite.sra 中国测序论坛Read Quality with the FASTX-Toolkit http:/hannonlab.cshl.edu/fastx_toolkit/中国测序论坛Read Quality with the FASTX-Toolkit Bad Sequence Good Sequence 中国测序论坛FASTX Toolkit fastx_quality_stats-Q 33 i initial_fastq_file.fastq o stats.txt fast
9、x_quality_boxplot_graph.sh-Q 33 i stats.txt t Title o quality.png fastx_clipper-Q 33 -v -a AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT -i initial_fastq_file.fastq -o fastq_file_clipped.fastq fastx_artifacts_filter -Q 33 -v -i fastq_file_clipped.fastq -o fastq_file_artifact_filtered.fa
10、stq fastq_quality_trimmer -Q 33-v -t 20 -l 30 -i fastq_file_artifact_filtered.fastq-o fastq_file_cleaned.fastq -Q is an undocumented parameter to indicate that quality values use ASCII 33 encoding.中国测序论坛FastQC http:/www.bioinformatics.bbsrc.ac.uk/projects/fastqc/A quality control tool for high throu
11、ghput sequence data.中国测序论坛Samtools Package of programs for manipulating sam and bam files sam sequence alignment map bam binary alignment map compressed form of sam file http:/ 中国测序论坛Tuxedo Suite Bowtie fast and quality aware short read aligner for aligning DNA and RNA sequence reads TopHat fast,spl
12、ice junction mapper for RNA-Seq reads built on the Bowtie aligner Cufflinks assembles transcripts,estimates their abundances,and test for differential expression and regulation using the alignments from Bowtie and TopHat 中国测序论坛Bowtie Aligns short reads to large genomes Forms the basis for TopHat,Cuf
13、flinks,Crossbow,and Myrna Unless you are working with genomic DNA derived short reads,you will not directly use Bowtie With the exception of using bowtie-build to create an genomic sequence index file 中国测序论坛TopHat Built on Bowtie and uses the same genome index Used for alignment of RNA-Seq reads to
14、a genome Optimized for paired-end,Illumina sequence reads 70bp 中国测序论坛TopHat 中国测序论坛 Quantification of gene expression using RNA-seq reads Tests for differential expression Uses output from bowtie/tophat Assembles read alignments into transcripts Uses cufflinks-predicted transcripts or user-supplied g
15、ene models for quantification Estimates transcript abundance balanced across transcript isoforms Cufflinks 中国测序论坛Cufflinks 中国测序论坛Bowtie/Tophat/Cufflinks bowtie-build pseudomolecule.fa pseudomolecule.index tophat-p 6-solexa1.3-quals-i 5 -I 1000 -r 100 -no-novel-juncs -GTF pseudomolecule.gtf -o/output
16、/directory pseudomolecule.index purified_reads.fastq samtools sort tophat_output_pairs.bam tophat_output_pairs_sorted samtools view -o tophat_output_pairs_sorted.sam tophat_output_pairs_sorted.bam cufflinks -q -o/output/directory/-p 4 -G pseudomolecule_corrected.gtf tophat_output_pairs_sorted.sam 中国
17、测序论坛Velvet/Oases Genome/transcriptome assembly package Velveth/velvetg work well for genomes but produce fragmented transcriptomes assemblies.Its modules explicitly assume linearity and uniform coverage distribution.Velvet was designed to assemble genomic reads Oases was designed to assemble transcr
18、iptomes of new species.Oases takes the preliminary assembly produced by velvetg,and clusters the contigs into small groups,called loci.It then exploits the read sequence and pairing information,when available,to produce transcript isoforms.中国测序论坛Velvet/Oases velveth /working/directory/31-fastq short
19、Paired switchgrass_purified_1.fastq switchgrass_purified_2.fastq velvetg /working/directory/-read_trkg yes-ins_length 190 -ins_length_sd 44 oases /working/directory/-ins_length 190 -ins_length_sd 44 -min_trans_lgth 250 中国测序论坛Trinity Inchworm,Chrysalis,Butterfly Transcript assembly package 中国测序论坛Trin
20、ity.pl -seqType fa left ABA_RC_1.fa right ABA_RC_2.fa single se_ABA_RC.fa -paired_fragment_length 400 -run_butterfly -output/output/directory -CPU 2 Trinity 中国测序论坛num_of_transcripts=25069 max_len_trans=2707 min_len_trans=300 N50=6492168.5 Avrg size of contigs=517.943954685069 N50_contig=544 Length(b
21、p)distribution 5000 0 num_of_transcripts=86458 max_len_trans=3597 min_len_trans=251 N50=39127035.5 Avrg size of contigs=905.110816812788 N50_contig=1136 Length(bp)distribution 5000 0 Assembly:used 41385139/46322456 reads Assembled reads=89.3414179075479%pe_ABA_RC,pe_ABA_RB,se_ABA_RC,se_ABA_RC.pe_ABA
22、_RC,se_ABA_RC 5,985,516 sequences TRINITY VELVET 中国测序论坛Memory requested and technical notes qsub_time Thu Jun 23 15:07:59 2011 start_time Thu Jun 23 15:08:08 2011 end_time Fri Jun 24 05:54:35 2011 5,985,516 sequences 10G 55G A basic recommendation is to have 1G of RAM per 1M pairs of Illumina reads.Our experience is that the entire process can require about 1 hour per million pairs of reads in the current implementation.*Memory requirements have improved in more recent updates.中国测序论坛