New subject: [trinity-users] Assembly for differential expression analysis

19 Apr 2013


      Dear Trinity developpers,
I have been using the program to assemble the transcriptome of an organism
without a good quality genome. The purpose of this transcriptome is to be
used for differential expression analysis.
Since I wanted the transcriptome to be as complete as possible I relied on
high coverage. For my first assembly I used ~1 billion single-end reads,
these reads are from different libraries and the length range goes from 25
to 75 bp. After running Trinity I got 320276 contigs longer than 100 bp. As
some of the contigs were redundant I decided to used minimus2 to merge very
similar sequences, and I ended up with more than 100000 contigs. When I
mapped the libraries that I want to analyze for differential expression
more than 89 % of the reads per library were mappable and around 1-2 % of
the reads were aligning to more than one place, so I was quite happy with
these results.
I understand that with very high coverage I could be assembling transcripts
that originated from pervasive transcription, and I think that this could
be one of the reasons why I am getting over 100000 contigs, besides
fragmentation. But this might not be a major problem since I could apply
filters before proceeding with the differential expression analysis, for
example I could only keep contigs that have more than x number of reads.
However, now I have access to paired-end libraries and to other single-end
libraries that have longer reads (75-100 bp), therefore I assembled these
new libraries to check whether paired-end information could solve
fragmentation problems. All together I had 2682537009 reads and I used in
silico normalization to reduce the number of reads to 107846820. For the
normalization I did not used the --PARALLEL_STATS parameter (memomy
limitations). I followed the same pipeline than with my first assembly and
after Trinity and minimus2 I have 200655 contigs. Nevertheless, now I have
more redundant contigs, for instance some components have more than 100
sequences and they are sharing at least 50 bp; and I do not see an evident
gain in completeness when I compare my two transcriptomes, at least with
the assays that I carried out, such as checking the number and coverage of
orthologous sequences that are included in the assemblies, using as
reference the transcriptome of the phylogenetically closest organism.
At this point I do not know whether it will be better to use my first
assembly for further analysis or to try to assemble my new dataset again
using options like --REDUCE (as previously suggested
http://sourceforge.net/mailarchive/forum.php?thread_name=CAJCu8qPrOE1BjFPk%2...)
or --max_number_of_paths_per_node to solve redundancy problems and
--min_kmer_cov to not include so lowly expressed transcripts. I would
appreciate any comments or suggestions on this regard.
Regards,
Jose Trejo Uribe