Welcome to the blog of ScaMPI, a suite of tools for genome scaffolding based on Mate Paired information.
This is a project of the Genomics and Bioinformatics unit of the CRIBI Center, University of Padua.
The missing script “trap_words.pl” is being included in the (soon to be released) next update of the ScaMPI Package.
In the meanwhile you can download it from here.
To assess the accuracy of ScaMPI scaffolding algorithm we generated a simulated set of contigs from A. thaliana genome (~12 000 in total). Each contig has been enumerated according to its position in the genome: “seq.3.400″ is the 400th contig of chromosome 3.
To assess the accuracy of the correctness.pl script (that evaluates the integrity of contigs) we selected a set of 600 contigs from the 454 shotgun assembly of N. gaditana. The first 200 were fused forming 100 chimeric contigs. You can download the dataset from here.
We the run all the steps already described and briefly summarized here:
pass -csfastq 2_F3.csfastq -d testset.fa -cpu 12 -uniq -fid 90 -gff > 2_F3.gff pass -csfastq 2_R3.csfastq -d testset.fa -cpu 12 -uniq -fid 90 -gff > 2_F3.gff pass_pair -gff1 2_F3.gff -gff2 2_R3.gff -range 900 3000 3001 -ref testset.fa -o ./
This tutorial explain how to use the TRAP pipeline (Telomeric Repeat Analysis Program) to spot telomeres from mate paired reads in a under assembly genome.
The insert size of mate paired reads should be as long as possibile, to ensure the possibility of anchoring telomeres to existing scaffolds.
A pivotal step to prepare data for ScaMPI is the pairing of alignments, using the pass_pair program from PASS suite. Once that we aligned both the first and the second mate against reference contigs, we obtain two alignment files, that when paired will be categorized as:
All other pairs will be discarded. The first category can be used for integrity checking of the contigs, while the latter for scaffolding.
The ScaMPI suite comes with its own aligner (PASS), because of its support for color space reads and its ability to run also without buiding an index (that turns to be handy when testing alignments against many different contigs sets).
As older versions of PASS produced a GFF output, instead of the now widespread SAM, that is the alignment format used by ScaMPI to infer arcs between contigs. We now include a script for SAM to GFF conversion (that keeps only scaffolding-needed information!).
Checking the correctness of contigs is a pivotal step to perform prior to scaffold, as chimeric sequences can lead to big mistakes. This is especially true when the contigs are generate by a shotgun approach with long reads, while a paired library of short reads is available.
The “correctness.pl” script whilll break apart contigs basing its analysis on the physical coverage of the mates aligned against reference contigs.
Note that step 1 and 2 are identical to the alignment steps used for scaffolding: you can skip to step 3 if already performed them.
The tool “trap_words” is designed to identify overrepresented tadem repeats in shotgun reads. To test it we prepared a simulated shotgun from the Human genome (100X).
We ran the program:
trap_words.pl -i shotgun.fastq -k 7 > motifs.7.txt