How to use Scipio (version 1.0)

Requirements

What the Scipio script does

The intention of the script is to locate protein query sequences in DNA targets.

BLAT does most of the computational work, however, there is some important postprocessing that is performed by the script. This includes the following tasks:

Usage

scipio.pl [〈options〉] 〈target〉 〈query〉
〈target〉 is a DNA file
〈query〉 is a protein file
both in FASTA format
Options: --blat_output=〈filename〉 name of BLAT output file (.psl); if not specified, defaults to the newest psl file in the working directory, or to Scipio〈nnnn〉_blat.psl if none exists
--force_new run BLAT even if output file already exists (the default is never to run BLAT if the specified or default psl file would be overwritten)
--verbose show verbose information (including progress)
--min_score=〈value〉 minimal score of the best partial hit for a query
--min_identity=〈value〉 minimal identity in any hit (default is 0.9)
--max_mismatch=〈value〉 maximal number of mismatches in any hit (0 means any number, the default)
--split_on_fs do not join matchings separated by frameshifts (the default is to consider them a single exon)
--region_size=〈value〉 size of shown upstream/downstream sequence parts (defaults to 1000, maximum is 75000)
--force_score=〈value〉 specify a score (between 0 and 1) that forces a hit to be shown even if it is contradicting a better one
(these hits are not assembled together and shown as 〈queryname〉_(1), etc.)
--partial_target accept BLAT output files containing hits referring to nonexistent target sequences
--show=〈list〉 which keys are to be shown in YAML output (details see below)
--show_intron=〈list〉
--show_exon=〈list〉
--show_gap=〈list〉
--hide_undef hide keys with undefined or empty list values
--hide_defaults hide some keys if they have a default value
Options passed to BLAT (ignored when BLAT is not run):
--blat_bin=〈name〉 name of BLAT executable, defaults to "blat"
--blat_params=〈params〉 parameters passed to BLAT
--blat_tilesize=〈..〉 (see BLAT usage information)
--blat_score=〈..〉
--blat_identity=〈..〉

Output overview

The script produces an output in YAML format. Additional scripts are supplied that transform the YAML output into other formats (e.g. GFF3). The output is organized as follows:

For each query, a list of the matched parts is given:

〈query 1〉:
- 〈1st BLAT hit〉
- 〈2nd BLAT hit〉
...

〈query 2〉:
- 〈1st BLAT hit〉

etc.

Usually, there is only one matched part (= BLAT hit) for each query, except in cases where queries are found on multiple non-overlapping targets (contigs).

At the end of the output, a list of the unmatched sequences is shown.


Description of keys

Keys for BLAT hits
Every BLAT hit is described by the following keys:
IDThe id of the BLAT hit (equals the line number in the psl file).
statusOne of "complete", "partial", "incomplete" or "manual". "complete" means that Scipio had no problems locating the query, "partial" means that the hit is on one of multiple targets each matching a part of the query, "incomplete" means there might be the need to edit it manually. "manual" can be entered if the output was modified by hand.
reasonIf status is "incomplete", the reason why.
prot_lenThe length of the query (in amino acid coordinates).
prot_startStart of the matched part of the query if larger than zero.
prot_endEnd of the matched part of the query if less than prot_len.
prot_seqThe query sequence (in a partial hit: only the matched part).
targetThe name of the target sequence.
target_lenThe total length of the target sequence.
strand"+" (= forward) or "-" (= reverse).
dna_startThe location of the hit.
dna_end
matchesThe number of matches if less than prot_len.
mismatchesThe number of mismatches.
undeterminedThe number of undetermined residues.
unmatchedThe number of unmatched query residues.
additionalThe number of additional residues in the translated target.
scoreThe score of the hit.
upstreamDNA Sequence upstream of hit, ending before start codon.
upstream_gap(unaligned) DNA Sequence preceding a partial hit.
matchingsThe locations of exons and introns, see next section.
stopcodonStop codon if present.
downstreamDNA Sequence downstream of hit, starting with stop codon if present.
downstream_gap (unaligned) DNA Sequence following a partial hit.

The commandline parameter --show=〈comma-separated list〉 can be used to choose a user-defined collection of keys to be shown. By default, all of the above are shown.


Keys for matchings
Every matching (intron/exon) is given with the following keys:
type"intron","exon" or "gap".
nucl_startLocation in the query (in nucleotide coordinates; nucl_end not for introns).
[nucl_end]
dna_startLocation in the target.
dna_end
seqDNA sequence of the feature.

Keys that appear only in exons:
seqshiftsLocation of sequence shifts.
mismatchlistPositions of mismatches (in query coordinates).
undeterminedlistPositions of undetermined residues (in query coordinates).
inframe_stopcodonsPositions of in-frame stopcodons (in query coordinates).
translationTranslation of the aligned part of the DNA sequence.
overlapNumber of nucleotides at the end of the target that are identical with the beginning of the following target, and not considered part of this location.

The commandline parameters --show_exon=〈...〉, --show_intron=〈...〉, --show_gap=〈...〉 can be used to choose user-defined output for matchings. The following additional keys can be activated by this:
nucl_pos In introns, a synonym for nucl_start(=nucl_end)
prot_start The location transformed into residue coordinates rather than nucleotides. A remainder of 1 is rounded down, and 2 is rounded up.
prot_end
prot_pos(this one for introns)
prot_seqPart of the query matching an exon, or unmatched part in a gap.

Sequence coordinate conventions

All nucleotide and residue coordinates specify the number of preceding letters, that is, unlike in GFF, a location specified as starting at 5 and ending at 9 has length four, starting with the sixth and ending with the ninth character. (GFF would show this as start=6 end=9).
Locations on reverse strand are specified by negative numbers that represent the distance from the end of the reverse strand. Hence, a location specified as starting at -9 and ending at -5, refers to the reverse complement of the previous example (GFF would show this also as start=6 end=9).
Amino acid coordinates are rounded the following way: If a location contains a split codon, our convention is to count it if its second nucleotide belongs to the location. Consequently, in reading frame 2 (meaning the last two nucleotides of the split codon are part of the next location) the start location is rounded down and the codon is considered part of the next location; conversely, a reading frame of 1 means that the codon is considered part of the previous location, and the starting point for the next location is rounded up when given in amino acid / codon coordinates.