The intention of the script is to locate protein query sequences in DNA targets.
BLAT does most of the computational work, however, there is some important postprocessing that is performed by the script. This includes the following tasks:
scipio.pl [〈options〉] 〈target〉 〈query〉 | ||
〈target〉 is a DNA file | ||
〈query〉 is a protein file | ||
both in FASTA format | ||
Options: | --blat_output=〈filename〉 | name of BLAT output file (.psl); if not specified, defaults to the newest psl file in the working directory, or to Scipio〈nnnn〉_blat.psl if none exists |
--force_new | run BLAT even if output file already exists (the default is never to run BLAT if the specified or default psl file would be overwritten) | |
--verbose | show verbose information (including progress) | |
--min_score=〈value〉 | minimal score of the best partial hit for a query | |
--min_identity=〈value〉 | minimal identity in any hit (default is 0.9) | |
--max_mismatch=〈value〉 | maximal number of mismatches in any hit (0 means any number, the default) | |
--split_on_fs | do not join matchings separated by frameshifts (the default is to consider them a single exon) | |
--region_size=〈value〉 | size of shown upstream/downstream sequence parts (defaults to 1000, maximum is 75000) | |
--force_score=〈value〉 | specify a score (between 0 and 1) that forces a hit to be shown even if it is contradicting a better one (these hits are not assembled together and shown as 〈queryname〉_(1), etc.) | |
--partial_target | accept BLAT output files containing hits referring to nonexistent target sequences | |
--show=〈list〉 | which keys are to be shown in YAML output (details see below) | |
--show_intron=〈list〉 | ||
--show_exon=〈list〉 | ||
--show_gap=〈list〉 | ||
--hide_undef | hide keys with undefined or empty list values | |
--hide_defaults | hide some keys if they have a default value | |
Options passed to BLAT (ignored when BLAT is not run): | ||
--blat_bin=〈name〉 | name of BLAT executable, defaults to "blat" | |
--blat_params=〈params〉 | parameters passed to BLAT | |
--blat_tilesize=〈..〉 | (see BLAT usage information) | |
--blat_score=〈..〉 | ||
--blat_identity=〈..〉 |
The script produces an output in YAML format. Additional scripts are supplied that transform the YAML output into other formats (e.g. GFF3). The output is organized as follows:
For each query, a list of the matched parts is given:
etc.
Usually, there is only one matched part (= BLAT hit) for each query, except in cases where queries are found on multiple non-overlapping targets (contigs).
At the end of the output, a list of the unmatched sequences is shown.
ID | The id of the BLAT hit (equals the line number in the psl file). |
status | One of "complete", "partial", "incomplete" or "manual". "complete" means that Scipio had no problems locating the query, "partial" means that the hit is on one of multiple targets each matching a part of the query, "incomplete" means there might be the need to edit it manually. "manual" can be entered if the output was modified by hand. |
reason | If status is "incomplete", the reason why. |
prot_len | The length of the query (in amino acid coordinates). |
prot_start | Start of the matched part of the query if larger than zero. |
prot_end | End of the matched part of the query if less than prot_len. |
prot_seq | The query sequence (in a partial hit: only the matched part). |
target | The name of the target sequence. |
target_len | The total length of the target sequence. |
strand | "+" (= forward) or "-" (= reverse). |
dna_start | The location of the hit. |
dna_end | |
matches | The number of matches if less than prot_len. |
mismatches | The number of mismatches. |
undetermined | The number of undetermined residues. |
unmatched | The number of unmatched query residues. |
additional | The number of additional residues in the translated target. |
score | The score of the hit. |
upstream | DNA Sequence upstream of hit, ending before start codon. |
upstream_gap | (unaligned) DNA Sequence preceding a partial hit. |
matchings | The locations of exons and introns, see next section. |
stopcodon | Stop codon if present. |
downstream | DNA Sequence downstream of hit, starting with stop codon if present. |
downstream_gap | (unaligned) DNA Sequence following a partial hit. |
The commandline parameter --show=〈comma-separated list〉 can be used to choose a user-defined collection of keys to be shown. By default, all of the above are shown.
type | "intron","exon" or "gap". |
nucl_start | Location in the query (in nucleotide coordinates; nucl_end not for introns). |
[nucl_end] | |
dna_start | Location in the target. |
dna_end | |
seq | DNA sequence of the feature. |
seqshifts | Location of sequence shifts. |
mismatchlist | Positions of mismatches (in query coordinates). |
undeterminedlist | Positions of undetermined residues (in query coordinates). |
inframe_stopcodons | Positions of in-frame stopcodons (in query coordinates). |
translation | Translation of the aligned part of the DNA sequence. |
overlap | Number of nucleotides at the end of the target that are identical with the beginning of the following target, and not considered part of this location. |
The commandline parameters --show_exon=〈...〉, --show_intron=〈...〉, --show_gap=〈...〉 can be used to choose user-defined output for matchings. The following additional keys can be activated by this:
nucl_pos | In introns, a synonym for nucl_start(=nucl_end) |
prot_start | The location transformed into residue coordinates rather than nucleotides. A remainder of 1 is rounded down, and 2 is rounded up. |
prot_end | |
prot_pos | (this one for introns) |
prot_seq | Part of the query matching an exon, or unmatched part in a gap. |
All nucleotide and residue coordinates specify the number of preceding letters, that
is, unlike in GFF, a location specified as starting at 5 and ending at 9 has length four,
starting with the sixth and ending with the ninth character. (GFF would show this as
start=6 end=9).
Locations on reverse strand are specified by negative numbers that represent the distance
from the end of the reverse strand. Hence, a location specified as starting at -9 and
ending at -5, refers to the reverse complement of the previous example (GFF would show
this also as start=6 end=9).
Amino acid coordinates are rounded the following way:
If a location contains a split codon, our convention is to count it
if its second nucleotide belongs to the location. Consequently, in reading frame 2
(meaning the last two nucleotides of the split codon are part of the next location) the start
location is rounded down and the codon is considered part of the next location; conversely,
a reading frame of 1 means that the codon is considered part of the previous
location, and the starting point for the next location is rounded up when given in amino acid / codon coordinates.