What is M-SALSA?
Multiple sequence alignment (MSA) is an important problem in bioinformatics. Its purpose is to find relationships among residues of biological sequences connected by an evolutionary, structural or functional relationship.
M-SALSA (Sequence Alignment by Local Search Algorithm) is a stochastic local search algorithm that improves alignments generated by other MSA programs.
Two versions of M-SALSA are available: one implemented using c++ and the other using java. It is also available a web application that gives an easier user experience. It automatically calls an external program that calculates the initial alignment to be enhanced by M-SALSA.
How to install
For install instructions visit M-SALSA wiki.
Contribution
To report a bug or suggest an enhancement visit M-SALSA issue tracker.
Command line usage
c++ version
In order to use M-SALSA command line version on you must use this syntax:
M-SALSA [PARAMETERS]
Parameters must be specified writing the name of the option followed by a space and the option value:
-optionName option_value
An example of M-SALSA usage is:
M-SALSA -inputFile BB11001.tfa -outputFile BB11001_final.tfa -phTreeFile BB11001.dnd -scoringMatrix BLOSUM80
Mandatory parameters:
- -inputFile
- path of a file containing the initial alignment. The file must be in FASTA format
- -outputFile
- path of the output file. This will be in FASTA format as well
- -phTreeFile
- file containing the guide tree, used by M-SALSA in order to generate correct weigths for the WSP-Score. The file must be in Newicks format
Optional parameters:
- -GOP
- GAP Opening Penalty (default 8)
- -GEP
- GAP Extension Penalty (default 5)
- -gamma
- dimension of the range of positions a GAP can move in during an iteration (default 30)
- -type
- type of sequences. Possible options are DNA, RNA and PROTEINS (default PROTEINS)
- -scoringMatrix
- scoring matrix (default BLOSUM62). Available matrices are: BLOSUM30, BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, IUB, PAM20, PAM60, PAM120, PAM350.
- -scoringMatrixPath
- scoring matrix file. For more information on scoring matrix file format visit M-SALSA wiki
- -matrixSerie
- matrix serie. Possible options: BLOSUM or PAM (default BLOSUM)
- -distanceMatrix
- distance matrix file
- -minIt
- minimum number of iterations (default 1000)
- -pSplit
- probability of split (default 0.1)
- -terminal
- the strategy to be used to manage terminal GAPs. Possible values:
- ONLY_GEP: GOP=0 only for terminal GAPs (default value)
- BOTH: both opening and extension penalty for terminal GAPs
Substitution matrix choice:
If no optional parameters are provided M-SALSA default substitution matrix is BLOSUM62. If user specifies the type parameter, the default matrix is chosen according to it. For PROTEINS the default matrix is still BLOSUM62, for DNA it is IUB, for RNA no default matrix exists. Therefore, if input consists on RNA sequences user must provide a substitution matrix.
In case of PROTEINS type it is also possible to specify a distance matrix, a matrix containing the distance between each couple of sequences in the alignment. M-SALSA uses such a matrix to automatically infer the correct substitution matrix. The chosen matrix depends also an the matrixSerie parameter, that can be BLOSUM or PAM (if not specified it will use a BLOSUM matrix).
Java version
M-SALSA Java implementation requires Java version > 1.8 that could be downloaded from Java download
The command line call of Java version of M-SALSA is very smilar to the one in C++. See C++ definition for more informations. The Java version has the same parameters of C++ version. Other parameters are available (described in next paragraph)
An example of M-SALSA usage is:
java -jar m-salsa-cli.jar -inputFile BB11001.tfa -outputFile BB11001_final.fasta -phTreeFile BB11001.dnd
More examples are available in wiki Examples-Java
Java parameters:
The Java version could call Clustal and perform the pre-alignment required from M-SALSA. This permit to avoid the input of -phTreeFile because this data is generated using Clustal.
To perform the pre-alignment it's possible to use: ClustalOmega or ClustalW2.
This approach permit to give and input file -inputFile format different from FASTA, but accepted from Clustal: NBRF/PIR, EMBL/UniProt, Pearson (FASTA), GDE, ALN/Clustal, GCG/MSF, RSF (see the Clustal help pages for details about formats).
- -clustalOPath
- define path where clustalOmega is intalled. Use to perform the pre-alignment
- -clustalWPath
- define path where clustalW2 is intalled. Use to perform the pre-alignment only if -clustalOPath not set. Required for generate tree file
- -generatePhTree
- define if the phylogenetic neighbour-joining tree file must be generated. Requires ClustalW2 path defined. Could be set using -clustalWPath or using ClustalW2 for the alignment
- -help
- write the documentation of all commands
Source Code Documentation
The source code documentation is available at http://salsa-w.github.io/M-SALSA/apidocs
More informations about documentation generation is available at https://github.com/SALSA-W/M-SALSA/wiki/API-Docs
Tests on BAliBASE
Tests were performed on BAliBASE v3 dataset.
Instruction on how to perform the tests can be found at M-SALSA wiki.
License
Copyright 2015 Alessandro Daniele, Fabio Cesarato, Andrea Giraldin
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.