What is M-SALSA?

Multiple sequence alignment (MSA) is an important problem in bioinformatics. Its purpose is to find relationships among residues of biological sequences connected by an evolutionary, structural or functional relationship.

M-SALSA (Sequence Alignment by Local Search Algorithm) is a stochastic local search algorithm that improves alignments generated by other MSA programs.

Two versions of M-SALSA are available: one implemented using c++ and the other using java. It is also available a web application that gives an easier user experience. It automatically calls an external program that calculates the initial alignment to be enhanced by M-SALSA.

How to install

For install instructions visit M-SALSA wiki.

Contribution

To report a bug or suggest an enhancement visit M-SALSA issue tracker.

Command line usage

c++ version

In order to use M-SALSA command line version on you must use this syntax:
M-SALSA [PARAMETERS]
Parameters must be specified writing the name of the option followed by a space and the option value:
-optionName option_value
An example of M-SALSA usage is:
M-SALSA -inputFile BB11001.tfa -outputFile BB11001_final.tfa -phTreeFile BB11001.dnd -scoringMatrix BLOSUM80

Mandatory parameters:

-inputFile: path of a file containing the initial alignment. The file must be in FASTA format
-outputFile: path of the output file. This will be in FASTA format as well
-phTreeFile: file containing the guide tree, used by M-SALSA in order to generate correct weigths for the WSP-Score. The file must be in Newicks format

Optional parameters:

-GOP

GAP Opening Penalty (default 8)

-GEP

GAP Extension Penalty (default 5)

-gamma

dimension of the range of positions a GAP can move in during an iteration (default 30)

-type

type of sequences. Possible options are DNA, RNA and PROTEINS (default PROTEINS)

-scoringMatrix

scoring matrix (default BLOSUM62). Available matrices are: BLOSUM30, BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, IUB, PAM20, PAM60, PAM120, PAM350.

-scoringMatrixPath

scoring matrix file. For more information on scoring matrix file format visit M-SALSA wiki

-matrixSerie

matrix serie. Possible options: BLOSUM or PAM (default BLOSUM)

-distanceMatrix

distance matrix file

-minIt

minimum number of iterations (default 1000)

-pSplit

probability of split (default 0.1)

-terminal

the strategy to be used to manage terminal GAPs. Possible values:

ONLY_GEP: GOP=0 only for terminal GAPs (default value)
BOTH: both opening and extension penalty for terminal GAPs

Substitution matrix choice:

If no optional parameters are provided M-SALSA default substitution matrix is BLOSUM62. If user specifies the type parameter, the default matrix is chosen according to it. For PROTEINS the default matrix is still BLOSUM62, for DNA it is IUB, for RNA no default matrix exists. Therefore, if input consists on RNA sequences user must provide a substitution matrix.

In case of PROTEINS type it is also possible to specify a distance matrix, a matrix containing the distance between each couple of sequences in the alignment. M-SALSA uses such a matrix to automatically infer the correct substitution matrix. The chosen matrix depends also an the matrixSerie parameter, that can be BLOSUM or PAM (if not specified it will use a BLOSUM matrix).

Java version

M-SALSA Java implementation requires Java version > 1.8 that could be downloaded from Java download

The command line call of Java version of M-SALSA is very smilar to the one in C++. See C++ definition for more informations. The Java version has the same parameters of C++ version. Other parameters are available (described in next paragraph)

An example of M-SALSA usage is:
java -jar m-salsa-cli.jar -inputFile BB11001.tfa -outputFile BB11001_final.fasta -phTreeFile BB11001.dnd

More examples are available in wiki Examples-Java

Java parameters:

The Java version could call Clustal and perform the pre-alignment required from M-SALSA. This permit to avoid the input of -phTreeFile because this data is generated using Clustal.

To perform the pre-alignment it's possible to use: ClustalOmega or ClustalW2.

This approach permit to give and input file -inputFile format different from FASTA, but accepted from Clustal: NBRF/PIR, EMBL/UniProt, Pearson (FASTA), GDE, ALN/Clustal, GCG/MSF, RSF (see the Clustal help pages for details about formats).

-clustalOPath: define path where clustalOmega is intalled. Use to perform the pre-alignment
-clustalWPath: define path where clustalW2 is intalled. Use to perform the pre-alignment only if -clustalOPath not set. Required for generate tree file
-generatePhTree: define if the phylogenetic neighbour-joining tree file must be generated. Requires ClustalW2 path defined. Could be set using -clustalWPath or using ClustalW2 for the alignment
-help: write the documentation of all commands

Source Code Documentation

The source code documentation is available at http://salsa-w.github.io/M-SALSA/apidocs

More informations about documentation generation is available at https://github.com/SALSA-W/M-SALSA/wiki/API-Docs

Tests on BAliBASE

Tests were performed on BAliBASE v3 dataset.

Instruction on how to perform the tests can be found at M-SALSA wiki.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

M-SALSA

M-SALSA (Sequence Alignment by Local Search Algorithm) is a local search algorithm for solving Multiple Sequence Alignment