ENZYMAP

This tutorial allows you to execute the proposed methodology step by step. In the first section all necessary data (flat files and MySql database) are provided. Section 2 presents instructions to execute our descriptive multiclass experiment and sections 3 and 4 details steps to perform predictive multiclass experiment and common source experiment respectively. The tools MySql 5.5.24, R 2.10.1 and Weka 3.6.2 were used, so they must be installed. All data files and programs are available.

1) Retrieve UniProt/SwissProt data

a) Restore database

tar xvfj backup_sabrina.tar.bz2
# uncompress dump file

mysql sabrina < backup_sabrina.sql
# restore sabrina database

CREATE USER 'sabrina' IDENTIFIED BY 'bioinfo12';
# execute in MySql shell to create a database 'sabrina' with password 'bioinfo12'

b) Generate text file repository

tar xvfj uniprot.tar.bz2
# uncompress 44 Swiss-Prot .dat files into 'uniprot' directory

ls /.../uniprot/* > file0
# file0 = list of paths for 44 Swiss-Prot .dat files in 'uniprot' directory

java -jar divideBasesUniprot.jar file0 dir1
# divide each Swiss-Prot .dat files in one flat file for each entry
# dir1 = directory that will contain one directory for each Swiss-Prot release divided in flat files

java -jar fazListaIdsTodasVersoesUniprot.jar dir1 dir2
# make a list of Swiss-Prot ids for all releases.
# dir2 = directory that will contain the list of ids for each Swiss-Prot release.

java -jar matchIdsIntersecaoUniprot.jar dir2 dir1 dir3 44
# match Swiss-Prot entries in the intersection of release pairs.
# dir3 = directory with ids in the intersection of releases.
# 44 = index of last release (44 in this case).

java -jar geraDadosMudancaEC.jar dir1 dir3 dir4
# generate EC change data for each release pair.
# dir4 = directory that will contain EC change data for each release pair.

java -jar geraListaMudancasAcima10exemplos.jar file1 file2 file3
# generate list of changes with at least 10 examples.
# file1 = file that will contain list of changes in which EC numbers are different
# file2 = file that will contain list with types of changes and releases in which they occur
# file3 = file that will contain list of EC changes with at least 10 examples in the whole data set

2) Execute descriptive multiclass experiment

a) Generate data matrix that will be reduced via SVD

java -jar geraMatrizes_11_n2.jar file3 dir5 dir6
# dir5 = directory that will contain data matrices for each EC change type and releases in which they occur

java -jar juntaTreinoTeste.jar file3 dir5 dir7
# dir7 = directory with data matrices for each EC change (containing all releases)

ls /.../dir7/* > file4
# file4 = list of data matrices

java -jar tiraColuna1.jar file4 , dir8
# dir8 = data matrices without first column

ls /.../dir8/* > file5
# file5 = list of data matrices without first column

java -jar mergeAtributos.jar file5 file6
# file6 = one data matrix comprehending data from all studied EC change types

java -jar tiraCabecalho.jar file6 file7
# file7 = file6 without header

b) Reduce data matrices via SVD

mkdir dir9
# create dir9
# create infile inside dir9
# infile = text file with path of file7
# enter in the dir9 (the directory that contains infile) and type R (open R software)

source("svd.R")
# execute SVD with number of singular values varying from 1 to 100 for file list in infile
# svd.R must be executed in the directory containing infile

c) Classification task

ls /.../dir9/*D.csv > file8
# file8 = list of SVD output files

perl coloca_cabecalho_csv.pl file8
# add header to SVD output files

perl rodaWekaAllClassifiers_crossval.pl file8
# execute descriptive multiclass experiment with 10 fold cross validation

java -jar bestResult_3t_corrigido.jar 3 dir9 file9 file10
# choose best result for each classificaton algorithm
# file9 = results for technique KNN_K1
# file10 = best result for technique KNN_K1

java -jar bestResult_3t_corrigido.jar 4 dir9 file11 file12
# file11 = results for technique KNN_K3
# file12 = best result for technique KNN_K3

java -jar bestResult_3t_corrigido.jar 5 dir9 file13 file14
# file13 = results for technique KNN_K5
# file14 = best result for technique KNN_K5

java -jar bestResult_3t_corrigido.jar 6 dir9 file15 file16
# file15 = results for technique KNN_K7
# file16 = best result for technique KNN_K7

java -jar bestResult_3t_corrigido.jar 7 dir9 file17 file18
# file17 = results for technique KNN_K10
# file18 = best result for technique KNN_K10

java -jar bestResult_3t_corrigido.jar 2 dir9 file19 file20
# file19 = results for technique Naive Bayes
# file20 = best result for technique Naive Bayes

java -jar bestResult_3t_corrigido.jar 9 dir9 file21 file22
# file21 = results for technique J48
# file22 = best result for technique J48

3) Execute predictive multiclass experiment

java -jar separaListaMudTreinoTeste.jar file3 file 23 file24
# separates the list of changes in train and test set (the last release in which a change occur is left for test)
# file23 = file with list of changes and releases in which they occur for train set.
# file24 = file with list of changes and releases in which they occur for test set.

java -jar obtemClassesModeladas.jar file23 file25 file26
java -jar obtemClassesModeladas.jar file24 file27 file28
# get only modeled changes (F-1 > 0.5) to use in predictive experiments
# file25 = modeled train changes
# file26 = file with list of modeled changes and releases in which they occur for train set
# file27 = modeled test changes
# file28 = file with list of modeled changes and releases in which they occur for test set

a) Generate data matrix that will be reduced via SVD

a.1) Generate train data matrices

java -jar geraMatrizes_11_n2.jar file26 dir10 dir11
# dir10 = directory that will contain train data matrices for each modeled EC change type

java -jar juntaTreinoTeste.jar file26 dir10 dir12
# dir12 = directory with data matrices for each EC change

ls /.../dir12/* > file29
# file29 = list of data matrices

java -jar tiraColuna1.jar file29 dir13
# dir13 = data matrices without first column

ls /.../dir13/* > file30
# file30 = list of data matrices without first column

java -jar mergeAtributos.jar file30 file31 # file31 = one data matrix comprehending data from all modeled EC change types

a.2) Generate test data matrices

java -jar geraMatrizes_11_n2.jar file28 dir14 dir15
# dir14 = directory that will contain test data matrices for each modeled EC change type

java -jar juntaTreinoTeste.jar file28 dir14 dir16
# dir16 = directory with data matrices for each EC change

ls /.../dir16/* > file32
# file32 = list of data matrices

java -jar tiraColuna1.jar file32 dir17
# dir17 = data matrices without first column

ls /.../dir17/* > file33
# file33 = list of data matrices without first column

java -jar mergeAtributos.jar file33 file34 # file34 = one data matrix comprehending data from all modeled EC change types

a.3) Concatenate train and test data matrices

# generate file35
# file35 = file that must contain the path of train (file31) and test (file34) data matrices (1 per line)

java -jar mergeAtributosTT.jar 1 file35 file36
#file36 = merged data matrix from train and test

java -jar tiraCabecalho.jar file36 file37
#file37 = merged data matrix from train and test without header

b) Reduce data matrices via SVD

mkdir dir18
# create dir18
# create infile inside dir18
# infile = text file with path of file37
# enter in the dir18 (the directory that contains infile) and type R (open R software)

source("svd.R")
# execute SVD with number of singular values varying from 1 to 100 for file list in infile
# svd.R must be executed in the directory containing infile

c) Classification task

java -jar separaTreinoTeste.jar 45665 3996 dir18 dir19
# separate train and test data matrices
# 45665 = total number of instances
# 3996 = number of test instances
# dir19 = directory that will contain train and data matrices separated

ls /.../dir19/* > file38
# file38 = list of train and test matrices

perl coloca_cabecalho_csv.pl file38
# put header in train and test matrices

java -jar fazListaTTgenerico.jar dir19 file39
# file39 = list of train and test files

java -jar compatibilizaWekaTT_ord.jar file39 dir20
# dir20 = train and test files with ordered classes

ls /.../dir20/* > file40
# file40 = contains list of all train and test matrices

perl csvToArff.pl file40
# make .arff files (Weka format)

mkdir dir21
cp /.../dir20/*arff /.../dir21
# create dir21 and copy arff files from dir20 to dir21

java -jar fazListaTTgenerico.jar dir21 file41
# file41 = list of train and test files in .arff format

java -jar compatibilizaTTArff.jar file41 dir22
# dir22 = directory that will contain test matrices with compatible headers

cp /.../dir21/*.treino.* /.../dir22
# copy train data to dir22

java -jar fazListaTTgenerico.jar dir22 file42
# file42 = list of train and test files in .arff format

perl rodaWekaAllClassifiers_ttmulticlass.pl file42
# execute predictive multiclass experiment with train and test data

java -jar bestResult_3t_corrigido.jar <technique number> dir22 file43 file44
# choose best result for each classificaton algorithm
# dir22 must be informed for all techniques and the last two parameters are technique specific
# file43 = results for selected technique
# file44 = best result for selected technique

# Possible Technique Numbers
#Naive Bayes = 2
#KNN_K1 = 3
#KNN_K3 = 4
#KNN_K5 = 5
#KNN_K7 = 6
#KNN_K10 = 7
#J48 = 9

4) Execute predictive common source experiment

a) Generate data matrices that will be reduced via SVD

java -jar preparaExpOrigemComum.jar file28 file37 41669 dir23 file45
# 41669 = number of train instances
# dir23 = directory that will contain changes separated by common source
# file45 = list of ids and their classes

b) Reduce data matrices via SVD

ls /.../dir23/bases_svd/* > infile
# infile = contains list of common source data matrices
# enter in the dir23/bases_svd/ (the directory that contains infile) and type R (open R software)

source("svd.R")
# execute SVD with number of singular values varying from 1 to 100 for file list in infile
# svd.R must be executed in the directory containing infile

c) Classification task

ls /.../dir23/bases_svd/*D.csv > file46
# file46 = contains list of SVD output files

java -jar separaTTOrigComum.jar file46 dir24
# dir24 = directory that will contain train and test files separated

ls /.../dir24 > file47
# file47 = contains list of train and test files separated

perl coloca_cabecalho_csv.pl file47
# put a header in SVD output files

java -jar fazListaTTOrigComum.jar dir24 file48
# file48 = list of train and test files

java -jar compatibilizaWekaTT_ord.jar file48 dir25
# dir25 = train and test files with ordered classes

ls /.../dir25/* > file49
# file49 = contains list of all train and test matrices

perl csvToArff.pl file49
# make .arff files (Weka format)

mkdir dir26
cp /.../dir25/*arff /.../dir26
# create dir26 and copy arff files from dir25 to dir26

java -jar fazListaTTOrigComum_2.jar dir26 file50
# file50 = list of train and test files in .arff format

java -jar compatibilizaTTArff.jar file50 dir27
# dir27 = directory that will contain test matrices with compatible headers

cp /.../dir26/tr* /.../dir27
# copy train data to dir27

java -jar fazListaTTOrigComum_2.jar dir27 file51
# file51 = list of train and test files in .arff format perl rodaWekaAllClassifiers_ttcommonsource.pl file51
# execute predictive common source experiment with train and test data

ls /.../dir27/*out* > file52
# file52 = list of classification task output files

java -jar obtemMelhorResultOrigComum.jar file52 dir28
# choose best result for each common source
# dir28 = directory that will contain best result for each common source

Exploiting protein annotation for modeling and predicting EC number changes in UniProt/Swiss-Prot

1) Retrieve UniProt/SwissProt data

a) Restore database

b) Generate text file repository

2) Execute descriptive multiclass experiment

a) Generate data matrix that will be reduced via SVD

b) Reduce data matrices via SVD

c) Classification task

3) Execute predictive multiclass experiment

a) Generate data matrix that will be reduced via SVD

a.1) Generate train data matrices

a.2) Generate test data matrices

a.3) Concatenate train and test data matrices

b) Reduce data matrices via SVD

c) Classification task

4) Execute predictive common source experiment

a) Generate data matrices that will be reduced via SVD

b) Reduce data matrices via SVD

c) Classification task