網頁

2019年7月9日 星期二

Classifying DNA Sequences

本文中將使用馬爾可夫模型,K-最近鄰(KNN)算法,支持向量機和其他常見分類器來分析短大腸桿菌DNA序列,從而探索生物信息學的世界。 該項目將使用來自UCI Machine Learning Repository的數據集,該數據集具有106個DNA序列,每個具有57個連續核苷酸(“鹼基對”)。

將會學到如何:

1. 從UCI存儲庫導入數據
2. 將文本輸入轉換為數值型資料
3. 構建和訓練分類演算法
4. 比較和對比分類演算法

Step 1: Importing the Dataset
以下代碼單元將導入重要的函式庫,並從UCI存儲庫導入數據集作為Pandas DataFrame。

# To make sure all of the correct libraries are installed, import each module and print the version number

import sys
import numpy
import sklearn
import pandas

print('Python: {}'.format(sys.version))
print('Numpy: {}'.format(numpy.__version__))
print('Sklearn: {}'.format(sklearn.__version__))
print('Pandas: {}'.format(pandas.__version__))


# Import, change module names
import numpy as np
import pandas as pd

# import the uci Molecular Biology (Promoter Gene Sequences) Data Set
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/promoter-gene-sequences/promoters.data'
names = ['Class', 'id', 'Sequence']
data = pd.read_csv(url, names = names)

print(data.iloc[0])


Step 2: Preprocessing the Dataset
數據不是可用的形式; 因此,我們需要在使用它來訓練我們的演算法之前對其進行處理。

# Building our Dataset by creating a custom Pandas DataFrame
# Each column in a DataFrame is called a Series. Lets start by making a series for each column.

classes = data.loc[:, 'Class']
print(classes[:5])


# generate list of DNA sequences
sequences = list(data.loc[:, 'Sequence'])
dataset = {}

# loop through sequences and split into individual nucleotides
for i, seq in enumerate(sequences):
    
    # split into nucleotides, remove tab characters
    nucleotides = list(seq)
    nucleotides = [x for x in nucleotides if x != '\t']
    
    # append class assignment
    nucleotides.append(classes[i])
    
    # add to dataset
    dataset[i] = nucleotides
    
print(dataset[0])


# turn dataset into pandas DataFrame
dframe = pd.DataFrame(dataset)
print(dframe)


# transpose the DataFrame
df = dframe.transpose()
print(df.iloc[:5])


# for clarity, lets rename the last dataframe column to class
df.rename(columns = {57: 'Class'}, inplace = True) 
print(df.iloc[:5])


# looks good! Let's start to familiarize ourselves with the dataset so we can pick the most suitable 
# algorithms for this data

df.describe()


# desribe does not tell us enough information since the attributes are text. Lets record value counts for each sequence
series = []
for name in df.columns:
    series.append(df[name].value_counts())
    
info = pd.DataFrame(series)
details = info.transpose()
print(details)


# Unfortunately, we can't run machine learning algorithms on the data in 'String' formats. As a result, we need to switch
# it to numerical data. This can easily be accomplished using the pd.get_dummies() function
numerical_df = pd.get_dummies(df)
numerical_df.iloc[:5]


# We don't need both class columns.  Lets drop one then rename the other to simply 'Class'.
df = numerical_df.drop(columns=['57_-'])

df.rename(columns = {'57_+': 'Class'}, inplace = True)
print(df.iloc[:5])


# Use the model_selection module to separate training and testing datasets
from sklearn import model_selection

# Create X and Y datasets for training
X = np.array(df.drop(['Class'], 1))
y = np.array(df['Class'])

# define seed for reproducibility
seed = 1

# split data into training and testing datasets
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=seed)

print(X.shape)
print(type(X))
print(type(X[0]))
print(type(X[0][0]))

(106, 228)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.uint8'>

Step 3: Training and Testing the Classification Algorithms
現在我們已經預處理了數據並構建了我們的訓練和測試數據集,我們可以開始部署不同的分類演算法。 測試多個模型相對容易; 因此,我們將比較和對比十種不同演算法的性能。

# Now that we have our dataset, we can start building algorithms! We'll need to import each algorithm we plan on using
# from sklearn.  We also need to import some performance metrics, such as accuracy_score and classification_report.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

# define scoring method
scoring = 'accuracy'

# Define models to train
names = ["Nearest Neighbors", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "SVM Linear", "SVM RBF", "SVM Sigmoid"]

classifiers = [
    KNeighborsClassifier(n_neighbors = 3),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1e-5),
    AdaBoostClassifier(),
    GaussianNB(),
    SVC(kernel = 'linear', gamma='auto'), 
    SVC(kernel = 'rbf', gamma='auto'),
    SVC(kernel = 'sigmoid', gamma='auto')
]

models = zip(names, classifiers)

# evaluate each model in turn
results = []
names = []

for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state = seed)
    cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print('Test-- ',name,': ',accuracy_score(y_test, predictions))
    print()
    print(classification_report(y_test, predictions))




參考
https://www.kaggle.com/bulentsiyah/dna-classification-code

沒有留言:

張貼留言