將會學到如何:
1. 從UCI存儲庫導入數據
2. 將文本輸入轉換為數值型資料
3. 構建和訓練分類演算法
4. 比較和對比分類演算法
Step 1: Importing the Dataset
以下代碼單元將導入重要的函式庫,並從UCI存儲庫導入數據集作為Pandas DataFrame。
# To make sure all of the correct libraries are installed, import each module and print the version number import sys import numpy import sklearn import pandas print('Python: {}'.format(sys.version)) print('Numpy: {}'.format(numpy.__version__)) print('Sklearn: {}'.format(sklearn.__version__)) print('Pandas: {}'.format(pandas.__version__))
# Import, change module names import numpy as np import pandas as pd # import the uci Molecular Biology (Promoter Gene Sequences) Data Set url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/promoter-gene-sequences/promoters.data' names = ['Class', 'id', 'Sequence'] data = pd.read_csv(url, names = names)
print(data.iloc[0])
Step 2: Preprocessing the Dataset
數據不是可用的形式; 因此,我們需要在使用它來訓練我們的演算法之前對其進行處理。
# Building our Dataset by creating a custom Pandas DataFrame # Each column in a DataFrame is called a Series. Lets start by making a series for each column. classes = data.loc[:, 'Class'] print(classes[:5])
# generate list of DNA sequences sequences = list(data.loc[:, 'Sequence']) dataset = {} # loop through sequences and split into individual nucleotides for i, seq in enumerate(sequences): # split into nucleotides, remove tab characters nucleotides = list(seq) nucleotides = [x for x in nucleotides if x != '\t'] # append class assignment nucleotides.append(classes[i]) # add to dataset dataset[i] = nucleotides print(dataset[0])
# turn dataset into pandas DataFrame dframe = pd.DataFrame(dataset) print(dframe)
# transpose the DataFrame df = dframe.transpose() print(df.iloc[:5])
# for clarity, lets rename the last dataframe column to class df.rename(columns = {57: 'Class'}, inplace = True) print(df.iloc[:5])
# looks good! Let's start to familiarize ourselves with the dataset so we can pick the most suitable # algorithms for this data df.describe()
# desribe does not tell us enough information since the attributes are text. Lets record value counts for each sequence series = [] for name in df.columns: series.append(df[name].value_counts()) info = pd.DataFrame(series) details = info.transpose() print(details)
# Unfortunately, we can't run machine learning algorithms on the data in 'String' formats. As a result, we need to switch # it to numerical data. This can easily be accomplished using the pd.get_dummies() function numerical_df = pd.get_dummies(df) numerical_df.iloc[:5]
# We don't need both class columns. Lets drop one then rename the other to simply 'Class'. df = numerical_df.drop(columns=['57_-']) df.rename(columns = {'57_+': 'Class'}, inplace = True) print(df.iloc[:5])
# Use the model_selection module to separate training and testing datasets from sklearn import model_selection # Create X and Y datasets for training X = np.array(df.drop(['Class'], 1)) y = np.array(df['Class']) # define seed for reproducibility seed = 1 # split data into training and testing datasets X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=seed)
print(X.shape)
print(type(X))
print(type(X[0]))
print(type(X[0][0]))
(106, 228)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.uint8'>
Step 3: Training and Testing the Classification Algorithms
現在我們已經預處理了數據並構建了我們的訓練和測試數據集,我們可以開始部署不同的分類演算法。 測試多個模型相對容易; 因此,我們將比較和對比十種不同演算法的性能。
# Now that we have our dataset, we can start building algorithms! We'll need to import each algorithm we plan on using # from sklearn. We also need to import some performance metrics, such as accuracy_score and classification_report. from sklearn.neighbors import KNeighborsClassifier from sklearn.neural_network import MLPClassifier from sklearn.gaussian_process import GaussianProcessClassifier from sklearn.gaussian_process.kernels import RBF from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC from sklearn.metrics import classification_report, accuracy_score # define scoring method scoring = 'accuracy' # Define models to train names = ["Nearest Neighbors", "Gaussian Process", "Decision Tree", "Random Forest", "Neural Net", "AdaBoost", "Naive Bayes", "SVM Linear", "SVM RBF", "SVM Sigmoid"] classifiers = [ KNeighborsClassifier(n_neighbors = 3), GaussianProcessClassifier(1.0 * RBF(1.0)), DecisionTreeClassifier(max_depth=5), RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1), MLPClassifier(alpha=1e-5), AdaBoostClassifier(), GaussianNB(), SVC(kernel = 'linear', gamma='auto'),
SVC(kernel = 'rbf', gamma='auto'),
SVC(kernel = 'sigmoid', gamma='auto')
]
models = zip(names, classifiers) # evaluate each model in turn results = [] names = [] for name, model in models: kfold = model_selection.KFold(n_splits=10, random_state = seed) cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring) results.append(cv_results) names.append(name) msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()) print(msg) model.fit(X_train, y_train) predictions = model.predict(X_test) print('Test-- ',name,': ',accuracy_score(y_test, predictions)) print() print(classification_report(y_test, predictions))
參考
https://www.kaggle.com/bulentsiyah/dna-classification-code
沒有留言:
張貼留言