在这篇文章中,我们将使用Python中最流行的机器学习工具scikit- learn,在Python中实现几种机器学习算法。使用简单的数据集来训练分类器区分不同类型的水果。
这篇文章的目的是识别出最适合当前问题的机器学习算法。因此,我们要比较不同的算法,选择性能最好的算法。让我们开始吧!
数据
水果数据集由爱丁堡大学的Iain Murray博士创建。他买了几十个不同种类的橘子、柠檬和苹果,并把它们的尺寸记录在一张桌子上。密歇根大学的教授们对水果数据进行了些微的格式化,可以从这里下载。
下载地址:https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/fruit_data_with_colors.txt
让我们先看一看数据的前几行。
1%matplotlib inline
2import pandas as pd
3import matplotlib.pyplot as plt
4fruits= pd.read_table(‘fruit_data_with_colors.txt’)
5fruits.head()
图1
数据集的每一行表示一个水果块,它由表中的几个特征表示。
在数据集中有59个水果和7个特征:
1print(fruits.shape)
(59, 7)
在数据集中有四种水果:
1print(fruits[‘fruit_name’].unique())
[“苹果”柑橘”“橙子”“柠檬”]
除了柑橘,数据是相当平衡的。我们只好接着进行下一步。
1print(fruits.groupby(‘fruit_name’).size())
图2
1import seaborn as sns
2sns.countplot(fruits[‘fruit_name’],label=”Count”)
3plt.show()
图3
可视化
每个数字变量的箱线图将使我们更清楚地了解输入变量的分布:
1fruits.drop(‘fruit_label’, axis=1).plot(kind=’box’, subplots=True, layout=(2,2), sharex=False, sharey=False, figsize=(9,9),
2title=’Box Plot for each input variable’)
3plt.savefig(‘fruits_box’)
4plt.show()
图4
看起来颜色分值近似于高斯分布。
1import pylab as pl
2fruits.drop(‘fruit_label’ ,axis=1).hist(bins=30, figsize=(9,9))
3pl.suptitle(“Histogram for each numeric input variable”)
4plt.savefig(‘fruits_hist’)
5plt.show()
图5
一些成对的属性是相关的(质量和宽度)。这表明了高度的相关性和可预测的关系。
1from pandas.tools.plottingimport scatter_matrix
2from matplotlibimport cm
3feature_names= [‘mass’,’width’,’height’,’color_score’]
4X= fruits[feature_names]
5y= fruits[‘fruit_label’]
6cmap= cm.get_cmap(‘gnuplot’)
7scatter= pd.scatter_matrix(X, c= y, marker= ‘o’, s=40, hist_kwds={‘bins’:15}, figsize=(9,9), cmap= cmap)
8plt.suptitle(‘Scatter-matrix for each input variable’)
9plt.savefig(‘fruits_scatter_matrix’)
图6
统计摘要
图7
我们可以看到数值没有相同的缩放比例。我们需要将缩放比例扩展应用到我们为训练集计算的测试集上。
创建训练和测试集,并应用缩放比例
1from sklearn.model_selectionimport train_test_split
2X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=0)
3from sklearn.preprocessingimport MinMaxScaler
4scaler= MinMaxScaler()
5X_train= scaler.fit_transform(X_train)
6X_test= scaler.transform(X_test)
构建模型
逻辑回归
1from sklearn.linear_modelimport LogisticRegression
2logreg= LogisticRegression()
3logreg.fit(X_train, y_train)
4print(‘Accuracy of Logistic regression classifier on training set: {:.2f}’
5.format(logreg.score(X_train, y_train)))
6print(‘Accuracy of Logistic regression classifier on test set: {:.2f}’
7.format(logreg.score(X_test, y_test)))
训练集中逻辑回归分类器的精确度:0.70
测试集中逻辑回归分类器的精确度:0.40
决策树
1from sklearn.treeimport DecisionTreeClassifier
2clf= DecisionTreeClassifier().fit(X_train, y_train)
3print(‘Accuracy of Decision Tree classifier on training set: {:.2f}’
4.format(clf.score(X_train, y_train)))
5print(‘Accuracy of Decision Tree classifier on test set: {:.2f}’
6.format(clf.score(X_test, y_test)))
训练集中决策树分类器的精确度:1.00
测试集中决策树分类器的精确度:0.73
K-Nearest Neighbors(K-NN )
1from sklearn.neighborsimport KNeighborsClassifier
2knn= KNeighborsClassifier()
3knn.fit(X_train, y_train)
4print(‘Accuracy of K-NN classifier on training set: {:.2f}’
5.format(knn.score(X_train, y_train)))
6print(‘Accuracy of K-NN classifier on test set: {:.2f}’
7.format(knn.score(X_test, y_test)))
训练集中K-NN 分类器的精确度:0.95
测试集中K-NN 分类器的精确度:1.00
线性判别分析
1from sklearn.discriminant_analysisimport LinearDiscriminantAnalysis
2lda= LinearDiscriminantAnalysis()
3lda.fit(X_train, y_train)
4print(‘Accuracy of LDA classifier on training set: {:.2f}’
5.format(lda.score(X_train, y_train)))
6print(‘Accuracy of LDA classifier on test set: {:.2f}’
7.format(lda.score(X_test, y_test)))
训练集中LDA分类器的精确度:0.86
测试集中LDA分类器的精确度:0.67
高斯朴素贝叶斯
1from sklearn.naive_bayesimport GaussianNB
2
3gnb= GaussianNB()
4gnb.fit(X_train, y_train)
5print(‘Accuracy of GNB classifier on training set: {:.2f}’
6.format(gnb.score(X_train, y_train)))
7print(‘Accuracy of GNB classifier on test set: {:.2f}’
8.format(gnb.score(X_test, y_test)))
训练集中GNB分类器的精确度:0.86
测试集中GNB分类器的精确度:0.67
支持向量机
1from sklearn.svmimport SVC
2
3svm= SVC()
4svm.fit(X_train, y_train)
5print(‘Accuracy of SVM classifier on training set: {:.2f}’
6.format(svm.score(X_train, y_train)))
7print(‘Accuracy of SVM classifier on test set: {:.2f}’
8.format(svm.score(X_test, y_test)))
训练集中SVM分类器的精确度:0.61
测试集中SVM分类器的精确度:0.33
KNN算法是我们尝试过的最精确的模型。混淆矩阵提供了在测试集上没有错误的指示。但是,测试集非常小。
1from sklearn.metricsimport classification_report
2from sklearn.metricsimport confusion_matrix
3pred= knn.predict(X_test)
4print(confusion_matrix(y_test, pred))
5print(classification_report(y_test, pred))
图8
绘制k-NN分类器的决策边界
01import matplotlib.cm as cm
02from matplotlib.colorsimport ListedColormap, BoundaryNorm
03import matplotlib.patches as mpatches
04import matplotlib.patches as mpatches
05X= fruits[[‘mass’,’width’,’height’,’color_score’]]
06y= fruits[‘fruit_label’]
07X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=0)
08def plot_fruit_knn(X, y, n_neighbors, weights):
09X_mat= X[[‘height’,’width’]].as_matrix()
10y_mat= y.as_matrix()
11# Create color maps
12cmap_light= ListedColormap([‘#FFAAAA’,’#AAFFAA’,’#AAAAFF’,’#AFAFAF’])
13cmap_bold= ListedColormap([‘#FF0000′,’#00FF00′,’#0000FF’,’#AFAFAF’])
14
15clf= neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
16clf.fit(X_mat, y_mat)
17# Plot the decision boundary by assigning a color in the color map
18# to each mesh point.
19
20mesh_step_size= .01 # step size in the mesh
21plot_symbol_size= 50
22
23x_min, x_max= X_mat[:,0].min()- 1, X_mat[:,0].max()+ 1
24y_min, y_max= X_mat[:,1].min()- 1, X_mat[:,1].max()+ 1
25xx, yy= np.meshgrid(np.arange(x_min, x_max, mesh_step_size),
26np.arange(y_min, y_max, mesh_step_size))
27Z= clf.predict(np.c_[xx.ravel(), yy.ravel()])
28# Put the result into a color plot
29Z= Z.reshape(xx.shape)
30plt.figure()
31plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
32# Plot training points
33plt.scatter(X_mat[:,0], X_mat[:,1], s=plot_symbol_size, c=y, cmap=cmap_bold, edgecolor= ‘black’)
34plt.xlim(xx.min(), xx.max())
35plt.ylim(yy.min(), yy.max())
36patch0= mpatches.Patch(color=’#FF0000′, label=’apple’)
37patch1= mpatches.Patch(color=’#00FF00′, label=’mandarin’)
38patch2= mpatches.Patch(color=’#0000FF’, label=’orange’)
39patch3= mpatches.Patch(color=’#AFAFAF’, label=’lemon’)
40plt.legend(handles=[patch0, patch1, patch2, patch3])
41plt.xlabel(‘height (cm)’)
42plt.ylabel(‘width (cm)’)
43plt.title(“4-Class classification (k = %i, weights = ‘%s’)”
44% (n_neighbors, weights))
45plt.show()
46plot_fruit_knn(X_train, y_train,5,’uniform’)
图9
01k_range= range(1,20)
02scores= []
03
04for kin k_range:
05knn= KNeighborsClassifier(n_neighbors= k)
06knn.fit(X_train, y_train)
07scores.append(knn.score(X_test, y_test))
08plt.figure()
09plt.xlabel(‘k’)
10plt.ylabel(‘accuracy’)
11plt.scatter(k_range, scores)
12plt.xticks([0,5,10,15,20])
图10
对于这个特定的数据集,当k = 5时,我们获得了最高精确度。
结语
在这篇文章中,我们关注的是预测的准确度。我们的目标是学习一个具有良好泛化性能的模型。这样的模型使预测准确度最大化。通过比较不同的算法,我们确定了最适合当前问题的机器学习算法(即水果类型分类)。
创建这个帖子的源代码可以在这里找到。
源代码地址:https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Solving%20A%20Simple%20Classification%20Problem%20with%20Python.ipynb
来自ATYUN订阅号