江義華的部落格(cyh.etlab's blog): 由識別手寫數字圖像應用程式(plot_digits_classification.py)範例了解機器學習(Machine Learning)的支持向量機(Support Vector Machine)演算法

現在有很多OSS的機器學習(Machine Learning)相關的Tools，讓非資工相關科系掌握機器學習(Machine Learning)演算法相對容易很多。如Scikit-learn就是開源機器學習的一種框架，Scikit-learn的機器學習基本功能主要是分：監督學習(Supervised learning)、無監督學習(Unsupervised learning)、模型選擇和評估(Model selection and evaluation)、數據集轉換(Dataset transformations)等。

識別手寫數字圖像應用程式是由支持向量機(Support Vector Machines/SVM)，對SVM模型進行訓練，由分類、預測完成手寫數字的識別。執行畫面如下：

什麼是支持向量機(Support Vector Machines/SVM)? 簡單的說:SVM用於分類和回歸的相關監督學習方法，現給定一組訓練樣例，每個訓練樣例被標記為屬於兩類別之一，SVM會尋找使用盡可能寬的邊界來分隔類別的界限。倘若如果無法清楚地分隔兩個類別，SVM演算法就會盡量找出最佳界限。如圖示，將綠圓點跟紅圓點用一超平面(hyper-plane)分成兩類：

現透過幾個簡單情景範例說明，要找出正確的超平面(hyper-plane)，圖示法式很容易理解的：
情景-1:

有三個超平面(hyper-plane)A，B和C。現在，要找出正確的超平面(hyper-plane)來分類"藍色星型"和"紅色圓"。在這種情景，超平面“B” 完成了這項工作。
情景-2:

有三個超平面(hyper-plane)A，B和C。最近數據點（任一類）和超平面(hyper-plane)之間的距離(邊距)將有助於我們決定正確的超平面(hyper-plane)。

在這種情景，超平面“C” 完成了這項工作。

情景-3:

有些人可能選擇了超平面B，因為它具有比A更高的邊距餘量。這是錯的，所以SVM在最大化邊界之前，要準確地對類別進行分類。在這種情景，超平面“A” 完成了這項工作。

情景-4:

因為其中一個"藍色星型"位於其他（圈子）類的領域，而作為異常點。

SVM具有忽略異常值，並找到具有最大餘量的超平面的特徵。

情景-5:

在上面的情景中，兩個類之間不能有線性超平面，那要如何對這兩個類進行分類？這裡將添加一個新的特徵 z = x^2 + y^2。現在，重新繪製x軸和z軸上的數據點，如下：

如上圖SVM中，這兩個類之間很容易有一個線性超平面了，那在原始的x y平面空間中查看超平面時，它看起來像一個圓圈：

上述的Image圖參考自Analytics Vidhya。在Python中，scikit-learn是實現機器學習算法的廣泛使用的庫，SVM也可以在scikit-learn庫中使用，並遵循相同的結構（導入庫，對象創建，擬合模型和預測）。讓我們看看識別手寫數字圖像應用程式(plot_digits_classification.py)範例的代碼：

print(__doc__)

# Author: Gael Varoquaux <gael dot varoquaux at normalesup dot org>
# License: BSD 3 clause

# Standard scientific Python imports
import matplotlib.pyplot as plt

# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics

# 數字資料集(The digits dataset)
digits = datasets.load_digits()

# The data that we are interested in is made of 8x8 images of digits, let's
# have a look at the first 4 images, stored in the `images` attribute of the
# dataset. If we were working from image files, we could load them using
# matplotlib.pyplot.imread. Note that each image must have the same size. For these
# images, we know which digit they represent: it is given in the 'target' of
# the dataset.
images_and_labels = list(zip(digits.images, digits.target))
for index, (image, label) in enumerate(images_and_labels[:4]):
plt.subplot(2, 4, index + 1)
plt.axis('off')
plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Training: %i' % label)

# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.0001)

# We learn the digits on the first half of the digits
classifier.fit(data[:n_samples // 2], digits.target[:n_samples // 2])

# Now predict the value of the digit on the second half:
expected = digits.target[n_samples // 2:]
predicted = classifier.predict(data[n_samples // 2:])

print("Classification report for classifier %s:\n%s\n"
% (classifier, metrics.classification_report(expected, predicted)))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))

images_and_predictions = list(zip(digits.images[n_samples // 2:], predicted))
for index, (image, prediction) in enumerate(images_and_predictions[:4]):
plt.subplot(2, 4, index + 5)
plt.axis('off')
plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Prediction: %i' % prediction)

plt.show()

上述程式碼，有關SVM的部分，也是這支程式重要的地方如下，含程式注釋:

# 創建一個支持向量機(Support Vector Machine)分類器
classifier = svm.SVC(gamma=0.0001)

# 訓練 SVC 模型
classifier.fit(data[:n_samples // 2], digits.target[:n_samples // 2])

expected = digits.target[n_samples // 2:]
# 預測測試樣本的數字值
predicted = classifier.predict(data[n_samples // 2:])

其中 svm.SVC 的 SVC是個class，原始程式碼在 classes.py， class SVC完整宣告如下(刪掉注釋部分):
class SVC(BaseSVC):
def __init__(self, C=1.0, kernel='rbf', degree=3, gamma='auto',
coef0=0.0, shrinking=True, probability=False,
tol=1e-3, cache_size=200, class_weight=None,
verbose=False, max_iter=-1, decision_function_shape=None,
random_state=None):

super(SVC, self).__init__(
impl='c_svc', kernel=kernel, degree=degree, gamma=gamma,
coef0=coef0, tol=tol, C=C, nu=0., shrinking=shrinking,
probability=probability, cache_size=cache_size,
class_weight=class_weight, verbose=verbose, max_iter=max_iter,
decision_function_shape=decision_function_shape,
random_state=random_state)
其__init__建構函數的參數有很多，最重要的是 C=1.0, kernel='rbf', degree=3, gamma='auto'，我們可以透過調整SVC的參數，以有效地提高此模型性能，這重要參數，“kernel”，“gamma”和“C” 簡單說明如下：
kernel：kernel可用的選項，如“linear”，“rbf”，“poly”等等（默認值是“rbf”）。這裡“rbf”和“poly”對於非線性超平面很有用。
gamma：'rbf'，'poly'和'sigmoid'的kernel係數。伽馬值越高，將盡力準確地按照訓練數據集進行擬合。
C：誤差項的懲罰參數C. 它還控制平滑決策邊界和正確分類訓練點之間的權衡。

到此簡單的透過由識別手寫數字圖像應用程式(plot_digits_classification.py)範例了解機器學習(Machine Learning)的支持向量機(Support Vector Machine)演算法。而SVM演算法也是有相關的優點和缺點，
優點是：
1.它工作得很好，分離的邊界很清晰
2.它在高維空間中很有效。
3.它在維數大於樣本數的情況下有效。
4.它使用決策函數中的一個訓練點子集（稱為支持向量），因此它也具有內存效率。
缺點是：
1.如果我們有大量數據，因為所需的訓練時間較長，它表現不佳
2.當數據集有更多的noise 即目標類重疊時，它也表現不佳
3.支持向量機不直接提供概率估計，這些是使用昂貴的 five-fold cross-validation 來計算的。

江義華的部落格(cyh.etlab's blog)

2018年6月11日星期一

由識別手寫數字圖像應用程式(plot_digits_classification.py)範例了解機器學習(Machine Learning)的支持向量機(Support Vector Machine)演算法

沒有留言:

張貼留言

FPGA Verilog 的學習經驗，提供給要入門的新手

搜尋此網誌

2018年6月11日 星期一

由識別手寫數字圖像應用程式(plot_digits_classification.py)範例了解機器學習(Machine Learning)的支持向量機(Support Vector Machine)演算法

沒有留言:

張貼留言

FPGA Verilog 的學習經驗，提供給要入門的新手

2018年6月11日星期一