ML Primer

Max Kleiner
6 min readApr 6, 2021

MNIST Single Prediction

For this tutor we’ll explore one of the classic machine learning datasets — hand written digits classification.

https://github.com/maxkleiner/maXbox4/blob/master/MNISTSinglePredict.ipynb

We have set up a very simple SVC to classify the MNIST digits to make one single predict. First we load the libraries and the dataset:

#sign:max: MAXBOX8: 13/03/2021 07:46:37
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

The dataset is available either for download from the UCI ML repository or via a Python library scikit-learn dataset.

# [height, weight, 8*8 pixels of digits 0..9]
dimages = datasets.load_digits()
print(type(dimages), len(dimages.data))
>>> <class 'sklearn.utils.Bunch'> 1797 samples

The dataset consists of a table — columns are attributes, rows are instances (individual observations). In order to do computations easily and efficiently and not to reinvent wheel we can use a suitable tool — pandas. So the first step is to obtain the dataset and load it in a second step for train and test-split into a DataFrame.

Then we setup the Support Vector Classifier with the training data X and the target y:

#Support Vector Classifier 
sclf = SVC(gamma=0.001, C=100, kernel='linear')
X= dimages.data[:-10]
y= dimages.target[:-10]
print(len(X))
>>> train set samples: 1787

Gamma is the learning rate and the higher the value of gamma the more precise the decision boundary would be. C (regularization) is the penalty of the fault tolerance.

Having a larger C will lead to smaller values for the slack variables. This means that the number of support vectors will decrease. When you run the prediction, it will need to calculate the indicator function for each support vector. Now we train (fit) the image samples:

sclf.fit(X,y) 
print('ONLY train score ',sclf.score(X,y))
>>> ONLY train score 1.0>>> SVC(C=100, break_ties=False, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=0.001, kernel='linear', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

The only train score is 100% because the target was full trained and had no unseen test data. In the last step we predict a specific digit from the test set (only the last 10 samples are unseen), means we pass an actual image and SVC makes the prediction of which digit belongs to the image:

testimage = -6 
s_prediction = sclf.predict([dimages.data[testimage]])
print('the image maybe belongs to ',s_prediction) plt.imshow(dimages.images[testimage], cmap=plt.cm.gray_r,
interpolation="nearest")
plt.show()
>>> the image maybe belongs to [4]

ASUVORK5CYII=%0A

+k8YX+uM6y/XNJY5K+ExE3tT1PVWw/IenPEbG9uILuxRFxsu25etGFI/UaSYci4nBEnJb0tKRbWp5pYBHxQUTsLT7/WNKUpGXtTlUN2yOSbpS0ve1ZqmR7saRrJD0qSRFxeqEFLXUj6mWSjpxze0ZJ/uc/y/aopNWS3mp3kspsk3SPpC/aHqRiKyQdlOkXEGdt3SnpF0pCkxyJif8tjVWGtpNsk/cP2ZPG1X0bESy3OhPndJWlncYA5LOn2lufpWeu/0gJQrS48/QZQIaIGkiFqIBmiBpIhaiAZogaSIWogmf8B0Eylpk5WooEAAAAASUVORK5CYII=%0A

The same fit we try with a Random Forest Classifier to finish the first step of this lesson:

# RandomForestClassifier to make a white box modelpredictimage_test= dimages.data[-2]
rfc_clf = RandomForestClassifier()
rfc_clf.fit(X,y)
rfc_prediction = rfc_clf.predict([predictimage_test])
print ('predict with RFC ',rfc_prediction)

A 5 Detector

Imagine a 5 Detector this — is the confusion matrix with precision and recall:

From Screenshot_2021–05–02 Hands-On Machine Learning with Scikit-Learn and TensorFlow Book.

There are many ways to improve this predict, including not using a vector classifier and go further with a neural classifier, but here’s a simple one to start what we do. Let’s just simplify our images by making them true black and white and stack an array.

MNIST Multi Prediction

Now we split explicit data in train- and test-set. Splitting the given images in 80:20 ratio so that 80% image is available for training and 20 % image is available for testing. We consider the data as pixels and the target as labels.

Convert and create the dataframe from datasets. We are using support vector machines for classification. Fit method trains the model and score will test it against the given test set and score.

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. In two dimensional space this hyperplane is a line dividing a plane in two parts where in each class lay in either side.

df = pd.DataFrame(data=dimages.data, columns=dimages.feature_names)print(df.head(5)) 
df['target'] = pd.Series(dimages.target)
print(df['target'])
print(df.shape)
print(df.info)

pixels = df
labels = df.target
print('pixels ',pixels)
test_images,train_images, test_labels,train_labels = \
train_test_split(pixels,labels,train_size=0.8,random_state=2); print('train size: ',len(train_images), len(train_labels))
sclf.fit(train_images, train_labels)
print('test score ',sclf.score(test_images,test_labels))

This gives us the score of 97 percent ( 0.97633959638135 ) which is at all a good score. We could try to increase the accuracy but this is sort of challenge.

Would be nice to get the confusion matrix of MNIST dataset to get an impression of the score.

from sklearn.metrics import confusion_matrix

test_predictions = sclf.predict(test_images)

print(confusion_matrix(test_labels, test_predictions))

test score 0.97633959638135

[[146 0 0 0 0 0 0 0 0 0] 0
[ 0 138 0 0 0 0 0 0 0 0] 1
[ 1 7 137 0 0 0 1 0 0 0] 2
[ 0 0 0 146 0 0 0 0 1 0] 3
[ 0 1 0 0 145 0 0 0 0 0] 4
[ 0 0 0 0 1 135 1 0 0 2] 5
[ 2 0 0 0 0 0 144 0 0 0] 6
[ 0 0 0 0 0 0 0 138 0 1] 7
[ 0 1 0 0 1 3 1 4 126 2] 8
[ 0 0 0 0 0 1 0 0 3 148]] 9

The dataset description of our primer says: Each image is 8 pixels in height and 8 pixels in width, for a total of 64 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

pixel_0_0 pixel_0_1 pixel_0_2 … pixel_7_5 pixel_7_6 pixel_7_7
0 0.0 0.0 5.0 … 0.0 0.0 0.0
1 0.0 0.0 0.0 … 10.0 0.0 0.0
2 0.0 0.0 0.0 … 16.0 9.0 0.0
3 0.0 0.0 7.0 … 9.0 0.0 0.0
4 0.0 0.0 0.0 … 4.0 0.0 0.0

0 0.0 0.0 5.0 … 0.0 0.0 0
1 0.0 0.0 0.0 … 0.0 0.0 1
2 0.0 0.0 0.0 … 9.0 0.0 2
3 0.0 0.0 7.0 … 0.0 0.0 3
4 0.0 0.0 0.0 … 0.0 0.0 4
… … … … … … … …
1792 0.0 0.0 4.0 … 0.0 0.0 9
1793 0.0 0.0 6.0 … 0.0 0.0 0
1794 0.0 0.0 1.0 … 0.0 0.0 8
1795 0.0 0.0 2.0 … 0.0 0.0 9
1796 0.0 0.0 10.0 … 1.0 0.0 8

[1797 rows x 65 columns]
train size: 360 : 360
test score 0.97633959638135
predict with RFC [9]

C:\maXbox\mX46210\DataScience>

Log Regression Schema

import numpy as np 
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression
X=np.array([[10000,80000,35],[7000,120000,57],[100,23000,22],
[223,18000,26]])
y=np.array([1,1,0,0])
cls=LogisticRegression(random_state=12)
cls.fit(X,y)
print(cls.predict([[6500,50000,26]]))
>>> [1]

Could be a predictive maintenance task to predict next failure of the machine type! Numbers could be 6500 hours, 50000 items, 26 as age. [1] means could fail or need maintenance next time.

Task notebook with matrix solution:

Train and test split is mistaken (muddle up or confuse), that’s the right one:
train_images, test_images, train_labels, test_labels = \
train_test_split(pixels,labels,train_size=0.8,random_state=2);

train size: 1437 1437
test size: 360 360
test score 0.9777777777777777

Originally published at http://maxbox4.wordpress.com on April 6, 2021.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Max Kleiner
Max Kleiner

Written by Max Kleiner

Max Kleiner's professional environment is in the areas of OOP, UML and coding - among other things as a trainer, developer and consultant.

No responses yet

Write a response