Recognizing Handwritten Digits with Scikit-Learn

Neha Kumari
9 min readJun 17, 2021

Recognizing handwritten text is a problem that traces back to the first automatic machines that needed to recognize individual characters in handwritten documents. Think about, for example, the ZIP codes on letters at the post office and the automation needed to recognize these five digits. Perfect recognition of these codes is necessary to sort mail automatically and efficiently. Included among the other applications that may come to mind is OCR (Optical Character Recognition) software. OCR software must read handwritten text, or pages of printed books, for general electronic documents in which each character is well defined. But the problem of handwriting recognition goes farther back in time, more precisely to the early 20th Century (the 1920s), when Emanuel Goldberg (1881–1970) began his studies regarding this issue and suggested that a statistical approach would be an optimal choice.

To address this issue in Python, the scikit-learn library provides a good example to better understand this technique, the issues involved, and the possibility of making predictions.

Objective:

The primary objective of this project involves predicting a numeric value, and then reading and interpreting an image that uses a handwritten font.

we will have an estimator with the task of learning through a fit() function, and once it has reached a degree of predictive capability (a model sufficiently valid), it will produce a prediction with the predict() function. Then we will discuss the training set and validation set created this time from a series of images.

Hypothesis :

The Digits data set of the scikit-learn library provides numerous datasets that are useful for testing many problems of data analysis and prediction of the results. Some Scientist claims that it predicts the digit accurately 95% of the times. Perform data Analysis to accept or reject this Hypothesis.

The Digits Dataset:

The scikit-learn library provides many datasets that are useful for testing many problems of data analysis and prediction of the results. Also in this case there is a dataset of images called Digits. This dataset comprises 1,797 images that are 8x8 pixels in size. Each image is a handwritten digit in grayscale.

Data Analysis:

Code part:

So, let’s get started,

  1. Importing datasets
from sklearn import datasets
digits = datasets.load_digits()

2. Full descripton of datasets

print(digits.DESCR)OUTPUT:.. _digits_dataset:Optical recognition of handwritten digits dataset
--------------------------------------------------
**Data Set Characteristics:** :Number of Instances: 5620
:Number of Attributes: 64
:Attribute Information: 8x8 image of integer pixels in the range 0..16.
:Missing Attribute Values: None
:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
:Date: July; 1998
This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.
Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.
For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.
.. topic:: References - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
Graduate Studies in Science and Engineering, Bogazici University.
- E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
- Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
Linear dimensionalityreduction using relevance weighted LDA. School of
Electrical and Electronic Engineering Nanyang Technological University.
2005.
- Claudio Gentile. A New Approximate Maximal Margin Classification
Algorithm. NIPS. 2000.

3. Targets

digits.targetOUTPUT:array([0, 1, 2, ..., 8, 9, 8])

4. Shape of the dataset

digits.data.shapeOUTPUT:
(1797, 64)

5. Images stored in the form of array

The images of the handwritten digits are contained in a digits.images array. Each element of this array is an image that is represented by an 8x8 matrix of numerical values that correspond to a grayscale from white, with a value of 0, to black, with the value 15

digits.images[0]

OUTPUT:

array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
[ 0., 0., 13., 15., 10., 15., 5., 0.],
[ 0., 3., 15., 2., 0., 11., 8., 0.],
[ 0., 4., 12., 0., 0., 8., 8., 0.],
[ 0., 5., 8., 0., 0., 9., 8., 0.],
[ 0., 4., 11., 0., 1., 12., 7., 0.],
[ 0., 2., 14., 5., 10., 12., 0., 0.],
[ 0., 0., 6., 13., 10., 0., 0., 0.]])

The images of the handwritten digits are contained in a digits.images array

6. Visualizing an array

  • Import pyplot module which is under matplotlib as plt.
  • The imshow() function is used to display data as an image; i.e. on a 2D regular raster.
  • cmap = gray_r displays a grayscale image.
  • interpolation= ‘nearest’ displays an image without trying to interpolate between pixels if the display resolution is not the same as the image resolution.
  • The title() function is used to display the title on the graph.
import matplotlib.pyplot as plt
plt.imshow(digits.images[0], cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Visualizing array')
#save the figure
plt.savefig('plot2.png', dpi=100, bbox_inches='tight')

OUTPUT:

7. Visualization of digits

  • The figure() function in the pyplot module of the matplotlib library is used to create a new figure with a specified size of (15,4).
  • subplots_adjust(hspace=0.8) is used to adjust the space between the rows of the subplots.
  • Combine two lists using the zip() function for easier handling inside the plotting loop.
  • enumerate() method adds a counter to an iterable and returns it. The returned object is a enumerate object.
  • subplot() function is used to add a subplot to a current figure at the specified grid position.
import numpy as np
plt.figure(figsize=(15,4))
plt.subplots_adjust(hspace=0.8)
for index, (image, label) in enumerate(zip(digits.data[0:10], digits.target[0:10])):
plt.subplot(2, 5, index+1)
plt.imshow(np.reshape(image, (8,8)), cmap=plt.cm.gray)
plt.title('Training: %i\n' % label, fontsize =12)
#save the figure
plt.savefig('plot1.png', dpi=300, bbox_inches='tight')

8. Split the dataset

Size of the training set

It was reported that the dataset is a training set consisting of 1,797 images. we can determine if that is true.

digits.target.size

OUTPUT:

1797

9. Flatten the input images

# flatten the images
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

10.Training and Prediction

from sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(data, digits.target, test_size=0.2, random_state=0)

11 .Support Vector Classifier

An estimator that is useful in this case is sklearn.svm.SVC, which uses the technique of Support Vector Classification (SVC).

“Support Vector Machine” (SVM) is a supervised machine learning algorithm that is mostly used in classification problems.

Import the SVM module of the scikit-learn library and create an estimator of SVC type and then choose an initial setting, assigning the values C and gamma generic values.

from sklearn import svm
svc = svm.SVC(gamma=0.001, C=100.)

Train the model

svc.fit(x_train, y_train)OUTPUT:
SVC(C=100.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

Test the model

y_pred = svc.predict(x_test)

12. 4 test samples and their predicted digit value

_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax, image, prediction in zip(axes, x_test, y_pred):
ax.set_axis_off()
image = image.reshape(8, 8)
ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
ax.set_title(f'Prediction: {prediction}')
# save the figure
plt.savefig('plot7.png', dpi=300, bbox_inches='tight')

13. Accuracy of the model

score = svc.score(x_test, y_test)
print('Accuracy Score: {0}'.format(score))
OUTPUT:Accuracy Score: 0.9916666666666667

14. Confusion matrix and Classification report of the model

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.

A Classification report is used to measure the quality of predictions from a classification algorithm.

from sklearn.metrics import confusion_matrix,classification_report,accuracy_scoreimport seaborn as sns
import pandas as pd
labels=['0','1','2', '3','4','5','6','7','8','9']
f, ax = plt.subplots(figsize=(10,10))
cm=confusion_matrix(y_test,y_pred)
sns.heatmap(cm, annot=True,ax=ax,cmap="Dark2_r")
#labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Accuracy Score: {0} \n Confusion Matrix'.format(np.round(score,2)))
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.savefig('plot3.png', dpi=300, bbox_inches='tight')
plt.show()
f, ax = plt.subplots(figsize=(6,6))
class_report=classification_report(y_test,y_pred,target_names=labels, output_dict=True)
sns.heatmap(pd.DataFrame(class_report).iloc[:-1, :].T, annot=True,ax=ax,cmap="Dark2_r")
ax.set_title('Classification Report')
plt.savefig('plot4.png', dpi=300, bbox_inches='tight')
plt.show()

14. Logistic Regression

from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression()
import warnings
warnings.filterwarnings("ignore")
logisticRegr.fit(x_train, y_train)
OUTPUT:LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)

15. Testing the model

y_pred=logisticRegr.predict(x_test)

16. Accuracy of the model

# Use score method to get accuracy of model
score = logisticRegr.score(x_test, y_test)
print('Accuracy Score: {0}'.format(score))
OUTPUT:Accuracy Score: 0.9666666666666667

17. Confusion matrix and Classification report of the model

from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
import seaborn as sns
import pandas as pd
labels=['0','1','2', '3','4','5','6','7','8','9']
f, ax = plt.subplots(figsize=(10,10))
cm=confusion_matrix(y_test,y_pred)
sns.heatmap(cm, annot=True,ax=ax,cmap="Dark2_r")
#labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Accuracy Score: {0} \n Confusion Matrix'.format(np.round(score,2)))
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.savefig('plot5.png', dpi=300, bbox_inches='tight')
plt.show()
f, ax = plt.subplots(figsize=(6,6))
class_report=classification_report(y_test,y_pred,target_names=labels, output_dict=True)
sns.heatmap(pd.DataFrame(class_report).iloc[:-1, :].T, annot=True,ax=ax,cmap="Dark2_r")
ax.set_title('Classification Report')
plt.savefig('plot6.png', dpi=300, bbox_inches='tight')
plt.show()

18. Using Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion = 'gini')

Training the data

​dt.fit(x_train, y_train)

OUTPUT:

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')

Testing the data

y_pred=dt.predict(x_test)

Checking accuracy of data

# Use score method to get accuracy of model
score = dt.score(x_test, y_test)
print('Accuracy Score: {0}'.format(score))
OUTPUT:Accuracy Score: 0.8583333333333333

Confusion matrix and Classification report of the model

from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
import seaborn as sns
import pandas as pd
labels=['0','1','2', '3','4','5','6','7','8','9']
f, ax = plt.subplots(figsize=(10,10))
cm=confusion_matrix(y_test,y_pred)
sns.heatmap(cm, annot=True,ax=ax,cmap="Dark2_r")
#labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Accuracy Score: {0} \n Confusion Matrix'.format(np.round(score,2)))
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.savefig('plot7.png', dpi=300, bbox_inches='tight')
plt.show()
f, ax = plt.subplots(figsize=(6,6))
class_report=classification_report(y_test,y_pred,target_names=labels, output_dict=True)
sns.heatmap(pd.DataFrame(class_report).iloc[:-1, :].T, annot=True,ax=ax,cmap="Dark2_r")
ax.set_title('Classification Report')
plt.savefig('plot8.png', dpi=300, bbox_inches='tight')
plt.show()

19. Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier
rc = RandomForestClassifier(n_estimators = 150)

Train the model

rc.fit(x_train, y_train)

OUTPUT:

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=150,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)

Test the model

y_pred=rc.predict(x_test)

Checking accuracy of model

# Use score method to get accuracy of modelscore = rc.score(x_test, y_test)
print('Accuracy Score: {0}'.format(score))
OUTPUT:Accuracy Score: 0.975

Confusion matrix and Classification report of the model

from sklearn.metrics import confusion_matrix,classification_report,accuracy_scoreimport seaborn as sns
import pandas as pd
labels=['0','1','2', '3','4','5','6','7','8','9']
f, ax = plt.subplots(figsize=(10,10))
cm=confusion_matrix(y_test,y_pred)
sns.heatmap(cm, annot=True,ax=ax,cmap="Dark2_r")
#labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Accuracy Score: {0} \n Confusion Matrix'.format(np.round(score,2)))
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.savefig('plot9.png', dpi=300, bbox_inches='tight')
plt.show()
f, ax = plt.subplots(figsize=(6,6))
class_report=classification_report(y_test,y_pred,target_names=labels, output_dict=True)
sns.heatmap(pd.DataFrame(class_report).iloc[:-1, :].T, annot=True,ax=ax,cmap="Dark2_r")
ax.set_title('Classification Report')
plt.savefig('plot10.png', dpi=300, bbox_inches='tight')
plt.show()

Conclusion:

This dataset predicts the digit accurately 95% of the times.

I am thankful to mentors at https://internship.suvenconsultants.com for providing awesome problem statements and giving many of us a Coding Internship Experience. Thank you www.suvenconsultants.com

Source Code: Github

Connect with me:

LinkedIn: https://www.linkedin.com/in/neha-kumari-09415a16b/

GitHub: https://github.com/neha07kumari

--

--