# Optical Character Recogntion using Machine Learning¶

In this example we'll explore at recognizing handwritten digits from the MNIST dataset by using Logistic Regression, a simple Artificial Neural Network and finally a Deep Convolutional Network. The objective is to highlight the difference in performance between them.

For each case, they will be tested against my own handwriting to see how well they are able to predict completely unseen data.

The MNIST dataset consits of thousands of handwritten digits (0-9), let's have a look at the first four in the dataset:

```
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score
from sklearn.linear_model import LogisticRegression
from PIL import Image, ImageOps
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.utils import np_utils
from keras import backend as K
K.set_image_dim_ordering('th')
# Load the MNIST data
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Display the first four digits
plt.figure(figsize=(5, 5))
for i in range(4):
plt.subplot(221 + i)
plt.imshow(X_train[i], cmap=plt.get_cmap('gray'))
plt.axis('off')
plt.show()
```

## Logistic Regression¶

Logistic Regression is a binary classifier that implements a logistic model where the probability of an event is a linear combination of independent variables. It can be used for multinomial regression and classification as well. In this case a multinomial logistic regression is fitted on the data from the images.

To convert an image to a feature vector the images are reshaped so that each pixel in the 28x28 image represents a feature. We would expect each digit to have low coefficients on the pixels that it doesn't occupy and high coefficients on the pixels that it tends to occupy more.

The logistic regression is fitted with *scikit-learn*. The accuracy in the training and test sets are reported after the fitting and prediction.

```
# Fix random seed for reproducibility
np.random.seed(13)
# Flatten the 28*28 images to a 784 item vector for each image
num_pixels = X_train.shape[1] * X_train.shape[2]
X_train = X_train.reshape(X_train.shape[0], num_pixels).astype('float32')
X_test = X_test.reshape(X_test.shape[0], num_pixels).astype('float32')
# Normalize inputs
X_train = X_train / 255
X_test = X_test / 255
X_train = X_train / 255
X_test = X_test / 255
# One hot encode outputs
y_train_1h = np_utils.to_categorical(y_train)
y_test_1h = np_utils.to_categorical(y_test)
num_classes = y_test_1h.shape[1]
# Define the logistic regression model
log_reg = LogisticRegression(C=1.0, fit_intercept=True, intercept_scaling=1,
max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='lbfgs', tol=0.0001)
# Fit the logistic regression to the data
log_reg.fit(X_train, y_train)
# Predict on the training set
train_pred = log_reg.predict(X_train)
train_pred_1h = np_utils.to_categorical(train_pred)
test_pred = log_reg.predict(X_test)
test_pred_1h = np_utils.to_categorical(test_pred)
print("Training set accuracy: " + str(accuracy_score(y_train, train_pred)))
print("Test set accuracy: " + str(accuracy_score(y_test, test_pred)))
```

As mentioned before, we would expect the coefficients to have high values for the pixels that are most commonly occupied to represent a digit. To test this hypothesis a heatmap for each coefficient vector is created by reshaping to the original image dimensions and plotting their values. As can be seen the heatmap shows the prototype digit that the logistic regression found to be most representative.

```
# Plot the heatmap for the coefficients for all numbers
plt.figure(figsize=(10, 10))
for i in range(9):
plt.subplot(4, 3, i + 1)
plt.imshow(log_reg.coef_[i+1, :].reshape(28, 28), cmap=plt.get_cmap('gray'))
plt.axis('off')
plt.subplot(4, 3, 11)
plt.imshow(log_reg.coef_[0, :].reshape(28, 28), cmap=plt.get_cmap('gray'))
plt.axis('off')
plt.show()
```

The logistic regression can also be tested on my own handwriting, which fails to predict most of their values correctly:

```
# Load the digits in my own handwriting
X_self = np.zeros((10, 28, 28))
for i in range(10):
X_self[i, :, :] = np.array(ImageOps.invert(Image.open(str(i) + ".jpg").convert('L')))
X_self = X_self.reshape(X_self.shape[0], num_pixels).astype('float32')
X_self = X_self / 255
y_self = np.arange(0, 10)
self_pred = log_reg.predict(X_self)
plt.figure(figsize=(7, 7))
for i in range(9):
plt.subplot(4, 3, i + 1)
plt.imshow(X_self[i + 1, :].reshape(28, 28), cmap=plt.get_cmap('gray'))
plt.title("Predicted: " + str(self_pred[i + 1]))
plt.axis('off')
plt.subplot(4, 3, 11)
plt.imshow(X_self[0, :].reshape(28, 28), cmap=plt.get_cmap('gray'))
plt.title("Predicted: " + str(self_pred[0]))
plt.axis('off')
plt.show()
```

The overall accuracy reported from the output of the code above (~75%) conveys that the modeling with the chosen logistic regression is not very good, but it doesn't show us which digits cause the accuraccy to drop. For that the ROC curves for each digit are plotted extracting them from the prediction by one-hot encoding:

```
# Plot the ROC curves for all digits
colors = ['#a6cee3', '#1f78b4', '#b2df8a',
'#33a02c', '#fb9a99', '#e31a1c',
'#cab2d6', '#6a3d9a', '#fdbf6f',
'#ff7f00']
plt.figure(figsize=(5, 5))
for i in range(10):
fpr, tpr, _ = roc_curve(y_train_1h[:, i], train_pred_1h[:, i])
plt.plot(fpr, tpr, label=str(i), color=colors[i])
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.legend()
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic curve')
plt.show()
```

We can see that the digit with the lowest area under the curve is number 5. Intuitively this is not very surprising as it bears a high resemblance with 6. The AUC for each digit helps us differentiate which are easier to detect (the ones with the higher AUC):

```
for i in range(10):
auc = roc_auc_score(y_train_1h[:, i], train_pred_1h[:, i])
print("Digit: " + str(i) + ", AUC: " + str(auc))
```

## Simple Artificial Neural Network¶

The next Machine Learning algorithm we'll test are Artificial Neural Networks, which put simply are computational systems based on biological systems (brains) that can learn any arbitrary function without the need of programming task-specific conditions.

We define an ANN with Keras, with a single hidden layer and the same number of neurons as the input (28x28 pixels). The output layer is configured to convert the outputs into a class probability to allow the network to classify an image into a digit. The model is trained over 10 epochs, updating every 200 episodes. The output of the code shows the accuracy over the training and test sets.

```
# Define the model for a simple neural network
def baseline_model():
# create model
model = Sequential()
model.add(Dense(num_pixels,
input_dim=num_pixels,
kernel_initializer='normal',
activation='relu'))
model.add(Dense(num_classes,
kernel_initializer='normal',
activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
return model
model = baseline_model()
# Fit the model
model.fit(X_train, y_train_1h,
validation_data=(X_test, y_test_1h),
epochs=10,
batch_size=200,
verbose=2)
# Final evaluation of the model
scores = model.evaluate(X_train, y_train_1h, verbose=0)
print("Baseline Error on Training Set: %.2f%%" % (100-scores[1]*100))
scores = model.evaluate(X_test, y_test_1h, verbose=0)
print("Baseline Error on Test Set: %.2f%%" % (100-scores[1]*100))
```

We can see a nice boost in accuracy with respect to logistic regression, even if we are just using perceptrons to map the images to the digits' classification. We can now test the model on my handwriting:

```
self_pred = model.predict(X_self)
plt.figure(figsize=(7, 7))
for i in range(9):
plt.subplot(4, 3, i + 1)
plt.imshow(X_self[i + 1, :].reshape(28, 28), cmap=plt.get_cmap('gray'))
plt.title("Predicted: " + str(np.argmax(self_pred[i + 1])))
plt.axis('off')
plt.subplot(4, 3, 11)
plt.imshow(X_self[0, :].reshape(28, 28), cmap=plt.get_cmap('gray'))
plt.title("Predicted: " + str(np.argmax(self_pred[0])))
plt.axis('off')
plt.show()
```

Even if accuracy with respect to logistic regression increased by ~20% the simple neural network still makes mistakes when trying to detect my handwriting.

We can then look at the ROC curves to see the improvement in classification ability on the training data, the AUC approximates 1 for every digit, showing a much more competent classification than ability than logistic regresion.

```
# Calculate prediction for training data
train_pred = model.predict(X_train)
# Plot the ROC curves for all digits
colors = ['#a6cee3', '#1f78b4', '#b2df8a',
'#33a02c', '#fb9a99', '#e31a1c',
'#cab2d6', '#6a3d9a', '#fdbf6f',
'#ff7f00']
plt.figure(figsize=(5, 5))
for i in range(10):
fpr, tpr, _ = roc_curve(y_train_1h[:, i], train_pred[:, i])
plt.plot(fpr, tpr, label=str(i), color=colors[i])
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.legend()
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic curve')
plt.show()
```

## Convolutional Neural Network¶

We will finally implement a deeper convolutional neural network with Keras, we have to reshape the data differently in this case as a convolutional network expect pixel values with the pixel dimensions as well as width and height information. In the case of the MNIST data, the pixel dimension is 1 (gray scale), and the width and height is 28.

The network architecture implemented in this example has the following components:

- The first hidden layer is a convolutional layer that implements a 2D convolution.
- The second hidden layer is a pooling layer that takes the max.
- The third hidden layer is a regularization layer that randomly excludes 20% of the neurons to reduce overfitting.
- The fourth hidden layer flattens a 2D matrix into a vector
- The fifth hidden layer is a fully connected layer with 128 neurons and a rectifier activation function
- The output layer has one neuron per class (10 neurons) and a softmax activation function to output full class probabilities.

This network is trained over the same number of epochs and episodes as the perceptron network.

```
# Fix random seed for reproducibility
np.random.seed(13)
# Load data
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Reshape to be [samples][pixels][width][height]
X_train = X_train.reshape(X_train.shape[0], 1, 28, 28).astype('float32')
X_test = X_test.reshape(X_test.shape[0], 1, 28, 28).astype('float32')
# Normalize inputs
X_train = X_train / 255
X_test = X_test / 255
# One hot encode outputs
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
num_classes = y_test.shape[1]
def baseline_model():
# create model
model = Sequential()
model.add(Conv2D(32, (5, 5), input_shape=(1, 28, 28), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
return model
# Build the model
model = baseline_model()
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=200, verbose=2)
# Final evaluation of the model
scores = model.evaluate(X_train, y_train_1h, verbose=0)
print("Baseline Error on Training Set: %.2f%%" % (100-scores[1]*100))
scores = model.evaluate(X_test, y_test_1h, verbose=0)
print("Baseline Error on Test Set: %.2f%%" % (100-scores[1]*100))
```

We can see that each iteration takes significantly longer than with the perceptron network, but the results are evidently much better. The convolutional network is better able to capture the character recognition from the images, and in this case within reasonable time and hardware. We can compare the results in my handwriting to reinforce this:

```
# Load the digits in my own handwriting
X_self = np.zeros((10, 28, 28))
for i in range(10):
X_self[i, :, :] = np.array(ImageOps.invert(Image.open(str(i) + ".jpg").convert('L')))
X_self = X_self.reshape(X_self.shape[0], 1, 28, 28).astype('float32')
X_self = X_self / 255
y_self = np.arange(0, 10)
self_pred = model.predict(X_self)
plt.figure(figsize=(7, 7))
for i in range(9):
plt.subplot(4, 3, i + 1)
plt.imshow(X_self[i + 1, :].reshape(28, 28), cmap=plt.get_cmap('gray'))
plt.title("Predicted: " + str(np.argmax(self_pred[i + 1])))
plt.axis('off')
plt.subplot(4, 3, 11)
plt.imshow(X_self[0, :].reshape(28, 28), cmap=plt.get_cmap('gray'))
plt.title("Predicted: " + str(np.argmax(self_pred[0])))
plt.axis('off')
plt.show()
```

The convolutional network makes almost no mistakes, even if the overall accuracy had just a very slight increase. What this shows is that the power of deep learning is real, and even if it's more computationally costly it is better able to generalize to unseen data with respect to standard neural networks.

In this case the character recognition was just applied to digits, but this could be extended to letters as well. In the problem of OCR for documents, in which you'd have to recognize full words or sentences, the principle would be the same but other concepts would have to be implemented, like pagination and segmentation. Nonetheless, for the purpose of this example, convolutional networks show the muscle of deep learning in character recognition.