02
Resnet-152 finetuning in Keras
July, 2017 [author: Adam Brzeski – CTA.ai]
In this short experiment we will implement simple transfer learning with Keras framework in order to get a fairly reliable classifier with just a little effort and using really little data. For this purpose we will use Resnet-152 model that was trained on ImageNet and finetune it for the task of classifying interiors of 67 different places. The data used in the experiment comes from the MIT Indoor-67 dataset.

Full code of this experiment is available at GitHub. It uses Keras 2.0 (ver. 2.0.4) with TensorFlow backend (ver. 1.2.0-rc2), working on Python 3.6.

To reproduce this experiment, you should download the data package which is available here. Unpack it in your home directory for compliance with the code.

The data package consists of several elements:

  • Indoor-67 images, split into train, val and test subsets
  • ImageNet-trained Resnet-152 weights, acquired from here
  • Text file mapping ImageNet ids to class names, acquired from here
  • Cached features from Resnet-152 model and labels for images in train, val and test subsets

Resnet-152 for Keras

The first step in this experiment was to get Resnet-152 model for Keras. Unfortunately, Resnet models are not officially available for Keras, but thanks to this great contribution we can have both network implementation and model weights converted directly from the official Caffe models. The code is for Keras 1.0, but you can easily get a version adapted for Keras 2.0 here, which is the one included in our code.

We will begin with checking out the model to see whether it correctly classifies objects from ImageNet dataset classes. We created a simple resnet_demo.py script that lets you run the network on chosen images from Google just by copying them to clipboard. Here are some sample results, with correct classes in bold:

Indoor classification and reference accuracy

Now let’s get back to our task. First, let see some of the images from the Indoor-67 dataset. As you can see below, many of the images and classes are highly characteristic and clear, but some of them are not easy to distinguish, like fastfood restaurant and restaurant, or library and bookstore, which make the classification task challenging.

Before we implement our classifier, let’s also check some reference results that were acquired on this dataset in research papers, so that we know what level of accuracy will be satisfactory. The authors of the dataset proposed their own method for classifying the images. A decade ago it was however quite a difficult task. The method achieved only 26% of mean per class accuracy. The result improved over years and the current best method reached 79% in 2016. As we will see in a moment, our simple approach will achieve accuracy of 73,2%, which is not far from the current state-of-the-art and is actually better than the top result from 2015.

Our attempt

We will use the output of the last but one layer of the Resnet-152 model, which is a pooling layer with 2048 outputs, and use it as a feature vector for the classification task. This vector is directly used by the last layer to make classification in the original model (1000 Imagenet classes), so it is expected to contain a kind of summary of an input image. When treated as feature vector of an image, the vector of outputs of a given layer is also often called code or bottleneck features. Note that you can use other layers for your feature vector as well (you may have to compress them with pooling or other way, though).

Build features

In order to avoid multiple computations of the feature vector for a single image during training, we will firstly cache them by computing model outputs for each image and storing it in numpy arrays on the disk. Note that Indoor-67 dataset comes with a given train/test split, but we additionally extracted some of the train images to create a validation subset. Generated features for images in each subset are already contained in the data package provided in this tutorial, so you can also skip this step, if you have no time.

Here are some basic steps to calculate and cache Resnet-152 features:

  1. First load our downloaded Resnet-152 model along with the provided ImageNet weights (for convenience included in our data package) and create a features sub-model over it, which will output average pooling layer instead of the final, classification layer:
from keras.models import Model
from resnet import resnet152

WEIGHTS_RESNET = os.path.expanduser("~/ml/models/keras/resnet152/resnet152_weights_tf.h5")

# Load Resnet 152 model and construct feature extraction submodel
resnet_model = resnet152.resnet152_model(WEIGHTS_RESNET)
feature_layer = 'avg_pool'
features_model = Model(inputs=resnet_model.input,
                       outputs=resnet_model.get_layer(feature_layer).output)
  1. Before we can pass our images into the model, we must remember to properly preprocess the images. In particular we need to convert the color space from RGB, which is the default format loaded by skimage, to BGR format, which is assumed by the Resnet-152 model. We also need to resize the image to comply with the model as well as scale pixel values from <0, 1> to <0, 255>. It is also very important to subtract training set mean from the input image before passing it to the model. In this case, the training set mean has a form of 3 BGR values, representing mean values for each channel over the training dataset. So let’s prepare a dedicated function for that:
import numpy as np
import skimage.transform

def preprocess(im):

    """
    Preprocesses image array for classifying using ImageNet trained Resnet-152 model
    :param im: RGB, RGBA float-type image or grayscale image
    :return: ready image for passing to a Resnet model
    """

    # Some special cases handling
    # …

    # RGB to BGR
    im = im[:, :, ::-1]

    # Resize and scale values to <0, 255>
    im = skimage.transform.resize(im, (224, 224), mode='constant').astype(np.float32)
    im *= 255

    # Subtract ImageNet mean
    im[:, :, 0] -= 103.939
    im[:, :, 1] -= 116.779
    im[:, :, 2] -= 123.68

    # Add a dimension
    im = np.expand_dims(im, axis=0)

    return im
  1. Now we can load our images and run the model to get the feature vectors. Which is fairly simple:
# Load image
im = skimage.io.imread(path)
im = helper.preprocess(im)

# Run model to get features
code = features_model.predict(im).flatten()

This way we’ll compute the feature vectors for the entire dataset (contained in our data package) and store them in an array along with integer class ids, which we infer from the paths, as they include class names. Finally we can save the cached features and labels to the disk. Note that we actually do this process separately for train, val and test subsets. The entire process can take up to 2 hours when using a CPU, but it will be a matter of minutes if you use a GPU.

import json
import os
import skimage.io
import numpy as np

DATA_SUBSETS = [
    os.path.expanduser("~/ml/data/indoor/train"),
    os.path.expanduser("~/ml/data/indoor/val"),
    os.path.expanduser("~/ml/data/indoor/test"),
]
FEATURES_FILENAME = "features-resnet152.npy"
LABELS_FILENAME = "labels-resnet152.npy"
PATHS_FILENAME = "paths-resnet152.json"
NAMES_TO_IDS = json.load(open("names_to_ids.json"))

# For each data subset
for datadir in DATA_SUBSETS:

    features = []
    labels = []
    images_list = glob.glob(datadir + "/*/*.jpg")

    # Process images
    for path in images_list:

        # Load image
        im = skimage.io.imread(path)
        im = helper.preprocess(im)

        # Run model to get features
        code = features_model.predict(im).flatten()

        # Cache result
        label = NAMES_TO_IDS[os.path.basename(os.path.dirname(path))]
        labels.append(label)
        features.append(code)

    # Save to disk
   np.save(os.path.join(datadir, FEATURES_FILENAME), features)
   np.save(os.path.join(datadir, LABELS_FILENAME), np.uint8(labels))

Train

After caching feature vectors for the dataset we can move on to training our classifier.

  1. First of all, we will load the cached features and labels from the train and val subsets, which will be inputs for our classifier. Note that we convert our integer labels to one-hot vectors using to_categorical() function available in Keras.
import os
import numpy as np
import keras

TRAIN_DIR = os.path.expanduser("~/ml/data/indoor/train")
VAL_DIR = os.path.expanduser("~/ml/data/indoor/val")
FEATURES_FILENAME = "features-resnet152.npy"
LABELS_FILENAME = "labels-resnet152.npy"

# Load train data
train_features = np.load(os.path.join(TRAIN_DIR, FEATURES_FILENAME))
train_labels = np.load(os.path.join(TRAIN_DIR, LABELS_FILENAME))
train_labels = keras.utils.np_utils.to_categorical(train_labels)

# Load val data
val_features = np.load(os.path.join(VAL_DIR, features))
val_labels = np.load(os.path.join(VAL_DIR, labels))
val_labels = keras.utils.np_utils.to_categorical(val_labels)
  1. Next, we build up a fresh classification layer, that we will train for our places recognition task. We create just a single fully-connected layer (named Dense in Keras). We initialize weights with truncated normal distribution and biases with zeros. As our problem is one-of-many classification, we use softmax activation. We will use SGD (Stochastic Gradient Descent) as the optimizer. We will use 0.1 as the base learning rate, as we experimentally found it work well in this case. Finally, we choose categorical_crossentropy as our loss function and accuracy as a measure to log during training.
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import SGD

# Build softmax model
classifier_model = Sequential()
classifier_model.add(Dense(67, activation='softmax',
                           kernel_initializer='TruncatedNormal',
                           bias_initializer='zeros',
                           input_shape=train_features.shape[1:]))

# Define optimizer and compile
opt = SGD(lr=0.1)
classifier_model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
  1. We will also define some callbacks. Firstly, we will slowly reduce the learning rate during training, which improves the optimization process. Secondly, during the training we want to save the best model that shows up. Keras checkpointer by default will pick the model with the lowest validation loss, which is what we need to prevent overfitting.
WEIGHTS_CLASSIFIER = "classifier_weights.h5"

# Prepare callbacks
lr_decay = ReduceLROnPlateau(factor=0.9, patience=1, verbose=1)
checkpointer = ModelCheckpoint(filepath=WEIGHTS_CLASSIFIER,
                               save_best_only=True,
                               verbose=1)
  1. Now we can finally run the training:
# Train
classifier_model.fit(train_features, train_labels,
                     epochs=50,
                     batch_size=256,
                     validation_data=(val_features, val_labels),
                     callbacks=[lr_decay, checkpointer])

Training time should take just around a minute and your outputs should look like this:

Epoch 27/50
 256/4690 [>.............................] - ETA: 0s - loss: 0.1456 - acc: 1.0000
1792/4690 [==========>...................] - ETA: 0s - loss: 0.1585 - acc: 0.9877
3584/4690 [=====================>........] - ETA: 0s - loss: 0.1556 - acc: 0.9877
4690/4690 [==============================] - 0s - loss: 0.1547 - acc: 0.9883 - val_loss: 1.0582 - val_acc: 0.6910
Epoch 00026: val_loss improved from 1.06202 to 1.05825, saving model to classifier_weights.h5

Test

Now when training is completed, it’s time to test our classifier.

  1. First, let’s load cached feature vectors and labels from the test data:
import os
import numpy as np

TEST_DIR = os.path.expanduser("~/ml/data/indoor/test")
FEATURES_FILENAME = "features-resnet152.npy"
LABELS_FILENAME = "labels-resnet152.npy"

# Load test data
test_features = np.load(os.path.join(TEST_DIR, FEATURES_FILENAME))
test_labels = np.load(os.path.join(TEST_DIR, LABELS_FILENAME))
  1. Next, build the classifier model and load the weights that we trained:
import os
from keras.layers import Dense
from keras.models import Sequential

WEIGHTS_CLASSIFIER = "classifier_weights.h5"

# Load top layer classifier model
classifier_model = Sequential()
classifier_model.add(Dense(67, activation='softmax', input_shape=test_features.shape[1:]))
classifier_model.load_weights(WEIGHTS_CLASSIFIER)
  1. Now we will run the classifier over the test samples and count the correct answers per each class:
from collections import defaultdict
import numpy as np

# Classify the test set, count correct answers
all_count = defaultdict(int)
correct_count = defaultdict(int)
for code, gt in zip(test_features, test_labels):

    code = np.expand_dims(code, axis=0)
    prediction = classifier_model.predict(code)
    result = np.argmax(prediction)

    all_count[gt] += 1
    if gt == result:
        correct_count[gt] += 1

# Calculate accuracies
print("Average per class acc:",
      np.mean([correct_count[classid] / all_count[classid]
               for classid in all_count.keys()]))

And now we see our test result:

Average per class acc: 0.73175781553

Test demo

And finally it is the best part – let’s see for ourselves how the classifier performs on the test data. We will run a demo to perform full classification (which includes feature extraction) to see how it works on particular images.

  1. First, we need load both our models: the feature extraction model and the classifier model:
import os
from keras.layers import Dense
from keras.models import Model, Sequential
from resnet import resnet152

WEIGHTS_RESNET = os.path.expanduser("~/ml/models/keras/resnet152/resnet152_weights_tf.h5")
WEIGHTS_CLASSIFIER = "classifier_weights.h5"

# Load Resnet 152 model and construct feature extraction submodel
resnet_model = resnet152.resnet152_model(WEIGHTS_RESNET)
feature_layer = 'avg_pool'
feature_vector_size = int(resnet_model.get_layer(feature_layer).output.shape[-1])
features_model = Model(inputs=resnet_model.input,
                       outputs=resnet_model.get_layer(feature_layer).output)

# Load classifier model
classifier_model = Sequential()
classifier_model.add(Dense(67, activation='softmax', input_shape=[feature_vector_size]))
classifier_model.load_weights(WEIGHTS_CLASSIFIER)
  1. Now iterate through the test images, classify and show results:
import glob
import json
import os
import random
import numpy as np
import skimage.io
from matplotlib import pyplot as plt
import helper

IDS_TO_NAMES = json.load(open("ids_to_names.json"))

# Load test images
paths = glob.glob(os.path.expanduser("~/ml/data/indoor/test/*/*.jpg"))
random.shuffle(paths)

# Classify images
for path in paths:

    print("Classifying image: ", path)

    # Load and preprocess image
    im = skimage.io.imread(path)
    transformed = helper.preprocess(im)
    if transformed is None: continue

    # Classify
    code = features_model.predict(transformed).reshape(1, -1)
    prediction = classifier_model.predict(code)

    # Print result
    prediction = prediction.flatten()
    top_idx = np.argsort(prediction)[::-1][:5]
    for i, idx in enumerate(top_idx):
        print("{}. {:.2f} {}".format(i + 1, prediction[idx], IDS_TO_NAMES[str(idx)]))

    # Show image
    skimage.io.imshow(im)
    plt.show()

And here are some randomly picked results:

Conclusions

In this short experiment with indoor places classification we eventually achieved a pretty nice result of 73% accuracy, which is in fact a little bit better than the methods from previous years, and only 6% lower than the current state-of-the-art. The presented approach method is fairly general and can be used in many other problems other than indoor photos classification. Notably, the overall training time of the classifier takes only 2 hours even without using a GPU, while the classification layer trains just in a minute.

There are still several things that can be done better, though. First of all, you can consider slight retraining of the entire network with lower learning rate, instead of freezing all of the layers up to the final pooling layer. This will obviously take longer to train. You can also try earlier layers for feature vectors and compress them to handle their large dimension. Furthermore, you could run hyperparam search to find better training parameters, especially regarding learning rate and its decay policy or the batch size. In this tutorial the train params where simply adjusted experimentally, but in general you would prefer to run an automated random search to check larger amount of parameter sets. Finally, you can also pay more attention to the data itself and apply more or less fancy data augmentation techniques.

References

[1] Quattoni, Ariadna, and Antonio Torralba. "Recognizing indoor scenes." Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009.

[2] Zhou, Bolei, et al. "Learning deep features for scene recognition using places database." Advances in neural information processing systems. 2014.

[3] Khan, Salman H., et al. "A discriminative representation of convolutional features for indoor scene recognition." IEEE Transactions on Image Processing 25.7 (2016): 3372-3383.

[4] Herranz, Luis, Shuqiang Jiang, and Xiangyang Li. "Scene recognition with CNNs: objects, scales and dataset bias." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

back
partners