Classification of Food and Restaurant Images using CNNs

3830 words (15 pages) Essay in Computer Science

23/09/19 Computer Science Reference this

Disclaimer: This work has been submitted by a student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.

Classification of Food and Restaurant Images using CNNs

Abstract

Over the last decade, quantum leaps in deep learning and convolutional neural networks (CNNs) have proved beneficial for image classification tasks. In this report, we present our experiments on food and restaurant classification using various CNN architectures. The experiments for food recognition were conducted on a public food dataset consisting of eleven different food categories. For restaurant classification the Yelp Kaggle dataset was used. Experimental results show an F1 score of 0.781 on food category recognition and a score of 0.822 on restaurant classification.

1.    Introduction

In this age of photo centric social storytelling, the trend of food photography is on a rise. The advent and success of applications like Instagram and Snapchat have provided various businesses with an opportunity to leverage this trend by incentivizing guests to click photos showing them engaging with the brand and sharing these on social media. Because of this blooming trend, countless food selfies have flooded the social media. This myriad of gourmet food photos has given rise to a new challenge – that of autonomously classifying food images and tagging attributes to restaurants.

Over the past several years, advancements in image processing, machine learning, deep learning and CNNs have advanced image recognition tasks as well. But the pursuit of a solution giving high accuracy for food classification, considering a huge variety and highly mixed food items in many images, is still ongoing. Hence, due to the type of food being similar in color or shape and sometimes being indistinguishable to the human eye, it’s a daunting task to correctly identify every food item.

In the field of dietary assessment, food image analysis has led to a lot of developments and accurate automatic image-based estimation of daily nutritional intake enabling a solution for monitoring health and preventing diseases. Therefore, we can estimate the dietary value of a generic food item if we sufficiently recognize it.

Every day, millions of users upload numerous photos of different businesses to Yelp. Though most of them are tagged with simple information such as business names, deeper information typically is not revealed until looked by humans. Such information contains details to label whether the restaurant is good for lunch or dinner whether it has outdoor seating, etc. This information can be crucial in helping potential customers to determine whether the ideal restaurant for dining. Hand labeling these restaurants is not only costly but impossible since photos are uploaded almost every second to Yelp. Hence deep learning techniques, specifically image classification methods, are used to identify the context and extract this information from different photos on Yelp.

This report presents the results of several CNN architectures on food category recognition and restaurant classification and compares them using two different performance criteria. We evaluate the performance of ResNet50, VGG16, VGG19, InceptionV3, InceptionV4, Xception using accuracy and F1 score. A comparison is done across common variations and different parameter spaces such as regularization, data augmentation, different optimizers, learning rates, dropout layers and approaches through transfer learning.

2.    Related Work

Most researchers who work on food recognition assume that only one food item is present in one image. Thus, food recognition can be solved as a multi-class single-instance single-label classification problem. The task of predicting attributes for restaurants using visual cues from guest-submitted photos is a type of multi instance multi label learning (MIML) problem.

CNNs have been widely used in food recognition and provide better performance than conventional computer vision methods like combining SIFT features with classifiers like SVM. Bossard et al. [1] trained a deep CNN from scratch on the Food-101 dataset using AlexNet proposed by Krizhevsky et al. [2] and achieved 56.4% top-1 accuracy. In [3], Kawano et al. used CNN as a feature extractor and achieved the best accuracy of 72.3% on the UEC-Food-100 [4] dataset.

The VGG model [7] has shown state of the art performances on many classification tasks and usually outperforms the AlexNet architecture.

Yu et al. [8] proposed a CNN based food recognition method by using transfer learning and fine tuning the InceptionV3 model. Bolanos et al. [9] fine-tuned InceptionV3 on the Recipes5K dataset to achieve an F1 score of 0.475

3.    Methodology

3.1.    Learning Algorithms

We try to explore the hyperparameter space and conventional variations for each CNN architecture in detail to our limit of computational feasibility.

Transfer Learning: CNNs can get computationally large and become slow to train, especially on regular hardware. Hence, it is easier to obtain a pretrained CNN on a very large dataset like ImageNet and then modify the CNN as a feature extractor or fine-tune parts of the CNN for the specific task. Relying on transfer learning, we use the existing CNN trained on the ImageNet and add layers {Fully connected layer – 1024, 11} towards the top of the network to suit our specific classification requirement.

ResNet50: The last fully connected layer with 1000 classes is excluded. Trials are done by varying the number of layers, adding dropout (0.4, 0.5, 0.6), changing regularization (L2 – {0.01, 0.1}) and learning rate.

VGG: We use the original VGG structure with default weights to train the model. After removing the last layer two dropout layers(probability=0.5) were added and part of front layers was frozen to correct the model.

InceptionV3: The last thousand class fully connected layer is removed, and various experiments are conducted by changing the number of layers (Fully connected layer, Pooling layer), adding dropout (0.4, 0.5, 0.6), changing regularization (L2- {0.01, 0.1}) and learning rate.

Xception: Since Xception is an extension of the InceptionV3 architecture which replaces the standard Inception modules with depth-wise separable convolutions, the experiments with similar variations as InceptionV3 are conducted, namely adding dropout and regularization.

Pretrained CNN as a feature extractor: One other method to use Deep Learning when we don’t have enough data is to use the pretrained network as a feature extractor without using any fine-tuning. The output of the last fully connected layer of the network (prior to the softmax layer) was used as a feature vector for the images. As pretrained networks are trained in other domains (very different image categories), they cannot be used as a classifier. We used deep features to train traditional classifiers like SVM, Random Forest, Naïve Bayes as they usually perform better with a smaller amount of data.

Naive Bayes (NB): We use scikit-learn and try the Gaussian Naïve Bayes model to classify the features extracted from different CNNs.

SVMs: For SVM, we utilize different kernels available in scikit-learn: linear, polynomial, radial and sigmoid. The gamma parameter or kernel coefficient for polynomial, radial and sigmoid is varied from 10-1 to 10-5 and the penalty term is varied from 0.001 to 1 by a factor of 10 on each run.

Random Forests (RF): We try different criteria to measure the quality of split including Gini index, entropy. We also varied the number of trees in the forest from 500 to 1000 trees. The size of the maximum feature set used was varied from max to log2.

Gradient Boosting (GB): We use both types of loss: deviance and exponential to train gradient boosting classifier. The number of estimators used from training was varied from 250 to 500 and max features used was also varied from max to log2.

Association Rules: We mine association rules between labels for further increasing the accuracy of the proposed system. Positive association rules and negative association rules have been considered. We take positive association rules to be those which predict the presence of a label given the presence or absence of certain other labels. Conversely, negative association rules are those which predict the absence of a label given the presence or absence of other associated labels. We have used PrefixSpan algorithm to mine data patterns in the training data.

4.    Dataset

4.1.             Food Dataset

The food dataset is provided by Ecole Polytechnic and comprises 16643 images across 11 categories covering most of food types consumed by people in daily life. The eleven categories are Bread, Dairy Products, Dessert, Fried Food, Egg, Meat, Pasta/Noodles, Rice, Sea Food, Soup and Vegetable/Fruit.

The size of the images varies from 330 x 220 to 2448 x 3264. But, a significant portion of them is bigger than the input size required by the CNN architecture. For pre-processing, we resize the images to a fixed size based on the CNN used and perform per-channel mean image subtraction for our experiments.

The distribution of the classes is shown in figure 1. Figures 2 are examples of food images from the dataset.

4.2.             Restaurant Dataset

The restaurant dataset is provided by Yelp on Kaggle.com. In the training set, 234842 arbitrary-sized images submitted by Yelp users are correlated to one of 2000 Yelp businesses. The business labels are spread across 9 categories each of which depict a different meaning like good for lunch, good for dinner, takes reservations, outdoor seating, restaurant is expensive, has alcohol, has table service, ambience is classy and good for kids.

 

As the images are user-uploaded photos, they lack consistency. Some images are in portrait mode, some in landscape mode, some are in square shape, etc. To make the images consistent and compatible with the CNNs, pre-processing is done as described above.

Fig 1. – Class Distribution

Fig 2. – Sample Images of Food Dataset

Fig 3. – Sample Images of Restaurant Dataset

CNN Architecture

Trained last two layers

Trained last two inception blocks

Trained complete network

Added Dropout (0.5)

Added 2 Dropout (0.5)

Add L2 regularization

Test Set

InceptionV3

40.1%

61.2%

71.2%

72.9%

74.5%

77.8%

0.771

Xception

38.2%

59.4%

Table 1. – InceptionV3 and Xception Results

70.5%

72.1%

75.0%

78.3%

0.781

CNN Architecture

Trained from scratch

ImageNet Weights

Added Dropout (0.5)

Added 2 Dropout (0.5)

Test Set

VGG16

47.5%

55.6%

Table 2. – VGG16 Results

70.2%

71.9%

0.718

CNN Architecture

Trained from scratch

ImageNet Weights

Added Dropout (0.5)

Added 2 Dropout (0.5) + L2 regularization

Test Set

ResNet50

58.5%

62.2%

70.1%

72.9%

0.727

CNN Architecture

Naïve Bayes

Random Forest

SVM

Gradient Boosting

AdaBoost

InceptionV3

0.40

0.51

0.48

0.52

0.51

VGG16

0.30

0.44

0.48

0.46

0.45

ResNet50

0.15

0.29

Table 4. – CNN + Classifiers All values are F1 score

0.24

0.31

0.30

Table 3. – ResNet50 Results

 

CNN Architecture

Naïve Bayes

Logistic Regression

Random Forest

Gradient Boosting

SVM

Best Classifier +     Association Rules

Test Set

VGG16

ResNet50

InceptionV3

40.1%

61.2%

71.2%

72.9%

74.5%

77.8%

0.771

InceptionV4

Xception

38.2%

59.4%

Table 5. – Restaurant Results – Max of features

70.5%

72.1%

75.0%

78.3%

0.781

CNN Architecture

Naïve Bayes

Logistic Regression

Random Forest

Gradient Boosting

SVM

Best Classifier +     Association Rules

Test Set

VGG16

ResNet50

InceptionV3

40.1%

61.2%

71.2%

72.9%

74.5%

77.8%

0.771

InceptionV4

Xception

38.2%

59.4%

Table 6. – Restaurant Results – Average of features

70.5%

72.1%

75.0%

78.3%

0.781

 

5.    Experimental Results

In our experiments, we implemented InceptionV3, ResNet50, VGG16 in Keras using Tensorflow as backend. This section details the different configurations and tests performed. In particular, we describe the details on how the refinement of each architecture was achieved.

5.1.    Performance Metric

For the Food dataset, we randomly select 20% for dev and test datasets to evaluate each configuration. We use the dev set to calibrate the architecture and select the best hyperparameters and then report the performance on the final test dataset. The overall performance of our models was evaluated based on accuracy and F1 score metric. Table 4 details the F1 score on the dev set by training the features extracted from CNNs on different classifiers. These classifiers weren’t evaluated on the test set due to low F1 scores compared to the values achieved from transfer learning. The experiments in the columns mentioned in tables 1,2 and 3 show the validation accuracies achieved on the dev set excluding the last column which presents F1 score achieved by evaluating the best model on the test set.

For the Yelp dataset, we randomly select 20% of the restaurants as our dev set. We use this dev set to tune the hyperparameters for the classifiers and also evaluate which CNN gives the best result. Since this was a Kaggle dataset the final performance on the test set was evaluated by submitting the results online. The performance of the models was evaluated on the basis of F1 score. Table 5,6 and 7 show the F1 scores achieved on the dev set and the last column shows the score of the best classifier on the test set.

5.2.    Results

5.2.1.        Food dataset

InceptionV3: Using the pretrained ImageNet weights, we fine-tune the InceptionV3 model andtrain the last two layers by freezing the previous layers. Due to limited accuracy achieved using this method, we go on to train the last two Inception blocks. The accuracy started to plateau due to which all the layers were unfrozen to be trained. Since the entire network was being trained, overfitting became an issue which was resolved by using different dropout layers and regularization parameters (as detailed in Table1)

VGG16:The original VGG structure with pretrained ImageNet weights is used to train the complete model. However, after 80 epochs, overfitting starts to occur (the difference between the training set and dev set is beyond 0.1). To solve this problem, two dropout layers (0.5) were added and layers were frozen partially (6,7,8,9,10) to correct the model (as detailed in Table 2).

ResNet50: Like InceptionV3, we fine-tuned ResNet50 by using pretrained ImageNet weights and choosing the layers for which the parameters were to be updated. Since training only, the last layer resulted in low accuracy, we kept using the setup of fine-tuning the previous layers in the remaining experiments. The issue

of overfitting was dealt in the same way, using dropout and regularization. (As detailed in Table 3).

5.2.2.        Restaurant dataset

The task of predicting attributes for restaurants using from multiple images is a type of multi instance multi label learning problem. In this dataset each image corresponds to a restaurant and restaurants have labels. In order to train CNN as discussed in previous step, the label of a restaurant will have to be assigned to all its images. However, it should be noted that in training set not all the photos of a restaurant could contribute to all the labels of this restaurant. for example, for a restaurant with label alcohol, there is only one image which shows a glass of beer, and the rest of the images are without any sign of alcohol. So, if we simply propagate the label alcohol to all of this restaurant’s photos, we are forcing our network to learn features that don’t even exist. So, we came to the strategy of only using pretrained CNN as feature extractor without training and then for each restaurant, group the features of all the images that belong to this restaurant and then directly train on restaurants labels using different classifiers.

6.    Conclusion & Future Work

In this report, we have applied various conventional architectures such as InceptionV3, ResNet50, VGG16 on the task of food image classification and food category recognition. We work on the publicly available dataset and fine-tuned the CNN structures using pretrained networks. The experimental results show an overall F1 score of 0.771 for InceptionV3, 0.781 for Xception, 0.718 for VGG16, 0.727 for ResNet50. The main challenges to performance are due to the complexity of the mixtures of food items as well as visual homogeneity across food types.

We also present an algorithm of MIML classification of attributes in user submitted restaurant images. The proposed algorithm works by extracting features using traditional CNN architectures and then aggregating them for different restaurants using two different methods. Various classifier along with association rules are then used to predict labels for each restaurant. We have experimentally validated our approach on the Yelp restaurant photo classification dataset on which we achieved a F1 score of 0.8224.

7.    References

 [1] Van Gool. Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.

[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet classification with deep convolutional neural networks”. In: Advances in Neural Information Processing Systems. 2012, pp. 1097– 1105.

[3] Yoshiyuki Kawano and Keiji Yanai. Food image recognition with deep convolutional features. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication, pages 589–593. ACM, 2014.

[4] Y. Matsuda, H. Hoashi, and K. Yanai. Recognition of multiple-food images by detecting candidate regions. In Proc. of IEEE International Conference on Multimedia and Expo (ICME), 2012.

[7] Christian Szegedy “Inception-v4, Inception-ResNet and the impact of residual connections on learning.” In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17). 2017.

[8] Qian Yu, Dongyuan Mao and Jingfan Wang, “Deep Learning Based Food Recognition”, Stanford University, 2016

[9] M. Bolanos and P. Radeva, Simultaneous food localization and recognition, in: Pattern Recognition (ICPR), 2016 23rd International Conference on, IEEE, 2016, pp. 3140–3145.

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have the essay published on the UK Essays website then please: