Informatics Research Review: A Review of Neural Network Based Image Captioning

4638 words (19 pages) Essay

8th Feb 2020 Computer Science Reference this


Disclaimer: This work has been submitted by a university student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of


This paper provides a survey of image captioning techniques, with a specific focus on encoder-decoder frameworks. We show promising directions of research, such as the use of semantic information and the attention mechanism. Ideas on future directions are given as well.

1 Introduction

Processing and interpreting visual information represents a fundamental attribute of being human. Given that the long-standing goal of Artificial Intelligence (AI) is to build agents capable of interacting in real-world surroundings, being able to describe images is a relevant task for agents to master. Although it describing images comes naturally to humans, it is a rather complex undertaking for machines. To provide an adequate description of an image, a model must not only identify all necessary features in an image, but also provide a syntactically and semantically accurate natural language description. To do so requires the coming together from the fields in AI called Computer Vision (CV) and Natural Language Processing (NLP).

CV can be described as being the overarching topic for tasks relating to processing, analyzing, and understanding visual data. Relating to caption generation, CV plays the role of identifying features in pixel data [1]. It is a particularly hard task given that, theoretically, any aspect of an image could be described. Adding further complexity, sometimes a description must be generated for an object which is not in the image (e.g. ”A woman is waiting to board a plane”). In short, a comprehensive understanding of the image is required for a good description of it.

Image descriptions are studied as part of a sub-field in NLP called Natural Language Generation (NLG). NLG represents the task of generating natural language from non-linguistic input [2]. In relation to image captioning, it is responsible for turning an image representation into descriptive natural language sentences. This process goes through multiple stages. Initially, the content to be displayed is determined and the structure of the output is chosen (Document Planning) [3]. Subsequently, decisions have to be made which words to use and what syntactic structure to choose to convey the information (Microplanning) [3]. Lastly, the representations developed by the microplanner have to be converted into actual text, which happens in the process called surface realization [3].

Automatic image captioning, as all other fields in AI, is evolving at a rapid pace. Before the rise of neural network based approaches, research relating to template and retrieval based methods was the norm. Retrieval based image captioning methods assign a caption to an image by retrieving the most relevant one from a database of sentences, called a captions pool [4]. The benefit of using this type of method is that it produces syntactically correct sentences. On the other hand, it limits expressiveness and does not produce semantically correct, nor image specific descriptions [4]. In template-based methods, descriptions of images are generated through predefined templates [5]. Such templates are filled in through a process where when predefined visual concepts are detected, relevant grammar and sentence structures are applied [6]. Template based methods are useful when producing consistent descriptions is the goal. However, they are generally too simple [7] and lack the flexibility to adapt to novel scenes [8].

As in any field in AI, the emergence of deep neural networks has had a significant impact on the direction of new state-of-the-art architectures. Retrieval and template-based methods have been largely superseded by various forms of neural architectures. Given this trend, this review will focus on neural network approaches to automatic image captioning. This paper is structured in the following manner: In section 2, neural network based methods are divided into subcategories and discussed. Section 3 details possible future directions, and section 4 provides concluding remarks to this review.

2 Background Knowledge for Image Captioning Research

Image captioning models are often evaluated on various data sets with various evaluation met- rics. In this section, an overview of data sets and evaluation metrics used in the following reviewed research papers is given.

2.1 Data Sets

2.1.1 Flickr

Flickr [9] is an photo-sharing community, which has gathered images and corresponding meta data into a data set. Each photo comes with a title, location, and description, among various other pieces of information.

2.1.2 MS COCO

MS COCO [10] contains images which resemble ”complex everyday scenes” [10]. An effort was made to ensure that these images also contain common objects, which would be recognizable by a 4 year old. This data set stores 328k photos which contain 2.5 million labeled objects.

2.2 Evaluation

2.2.1 Bilingual Evaluation Understudy

Bilingual Evaluation Understudy (BLEU) [11] represents a score for comparing a generated string to a target string. It was initially developed for evaluating text outputs generated by machine translation (MT) models. While it was originally developed for MT, it has since been used for evaluation of various NLP tasks, image captioning being one of them. The score ranges from [0,1], where 1.0 represents a perfect match and 0.0 a non-existent one. Higher scores can generally be interpreted as outputs that are more similar to professional human outputs. BLEU uses a type of language model called ”n-gram” to compute the probability of a sequence of words. An n-gram probability score is the measure of how likely a sequence of n strings/words is. A 1-gram measures the likelihood of single words, 2-grams measure the probability of word pairs, etc. Coincidentally, when researchers refer to using a version of BLEU called BLEU-1 or BLEU-2 (can be BLEU-[1-10]), they are referring to the n value for their n-grams.

2.2.2 Metric for Evaluation of Translation with Explicit Ordering

Metric for Evaluation of Translation with Explicit Ordering (METEOR) [12] is another common metric used for evaluating various NLP tasks, including image captioning. It contains several features which are not found in BLEU, such as stem and synonym matching. It differs from BLEU in that it evaluates the model output based on its correlation to human outputs on the sentence and segment level. On the contrary, BLEU compares the words used for the model output to the words used by human image captions.

2.2.3 Consensus-based Image Description Evaluation

Consensus-based Image Description Evaluation (CIDEr) [13], unlike the previous metrics, has been specifically designed for image caption evaluation. This metric was designed due to research proving that BLEU and ROUGE correlate weakly with human judgments [14]. Unlike the other metrics, CIDEr measures ”the similarity of a generated sentence against a set of ground truth sentences written by humans” [14]. The similarity is computed by computing the cosine- similarity between the model output and the target sentences.

2.2.4 Recall-Oriented Understudy for Gisting Evaluation

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [15] is a package used for auto- matically evaluating summaries. It has a metric commonly used for evaluating image captioning as well. Similarly to the previously explained metrics, this one also compares the model output to captions formulated by humans. In the context of image captioning, ROUGE-L is commonly used. It evaluates these two strings based on the matches of in-sequence words. This evalu- ation is meant to capture the similarity between the two by measuring the longest common sub-sequence of the two. The longer it is, the more similar the two captions.

3 Reviewing Encoder-Decoder based Image Captioning tech- niques

3.1 Introduction to Encoder-Decoder Networks

The original encoder-decoder framework [16, 17] was introduced as a new way of performing sequence-to-sequence learning through Recurrent Neural Networks (RNN) based models [18]. In this framework, an input sequence (normally natural language) is encoded to a fixed-length vector using one RNN, and subsequently decoded into a target sequence using another RNN [16]. RNN’s were chosen instead of regular deep neural networks (DNNs), because DNNs can only encode fixed-length vectors [16]. This is a significant hindrance for tasks where target sequence lengths are unknown, such as in Machine Translation or Speech Recognition [16].

Given that image captioning requires the processing of visual data, the encoding functionality of the framework has to be adjusted. Instead of an RNN encoder, a Convolutional Neural Network (CNN) [19] is preferred. CNN’s assume that the inputs are images, which allows them to encode certain properties into the architecture, which allows them to handle images more efficiently [19]. Such properties include neurons arranged in 3 dimensions (height, width, depth), or CNN-specific layers such as Pooling [20] and Convolutional Layers [19]. In subsequent sections, various derivations of this adapted encoder-decoder framework will be investigated.

3.2 Vanilla Encoder-Decoder Networks

Vinyas et al [21] were among the first to apply the encoder-decoder framework to image cap- tioning. At the time, researchers in Machine Translation were achieving state-of-the art results by maximizing the probability of an appropriate translation using sequence models []. They were making use of the original encoder-decoder framework. Vinyas et al hypothesized that this structure could work for image captioning as well. However, for the encoder-decoder framework to be sensible for image captioning, they had to make some adjustments: They replaced the

RNN encoder with a CNN, and used a Long-Short Term Memory [18] (LSTM) for decoding. They argued that using a CNN as encoder was ”natural” as CNNs allow for appropriate rep- resentations of input images. Therefore, the process of encoding and decoding was modified into the following process: The CNN trains on an image database, from which the last hidden layer is used as input to the decoder, which then generates a natural language output. Using this architecture, they formulated the task as ”maximizing the likelihood of the target descrip- tion sentence given the training image” [21]. Using this approach, they were able to achieve state-of-the art results on SBU and also for BLEU on the Pascal data set, thereby proving their hypothesis to be correct.

Similarly, Venugopalan et al [22] were looking to provide an architecture that could provide captions for videos. According to Venugopalan et al, it is important for video descriptions to take temporal structure into account i.e., permit varying lengths of input and output sequences [22]. Following the progress in Machine Translation, they also decided on an encoder-decoder framework, consisting of a CNN as encoder and LSTM as decoder. What differs from [21] is that the encoding and decoding of the video frame were learned jointly from a parallel corpus. This architecture was able to achieve state-of-the art results on video data sets.

While these two papers significantly improved upon results, it is noticeable that some image descriptions seem to lose track of the original image content i.e., produce captions that only remotely resemble target images. Jia et al [23] hypothesized that this is due the decoder at- tempting to find a balance between describing the image features, while also building a sentence that fits the language model. Therefore, if one side starts to dominate, the caption can become too general. To overcome this problem, Jia et al [23] propose an extension of the LSTM model, called gLSTM. In their network, they decide to extract semantic information from training im- ages, which are subsequently used to ”guide” the decoder [23]. They manage to maintain the focus of the decoder by adding a positive bias to words that are ”semantically linked to the image content” [23]. This bias is added by adding semantic information as an additional input to each memory cell gate. Using this approach they were able to achieve state-of-the art results on Flickr8k and Flickr30k using BLEU.

Further work with adding semantic information was done by [24]. They hypothesized that the use of ”high-level” semantic information could have a positive impact on image captioning. Their argument was that encoder-decoder networks do not explicitly encode high-level semantic concepts, therefore, the addition of such could lead to improved results. To be able to encode such information, they added a preliminary step to the framework. This step contains the mining of semantic attributes from training sentences. Then, for each attribute a CNN classifier is trained. With these attribute classifiers, the probability of any mined attribute occurring in each image can be computed. This probability vector is then fed into a LSTM decoder which generates a caption based on the image representation. With this attribute module they were able to improve upon state-of-the art results on the Toronto COCO-QA dataset using BLEU as the evaluation metric.

3.3 Encoder-Decoder Networks with Attention Mechanism

An inherent flaw with encoder-decoder networks is that the decoder can only use the information from the last hidden layer of the encoder network. The issue with this is that information learned at the very beginning of an input can get lost. To alleviate this issue, researchers built an ”attention mechanism” [25], which is loosely inspired by the human attention mechanism [26]. With an attention mechanism the encoder no longer passes on the last hidden layer to the decoder. Instead, at each step of the caption creation, the decoder can pay attention to

different parts of the input. A visualization of how an attention mechanism analyses specific areas of an image is given in figure 1. As visible in the figure, the attention mechanism learns specific features of the image which helps the decoder pay attention to specific aspects of the image at any given time step.

Figure 1: Focus of the attention mechanism given each step of the generation of the description:

”A person is standing on the beach with a surfboard.” [27]

An early adoption of the attention mechanism into an image captioning architecture came through [28]. The authors categorized contemporary image captioning architectures as either top-down or bottom-up. By top-down they meant when a model starts with the central idea of an image and converts into words, while bottom-up strategies described processes which come up with words describing various features of an image which are then combined to build bigger strings [28]. Top-down approaches are limited by their inability to describe fine details of an image. While bottom-up approaches do not suffer from this issue, they do lack an end-to-end sentence generation procedure. Given that each approach has positives and negatives, they hypothesized that combining them into one could produce satisfactory results. To accomplish this, to extract the main ideas of an image a CNN is used, while visual attribute detectors are used to extract image attributes. All of these features are subsequently fed into a RNN decoder for caption generation. Throughout this process, attention models are being used to attend to certain image attributes based on the status of the model.

While the attention mechanism helped to improve upon state-of-the art results, it contains a flaw. Namely, the attention functionality lacks ”global modeling abilities” [29] because it runs

sequentially. To mitigate this issue, [29] proposed a ”review” network. The idea behind the reviewer network is to review every bit of information that is encoded. Through this reviewing phase, a vector called the thought vector is able to capture the global features of the encoded image representation. This thought vector is then used by decoder with the help of the attention mechanism. Therefore, this architecture consists of three components instead of the standard two: An encoder for developing a vector representation of an image, a review network, which builds a thought vector through extracting global features of the encoded representation, and a decoder which outputs an image caption from the image representation. They evaluated their model on the MSCOCO benchmark data set [] and used BLEU-4, METEOR, and CIDEr as evaluation metrics. Using this network they were able to beat current state-of-the art results for METEOR and CIDEr.

While RNNs have become the go-to network for modelling sequential data, they also possess significant drawbacks. Namely, being dependent on the previous time step’s computation erases the possibility for parallel computations [30]. Furthermore, the need for significant storage due to backpropagation through time poses a challenge [30]. Given such reasons, researchers started exploring the possibility of using CNN’s for sequence modelling [30, 31]. Inspired by successes in the application of CNN’s to sequence modelling tasks, Aneja et al [32] were able to implement a convolutional network for image captioning that performs better than LSTM-based decoders. The CNN decoder contains a similar structure to a RNN decoder in that it is made up three components. The first and last components are input and output word embeddings, respectively. The main difference occurs in the middle section, which is now made up of masked convolutional layers instead of LSTM units. Masked convolutional layers [31] were used as they solely operate on historic data, thereby negating the possibility of using information from future output strings [30]. With a CNN as caption generator, the captions process can be described in the following manner: The target caption is passes through an input embedding layer. Then they are are concatenated with the image embedding, which is taken from the last hidden layer of the CNN encoder. Next, the concatenated data is processed by the CNN decoder, which is followed by the output embedding layer which produces the output word probabilities. Finally, they also added an attention mechanism to the CNN decoder. While the architecture performed on par with LSTM-based decoders, the use of attention increases the performance to state-of-the art levels. They analyzed their architecture on the MS COCO data set using ROUGE, CIDEr, SPICE, and various forms of BLEU. Some key insights include that their CNN decoder produces more diverse captions compared to a LSTM decoder, while also producing higher accuracy scores. In addition, it does not suffer from the vanishing gradient problem (which is common in RNNs).

A further CNN-only encoder-decoder framework was developed by Wang et al [33]. While they had an almost identical architecture as [32], they decided to experiment with a modified form of attention, called ”Hierarchical Attention”. In [32], the attention mechanism took visual features from the CNN encoder and concepts from the target output. The concepts come from the last layer of the CNN decoder. Using these two sets of information a score is computed, which acts as an indicator for matches between concepts and features of an image. Hierarchical attention differs form this in that concept vectors are computed in each layer of the CNN decoder and added into the next layer. Their proposal was analyzed on the Flickr30k and MSCOCO data sets and was evaluated using Meteor, Rouge-L, CIDEr, as well as various forms of BLEU.

4 Future Directions

As image captioning brings together research from both CV and NLP, image captioning will benefit through indirect advancements. Some progress will come indirectly through research in CV or NLP. With the success of the attention mechanism, more research will undoubtedly come from new attention models or new methods for analyzing visual concepts. Another direction that can/should be researched is the description of images based on regions of an image (i.e,

”A man is holdings hands with a woman, while a kid is playing football in the background”). Such captions have not been possible so far. Lastly, there is only a limited number of data sets containing images with descriptions. Therefore, it is important for unsupervised image captioning to become more prevalent.

5 Conclusion

In this paper, a survey of current research in image captioning is given. While there are plenty of avenues to elaborate upon, the focus of this survey has been on various forms of encoder-decoder frameworks. This is because they have been responsible for recent state-of- the-art results. We detail some of the first models to use the encoder-decoder framework and go over some cutting-edge frameworks utilizing semantic information. We then elaborate upon promising architectures using an attention mechanism. Lastly, ideas of future directions of research in image captioning are given.


[1] Sven Bambach. A survey on recent advances of computer vision algorithms for egocentric video. CoRR, abs/1501.02825, 2015.

[2] Dimitra Gkatzia. Content Selection in Data-to-Text Systems: A Survey. Technical report. [3] Ehud Reiter and Robert Dale. Building applied natural language generation systems.

Natural Language Engineering, 3, 03 2002.

[4] Zakir Hossain. A Comprehensive Survey of Deep Learning for Image Captioning. Technical report, 2018.

[5] Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, Kota Yamaguchi, Tamara Berg, and Karl Stratos Hal Daum´e. Midge: Generating Image Descriptions From Computer Vision Detections. Technical report, 2012.

[6] Raffaella Bernardi, Ruket Cakici, Desmond Elliott, Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis, Frank Keller, Adrian Muscat, and Barbara Plank. Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures (Ex- tended Abstract). Technical report, 2017.

[7] Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. Baby Talk: Understanding and Generating Image Descriptions. Technical report.

[8] Shuang Bai and Shan An. A survey on automatic image caption generation. Neurocom- puting, 311:291–304, oct 2018.

[9] Julian Mcauley and Jure Leskovec. Image Labeling on a Network: Using Social-Network

Metadata for Image Classification. Technical report.

[10] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C Lawrence Zitnick, and Piotr Dol´ı. Microsoft COCO: Common Objects in Context. Technical report.

[11] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a Method for

Automatic Evaluation of Machine Translation. Technical report.

[12] Michael Denkowski and Alon Lavie. Meteor Universal: Language Specific Translation

Evaluation for Any Target Language. Technical report.

[13] Jose Camacho-Collados and Mohammad Taher Pilehvar. A Survey on Vector Representa- tions of Meaning From Word to Sense Embeddings: A Survey on Vector Representations of Meaning. Technical report.

[14] Chris Callison-Burch, Miles Osborne, and Philipp Koehn. Re-evaluating the Role of BLEU

in Machine Translation Research. Technical report.

[15] Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. Technical report.

[16] Ilya Sutskever Google, Oriol Vinyals Google, and Quoc V Le Google. Sequence to Sequence

Learning with Neural Networks. Technical report.

[17] Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. Technical report.

[18] Sepp Hochreiter and Ju¨rgen Schmidhuber. Long short-term memory. Neural computation,

9(8):1735–1780, 1997.

[19] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based Learning

Applied to Document Recognition. Technical report.

[20] Dominik Scherer, Andreas Mu¨ller, and Sven Behnke. Evaluation of Pooling Operations in

Convolutional Architectures for Object Recognition. Technical report.

[21] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and Tell: A Neural Image Caption Generator. Technical report.

[22] Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Dar- rell, and Kate Saenko. Sequence to Sequence-Video to Text. Technical report.

[23] Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. Guiding the long-short term memory model for image caption generation. Proceedings of the IEEE International Conference on Computer Vision, 2015 International Conference on Computer Vision, ICCV

2015:2407–2415, 2015.

[24] Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton van den Hengel. What value do explicit high level concepts have in vision to language problems? 2015.

[25] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective Approaches to

Attention-based Neural Machine Translation. Technical report, 2015.

[26] Ronald A Rensink. The Dynamic Representation of Scenes. Visual Cognition, 7:17–42,


[27] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhut- dinov, Richard Zemel, and Yoshua Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. feb 2015.

[28] Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image Captioning with Semantic Attention. Technical report.

[29] Zhilin Yang, Ye Yuan, Yuexin Wu, Ruslan Salakhutdinov, and William W Cohen. Review

Networks for Caption Generation. may 2016.

[30] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Con- volutional Sequence to Sequence Learning. Technical report.

[31] Aaron Van Den, Oord Google Deepmind, Nal Kalchbrenner, Google Deepmind, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional Image Gener- ation with PixelCNN Decoders. Technical report.

[32] Jyoti Aneja, Aditya Deshpande, and Alexander G Schwing. Convolutional Image Caption- ing. Technical report.

[33] Qingzhong Wang and Antoni B Chan. CNN+CNN: Convolutional Decoders for Image

Captioning. may 2018.

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on the website then please:

Related Lectures

Study for free with our range of university lectures!