A REVIEW ON AUTOMATIC CAPTION GENERATION FOR NEWS

International Journal of Research In Science & Engineering
Volume: 1 Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
A REVIEW ON AUTOMATIC CAPTION GENERATION FOR NEWS
IMAGES
Suchita G. Dahule1, Prof. Chaitali S. Suratkar2
1
BE Final year student, I.T, JDEIT,Yavatmal,Maharashtra,India,dahulesuchita@gmail.com
Assistant Prof, I.T, JDIET, Yavatmal Maharashtra, India, chaitali.suratkar61@gmail.com
2
ABSTRACT
The task of automatically generating caption for news images is important task. Captions are
necessary part associated with images to make search engines to respond easily with user queries. Making correct
captions for images is somewhat difficult task. The caption must be in such a way that it should give the whole
information about about the article and image. The generated accurate captions will help user to find the relevant
images for news articles. But most of the images are associated with user annotated tags, captions and text
surrounding the images. This paper is concerned with the task of automatic caption generation for news images
in association with the related news article. In this method we will input one image and news article to the
system.
This thesis is concerned with the task of automatically generating captions for images, which is
important for many image-related applications. Automatic description generation for video frames would help
security authorities manage more efficiently and utilize large volumes of monitoring data. The system will generate
most important keywords which are associated with the image in association with the articles. To generate captions
we will first give image and article as an input to the image annotation and summarization. This will extract the
important keywords related to the image. After applying grammatical rules to the keywords an appropriate caption
is generated. Here we are combing the textual modalities with the visual one. In the existing method the captions are
not efficiently generated and there was no relation between the image and the article associated with it. But we are
introducing a new method for caption generation of a news images without costly involvement and this method will
reduce the human efforts. This method of automatic caption generation will save time.
Keywords: caption generation, pictorial information, summarization, image annotation.
----------------------------------------------------------------------------------------------------------------------------1. INTRODUCTION
Here a word caption means a short description about the news and related images; we can say a headline
of news images.A caption is the important term for news articles. As we know many times peoples used to read only
the headlines of news. A good caption is important to make news attractive. A good caption will represent the
complete article and image in a short manner. Without losing more time we can understand the whole matter in a
glance with an effective caption. As we know the caption generation is not a easy task. An expertise journalist can
also have many difficulties in generating the caption to make the news effective. Here we are introducing an
automatic caption generation method for news images in an efficient way. This method is helpful in text
summarization with the help of visual modality. Many people do summarization based on text alone, which will not
have any relation with the associated image. Making correct captions for images is a difficult task. By making the
appropriate caption will help the user to search images with long queries. But most of the images are associated with
user annotated tags, captions and text surrounding the images.
This paper is concerned with the task of automatic caption generation for news images in association
with the related news article. In this method we will input one image and news article to the system. The system will
generate most important keywords which are associated with the image in association with the image. To find the
image related keywords, first we will find out the input image’s features using SIFT (Scale Invariant Feature
Translation) method. And using these features we will compare the image with the images which are stored in the
IJRISE| www.ijrise.org|editor@ijrise.org [95-100]
International Journal of Research In Science & Engineering
Volume: 1 Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
database. After finding the best matched image we will extract the keywords associated with that image. After
applying grammatical rules to the keywords an appropriate caption is generated. A good caption should be
informative. It must clearly identify the objects of the picture. And follow a good summarization method to
summarize the textual content. And provide a relation between the textual and visual contents. Moreover lead the
reader in to the article. Worthless words are avoided from a good caption. While a short caption is appropriative, but
it should be informative. Following correct grammatical rule will generate a good caption.
In the existing method the captions are not efficiently generated and there is no mapping between
the image and the text associated with it. But we are introducing a best method for news image caption generation
without costly manual involvement. We are not using any dictionaries to provide text to image relation. By using the
extractive and abstractive caption generation process a good caption is generated. Amount of images which are
available on internet are huge today therefore for the better image retrieval process caption generation is very
important task. Making this process automatic means it will reduce the costly manual involvement in caption
generation process. There are many problems in image captioning. Because content based image retrieval uses
visual similarities of images to find out the matched image. But it is suffering with the semantics information loss.
Manual annotated words provides solution to this problem but it is time consuming and costly. In our system this
problem is avoided with using the image to text correspondence. A good caption for image is generated in
association with the associated article.
2. RELATED WORK
Image understanding is the popular topic within the computer vision; relatively some work has
focused on caption generation. All previous methods attempt to learn the correlation between image features and
words from examples of images manually annotated with keywords. They are typically developed and evaluated on
the Corel database, a collection of stock photographs, divided into themes (e.g., tigers, sunsets) each of which are
associated with keywords (e.g., sun, sea) that are in turn considered appropriate descriptors for all images belonging
to the same theme.
Image caption generation is important task because it makes user to easily retrieve appropriate images
with long queries. Image understanding works are done by many methods but less focus is on the process of caption
generation for images. It actually consists of two stages [1]. First our model will create a dataset in which large
amount of images and image descriptions are included. In the text annotation process we will find out the important
keywords and phrases using parts of speech. After that we will apply the stop word removal and stemming process.
In the image annotation process we apply SIFT algorithm [2] to find out the local features of the image.
The comparison between input image and image stored in the database are done with the Euclidian
distance of their feature vectors. After that associated keywords of images are extracted. All the keywords from
image and document are classified under pronoun, noun, verb etc. Then by applying extractive and abstractive
caption generation process [1] automatic caption is generated. In addition to the above mentioned technique many
other techniques are used for image description generation. For example Y. Feng and M. Lapata, “Automatic Image
Annotation Using Auxiliary Text Information,” Proc. 46th Ann. Meeting Assoc. of Computational Linguistics:
Human Language Technologies [3] has described usually to represent images of objects in some natural language or
in a human readable form image annotation system is utilize in image base management.
Text generation from images in a natural language by bearing in a mind manual database on the
account of image attributes like colour and texture. The graphics case generally avoids complex image processing,
assuming that the data used to draw the graphics are already at hand, hence places more emphasis on how to
verbally convey the information inherent in the graphics, especially on information that is easy to visualize but
usually omitted [2]. As we discussed in the previous chapter, in order to solve this cross-disciplinary task, we need
to deal with two main problems, namely automatic image annotation and description generation that, although
closely related, have been previously studied in isolation. When looking at previous efforts in automatic image
annotation, we address the problem in terms of the training paradigm employed and their capability dealing with
real-world data.
2.1.1. Automatic Image Description Generation
This application follows two stage architecture. The image is first analyzed using image
processing techniques into an abstract representation, which is then rendered into a natural language description with
a text generation engine. A common theme across different models is domain specificity, the use of hand-labelled
data indexed by image signature (e.g. Colour and texture), and reliance on background ontological information.
IJRISE| www.ijrise.org|editor@ijrise.org [95-100]
International Journal of Research In Science & Engineering
Volume: 1 Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
2.1.2. Automatic Description for human activities in video
The idea is to extract features of human motion from video key frames and interleave them with a
concept hierarchy of actions to create a case frame from which a natural language sentence is generated.
3. CAPTION GENERATION TECHNIQUE
Instead of relying on manual annotation or background onto logical information to exploit a multimodal
database of news articles, images, and their captions. Using an image annotation model, the first to describe the
picture with keywords, which are subsequently realized into a human readable sentence. The experiments used news
articles accompanied by captioned images.
3.1 Implementation
3.1.1 Input Image and Article
The image and the article are the inputs of the project.
3.1.2 Content Extraction
In this only the related content is extracted and from this the relevant content is retrieved in retrieve content
phases.
3.1.3 Summarization
In this phase the only focus is on textual information while ignoring pictures ,graphical figures, or tables that
are embedded in documents to produce more comprehensive summaries. In this we use abstractive summarization
module which produces more human-like summaries. In this, the source text naturally supplies grammatical
sentences or phrases that can be used to produce grammatical summaries. The abstractive summarization first
identify the key content of documents in the form of constituents, e.g., words or phrases, which are then organized
into a grammatical sentence.
3.1.4 Image Annotation
In this we annote the image from associated news document. For this we propose a probabilistic image
annotation model that learns to automatically label images under the assumption that images and their surrounding
text are generated by a shared set of latent variables or topics. By use of Latent Dirichlet Allocation [4], a
probabilistic model of text generation, we represent visual and textual meaning jointly as a probability distribution
over a set of topics. The annotation model takes these topic distributions into account while finding the most likely
keywords for an image and its associated document. On compare the mix LDA with the other three types of LDA
namely word overlap, standard vector space model and Txt LDA. And as per the results, it’s found that the Mix
LDA significantly (p < 0:01) is better than these models by a wide margin and has accuracy of 57.3% .As compared
with other models it has accuracy of 31.0% over Txt LDA, 38.7% over the vector space model, and accuracy of
31.3% over the word overlap .For this we also use the Scale Invariant Feature Transform (SIFT) [2] algorithm,
which is the part of LDA.
3.1.5 Caption Generation
Caption generation is done automatically without costly manual involvement. There are two types of
caption generation Abstractive and Extractive caption generation. The abstractive and extractive caption generation
procedures are explained below [5].
Extractive Caption Generation
Extractive caption generation is concerned with extracting a single sentence which contains maximum
predicted keywords. It is a text summarization technique in which a sentence from the article itself is retrieved based
on maximum occurrence of extracted words with that sentence.
a. Vector Space-Based Sentence Selection
The words which are extracted from the image and document and the sentences from the document is
represented in a vector space to find out the occurrence of words in that sentence. Create word-sentence cooccurrence matrix to find the frequency of a word in a sentence. The word with higher frequency is considered as an
important word. The Vector-Space based sentence selection is done as follows. For each keyword the number
IJRISE| www.ijrise.org|editor@ijrise.org [95-100]
International Journal of Research In Science & Engineering
Volume: 1 Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
occurrence of that word in every sentence is calculated. Choose a sentence with largest number of occurrence of
keywords. Retrieve that sentence as caption.
Abstractive Caption Generation
The extractive caption generation method generates a caption which is naturally grammatical but it is not
possible to express a subject with a single sentence. And also the selected sentence may be long making a caption
inefficient. For these reasons we will go for abstractive caption generation procedure which is more efficient than
the extractive one.
a. Word-Based Caption Generation
It focuses on the probability of words that may be occurred in final caption from the given document with
minimum length. The most important words from document and image are extracted and an effective caption is
generated using this words.
b. Phrase-Based Caption Generation
It focuses on the combination of words (phrases) that may be occurred for generating caption from the
given document. Phrases are naturally associated with function words and may potentially capture long-range
dependencies.
c.
Search
By reducing the size of the document search made more efficient. We can rank each sentences in terms
of their relevance to the image content which is taken from the document and only the best one is considered. And
also give importance to the best sentence with the neighbouring sentences under the assumption that neighbouring
sentences are about the same or similar topics.
Fig: Steps for Caption Generation
IJRISE| www.ijrise.org|editor@ijrise.org [95-100]
International Journal of Research In Science & Engineering
Volume: 1 Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
4. EXAMPLE
US arms flights carrying bombs have been carrying supplies to Israel refuelled in UK airports.
ARTICLE:
The British government could face claims it violated international humanitarian laws by allowing US
arms flights to Israel to use UK airports. The Islamic Human Rights Commission (IHRC) is seeking permission to
contest government bodies over what it says are crimes against the Geneva Convention. A number of US planes said
to be carrying bombs to Israel refueled in the UK during the Lebanon conflict. The IHRC said it received complaints
from Britons with families in Lebanon. The commission is accusing the government of "grave and serious
violations" of international humanitarian law
5. CONCLUSION
From this it is concluded the task of automatic caption generation for news images. The task
fuses insights from computer vision and natural language processing. Alike from previous work, approached this
task in a knowledge-lean fashion by using the vast resource of images available on the Internet and exploiting the
fact that many of these co-occur with textual information (i.e., captions and associated documents).The study of
paper show that it is possible to learn a caption generation model from weakly labeled data without costly manual
involvement. Image captions are treated as labels for the image. Although the caption words are admittedly noisy
compared to traditional human-created keywords, It seems that they can be used to learn the correspondences
between visual and textual modalities, and also serve as a gold standard for the caption generation task. Moreover,
this news dataset contains a unique component, the news document, which provides both information regarding to
the image’s content and rich linguistic information required for the generation procedure.
REFERENCES
[1] Yansong Feng, Member, IEEE, and Mirella Lapata, Member, IEEE “Automatic Caption Generation for News
Images,”IEEE Transactions on Pattern Analysis and Machine Intelligence, VOL. 35, NO. 4, APRIL 2013
IJRISE| www.ijrise.org|editor@ijrise.org [95-100]
International Journal of Research In Science & Engineering
Volume: 1 Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
[2] A. Farhadi, M. Hejrati, A. Sadeghi, P. Yong, C. Rashtchian, J.Hockenmaier, and D. Forsyth, “Every Picture
Tells a Story: Generating Sentences from Images,” Proc. 11th European Conf.Computer Vision, pp. 15-29, 2010.
[3] Y. Feng and M. Lapata, “Automatic Image Annotation Using Auxiliary Text Information,” Proc. 46th Ann.
Meeting Assoc. of Computational Linguistics: Human Language Technologies, pp. 272- 280, 2008.
[4] B. Yao, X. Yang, L. Lin, M.W. Lee, and S. Chun Zhu, “I2T: Image Parsing to Text Description,” Proc. IEEE,
vol. 98, no. 8, pp. 1485- 1508, 2009.
[5] M. Banko, V. Mittal, and M. Witbrock, “Headline Generation Based on Statistical Translation,” Proc. 38th Ann.
Meeting Assoc. for Computational Linguistics, pp. 318-325, 2000.
IJRISE| www.ijrise.org|editor@ijrise.org [95-100]