Ariyo Sanmi, PhD


Ariyo Sanmi, PhD

Fully convolutional captionnet: Siamese difference captioning attention model

The generation of the textual description of the differences in images is a relatively new concept
that requires the fusion of both computer vision and natural language techniques. In this paper, we present a
novel Fully Convolutional CaptionNet (FCC) that employs an encoder-decoder framework to perform visual
feature extractions, compute the feature distances, and generate new sentences describing the measured
distances. After extracting the features of the images, a contrastive function is used to compute their weighted
L1 distance which is learned and selectively attended to determine salient sections of the feature at every time
step. The attended feature region is adequately matched to corresponding words iteratively until a sentence
is completed. We propose the application of upsampling network to enlarge the features’ field of view, this
provides a robust pixel-based discrepancy computation. Our extensive experiments indicate that the FCC
model outperforms other learning models on the benchmark Spot-the-Diff datasets by generating succinct
and meaningful textual differences in images.

Leave a Reply

Your email address will not be published. Required fields are marked *