An Introduction to Content-based Image Retrieval

Access to appropriate information is a fundamental necessity in the modern society, and information retrieval techniques have wide applications in various areas. For example, commercial search services such as Google have become indispensable tools in the people’s work and daily life. The exponential growth of digital images has motivated research into image retrieval.

The conventional methods of image retrieval involved adding metadata such as captioning, keywords or descriptions to the images so that retrieval is done over the annotation words. However, metadata image retrieval becomes inadequate since it suffers from several problems like the lack of appropriate metadata associated with the image as well as the limitation to keyword expression of visual content.

A solution to solve the problem of image retrieval involves analysing the content of the image rather than the metadata. This approach is known as content-based image retrieval (CBIR). The term “content” in this context refers to the colour shapes, textures, or any other information that can be gotten from the image. CBIR is preferable because it does not rely on the completeness and quality of annotation.

Content-based image retrieval (CBIR) aims to query images by using visual properties of the image as search queries rather than metadata associated with the images like captions, tags and annotations.

Content-based image retrieval (CBIR) still attracts a lot of attention from the multimedia community, thanks in part to the scalability challenge and also the emergence of insights into new machine learning models.

Over the decades, the progress of CBIR has been extensively discussed in existing research papers [1]. The various techniques developed for image representation in CBIR include global feature representations, for example colour features [2], edge features [2], texture features [3], GIST [4], and CENTRIST [5], and local feature representations such as the bag-of-words (BoW) models [6] using invariant visual features (e.g. SIFT [7], and SURF [8], etc.).

Image representation and image similarity measurement form the crux of the problem in CBIR. In image representation, the goal is to transform an image into some kind of feature space while still maintaining the intrinsic value of the visual content. The representation is meant to distinguish similar and dissimilar images.

Typical CBIR approaches used rigid similarity/distance functions that extracted low-level features for image search, for example Euclidean distance or cosine similarity. In an ideal world, the similarity between the images should incorporate the high-level concepts perceived by humans. But, it is difficult because of the semantic gap issue.

Machine learning offers promise in addressing the semantic gap issue due to its recent successes in performing high-level perception tasks. A range of similarity/distance functions that explore machine learning techniques have been proposed [9] [10]. For example, Norouzi [9] adopted a mapping learning scheme for large-scale multimedia applications that preserves semantic similarity by transforming high-dimensional data to binary codes. Jegou [11] implemented the fisher kernel to aggregate local descriptors and utilised a joint dimension reduction that condensed an image to a couple of bytes while maintaining high accuracy.

Another technique used to enhance feature representation is distance metric learning (DML). They generally work by learning an optimal metric that minimises the distance between similar images and maximising the distance between dissimilar images. In order to handle large-scale data, online DML algorithms were developed [12] [13]. For example, Chechik et al. proposed an online algorithm for scalable image similarity (OASIS) [10] for improving image retrieval performance.


Unlike conventional machine learning methods that are often using “shallow” architectures, deep learning mimics the human brain that is organized in a deep architecture and processes information through multiple stages of transformation and representation. By exploring deep architectures to learn features at multiple level of abstracts from data automatically, deep learning methods allow a system to learn complex functions that directly map raw sensory input data to the output, without relying on human-crafted features using domain knowledge. Many recent studies have reported encouraging results for applying deep learning techniques to a variety of applications, including speech recognition , object recognition, and natural language processing , among others.


[1] A. W. Smeulders, M. Worring, S. Santini, A. Gupta and R. Jain, “Content-based image retrieval at the end of the years,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 12, p. 1349–1380, 2000.
[2] A. K. Jain and A. Vailaya, “Image retrieval using color and shape,” Pattern Recognition, vol. 29, no. 8, p. 1233–1244, 1996.
[3] B. S. Manjunath and W. Y. Ma, “Texture features for browsing and retrieval of image data,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 18, no. 8, p. 837–842, 1996.
[4] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International Journal of Computer Vision, vol. 42, no. 3, p. 145–175, 2001.
[5] J. Wu and J. M. Rehg, “Centrist: A visual descriptor for scene categorization.,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 33, no. 8, p. 1489–1501, 2011.
[6] L. Wu, S. C. H. Hoi and N. Yu, “Semantics-preserving bag-of-words models and applications.,” IEEE Transactions on Image Processing, vol. 19, no. 7, p. 1908–1920, 2010.
[7] D. G. Lowe, “Object recognition from local scale-invariant features,” ICCV, p. 1150–1157, 1999.
[8] H. Bay, T. Tuytelaars and L. J. V. Gool, “Surf: Speeded up robust features,” ECCV (1), pages , p. 404–417, 2006.
[9] M. Norouzi, D. J. Fleet and R. Salakhutdinov, “Hamming distance metric learning.,” NIPS, p. 1070–1078, 2012.
[10] G. Chechik, V. Sharma, U. Shalit and B. S, “Large scale online learning of image similarity through ranking,” Journal of Machine Learning Research, vol. 11, p. 1109–1135, 2010.
[11] H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez and S. C, “Aggregating local image descriptors into compact codes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 9, p. 1704–1716, 2012.
[12] P. Jain, B. Kulis, I. S. Dhillon and K. Grauman., “Online metric learning and fast similarity search,” NIPS, p. 761–768, 2008.
[13] R. Jin, S. Wang and Y. Zhou, “Regularized distance metric learning: Theory and algorithm,” NIPS, p. 862–870, 2009.

Machine Learning Engineer- Deep Learning, Generative Models, Reinforcement Learning, Bayesian Methods, NLP, Computer Vision