Automatic Discrimination of Text and Non-Text Natural Images Chengquan Zhang, Cong Yao, Baoguang Shi, Xiang Bai School of Electronic Information and Communications Huazhong University of Science and Technology Wuhan, 430074, P.R. China. Email: {zchengquan,yaocong2010,shibaoguang}@gmail.com , xbai@hust.edu.cn Abstract—With the rapid growth of image and video data, there comes an interesting yet challenging problem: How to organize and utilize such large volume of data? Textual content in images and videos is an important source of information, which can be of great usefulness and assistance. Therefore, we investigate in this paper the problem of text image discrimination, which aims at distinguishing natural images with text from those without text. To tackle this problem, we propose a method that combines three mature techniques in this area, namely: MSER, CNN and BoW. To better evaluate the proposed algorithm, we also construct a large benchmark for text image discrimination, which includes natural images in a variety of scenarios. This algorithm has proven to be both effective and efficient, thus it can serve as a tool for mining valuable textual information from huge amount of image and video data. I. I NTRODUCTION Text present in images and videos, either protogenous or superimposed, carries rich and precise information, thus it can serve as an important cue for a wide range of applications, such as image search [1], human-computer interaction [2] and assistive system for the blind [3]. Different from conventional approaches related to textual information understanding, which are usually concerned with text detection or recognition [4], we tackle in this paper the problem of text image discrimination. Given a collection of natural images, the goal of text image discrimination is to discriminate the images that contain text from those that do not contain any text, without considering the location or content of text, as shown in Fig. 1. This problem is obviously of great importance and practicality, but has been neglected by the community for a long period of time. The difficulties in text image discrimination mainly stem from three aspects: (1) Diversity of texts. In stark contrast to characters in document images, which are usually with regular fonts, similar color and horizontal orientation, texts in natural images are far more diverse, i.e., they may exhibit significant variations in font, color, orientation, or even language type. Moreover, the portion of texts in an image can vary tremendously, for instance, a single character in a corner or multiple lines of characters occupying the whole image; (2) Complexity of backgrounds. The backgrounds may contain a lot of non-text elements (such as branch, grass, fence and sign) that are very alike true text, which might cause confusion; (3) Interference factors. Various factors, like noise, blur and nonuniform illumination, may give rise to failures. To address these challenges, we propose an effective and robust algorithm, which takes the advantages of three widely Fig. 1: Text image discrimination. Given a collection of images, the goal of text image discrimination is to seperate images containing text from those not containing text. Text image discrimination can eliminate majority of irrelevant images in an early stage, thus it could benefit subsequent processing, such as text detection, text recognition, and image/video indexing. used techniques in this field: MSER, CNN and BoW. Character candidates are extracted using MSER [5]. CNN act as a discriminative codebook, which compute a bank of responses for each candidate. For each image, a feature vector is generated with BoW, by aggregating and pooling the responses of all the candidates in the image. The final prediction (text or non-text image) is given by a trained SVM classifier. Moreover, we generate and release a large image database for text image discrimination, since there is no public benchmark in this field. This dataset contains a collection of natural images with and without text, from various scenarios. The texts in the images are in different languages, colors, scales, fonts, orientations, and layouts. Due to its diversity and representativeness, we believe this dataset can serve as a standard benchmark for text image discrimination. We have evaluated the effectiveness of the proposed algorithm on the proposed dataset. The experiments demonstrate that the proposed method achieves superior performance, compared with conventional approaches. With GPU, this algorithm runs reasonably fast, so it can be used in large-scale non-text image filtration. In summary, the contributions of this paper are two-fold: • We emphasize the significance of text image discrimination and propose an effective algorithm for it. • We establish a large benchmark1 for algorithm development, assessment and comparison. It includes 15302 images, exhibiting variations in scene, style, illumination, text appearance and layout, etc. II. R ELATED W ORK Text has been the core research object in the document analysis and recognition community. However, majority of the existing works have focused on document images [6], [7], while those concerned with natural images mainly tackle tasks like text detection and recognition [8]–[13], rather the task of text/non-text distinction. There are a fraction of works that address text/non-text distinction. Indermuhle et al. [14] and Vidya et al. [15] proposed a system that can distinguish between text and nontext regions in handwritten documents, respectively. However, these methods were aimed at classifying text/non-text at region level, rather than image level. Most related to our work, the algorithm of Alessi et al. [16] is able to detect if a digital image contains text or not, but it is only applicable to document images or simple video frames. In contrast, the proposed approach is capable of distinguishing between natural images with text and those without text. In this sense, this work is the first that formally tackle the problem of text image discrimination for real-world natural images. III. T EXT I MAGE D ISCRIMINATION M ETHOD In this section, we describe in detail the proposed method for text image discrimination, which is an ingenious combination of Maximally Stable Extremal Region (MSER) [5], Convolutional Neural Network (CNN) [17] and Bag of Words (BoW) [18], [19]. Recently, a rich body of works [20]– [22] have demonstrated that CNN is effective for scene text detection and recognition. The proposed algorithm also utilizes the ability of CNN to learn discriminative features for accurate text image discrimination. In training phase, K-means [23] are applied to text regions extracted by MSER, before training a CNN model. This converts the binary-class classification problem (text vs. non-text) to multi-class classification problem. According to the response of each regions, a histogram-like descriptor, which represents the characteristics of a natural image, is formed using Bag of Words (BoW). Fig. 2: Pipeline for text image discrimination. As shown in Fig. 2, the proposed method is composed of the follow major processing steps: region extraction and clustering, multi-class CNN model training, feature vector generation and linear SVM training. 1 The dataset is available at http://zchengquan.github.io/textDis/ for academic use only. A. Region Extraction and Clustering Maximally Stable Extremal Regions [5] (MSER) is widely used in the fields of text detection and recognition, since it is able to discover candidate characters effectively and efficiently. Neumann et al. [24] have also proved that MSER can capture over 95% true text regions with using multiple channels. In our method, we employ MSER to extract regions which can support the discrimination of text and non-text images. However, different from text detection or recognition, text image discrimination does not require the localization and recognition characters. Moreover, the proposed method allows text regions with few connection characters. Multi-channels (RGB-Gray) and loose settings are used in our method. Before training the CNN model, we compute HOG [25] feature for each text region from foreground (text component), and K-means is performed to cluster the text regions into multiple categories. The softmax layer in CNN model will produce a response with (K+1) dimension for each region. Note that, clustering is not applied to regions from background. B. Multi-Class CNN Model Training The whole CNN Structure is composed of 4 convolutional layers and 2 full connection layers. Normally, each convolutional layer is followed by maxpooling and rectified units. The setting of the CNN architecture used in this paper is showed in TABLE I. TABLE I: The setting of the CNN architecture used in this work. ks means the kernel size; ps means the padding size; ss means stride size; nMap means the number of feature map, and nNode means the number of linear layer node. Layer Conv-1 Conv-2 Conv-3 Conv-1 Full-1 Full-2 Settings ks=5, ps=2, ss=1, ks=5, ps=2, ss=1, ks=3, ps=1, ss=1, ks=3, ps=0, ss=1, nNode=1024 nNode=(K+1) nMap=64 nMap=128 nMap=384 nMap=512 The input image region is rescaled to 32 × 32, and this would generate 1×1 size maps through the first 4 convolutional layers under special design. The kernel numbers we used are 64,128,384 and 512, respectively. After 4 convolution steps, 2 full connection layers with 1024 and (K+1) perception units are closely followed. Since softmax are used in the last layer, we can obtain a vector of (K+1) dimension, where each dimension denotes the probability of one category. In the training phase, we use Stochastic Gradient Descent and back-propagation to jointly optimize all parameters, which can minimize the classification error as much as possible. Two important measures are taken into consideration in the training process: one is penalty factor, the other is boosting. Since clustering methods belong to unsupervised learning, which can’t divide the samples into accuracy classes. Thus, we can use different penalty factors for regions from foreground and background, by using the Negative Log Likelihood criterion as loss function. Here, we design low penalty factors for K text region clusters, which allows for CNN model to adjust the suitable clustering results. However, high value is assigned to regions from background, because whether the model is good at filtering non-text regions is important for the feature vector generation step. In order to make the CNN model be good at filtering nontext regions, we use boosting [26] to optimize our model. At first, we divide the regions into two groups: regions from foreground (text regions) and regions from background (nontext regions). K-means is adopted to divide the text regions into multiple clusters, while keep one class for non-text regions. Because of the complexity and diversity of background, we randomly select some samples from the background. When the first time of the CNN model achieving an optimum state, we add hard non-text examples in next training process with the use of this model, which procedure repeats for 3∼5 times. C. Feature Vector Generation and Linear SVM Training Given an image, N regions are extracted with MSER in multi-channels, which is followed by color channel changing and scale regularization. Each region will generate a response with (K+1) dimension. A feature matrix with the size of N × (K + 1) can be used to represent the whole image. Then, a feature vector is formed with BoW, by aggregating and pooling the feature matrix, which is of (K+1) dimension. We call this operation CNN Coding, since the softmax output of CNN model acts as the codebook, which is similar to sparse coding used in word retrieval [27]. The coding output can be expressed as follow: Φ(I) = K+1 X αi φi , (1) i=1 where I denotes an image and φi is the i-th class response. αi denotes the weight of i-th class. Thus, Φ(I) stands for the CNN Coding result for a whole image. A feasible way that can improve upon CNN Coding is to perform region filtering before coding, since regions from background are complex and diverse. Moreover, given an text image, the number of non-text regions extracted by MSER is usually several order of magnitude larger than that of the text regions. In order to generate an reliable and effective descriptor for an image, some measures should be adopted to control the ratio between text and non-text regions. The trained CNN model has proven to be excellent at filtering non-text regions, so we not only utilize the CNN model to generate a feature vector of an image, but also use the model as a filter. As shown in Fig. 3, the CNN model is able to effectively eliminate nontext regions, which makes the final feature more discriminative. We will demonstrate the advantage of this strategy in Sec. V. Finally, the coding results are fed to a SVM classifier. In our method, we used LIBLINEAR [28] as the final classifier, which leads to excellent performance. IV. DATASET AND E VALUATION P ROTOCOL In this section, we introduce the proposed dataset for text image discrimination and the corresponding evaluation method. A. Dataset In order to evaluate the proposed algorithm and compare it with other methods, we have collected a large dataset from internet, which includes 7302 text images and 8000 non-text Fig. 3: Regions filtered by the CNN model. (a) Text image, (b) Non-text image. Input images are showed in the first row. The red bounding boxes presented in the second row are the regions extracted by MSER, while the green bounding boxes presented in the third row are the survived regions after filtering. images. Majority of the images are natural image, while a small number of them are born-digital images and scanned document images. If one image contains any piece of text line, it is considered as text image, regardless of the portion of text areas. The dataset is challenging and diverse, due to the variations in scene type, style, illumination, language type and text layout, as well as the complexity of backgrounds. As shown in Fig. 4, the text images are manually labelled with bounding boxes for visible text regions. We randomly select 2000 samples from each class to build the testing dataset, all the rest images are used as the training set. B. Evaluation Protocol Alike to evaluation methods for general image classification, precision(P) , recall(R) and F-Measure(F) are used as the metrics to measure the performance of algorithms for text image discrimination. These three metrics are defined as follows: TP P = TP + FP TP R= TP + FN 2×P ×R F = (2) P +R where, T P is the number of text images that are correctly classified and F P is the number of non-text images that are falsely classified. F N is the number of text images that are falsely classified. In our experiments, precision means the ratio (a) text images (b) non-text images Fig. 4: Examples from the proposed dataset. (a) Text images along with manually labelled bounding boxes. (b) Non-text images. TABLE III: Effects of different percentages of regions reserved. between the number of text images over that of those predicted as positive, and recall means the ratio between the number of Percentage Precision Recall F-Measure correctly classified text images over all text images in the test 1% 0.900 0.884 0.892 2% 0.898 0.903 0.901 set. V. E XPERIMENTS AND D ISCUSSIONS In this section, we evaluate our algorithm on the proposed benchmark dataset, using the standard evaluation protocol. In the following experiments, we will discuss the effects of different numbers of clusters on text regions, and different percentages of regions reserved after filtering. Moreover, we will compare our method with other traditional algorithms for image classification, such as Locality-constrained Linear Coding (LLC) [29], CNN, and improved LLC with using MSER. Effect of cluster number: As shown in TABLE II, we compare 7 different numbers of clusters used in training the CNN model. The precision and recall are measured when the F-Measure achieving optimum value. Comparing different values of K, we find out the best performance is achieved at the value of K = 100, over which there will be no obvious improvement. TABLE II: Effects of different values of K (K+1) 2 51 101 201 301 401 501 Precision 0.889 0.906 0.898 0.892 0.881 0.894 0.879 Recall 0.878 0.874 0.903 0.892 0.902 0.884 0.908 F-Measure 0.883 0.890 0.901 0.892 0.891 0.888 0.892 Effect of reserved region ratio: As aforementioned, we found that too many non-text regions will have a negative effect on the final performance. In our experiments, we have experimented with different percentages of regions reserved. As can be observed from TABLE III, the best performance is obtained when 2% top-ranked regions are reserved, which strongly proves that too many regions from background would hurt the classification performance. In this experiment, the precision and recall are also measured when F-Measure achieves optimum value. 5% 10% 20% 50% 100% 0.906 0.897 0.916 0.903 0.901 0.866 0.853 0.819 0.816 0.812 0.885 0.874 0.864 0.857 0.854 Comparison with other methods: (1) Locality-constrained Linear Coding [29] is our first baseline method, which extracts dense sift features with 3 different scales. The codebook size in our experiment is 2048. Besides, we replace SPM [30] with global max-pooling, since we find the SPM hardly bring any improvement with spending more time on coding. (2) CNN is considered as the second baseline method, whose structure is similar to the model used in our proposed method, except that the input scale is 224 × 224 and a global maxpooling is performed before the last 2 full connection layers. (3) Taking the bounding box information into account, we also design an improved coding scheme for the traditional LLC. For each image HOG, LBP, gradient histogram, and gradient orientation histogram are extracted for the stable regions after filtering, and integrated with the LLC method. We call this method MSER+Adaboost+LLC. The P-R curves in Fig. 5 show that with the bounding box information, the proposed method achieves the significantly enhanced performance, compared with the baseline methods. Note that the comparison between the proposed method (CNN Coding) and MSER+Adaboost+LLC is fair, since both methods used bounding box labels in the training phase. Time Cost: We also measured the time cost of our algorithm on a regular PC (CPU: Intel(R) Xeon(R) CPU E3-1230 V2 @3.30 GHz; GPU: Tesla K40c; RAM: 8GB). As shown in TABLE IV, our method takes around 0.43∼0.49s to accomplish the classification task on a single CPU and a single GPU. We analyse the average time taken for each stage of our pipeline on our dataset, which has an average image size of 720 × 620. The step of MSER is executed on C++ with Opencv-2.4.8. The CNN Coding and linear SVM are executed on the platform of Torch7 under Linux. The proposed system achieves high classification accuracy [6] [7] [8] [9] [10] [11] Fig. 5: The precision/recall curves of different methods on our [12] dataset. and runs reasonably fast, thus it can be used as a powerful tool in large-scale textual information mining tasks. TABLE IV: Time cost for each stage. Stage MSER Extraction CNN Coding SVM Classification Total VI. Time 0.18∼0.23s 0.25∼0.26s 0.24ms 0.43∼ 0.49s C ONCLUSION We have presented an effective and efficient algorithm for text image discrimination, an important problem in textual information understanding. Moreover, a large image dataset, which includes diverse natural images with and without text, is generated and released. This dataset can serve as a standard benchmark in this field. The experiments on the proposed benchmark show the advantages of the proposed algorithm over conventional methods. In this paper, we used the bounding boxes as well as the image level labels, for training the model for text image discrimination. In the future, we will investigate methods that only utilize image level labels [31]. Besides, it is certainly that using this method can also narrow the searching range of text region, which would be helpful to text localization and deserve further study. ACKNOWLEDGEMENTS This work was primarily supported by National Natural Science Foundation of China (NSFC) (No.61222308), and in part by Program for New Century Excellent Talentsin University under Grant NCET-12-0217, and by CCFTencentRAGR20140116. R EFERENCES [1] [2] [3] [4] [5] S. S. Tsai, H. Chen, D. Chen, G. Schroth, R. Grzeszczuk, and B. Girod, “Mobile visual search on printed documents using text and low bit-rate features,” in In Proc. of ICIP, 2011, pp. 2601–2604. B. Kisacanin, V. Pavlovic, and T. S. Huang, Real-time vision for humancomputer interaction. Springer Science & Business Media, 2005. B. T. Nassu, R. Minetto, and L. E. S. d. Oliveira, “Text line detection in document images: Towards a support system for the blind,” in In Proc. of ICDAR, 2013, pp. 638–642. C. Yao, X. Bai, B. Shi, and W. Liu, “Strokelets: A learned multi-scale representation for scene text recognition,” in In Proc. of CVPR, 2014. J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide-baseline stereo from maximally stable extremal regions,” Image and vision computing, vol. 22, no. 10, pp. 761–767, 2004. [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] N. Chen and D. Blostein, “A survey of document image classification: problem statement, classifier architecture and performance evaluation,” IJDAR, vol. 10, no. 1, pp. 1–16, 2007. D. Doermann, “The indexing and retrieval of document images: A survey,” Computer Vision and Image Understanding, vol. 70, no. 3, pp. 287–298, 1998. Y. Navon, V. Kluzner, and B. Ophir, “Enhanced active contour method for locating text,” in In Proc. of ICDAR, 2011, pp. 222–226. J. Liang, D. Doermann, and H. Li, “Camera-based analysis of text and documents: a survey,” IJDAR. C. Yi, X. Yang, and Y. Tian, “Feature representations for scene text character recognition: A comparative study,” in In Proc. of ICDAR, 2013. C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts of arbitrary orientations in natural images,” in In Proc. of CVPR, 2012. C. Yao, X. Bai, and W. Liu, “A unified framework for multioriented text detection and recognition,” IEEE Transactions on Image Processing, vol. 23, no. 11, pp. 4737–4749, 2014. Y. Zhu, C. Yao, and X. Bai, “Scene text detection and recognition: Recent advances and future trends,” Frontiers of Computer Science, 2015. E. Indermuhle, H. Bunke, F. Shafait, and T. Breuel, “Text versus nontext distinction in online handwritten documents,” in In Proc. of SAC, 2010. V. Vidya, T. R. Indhu, and V. K. Bhadran, “Classification of handwritten document image into text and non-text regions,” in In Proc. of ICSIP, 2012. N. G. Alessi, S. Battiato, G. Gallo, M. Mancuso, and F. Stanco, “Automatic discrimination of text images,” in In Proc. of SPIE, 2003. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. L. Fei-Fei and P. Perona, “A bayesian heirarcical model for learning natural scene categories,” in In Proc. of CVPR, 2005. S. Bai, X. Wang, C. Yao, and X. Bai, “Multiple stage residual model for accurate image classification,” in ACCV, 2014, pp. 430–445. T. Bluche, H. Ney, and C. Kermorvant, “Feature extraction with convolutional neural networks for handwritten word recognition,” in In Proc. of ICDAR, 2013, pp. 285–289. W. Huang, Y. Qiao, and X. Tang, “Robust scene text detection with convolution neural network induced mser trees,” in In Proc. of ECCV, 2014, pp. 497–511. M. Jaderberg, A. Vedaldi, and A. Zisserman, “Deep features for text spotting,” in In Proc. of ECCV. Springer, 2014, pp. 512–528. J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proc. Fifth Berkeley Symp. on Math. Statist. and Prob. L. Neumann and J. Matas, “Real-time scene text localization and recognition,” in In Proc. of CVPR, 2012, pp. 3538–3545. N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in In Proc. of CVPR, vol. 1, 2005, pp. 886–893. J. Friedman, T. Hastie, and R. Tibshiran, “Additive logistic regression: a statistical view of boosting,” The annals of statistics, vol. 28, no. 2, pp. 337–407, 2000. R. Shekhar and C. Jawahar, “Document specific sparse coding for word retrieval,” in In Proc. of ICDAR, 2013, pp. 643–647. R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “Liblinear: A library for large linear classification,” The Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008. J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Localityconstrained linear coding for image classification,” in In Proc. of CVPR, 2010, pp. 3360–3367. J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in In Proc. of CVPR, 2009, pp. 1794–1801. X. Wang, X. Bai, W. Liu, and L. J. Latecki, “Feature context for image classification and object detection,” in In Proc. of CVPR, 2011.
© Copyright 2024