Analysis of Convolution Neural Network-Based Algorithm for Annotation Cost Estimation in the Field of Supervised Machine Learning
Active learning is an important machine learning process of selectively querying the users to label or annotate examples with the goal of reducing the overall annotation cost. Although most existing convolution neural network (CNN) work are based on a simple assumption that the cost of annotation for each labeling query is the same or fixed, the assumption may not be realistic. That is, in fact, the cost of annotation may vary between instances of data. In this work, I have studied and presented various annotation cost-sensitive active learning algorithms, which need to estimate the utility and cost of each query simultaneously. The goal is to build or merge different models of machine learning and reduce the total cost of labelling to train the model. Hence , I propose a technique for combining Latent Semantic Indexing (LSI) and Word Mover’s Distance (WMD) methods to come up with an efficient architecture which can work on different set of datasets, thus reducing the overall labelling/annotation cost in the field of supervised machine learning and validate that the proposed method is generally superior to other annotation cost-sensitive algorithms.
Keywords: active learning, annotation, CNN, labelling, supervised learning
Table of Contents
Traditional machine learning algorithms use any data labeled to induce a model. By contrast, an active learning algorithm can select which instances are labeled and added to the training set. A learner typically starts with a small set of labeled instances, selects a few informative instances from a pool of unlabeled data, and queries from an oracle (e.g., a human annotator) for labels. The objective is to reduce the overall annotation cost to train a model. The notion of annotation costs must be better understood and incorporated into the active learning process in order to genuinely reduce the labeling costs required to build an accurate model. Hence, I propose a technique for combining Latent Semantic Indexing (LSI) and Word Mover’s Distance (WMD) methods to come up with an efficient architecture which can work on different set of datasets, thus reducing the overall labelling/annotation cost in the field of supervised machine learning.
Active learning is a machine learning setup that enables machines to strategically “ask questions” to label the oracle (Settles, 2010) in order to reduce the cost of labeling. With regard to the number of examples, annotation costs have traditionally been measured, but it has been widely recognized that different examples may require different annotation efforts (Settles et al., 2008).
Vast quantities of unlabeled instances can be easily acquired in many machine learning scenarios, yet high-quality labels are expensive to obtain. For example, a massive number of experiments and analyzes are needed in fields such as medicine (Liu, 2004)) or biology (King et al., 2004) to label a single instance, while collecting samples is a relatively easy task. In setting up cost-sensitive active learning, there are some variations. In (Margineantu, 2005), it is assumed that the cost of labeling for all data instances is known before querying, while in (Settles et al., 2008), the cost of a data instance can only be bought after querying its label. In this work, I concentrate on the later setup that closely matches the real-world human annotation scenario. Existing works (Haertel et al., 2008) must therefore simultaneously estimate the utility and cost of each instance in the setup and select instances with a high utility and low cost.
If you need assistance with writing your essay, our professional essay writing service is here to help!Essay Writing Service
The idea of sampling uncertainty (Lewis and Gale, 1994) is to query the data instance label with the classifier’s highest uncertainty. For example, in a support vector machine (SVM), (Tong and Koller, 2001) propose to query the data instance closest to the decision boundary; (Holub et al., 2008) selects data instances to be queried from a probabilistic classifier based on the entropy of label probabilities.
In Kang et al., 2004,Data instances closest to each cluster’s centroid are searched before using any other section criteria; (Huang et al., 2010) measures the representativeness of each data instance from both the unlabeled data in-stances cluster structure and the labeled data class assignments , and (Xu et al., 2003) clusters those data instances close to the SVM decision boundary and queries data instance labels close to the center of each cluster. In (Nguyen and Smeulders, 2004) clustering is used to estimate the probability of unlabeled data instances labeling, which is the key component in the measurement of data instance utilities.
There are various works targeting on annotation cost sensitive active learning with different problem settings, such as the querying target (Greiner et al., 2002), the number of the labelers (Donmez and Carbonell, 2008) the targeting classification problem (Yan and Huang, 2018) and the applied data domain (Vijayanarasimhan and Grauman, 2011).
In order to discuss cost-sensitive active learning with unknown costs, the first question to be answered is whether the cost of human annotation can be estimated accurately. In (Arora et al., 2009), Various unsupervised models are proposed to estimate the cost of annotation for corpus datasets, while (Settles et al., 2008) shows that the cost of annotation can be estimated accurately using a supervised model of learning.
Active learning is widespread framework with the ability to automatically select the most informative unlabeled examples for annotation. The motivation behind the sampling of uncertainty is to find some unlabeled examples closest to the data set labeled (nearest neighbor) and use them to assign the label. To achieve this, I am creating document classification using CNN for any unknown target label input article and doing a cosine similarity to finding the most similar documents as neighbors for the document in the training set without labels. It allows to assume fairly that the closest similar document can be labeled the same, this will facilitate the labeling of the oracle with a smaller set of inputs.
The architecture is combination of two major components, first is to collect and preprocess them and will explain the similarity measures and develop the related models. The architecture’s second part captures unlabeled data and uses different models to perform similarity checks. The output of the system is to use effective models to identify neighboring documents / articles. I am evaluating multiple models in this work to improve document similarity in order to reduce the overall labeling effort. For similarity score, I am using Word2Vec.Based on the Vector Space Model, two similarity measures based on word2vec (“Centroids” and “Word Mover’s Distance (WMD)”) will be studied and compared with the commonly used Latent Semantic Indexing (LSI). Also 20 newsgroups datasets will be used to compare the document similarity measures.
The following figure gives an overview of the methodology:
Figure 1: Overall Architecture
To implement this design modularly, I have divided the project into four independent tasks:
- Data Understanding
- Data Preparation – Prepare the data for machine learning algorithm
- Modelling – Select model and train models
- Evaluation of Results
In order to conduct the testing, I have to assess data situation, obtain data (Access), once data is available it needs to be explored. I used data pipeline ETL tool PowerCenter Informatica for building data warehouse. It was deployed in Virtual Machine with following specifications:
- Operating System: Windows Server 2012 R2 Standard
- RAM :32 GB
- CPU Cores :8 Core 2.40 GHz Processor
- Kernel Version 9.3.9600.18821
Data Preparation is the process of gathering, cleaning and consolidating data into a single file or data table, primarily for analysis purposes. I used Datawatch Monarch is the industry’s leading solution for self-service data preparation. Recommendation specification for using Datawatch Monarch are as follows:
- Windows 10 – 8 GB memory
- 5 GB disk space
- 2GHz or faster processor
- Google Chrome
- .NET Framework 4.5.2
- Microsoft Access Database Engine 2010 version
- Microsoft SQLServer
I went with Scikit-Learn, the Python programming language for machine learning library to implement some models quickly during this project. To get the data ready for machine learning, I have to take some basic steps: missing value imputation, encoding of categorical variables, and optionally feature selection if the input dimension is too large. Scikit-learn library requires following dependencies:
- Python (>= 2.7 or >= 3.4)
- NumPy (>= 1.8.2)
- SciPy (>= 0.13.3)
As part of testing, we compared the three methods (LSI, Centroid and WMD). First, a local analysis on a single example is done to get a sense of how well the methods work. Then a global analysis is done with a clustering task. A lemmatization step has been done, and duplicates are removed to make the table readable. The quality of the clustering task for each method is given by the following Normalized Mutual Information (NMI) values in Table 1.
Table 1: Normalized Mutual Information
Finally, I compared the overall performance of the methods considered to common discrete methods of representation such as K-medoids, K-Means, Complete, Ward and DBSCAN.
Figure 2: Top Score Overall Comparison
Coming up with this distributed architecture as explained in above sections would require six steps as mentioned in timeline section below:
- The first step involved reading and analyzing various relevant research papers and documents. This initial part would take around two weeks.
- For the next three steps I have selected various existing algorithm and I am going to test and record results for each algorithm. Testing and recording results of LSI Algorithm will take a week.
- In this step I will test, and record results of Centroid Algorithm. This part will take a week.
- In this step I will test, and record results of WMD Algorithm. This part will also take a week.
- In this step results of various algorithms as determined in above mentioned steps are compared across various matrices and identifying the bottle necks, this step will take one week.
- The Final step involved combining of LSI and WMD algorithms and applying various optimizations steps to address the issues identified so that final algorithm reduces the total labelling/annotation cost in the field. This step will take two weeks.
The following chart explains the details steps and timelines to accomplish the proposed study:
Figure 3: Gantt Chart for the steps and timeline
I am a lead member of technical staff at Salesforce with over 15 years of experience in software industry. I am responsible for building Test framework/harness design, development and execution for unit testing of Java based cloud services. I have also developed java-based tools for load & performance for applications within large-scale Linux Clustered. I have professional level experiences in following technologies: Java, Python, Big Data Technologies, Functional Testing, automation and performance engineering. I am leading Sales Cloud prediction quality team in Salesforce Prior to Salesforce I was working with Intuit Inc as Staff Engineer.
Our academic experts are ready and waiting to assist with any writing project you may have. From simple essay plans, through to full dissertations, you can guarantee we have a service perfectly matched to your needs.View our services
The existing literature provided well defined explanation and comparison of various algorithms calculating annotation/labeling costs in field of supervised machine learning, however it did not include any improvement like combining various models to come up with an architecture which can work on different set of datasets uniformly considering behavior and volume of the data, thus I managed to demonstrate for long texts corresponding to the 20 Newsgroups dataset, LSI is the best method; MD and the Centroid method both involve better clustering than LSI for the Web snippets dataset. and main focus in future work would be to investigate cost-sensitive active learning strategies that are more robust when given approximate, predicted annotation costs.
Figure 4: Steps involved for Analysis
Figure 5: Histogram for Annotation Time
- Settles B (2010). Active learning literature survey. University of Wisconsin, Madison 52(55-66):11
- Settles B, Craven M, Friedland L (2008). Active learning with real annotation costs. In: Proceedings of the NIPS workshop on cost-sensitive learning, pp 1–10
- Liu Y (2004). Active learning with support vector machine applied to gene expression data for cancer classification. Journal of chemical information and computer science.
- King RD, Whelan KE, Jones FM, Reiser PG, Bryant CH, Muggleton SH, Kell DB, Oliver SG (2004). Functional genomic hypothesis generation and experimentation by a robot scientist. Nature 427(6971):247–252
- Margineantu DD (2005). Active cost-sensitive learning. In: Proceedings of International Joint Conference on Artificial Intelligence, pp 1622–1623
- Lewis DD, Gale WA (1994). A sequential algorithm for training text classifiers. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, Springer-Verlag New York, Inc., pp 3–12
- Tong S, Koller D (2001). Support vector machine active learning with applications to text classification. Journal of machine learning research 2(Nov):45–66
- Holub A, Perona P, Burl MC (2008). Entropy-based active learning for object recognition. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, IEEE, pp 1–8
- Kang J, Ryu KR, Kwon HC (2004). Using cluster-based sampling to select initial training set for active learning in text classification. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, pp 384–388
- Huang SJ, Jin R, Zhou ZH (2010). Active learning by querying informative and representative examples. In: Advances in neural information processing systems, pp 892–900
- Xu Z, Yu K, Tresp V, Xu X, Wang J (2003). Representative sampling for text classification using support vector machines. In: European Conference on Information Retrieval, Springer, pp 393–407
- Nguyen HT, Smeulders A (2004). Active learning using pre-clustering. In: Proceedings of the 21th international conference on Machine learning, ACM, p 79
- Donmez P, Carbonell JG (2008). Proactive learning: cost-sensitive active learning with multiple imperfect oracles. In: Proceedings of the 17th ACM conference on Information and knowledge management, ACM, pp 619–628
- Guillory A, Bilmes J (2009). Average-case active learning with costs. In: International Conference on Algorithmic Learning Theory, Springer, pp 141–155
- Cuong N, Xu H (2016). Adaptive maximization of pointwise submodular functions with budget constraint. In: Advances in Neural Information Processing Systems, pp 1244–1252
- Vijayanarasimhan S, Grauman K (2011). Cost-sensitive active visual category learning. International Journal of Computer Vision 91(1):24–44
- Arora S, Nyberg E, Ros ́e CP (2009). Estimating annotation cost for active learning in a multi-annotator environment. In: Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing, Association for Computational Linguistics, pp 18–26
Cite This Work
To export a reference to this article please select a referencing stye below:
Related ServicesView all
DMCA / Removal Request
If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: