LDA-based Approach for Interactive Webmining

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


Many real-world Web mining tasks need to discover topics interactively, which means the users are likely to interfere the topic discovery and selection processes by expressing their preferences. In this paper, a new algorithm based on Latent Dirichlet Allocation (LDA) is proposed for interactive topic evolution pattern detection. To abate those topics not interested or related, it allows the users to add supervised information by adjusting the posterior topic-word distributions at the end of each iteration, which may influence the inference process of the next iteration. A framework is designed to incorporate different kinds of supervised information. Experiments are conducted both on English and Chinese corpus and the results show that the extracted topics capture meaningful themes, and the interactive information can help to find better topics more efficiently.

Keywords- Web mining; probabilistic topic models; topic evolutionary patterns

I. Introduction

The Web has become a very convenient and effective communication channel for people to share their knowledge, express their opinion, promote their products, or even educate each others, by publishing textual data through a Web browser interface. However, the Web users also have to face the frustration of information overflow. For example, an active forum may produce more than one hundred posts/articles which cover over tens of topics in a single day. Some of the hot posts/articles may be interested by hundreds of users, and they join the discussion by replying to those original documents and/or raising some more related topics. These topics then evolve over time repeating the above procedure and produce more subsequent textual data.

Mining meaningful and useful information (such as interpretable topics and topic evolution patterns) from textual data is believed to be helpful to overcome information overflow. One of the mining tasks is to find those interesting temporal patterns in the text data to see what themes they have and how these themes vary along time line. The typical applications may include document modeling, opinion mining, topic segmentation, information extraction and summary [1,2,3].

Probabilistic topic model has gain much research interests since the introduction of probabilistic Latent Semantic Analysis (pLSA), Latent Dirichlet Allocation (LDA) and other variants. Basically, all these models are based upon the assumptions that documents are mixtures of topics, and each topic is a probability distribution over words. This family of topic models are called generative models because they specify probabilistic procedures by which documents can be generated. To generate a new document, a distribution over topics is chosen first. Then, to obtain each word in that document, one chooses a topic at random according to this distribution, and draws a word from that topic.

It is not difficult to find that in many real-world topic detection tasks, the process of the topic detection is often interactive, which means the users are likely to interfere the reason process by expressing their preferences. For example, the users may like to perform multiple iterations on the forum data set to see what are most frequently discussed during the past two month. Or, narrow his/her interests in the subsequent mining process on several specific topics found in the previous runs.

If we directly use these models (e.g. the naïve LDA) to mine the topic models and evolutionary patterns, we often faced with the following difficulties.

(1) The algorithm such as naïve LDA is often run in a fire-and-go mode. However, the users are always in need of the ability to track certain topic along time line, to see how this topic evolves over time.

(2) the outcome of the LDA model is often not easy to interpret, and further efforts are required to infer the hidden interpretable meaning behind those word-distributions of the topics, and may issue new mining operations thereafter.

In this paper, we proposed an algorithm and software framework for interactive topic evolution pattern detection which is based on LDA. The proposed algorithm, iOLDA, is similar to LDA for its unsupervised topic clustering process; but unlike LDA, it runs in an interactive mode which allows the users to add supervised information by modifying the posterior distribution of each iteration through a uniform framework, hoping to guide the topic detection process of the next iteration.

The rest of the paper is organized as follows. Section 2 provides an overview on existing probabilistic topic modeling and evolution detection methods. Section 3 introduces an online version of LDA model. In Section 4 we describe in detail our interactive Latent Dirichlet Allocation approach. We give a simple extension to LDA model to enable the users to adjust the topic-word distributions interactively, which may influence the inference process of the next iteration by abating those topics not interested or related. Section 5 describes the experiments conducted on two different dataset to validate the effectiveness of the proposed methods. Section 6 concludes the paper and discusses future work.

II. Related Works

Probabilistic topic modeling is a new approach being successfully applied to discover and predict the underlying structure of textual data. Many probabilistic topic models have been proposed these years, such as PLSI, LDA, CTM etc[2]. They are statistical generative models that relates documents and words through latent variables which represent the topics. In this paper, we follow the LDA algorithm as the basic probabilistic model. For a comprehensive survey of these models, please refer to [2].

Topic evolution is the process of changes in all forms of topic over time, and topic evolution mining is the study of discoverying evolutionary topic. Similar to the concept of biological evolution, topics can also have topic traits and topic mutations. Informally, in the context of probabilistic model, by topic traits we mean the stability or similarity of the topic-word distributions between consecutive time slides. By topic mutations we mean the divergence or difference between them.

Recently, there are some research works on the modeling of topic evolutionary patterns. Reference [4] is probably the most similar one to our work introduced in this paper. It presents a solution which is based on pre-discretized time analysis that documents are divided by time into discretaized time slices, and for each slice, a separate topic model is developed. Our method is distinguished by its ability of integrating the user's guidance into an iterative process, and enable the user to discover and tracking topic evolutionary patterns interactively. Other recent work has been concerned with temporal documents using a collection of incremental and efficient algorithms [5, 6].

III. Modeling Evolution of Topics

In this paper, we will concentrate on how to integrate personal interactive activities into the process of probabilistic topic detection and tracking, in order to help the Internet users to select the interesting topics of their own. This can be divided into the following two questions: (1) how to capture the evolutionary pattern of the topics; and (2) how to use the user's supervised information to guide the search or recommendation services for a more focused search. In the next sub-section, we will discuss an online version of LDA to capture the temporal characteristics of the topics. In section IV, we will turn to the interactive LDA model which incorporates the user's feedback as bias on distribution priors.

A. iOLDA Topic Model

To capture the evolutionary pattern of the topics, we developed an interactive online version of the Latent Dirichlet Allocation algorithm (iOLDA). In this paper, we follow the notations of [4]. Figure 1(a) illustrates the main idea of iOLDA. The extension of the interactive mining to LDA will be discussed later in Section IV.

Comparing to the static LDA, iOLDA takes a text stream as its input, rather than a whole static text corpus, and find the hidden topics within that stream. For example, in the context of Internet forum, a text stream is formed by the posts or articles published in each single different channels. Each stream S is partitioned into T={0,1,2,...t,...} ascending time slides, called stream segments St, following a sliding window model. The windows size are fixed but the granularity can vary from a day, a week, to a month, depending on user choices and/or the average volume of the channel. For simplicity reason, we do not consider the intersection between different channels (i.e. different streams) in this paper.

Within each stream segment St, the documents are exchangeable, i.e., we do not consider the publication time order of the documents within each stream segment. This simplifies the process of discovering the topics within the range of the stream segment. In this paper, we make use of LDA algorithm for each stream segment with a simple extension to the generation of topic-word distribution prior, except the first iteration, i.e. LDA0 in Figure 1(a), which is a naïve LDA process [2].

The output of iOLDA is actually a sequence of LDA's topic in ascending order of the T stream segments. The topic model of time t consists of two parts which is similar to the output of a static LDA: (1) the estimate of the topic-word distribution, ft; and (2) the estimate of the document-topic distribution, qt.

In order to estimate ft and qt, we use Gibbs sampling, a form of Markov chain Monte Carlo, proposed by Grif?ths and Steyvers in [1], because it is easy to implement and is also an efficient method for topic extracting from large corpus. The Gibbs sampling procedure considers each word token in the corpus in turn and estimates the probability of assigning the current word token to each topic, conditioned on the topic assignments to all other word tokens. Because of the limited space, the estimates are given below. For more details about Gibbs Sampling procedure, please refer to [2].

where CWT and CDT are matrices of counts with dimensions W x T and D x T respectively; CwjWTcontains the number of times word w is assigned to topic  j, not including the current instance  i and CdjDT contains the number of times topic j is assigned to some word token in document d, not including the current instance i. W is the number of distinct words and T is topic numbers.

Besides the two distributions, the found topics are also correlated by semantic distances between each other along the time line, which can be used to represent the topic evolution patterns of the current text stream (e.g. forum channels). Here we use the Kullback-Leibler divergence (or K-L distance, information gain, relative entropy) [7] to measure the topic traits and mutations, quantitatively.

For probability distributions P and Q of a discrete random variable, the K-L divergence of Q from P is defined to be

Here P and Q are the consecutive topic-word distributions obtained from the LDAi, where i=0,1,2,.... Because the Kullback-Leibler divergence is not symmetric, we compute both DKL(P||Q) and DKL(Q||P) and use their average as a measure of the semantic distance between two consecutive topics.

With DKL, we can discover novel topic by checking if the distance between a newly found topic and its predecessor is larger than a threshold, where a topic mutation occurs. In this way, the topics found can be partitioned along time line to meaningful themes.

IV. Mining the Patterns Interatively with iOLDA

In this section, we will discuss the question of how to use the user's supervised information to guide the search or recommendation services. An interactive model, iOLDA, is developed to incorporate the user's preference and/or feedback as bias on distribution priors of the next iteration.

A. Different Interaction Policies

After the first interaction of iOLDA, the topics generated by LDA0 represent the main topical contents in S0. This result can be used to contribute to the prior of the next iteration of LDA1, because the documents in the same text stream share similar semantic background. From the statistical point of view, this prior serve as a bias on the next stream segment, and carries the topic traits to the next iteration. However, the next stream segment may introduce new tokens, and the sizes of the two segments maybe too different, or the users may specify their preferences. Then, a function f is introduced for smoothing, normalization, and other purposes. We have:

prior-ft =  f( posterior-ft-1)(3)

The different ways to determine the function f can be used to fulfill different interaction policies, which may includes:

1) Corpus level smoothing:  to take the corpus level influences into considerations. For example, to smooth the difference of the token set size of the consecutive corpus, we have:

prior-ft =   posterior-ft-1  wsmoothing(4)

where wsmoothing can be a smoothing weight vector, such as a single exponential smoothing weight as defined in (5).

2) Word level filtering: to leverage word level prior knowledge to filter the influence of words, e.g. the common sense words. For example, we can use a TFIDF filter to eliminate the noise of common sense words, by decrease their prior probabilities or set them to zero.

prior-ft =   posterior-ft-1 wfiltering(6)

where wfiltering is a filtering weight vector which set the token probability to zero, if their TFIDF weight is less than a predefined threshold.

3) User level preference: to encode the user's preferences as bias on the prior distribution. For example, the user can explicitly sepecify an uninterested token set and set the corresponding prior probability to zero.

All the different policies can be implemented under an uniform framework, which treat the modification of the priors as different services which share the same interface.

B. iOLDA Algorithm

All the different interaction policies as discussed in the previous sub-section have been realized under a uniform framework of functions, as illustrated in Figure 1(b). The symbol plus ('+') indicates a reward function to a specific token's prior of the topic-word distribution. The symbol minus ('-') indicates a penalty function to either a specific token's prior or to a topic as a whole. The reward/penalty functions can be designed to increase/decrease the corresponding token's occurrence in the corpus directly, or the token's probabilistic values in the topic model indirectly. The symbol x ('X') indicates other functions on the posterior distribution of the previous iteration.

Through this uniform framework, modifications to the prior can be implemented easily and the most important thing is that it also makes the combination of different interaction policies very flexible.

So far, we can summarize the overview of the proposed interactive online LDA algorithm as in Algorithm 1.

The algorithm takes as input the current text stream segment St, the topic number K, the time stamp t, the Dirichlet hyperparameters a and b, and the user interaction function f, respectively. The output of the algorithm contains the estimations and the distances of topic distributions along each topic theme.

V. Experiments

For the experiment, we have prepared two datasets, an English version of the NIPS dataset [8], and a Chinese version of TIANYA forum [9] dataset. NIPS dataset consists of the full text of the 14 years (1988-2001) of proceedings of the Neural Information Processing Systems (NIPS) Conferences, which is stored in 1,958 text files in 14 folders (one for a year), and occupied about 36MB. Because this dataset share a common background of machine learning and computational neuroscience, and also used by many related works for experiment or evaluation purposes, it is chosen to show iOLDA's abilities to capture the evolutionary pattern of the topics.

TIANYA dataset consists of the full text of Chinese articles for 5 month (from August to December, 2008), which is collected by a Web crawler. The original data is more than 1.4GB. After the preprocessing of de-tagging and stop words, the test data set is about 111MB, in 43,731 text files. Because the forum is very active, the topics of this dataset covers a lot of domains including politics, economy, education, entertainment, literature, social events, and others. All the experiments are run on a 2.4G Duo CPU laptop with 2GB memory. We used the Mallet Java library [10] for the implementation of the naïve LDA.

We also tested the effectiveness of the different interaction policies by interactive mining on the TIANYA dataset. Table III gives some results of testing the word level filtering mechanism. The left column (in bright gray) and the right column of each topic contains topic words before and after the word level filtering, respectively. Notice that words with lower TFIDF (in dark gray) weight are not in the top list.

iOLDA discovered many interesting topic evolutionary patterns. From Table II, it is very easy to see that topic 5 reveals the 2008 Chinese milk scandal and topic 7 reveals the Beijing Olympic games. We noticed that word set in topic 5 changed very little, which is coherent with the fact that the milk scandal were hot topic along that period of time. Another interesting thing is that from Octerbor, the word(which means number one in Chinese) started to be on list. Further investigation to the corpus shows that the corresponding documents are all about China's gold mental number were number one in Olympic game 2008.

Similarly, Table IV shows some results of testing the user level preference functions. The left column (in bright gray), the right column and the mid column of each topic contains topic words of two consecutive iterations and the preference keywords or functions. Symbol + and - are reward keyword set and penalty keyword set, respectively. All the affected words are in dark gray (For brevity reason, we only display the top 13 words and omitted the rest). The rewarded keywords are in white and the penalized words are in bold. By comparing the left and right columns of the same topic, we found that by encoding user preference as a bias into the inference distribution prior, the topics discovered from the corpus can be closer to those contents that are interested by the users.

VI. Conclusions

In this paper, we presented an algorithm and software framework for interactive topic evolution pattern detection which is based on LDA. The proposed algorithm, iOLDA, is basically a simple extension of LDA, but runs in an interactive mode which allows the users to add supervised information by modifying the posterior distribution of each iteration. We have developed three categories of interaction policies, namely, corpus level smoothing, word level filtering, and user level preference, to leverage the heuristic and supervised information to guide the topic detection process. We also designed and realized an uniform framework for this semi-supervised inference process. Experiments using both English text and Chinese Internet forum text are conducted to validate the effectiveness of iOLDA.

The proposed method is still a preliminary work towards incorporating domain knowledge and user supervised information into probabilistic topic modeling. It can be extended in many directions. Firstly, how to automatically learn and adaptively adjust the parameters of different interaction functions can be very important for this method to be used in larger corpus. We are also interested in incorporating a conceptual hierarchy into the interaction mining process, so that the users can use domain knowledge at a larger granularity. Although the performance is not a critical concern of this paper, we have to develop advanced indexing mechanism and parallel algorithms if we are going to use this method for larger corpus.


The authors would like to thank Dafu Shan, Yi Han and Zheng Liang for helpful discussions and data preparations. We are also like to thank Mallet project for an excellent implementation of LDA algorithm. The work in this paper is supported by National Natural Science Foundation of China under the grant 60873204 and 60933005. It is also partially supported by the National 242 program under the grant 2009A90.


[1] T. Griffiths and M. Steyvers, "Finding scientific topics," In Proceedings of the National Academy of Sci-ences of the United States of America, 2004.

[2] M. Steyvers and T. Griffiths, "Probabilistic topic models," In T. Landauer, D. Mcnamara, S. Dennis, and W. Kintsch, editors, Latent Semantic Analysis: A Road to Meaning. Lau-rence Erlbaum, 2005.

[3] J. Boyd-Graber, J. Chang, S. Gerrish, C. Wang, and D. Blei, "Reading tea leaves: How humans inter-pret topic models," In Neural Information ProcessingSystems (NIPS2009), 2009.L. AlSumait, D. Barbara, and C. Domeniconi, "OnlineLDA: adaptive topic models for mining text streamswith applications to topic detection and tracking. InICDM, 2008.

[4] Qiaozhu Mei and ChengXiang Zhai, "Discovering evolutionary theme patterns from text: an exploration oftemporal text mining", In SIGKDD, pages 198-207, New York, NY, USA, 2005. ACM.

[5] D. Blei and J. La erty. Dynamic topic models. InICML, 2006.

[6] T. Cover and J. Thomas, Elements of Information Theory, Wiley & Sons, New York, 1991. Second edition, 2006.

[7] Available at the NIPS Online Repository. http://nips.djvuzone.org/txt.html

[8] Available at the TIANYA forum Website, http://www.tianya.cn/new/publicforum/ArticlesList.asp?strItem=free.

[9] McCallum, Andrew Kachites.  "MALLET: A Machine Learning for Language Toolkit." http://mallet.cs.umass.edu. 2002.