PropertyValue
is nif:broaderContext of
nif:broaderContext
is schema:hasPart of
schema:isPartOf
nif:isString
  • A document can be represented as a vector of word term weights (i.e. features) from a set of terms (i.e. dictionary) and the topic of a document is made of a joint membership of terms which have a pattern of occurrence [7]. Early document clustering techniques employ the vector space modeling technique, which can calculate the similarity between two documents [8]. This technique fails to deal with the issues caused by synonymy (i.e. different words with similar or identical meanings) and polysemy (i.e. the words with different meanings in different contexts). Later, Latent Semantic Analysis (LSA) was developed in an effort to improve classification performance in document retrieval [9]. Like most topic modeling techniques, LSA starts from a pre-processing step, which cleans the corpus of a set of text documents and builds a document-term matrix for subsequent modeling. The cleaning procedures include tokenization (i.e. partitioning a text document into a list of tokens), stop-word removal (i.e. removing the words that are extremely common but are of little value in helping classifying documents, such as this, it, is), stemming and lemmatization (i.e. removing the ends of conjugated verbs or plural nouns while keeping the lemma, base or root form), and compound words (i.e. concatenating hyphenated words that describe one concept). The remaining words are used to construct a document-term-matrix (DTM). The DTM is a matrix where each row represents a document, each column represents a unique word, and each cell denotes the number of times a given word appears in a given document. Then, LSA reduces the DTM into a filtered DTM through singular value decomposition (SVD). Finally, LSA computes the similarity between text documents to pick the heist efficient related words. While computationally efficient, LSA fails to identify and distinguish between different contexts of word usage without recourse to a dictionary or thesaurus [10]. Backed by Bayesian statistics, Latent Dirichlet Allocation (LDA) is developed to apply a probabilistic model to analyze word distributions in text documents and uncover topics in an automated fashion [7,11]. This generative modeling technique does not require prior categorization, labelling and annotation of the texts but reveals the invisible, latent topic structure through statistical procedures [12]. Instead, it follows the “bag-of-words” assumption to treat a document as a vector containing the count of each word type, regardless the order in which they appear. In a nutshell, LDA assumes that each document can be modelled as a mixture of topics, and each topic is a discrete probability distribution that defines how likely each word is to appear in a given topic. A document is then represented by a distribution of topic probabilities. It estimates the parameters in the distributions of word and of topics with Markov chain Monte Carlo (MCMC) simulations [7]. LDA then assigns topics to each document through a Dirichlet distribution of topics. Given a specific number of topics in a collection of text documents, the extent to which each topic (and its associated words) is represented in a specific document can be modelled by a latent variable model, where latent variables represent the topics and how each document in the collection manifests them [7,13]. In short, LDA discovers patterns of word use and connect patterns of similar use to estimate the posterior distribution of hidden variables, which represents the topic structure of the collection [12,13]. Recently, some LDA-based techniques have been proposed. For example, Correlated-Topic-Model (CTM) uses a logistic normal distribution to create relations among topics [13]. Supervised LDA [14] can introduce known label information into the topic discovery process. Labeled LDA (LLDA) [15] allows for multiple labels of documents and for the relation of labels to topics represents one-to-one mapping. Partially labeled LDA (PLLDA) [16] further extends LLDA to have latent topics missing from the given document labels. LDA has been widely used to process otherwise unmanageably large volumes of text, identify the most salient topic in a single document, investigate similarities between documents, and uncover topic prevalence over time [11,13,17]. We summarize some recent applications of LDA in scientific topic discovery in Table 1. Table data removed from full text. Table identifier and caption: 10.1371/journal.pone.0199510.t001 A non-exhaustive list of LDA applications in scientific topic discovery. We extracted article abstracts from the core collection of the Web of Science (WoS) database using the following criteria: articles published in English, whose topic terms (i.e. titles, abstracts and keywords) included “social stratification(s)”, “social class(es)” or “social inequality(ies)” in SSCI indexed journals over the period of 1956 to December 2017. The search found 15,057 articles. We deleted those without keywords and abstracts, leaving 14,038 articles in the collection. Among these articles, 67.11% belong to “social class(es)” alone, 23.60% to “social inequality(ies)” alone and 6.71% to “social stratification(s)” alone. There are 1.74% of articles that belong to both “social class(es)” and “social inequality(ies)”; 0.52% to “social class(es)” and “social stratification(s)”; and 0.26% to both “social inequality(ies)” and “social stratification(s)”. There are only 0.04% of articles that belong to three topic terms. In addition, we built three time series in terms of annual article counts for these three terms respectively. The correlation coefficients between “social class(es)” and “social inequality(ies)” series is 0.87, between “social class(es)” and “social stratification(s)” series is 0.86, and between “social inequality(ies)” and “social stratification(s)” series is 0.97. These statistics confirm that the three topic themes are highly similar. They all reflect the types of social divisions envisaged by Marx and refer to groups defined by their relationship to ownership and control over the means of production, of labor and of distribution [18]. We did not include the term “social status” because it emphasizes the social distinctions caused not only by economic factors but also by cultural ones, which include denotative (what is), normative (what should be), and stylistic (how done) beliefs, shared by a group of individuals who have undergone a common historical experience and participate in an interrelated set of social structures [19].
rdf:type