PropertyValue
is nif:broaderContext of
nif:broaderContext
is schema:hasPart of
schema:isPartOf
nif:isString
  • Nouns, adjectives, and noun phrases from papers as search terms: We test whether ranked noun phrases can expose and discover potentially related literature on a sufficiently representative and recent sample of the medical literature. The medical literature is vast and described by over 26,000 Medical Subject Heading (MeSH) terms ([33] and also see “Fact Sheet, Medical Subject Headings” ). PubMed Central makes the text of open-access papers and the Pubmed links to the author-supplied citations available online (http://www.ncbi.nlm.nih.gov/pmc/). PubMed Central contains nearly 2 million open-access articles from several hundred journals, most of which are cross-listed on Pubmed (See “What is the connection between PubMed Central and PubMed?” http://www.ncbi.nlm.nih.gov/pmc/about/faq.html#q8). PubMed Central IDs are not sequential and therefore not amenable to random sampling of recent literature, and a list of most recent papers was not otherwise available. To select a representative sample from the recent research literature on PubMed Central, during the week of April 18, 2011 we searched for “research,” then ordered the results by publication date, and took the top 1000 search results. In order to perform the citation validation test, we filtered out the papers with less than 5 citations which left us with 883 papers. For each of these papers on PubMed Central, we retrieved its citations from PubMed Central, PubMed's “related citation” list, and selected search terms from each paper to run queries on Pubmed abstracts and titles using the “Entrez Utilities Entrez Programming Utilities” ). For example the paper, Differential expression of anterior gradient gene AGR2 in prostate cancer has a PubMed Central identifier of PMC3009682. There were 46 author-supplied citations on PubMed, and the top 20 PubMed related citations included 4 overlapping citations by different authors. Searches on all search terms are run in quotes for phrase matching such as “anterior gradient” versus simply anterior gradient. To extract noun phrases to use as search terms for each of the papers on PubMed Central, we run a software program written in Python based NLTK (http://www.nltk.org/) to automatically extract nouns, adjectives, and noun phrases from each sentence in the paper. NLTK is described in [11]. The theory of noun phrase extraction has been described in [22] and [20], and noun phrase extraction is available in commercial software packages such as Attivio (See ) and Inxight (See ). Starting with the text of a paper, the steps to extract noun phrases comprise separate software modules for, extracting the author's written text of the paper from the native format (PDF, HTML, etc)splitting the text into well-formed sentences (sentence tokenization) and words (word tokenization) which involves correctly recognizing punctuation and word boundaries (See “Package tokenize” )identifying the part of speech of each word in each sentence using a part of speech tagger such as the Brill tagger. (See “Module brill” ). Note that part-of-speech tagging may sometimes involve classification errors such as mistagging a noun for a verb, etc. The tagging accuracy is typically in the range of 90–97% (see “Tagging Accuracy” on pages 371–373 of [20]) but depends heavily on the corpus and tagger.selecting single word nouns (N), adjectives (A), and multi-word noun phrases using patterns such as AN, NN, etc. For example “blue sky” and “house boat.” Note the tagging errors introduced in the previous step may carry over to induce mis-identifcation of noun phrases in this step, which is why we may occasionally observe phrases that are not exactly noun phrases appearing below.For a paper, there are typically hundreds if not thousands of noun phrases depending on its length, and all combinations of these phrases are not possible to search for. To select the most representative terms, we rank them based on the number of occurrences of the term in the paper itself (document count) and inversely to the number of occurrences of that same term on the web (web count) obtained using the Yahoo-BOSS API (http://developer.yahoo.com/search/boss/). The greater this ratio the more significant the phrase is likely to be. An alternative to web counts would be to use the same counts obtained from the PubMed corpus, although the rate of search queries is limited by PubMed making it harder to utilize these counts compared to Yahoo-BOSS which permitted several searches per second. In contrast, PubMed “related citations” incorporates the count of the term within PubMed instead of the web. To illustrate the concept of document and web counts, Figure 1 shows a plot of these counts for an example paper on detection of highly enriched uranium [34] written by one of the authors – with each dot representing one phrase. Each data point on the plot shows the document count and web count on the x-axis and y-axis respectively on a log-scale. Using the log-scale, phrases that represent the document's subject tend to stand out based on a large document count and smaller web count, such as “in-vehicle detectors,” “nuclear material,” “u-232” and “u-283 signal.” The vast majority of phrases tend to be at the bottom with document count equal to 1. As the web count gets larger, it takes a larger document count for a phrase to be more representative of the paper's subject matter. Figure data removed from full text. Figure identifier and caption: 10.1371/journal.pone.0024920.g001 Document and web counts for phrases appearing in an example paper about nuclear detection technologies.The axes are on a log-scale; a quantile regression curve runs through the scatterplot. Phrases that most representative of the document's contents are along the upper boundary; they have high document count for their web count. Less relevant phrases fall below. After computing the ordered pairs of the document count and web count for each extracted phrase, the phrases need to be ranked in order to find the ones that best reflect the subject matter of the paper. We use the procedure described below based on regression. The document count and web count are converted to logarithms and a curve is fitted to the ordered pairs using quantile regression [35] as illustrated in Figure 1 with the blue line. Alternate variations of this curve fit are feasible such as a linear fit or quadratic fit. For each phrase, the numeric difference between the document count of the phrase and the value of the regression function (fitted curve) evaluated at the web count for that phrase is used to rank order the phrases. For example if the regression is y = mx+c and the document count is y', then the difference is y'-mx-c. We empirically observe that the more positive the difference between the document count and the value of the regression, the more of an outlier the phrase is relative to other phrases with similar web counts, and therefore the greater its relevance to the subject matter expressed in the document. In practice, we have found that the ranking algorithm described produces results comparable to the well-known TF-IDF algorithm [10] which computes a score using each ordered pair without the need for regression. The TF-IDF score is proportional to the document count and inversely proportional to the logarithm of the web count. Figure 2 shows the ranking of phrases we used for the paper, “Programmed cell death-1 (PD-1) at the heart of heterologous prime-boost vaccines and regulation of CD8+ T cell immunity.” To visually aid in phrase selection the ranked phrases are further split into three columns while maintaining the ranking in each column: those which occur on the web more than 10 million times (broad), those which occur between 100 and 10 million times (specific), and those which occur less than 100 times (rare). In the figure, representing our test interface, we interactively select phrases (shown as a tick) to invoke a PubMed search with those phrases and ‘more’ simply opens up more ranked phrases further down in the ranking. Figure data removed from full text. Figure identifier and caption: 10.1371/journal.pone.0024920.g002 The ranking of noun phrases for the paper, “Programmed cell death-1 (PD-1) at the heart of heterologous prime-boost vaccines and regulation of CD8+ T cell immunity.”We rank them based on the number of occurrences of the term in the paper itself (document count) and inversely to the number of occurrences of that same term on the web (web count). To visually aid in phrase selection the ranked phrases are further split into three columns while maintaining the ranking in each column: those which occur on the web more than 10 million times (broad), those which occur between 100 and 10 million times (specific), and those which occur less than 100 times (rare). Each combination of phrases results in a valid search that reveals different information about what is present in the Pubmed corpus. Many combinations of the ranked phrases can be readily produced for each paper. For the search terms we derive from each paper, we compared the top 20 search results to the author-supplied citations for that paper. We defined an “overlap” if a search result was from Pubmed and its respective ID matched or reproduced any of the author-supplied citations obtained in the paper itself. The author supplied citations are obtained directly from the HTML of the paper itself, and automatically compared to the search results for each search run to measure overlap. To select search terms automatically, we can select combinations of phrases from the set of all possible combinations of the top ranked phrases. In general for N top ranked terms, the number of combinations of k terms (N choose k) grows approximately as N∧k. Even for a computer, to execute this many searches would be prohibitive as N grows beyond several dozen with fixed k = 2 or 3. We try the top 20 ranked “specific” and “rare” phrases (k = 1) to generate search terms and discover citation validated search terms. For example, we show the automatic search terms for the paper mentioned above whose ID is PMC3009682 and title is “Differential expression of anterior gradient gene AGR2 in prostate cancer.” The citation validated search terms that were automatically generated are listed below, followed by the number of results on PubMed (ranging from 3 to 127 results) and a list of PubMed IDs in square brackets that were the author supplied citations appearing in the search results. “D” indicates they are by different authors, and “S” indicates if they are by authors of the paper itself. “laevis cement gland” (4 results) D [‘10095068’, ‘9790916’]“Xenopus laevis cement gland” (4 results) D [‘10095068’, ‘9790916’]“AGR2 promotes cell” (7 results) D [‘20048076’, ‘18199544’]“AGR2” (70 results) D [‘20945500’]“XAG-2” (9 results) D [‘15834940’, ‘14967811’, ‘10095068’, ‘9790916’, ‘9533957’]“AGR2 expression” (15 results) D [‘20048076’, ‘18681322’, ‘18199544’, ‘17457305’, ‘17455144’, ‘16551856’]“Xenopus laevis cement” (5 results) D [‘10095068’, ‘9790916’]“gene XAG-2” (3 results) D [‘9790916’, ‘9533957’]“hAG-2” (6 results) D [‘12592373’, ‘9790916’]“cement gland gene XAG-2” (4 results) D [‘10095068’, ‘9790916’, ‘9533957’]“gland gene XAG-2” (5 results) D [‘15834940’, ‘10095068’, ‘9790916’, ‘9533957’]“PIN lesions” (127 results) D [‘20945500’]“laevis cement gland gene” (70 results) D [‘15867376’]“levels of AGR2” (17 results) D [‘20945500’, ‘20048076’, ‘18973922’, ‘17694278’, ‘17457305’],“laevis cement” (118 results) D [‘15867376’]“lower levels of AGR2” (3 results) S [‘21144054’]To choose one example from this list, the search term “AGR2 expression” shows 15 results on PubMed with six of the results being author supplied citations. Is there potentially related literature among any of the remaining nine search results (copied below) that are not cited, and if so what is the relation? In each case only an interested researcher can determine their relevance to the paper as related literature. The human adenocarcinoma-associated gene, AGR2, induces expression of amphiregulin through hippo pathway co-activator YAP1 activation. Dong A, Gupta A, Pai RK, Tun M, Lowe AW. J Biol Chem. 2011 Mar 26; http://www.ncbi.nlm.nih.gov/pubmed/21454516/Differential expression of the anterior gradient protein-2 is a conserved feature during morphogenesis and carcinogenesis of the biliary tree. Lepreux S, Bioulac-Sage P, Chevet E. Liver Int. 2011 Mar;31(3):322–8 http://www.ncbi.nlm.nih.gov/pubmed/21281432/The pro-metastatic protein anterior gradient-2 predicts poor prognosis in tamoxifen-treated breast cancers. Hrstka R, Nenutil R, Fourtouna A, Maslon MM, Naughton C, Langdon S, Murray E, Larionov A, Petrakova K, Muller P, Dixon MJ, Hupp TR, Vojtesek B. Oncogene. 2010 Aug 26;29(34):4838–47 http://www.ncbi.nlm.nih.gov/pubmed/20531310/Anterior gradient-2 plays a critical role in breast cancer cell growth and survival by modulating cyclin D1, estrogen receptor-alpha and survivin. Vanderlaag KE, Hudak S, Bald L, Fayadat-Dilman L, Sathe M, Grein J, Janatpour MJ. Breast Cancer Res. 2010;12(3):R32 http://www.ncbi.nlm.nih.gov/pubmed/20525379/Disruption of Paneth and goblet cell homeostasis and increased endoplasmic reticulum stress in Agr2−/− mice. Zhao F, Edwards R, Dizon D, Afrasiabi K, Mastroianni JR, Geyfman M, Ouellette AJ, Andersen B, Lipkin SM. Dev Biol. 2010 Feb 15;338(2):270–9 http://www.ncbi.nlm.nih.gov/pubmed/20025862/Identification of candidate biomarkers of therapeutic response to docetaxel by proteomic profiling. Zhao L, Lee BY, Brown DA, Molloy MP, Marx GM, Pavlakis N, Boyer MJ, Stockler MR, Kaplan W, Breit SN, Sutherland RL, Henshall SM, Horvath LG. Cancer Res. 2009 Oct 1;69(19):7696–703 http://www.ncbi.nlm.nih.gov/pubmed/19773444/Anterior gradient 2 is expressed and secreted during the development of pancreatic cancer and promotes cancer cell survival. Ramachandran V, Arumugam T, Wang H, Logsdon CD. Cancer Res. 2008 Oct 1;68(19):7811–8 http://www.ncbi.nlm.nih.gov/pubmed/18829536/Sequence and expression of Drosophila Antigen 5-related 2, a new member of the CAP gene family. Megraw T, Kaufman TC, Kovalick GE. Gene. 1998 Nov 19;222(2):297–304 http://www.ncbi.nlm.nih.gov/pubmed/9831665/
rdf:type