Pesquisa | Portal Regional da BVS

Visualizing the topical structure of the medical sciences: a self-organizing map approach.

Skupin, André; Biberstine, Joseph R; Börner, Katy.

PLoS One ; 8(3): e58779, 2013.

Artigo em Inglês | MEDLINE | ID: mdl-23554924

RESUMO

BACKGROUND: We implement a high-resolution visualization of the medical knowledge domain using the self-organizing map (SOM) method, based on a corpus of over two million publications. While self-organizing maps have been used for document visualization for some time, (1) little is known about how to deal with truly large document collections in conjunction with a large number of SOM neurons, (2) post-training geometric and semiotic transformations of the SOM tend to be limited, and (3) no user studies have been conducted with domain experts to validate the utility and readability of the resulting visualizations. Our study makes key contributions to all of these issues. METHODOLOGY: Documents extracted from Medline and Scopus are analyzed on the basis of indexer-assigned MeSH terms. Initial dimensionality is reduced to include only the top 10% most frequent terms and the resulting document vectors are then used to train a large SOM consisting of over 75,000 neurons. The resulting two-dimensional model of the high-dimensional input space is then transformed into a large-format map by using geographic information system (GIS) techniques and cartographic design principles. This map is then annotated and evaluated by ten experts stemming from the biomedical and other domains. CONCLUSIONS: Study results demonstrate that it is possible to transform a very large document corpus into a map that is visually engaging and conceptually stimulating to subject experts from both inside and outside of the particular knowledge domain. The challenges of dealing with a truly large corpus come to the fore and require embracing parallelization and use of supercomputing resources to solve otherwise intractable computational tasks. Among the envisaged future efforts are the creation of a highly interactive interface and the elaboration of the notion of this map of medicine acting as a base map, onto which other knowledge artifacts could be overlaid.

Assuntos

Pesquisa Biomédica/estatística & dados numéricos , Modelos Teóricos , Inteligência Artificial , Informática Médica/métodos , Publicações/estatística & dados numéricos

Design and update of a classification system: the UCSD map of science.

Börner, Katy; Klavans, Richard; Patek, Michael; Zoss, Angela M; Biberstine, Joseph R; Light, Robert P; Larivière, Vincent; Boyack, Kevin W.

PLoS One ; 7(7): e39464, 2012.

Artigo em Inglês | MEDLINE | ID: mdl-22808037

RESUMO

Global maps of science can be used as a reference system to chart career trajectories, the location of emerging research frontiers, or the expertise profiles of institutes or nations. This paper details data preparation, analysis, and layout performed when designing and subsequently updating the UCSD map of science and classification system. The original classification and map use 7.2 million papers and their references from Elsevier's Scopus (about 15,000 source titles, 2001-2005) and Thomson Reuters' Web of Science (WoS) Science, Social Science, Arts & Humanities Citation Indexes (about 9,000 source titles, 2001-2004)-about 16,000 unique source titles. The updated map and classification adds six years (2005-2010) of WoS data and three years (2006-2008) from Scopus to the existing category structure-increasing the number of source titles to about 25,000. To our knowledge, this is the first time that a widely used map of science was updated. A comparison of the original 5-year and the new 10-year maps and classification system show (i) an increase in the total number of journals that can be mapped by 9,409 journals (social sciences had a 80% increase, humanities a 119% increase, medical (32%) and natural science (74%)), (ii) a simplification of the map by assigning all but five highly interdisciplinary journals to exactly one discipline, (iii) a more even distribution of journals over the 554 subdisciplines and 13 disciplines when calculating the coefficient of variation, and (iv) a better reflection of journal clusters when compared with paper-level citation data. When evaluating the map with a listing of desirable features for maps of science, the updated map is shown to have higher mapping accuracy, easier understandability as fewer journals are multiply classified, and higher usability for the generation of data overlays, among others.

Assuntos

Bibliometria , Bases de Dados Bibliográficas/estatística & dados numéricos , Ciências Humanas/classificação , Disciplinas das Ciências Naturais/classificação , Ciências Sociais/classificação , Ciências Humanas/tendências , Humanos , Internet , Disciplinas das Ciências Naturais/tendências , Projetos de Pesquisa , Ciências Sociais/tendências

Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches.

Boyack, Kevin W; Newman, David; Duhon, Russell J; Klavans, Richard; Patek, Michael; Biberstine, Joseph R; Schijvenaars, Bob; Skupin, André; Ma, Nianli; Börner, Katy.

PLoS One ; 6(3): e18029, 2011 Mar 17.

Artigo em Inglês | MEDLINE | ID: mdl-21437291

RESUMO

BACKGROUND: We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents. METHODOLOGY: We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models--BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE. CONCLUSIONS: PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts.

Assuntos

Pesquisa Biomédica , Análise por Conglomerados , Documentação , Armazenamento e Recuperação da Informação/métodos , Publicações Periódicas como Assunto

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA