Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 44
Filter
2.
J Pathol Inform ; 2: 18, 2011 Mar 31.
Article in English | MEDLINE | ID: mdl-21572506
3.
J Pathol Inform ; 2: 5, 2011 Jan 24.
Article in English | MEDLINE | ID: mdl-21383929

ABSTRACT

The day has not arrived when pathology departments freely distribute their collected anatomic and clinical data for research purposes. Nonetheless, several valuable public domain data sets are currently available, from the U.S. Government. Two public data sets of special interest to pathologists are the SEER (the U.S. National Cancer Institute's Surveillance, Epidemiology and End Results program) public use data files, and the CDC (Center for Disease Control and Prevention) mortality files. The SEER files contain about 4 million de-identified cancer records, dating from 1973. The CDC mortality files contain approximately 85 million de-identified death records, dating from 1968. This editorial briefly describes both data sources, how they can be obtained, and how they may be used for pathology research.

4.
J Pathol Inform ; 12010 Jul 13.
Article in English | MEDLINE | ID: mdl-20805954

ABSTRACT

BACKGROUND: Tissue microarrays (TMAs) are enormously useful tools for translational research, but incompatibilities in database systems between various researchers and institutions prevent the efficient sharing of data that could help realize their full potential. Resource Description Framework (RDF) provides a flexible method to represent knowledge in triples, which take the form Subject-Predicate-Object. All data resources are described using Uniform Resource Identifiers (URIs), which are global in scope. We present an OWL (Web Ontology Language) schema that expands upon the TMA data exchange specification to address this issue and assist in data sharing and integration. METHODS: A minimal OWL schema was designed containing only concepts specific to TMA experiments. More general data elements were incorporated from predefined ontologies such as the NCI thesaurus. URIs were assigned using the Linked Data format. RESULTS: We present examples of files utilizing the schema and conversion of XML data (similar to the TMA DES) to OWL. CONCLUSION: By utilizing predefined ontologies and global unique identifiers, this OWL schema provides a solution to the limitations of XML, which represents concepts defined in a localized setting. This will help increase the utilization of tissue resources, facilitating collaborative translational research efforts.

5.
Nat Biotechnol ; 26(3): 305-12, 2008 Mar.
Article in English | MEDLINE | ID: mdl-18327244

ABSTRACT

One purpose of the biomedical literature is to report results in sufficient detail that the methods of data collection and analysis can be independently replicated and verified. Here we present reporting guidelines for gene expression localization experiments: the minimum information specification for in situ hybridization and immunohistochemistry experiments (MISFISHIE). MISFISHIE is modeled after the Minimum Information About a Microarray Experiment (MIAME) specification for microarray experiments. Both guidelines define what information should be reported without dictating a format for encoding that information. MISFISHIE describes six types of information to be provided for each experiment: experimental design, biomaterials and treatments, reporters, staining, imaging data and image characterizations. This specification has benefited the consortium within which it was developed and is expected to benefit the wider research community. We welcome feedback from the scientific community to help improve our proposal.


Subject(s)
Immunohistochemistry/standards , In Situ Hybridization/standards , Computational Biology/methods , Computational Biology/standards , Gene Expression Profiling/methods , Gene Expression Profiling/standards , Immunohistochemistry/methods , In Situ Hybridization/methods
6.
Hum Pathol ; 38(8): 1212-25, 2007 Aug.
Article in English | MEDLINE | ID: mdl-17490722

ABSTRACT

This report presents an overview for pathologists of the development and potential applications of a novel Web enabled system allowing indexing and retrieval of pathology specimens across multiple institutions. The system was developed through the National Cancer Institute's Shared Pathology Informatics Network program with the goal of creating a prototype system to find existing pathology specimens derived from routine surgical and autopsy procedures ("paraffin blocks") that may be relevant to cancer research. To reach this goal, a number of challenges needed to be met. A central aspect was the development of an informatics system that supported Web-based searching while retaining local control of data. Additional aspects included the development of an eXtensible Markup Language schema, representation of tissue specimen annotation, methods for deidentifying pathology reports, tools for autocoding critical data from these reports using the Unified Medical Language System, and hierarchies of confidentiality and consent that met or exceeded federal requirements. The prototype system supported Web-based querying of millions of pathology reports from 6 participating institutions across the country in a matter of seconds to minutes and the ability of bona fide researchers to identify and potentially to request specific paraffin blocks from the participating institutions. With the addition of associated clinical and outcome information, this system could vastly expand the pool of annotated tissues available for cancer research as well as other diseases.


Subject(s)
Medical Informatics/organization & administration , Pathology, Surgical/organization & administration , Specimen Handling/methods , Tissue Banks , Humans , United States
7.
BMC Cancer ; 7: 37, 2007 Feb 28.
Article in English | MEDLINE | ID: mdl-17386082

ABSTRACT

BACKGROUND: Shared Pathology Informatics Network (SPIN) is a tissue resource initiative that utilizes clinical reports of the vast amount of paraffin-embedded tissues routinely stored by medical centers. SPIN has an informatics component (sending tissue-related queries to multiple institutions via the internet) and a service component (providing histopathologically annotated tissue specimens for medical research). This paper examines if tissue blocks, identified by localized computer searches at participating institutions, can be retrieved in adequate quantity and quality to support medical researchers. METHODS: Four centers evaluated pathology reports (1990-2005) for common and rare tumors to determine the percentage of cases where suitable tissue blocks with tumor were available. Each site generated a list of 100 common tumor cases (25 cases each of breast adenocarcinoma, colonic adenocarcinoma, lung squamous carcinoma, and prostate adenocarcinoma) and 100 rare tumor cases (25 cases each of adrenal cortical carcinoma, gastro-intestinal stromal tumor [GIST], adenoid cystic carcinoma, and mycosis fungoides) using a combination of Tumor Registry, laboratory information system (LIS) and/or SPIN-related tools. Pathologists identified the slides/blocks with tumor and noted first 3 slides with largest tumor and availability of the corresponding block. RESULTS: Common tumors cases (n = 400), the institutional retrieval rates (all blocks) were 83% (A), 95% (B), 80% (C), and 98% (D). Retrieval rate (tumor blocks) from all centers for common tumors was 73% with mean largest tumor size of 1.49 cm; retrieval (tumor blocks) was highest-lung (84%) and lowest-prostate (54%). Rare tumors cases (n = 400), each institution's retrieval rates (all blocks) were 78% (A), 73% (B), 67% (C), and 84% (D). Retrieval rate (tumor blocks) from all centers for rare tumors was 66% with mean largest tumor size of 1.56 cm; retrieval (tumor blocks) was highest for GIST (72%) and lowest for adenoid cystic carcinoma (58%). CONCLUSION: Assessment shows availability and quality of archival tissue blocks that are retrievable and associated electronic data that can be of value for researchers. This study serves to compliment the data from which uniform use of the SPIN query tools by all four centers will be measured to assure and highlight the usefulness of archival material for obtaining tumor tissues for research.


Subject(s)
Paraffin Embedding/statistics & numerical data , Pathology, Clinical/organization & administration , Tissue Banks/statistics & numerical data , Humans , Medical Informatics/organization & administration , Neoplasms/pathology , United States
8.
Cancer Detect Prev ; 30(5): 387-94, 2006.
Article in English | MEDLINE | ID: mdl-17079091

ABSTRACT

BACKGROUND: Precancers are lesions that precede the appearance of invasive cancers. The successful prevention or treatment of precancers has the potential to eliminate deaths due to cancer. METHODS: A National Cancer Institute-sponsored Conference on Precancer was convened on November 8-9, 2004, at The George Washington University Medical Center, Washington, DC. A definition of precancers was developed over 2 days of Conference discussions. RESULTS: The following five criteria define a precancer: (1) evidence must exist that the precancer is associated with an increased risk of cancer; (2) when a precancer progresses to cancer, the resulting cancer arises from cells within the precancer; (3) a precancer differs from the normal tissue from which it arises; (4) a precancer differs from the cancer into which it develops, although it has some, but not all, of the molecular and phenotypic properties that characterize the cancer; (5) there is a method by which the precancer can be diagnosed. CONCLUSIONS: The Conference participants developed a general definition for precancers that would provide a consistent and clinically useful way of distinguishing precancers from all other types of lesions. It was recognized that many precancerous lesions may not meet this strict definition, but the group felt it was necessary to define criteria that will help standardize clinical and biological studies. Furthermore, a set of defining criteria for putative precancer lesions will permit pathologists to build a diagnostically useful taxonomy of precancers based on specified clinical and biological properties. Precancers thus characterized can be classified into clinically relevant sub-groups based on shared properties (i.e. biomarkers, oncogenes, common metabolic pathways, responses to therapy, etc.). Publications that introduce newly described precancer entities should describe how each of the five defining criteria apply. This manuscript reviews the proposed definition of precancers and suggests how pathologists, oncologists and cancer researchers may determine when these criteria are satisfied.


Subject(s)
Neoplasms/pathology , Precancerous Conditions/pathology , Humans , National Institutes of Health (U.S.) , United States
9.
BMC Cancer ; 6: 120, 2006 May 05.
Article in English | MEDLINE | ID: mdl-16677389

ABSTRACT

BACKGROUND: Advances in molecular biology and growing requirements from biomarker validation studies have generated a need for tissue banks to provide quality-controlled tissue samples with standardized clinical annotation. The NCI Cooperative Prostate Cancer Tissue Resource (CPCTR) is a distributed tissue bank that comprises four academic centers and provides thousands of clinically annotated prostate cancer specimens to researchers. Here we describe the CPCTR information management system architecture, common data element (CDE) development, query interfaces, data curation, and quality control. METHODS: Data managers review the medical records to collect and continuously update information for the 145 clinical, pathological and inventorial CDEs that the Resource maintains for each case. An Access-based data entry tool provides de-identification and a standard communication mechanism between each group and a central CPCTR database. Standardized automated quality control audits have been implemented. Centrally, an Oracle database has web interfaces allowing multiple user-types, including the general public, to mine de-identified information from all of the sites with three levels of specificity and granularity as well as to request tissues through a formal letter of intent. RESULTS: Since July 2003, CPCTR has offered over 6,000 cases (38,000 blocks) of highly characterized prostate cancer biospecimens, including several tissue microarrays (TMA). The Resource developed a website with interfaces for the general public as well as researchers and internal members. These user groups have utilized the web-tools for public query of summary data on the cases that were available, to prepare requests, and to receive tissues. As of December 2005, the Resource received over 130 tissue requests, of which 45 have been reviewed, approved and filled. Additionally, the Resource implemented the TMA Data Exchange Specification in its TMA program and created a computer program for calculating PSA recurrence. CONCLUSION: Building a biorepository infrastructure that meets today's research needs involves time and input of many individuals from diverse disciplines. The CPCTR can provide large volumes of carefully annotated prostate tissue for research initiatives such as Specialized Programs of Research Excellence (SPOREs) and for biomarker validation studies and its experience can help development of collaborative, large scale, virtual tissue banks in other organ systems.


Subject(s)
Information Management , Medical Informatics Applications , Prostatic Neoplasms/pathology , Tissue Banks , Databases as Topic , Gene Expression Profiling , Gene Expression Regulation, Neoplastic , Humans , Information Management/standards , Internet , Male , Marketing , Medical Records , Prostatic Neoplasms/genetics , Prostatic Neoplasms/metabolism , Quality Control , Tissue Banks/standards
10.
BMC Med Inform Decis Mak ; 5: 35, 2005 Oct 18.
Article in English | MEDLINE | ID: mdl-16232314

ABSTRACT

BACKGROUND: New terminology continuously enters the biomedical literature. How can curators identify new terms that can be added to existing nomenclatures? The most direct method, and one that has served well, involves reading the current literature. The scholarly curator adds new terms as they are encountered. Present-day scholars are severely challenged by the enormous volume of biomedical literature. Curators of medical nomenclatures need computational assistance if they hope to keep their terminologies current. The purpose of this paper is to describe a method of rapidly extracting new, candidate terms from huge volumes of biomedical text. The resulting lists of terms can be quickly reviewed by curators and added to nomenclatures, if appropriate. The candidate term extractor uses a variation of the previously described doublet coding method. The algorithm, which operates on virtually any nomenclature, derives from the observation that most terms within a knowledge domain are composed entirely of word combinations found in other terms from the same knowledge domain. Terms can be expressed as sequences of overlapping word doublets that have more specific meaning than the individual words that compose the term. The algorithm parses through text, finding contiguous sequences of word doublets that are known to occur somewhere in the reference nomenclature. When a sequence of matching word doublets is encountered, it is compared with whole terms already included in the nomenclature. If the doublet sequence is not already in the nomenclature, it is extracted as a candidate new term. Candidate new terms can be reviewed by a curator to determine if they should be added to the nomenclature. An implementation of the algorithm is demonstrated, using a corpus of published abstracts obtained through the National Library of Medicine's PubMed query service and using "The developmental lineage classification and taxonomy of neoplasms" as a reference nomenclature. RESULTS: A 31+ Megabyte corpus of pathology journal abstracts was parsed using the doublet extraction method. This corpus consisted of 4,289 records, each containing an abstract title. The total number of words included in the abstract titles was 50,547. New candidate terms for the nomenclature were automatically extracted from the titles of abstracts in the corpus. Total execution time on a desktop computer with CPU speed of 2.79 GHz was 2 seconds. The resulting output consisted of 313 new candidate terms, each consisting of concatenated doublets found in the reference nomenclature. Human review of the 313 candidate terms yielded a list of 285 terms approved by a curator. A final automatic extraction of duplicate terms yielded a final list of 222 new terms (71% of the original 313 extracted candidate terms) that could be added to the reference nomenclature. CONCLUSION: The doublet method for automatically extracting candidate nomenclature terms can be used to quickly find new terms from vast amounts of text. The method can be immediately adapted for virtually any text and any nomenclature. An implementation of the algorithm, in the Perl programming language, is provided with this article.


Subject(s)
Electronic Data Processing/methods , Information Storage and Retrieval , Medical Informatics Computing , Terminology as Topic , Abstracting and Indexing , Algorithms , Humans , Medical Subject Headings , National Library of Medicine (U.S.) , Neoplasms/classification , PubMed , Semantics , Systems Integration , United States
11.
BMC Cancer ; 5: 108, 2005 Aug 21.
Article in English | MEDLINE | ID: mdl-16111498

ABSTRACT

BACKGROUND: The Cooperative Prostate Cancer Tissue Resource (CPCTR) is a consortium of four geographically dispersed institutions that are funded by the U.S. National Cancer Institute (NCI) to provide clinically annotated prostate cancer tissue samples to researchers. To facilitate this effort, it was critical to arrive at agreed upon common data elements (CDEs) that could be used to collect demographic, pathologic, treatment and clinical outcome data. METHODS: The CPCTR investigators convened a CDE curation subcommittee to develop and implement CDEs for the annotation of collected prostate tissues. The draft CDEs were refined and progressively annotated to make them ISO 11179 compliant. The CDEs were implemented in the CPCTR database and tested using software query tools developed by the investigators. RESULTS: By collaborative consensus the CPCTR CDE subcommittee developed 145 data elements to annotate the tissue samples collected. These included for each case: 1) demographic data, 2) clinical history, 3) pathology specimen level elements to describe the staging, grading and other characteristics of individual surgical pathology cases, 4) tissue block level annotation critical to managing a virtual inventory of cases and facilitating case selection, and 5) clinical outcome data including treatment, recurrence and vital status. These elements have been used successfully to respond to over 60 requests by end-users for tissue, including paraffin blocks from cases with 5 to 10 years of follow up, tissue microarrays (TMAs), as well as frozen tissue collected prospectively for genomic profiling and genetic studies. The CPCTR CDEs have been fully implemented in two major tissue banks and have been shared with dozens of other tissue banking efforts. CONCLUSION: The freely available CDEs developed by the CPCTR are robust, based on "best practices" for tissue resources, and are ISO 11179 compliant. The process for CDE development described in this manuscript provides a framework model for other organ sites and has been used as a model for breast and melanoma tissue banking efforts.


Subject(s)
Computational Biology/methods , Databases as Topic , Prostatic Neoplasms/pathology , Tissue Banks , Computers , Humans , Male , Prostatic Neoplasms/metabolism , Recurrence , Software , Treatment Outcome
12.
BMC Cancer ; 5: 100, 2005 Aug 10.
Article in English | MEDLINE | ID: mdl-16092965

ABSTRACT

BACKGROUND: For over 150 years, pathologists have relied on histomorphology to classify and diagnose neoplasms. Their success has been stunning, permitting the accurate diagnosis of thousands of different types of neoplasms using only a microscope and a trained eye. In the past two decades, cancer genomics has challenged the supremacy of histomorphology by identifying genetic alterations shared by morphologically diverse tumors and by finding genetic features that distinguish subgroups of morphologically homogeneous tumors. DISCUSSION: The Developmental Lineage Classification and Taxonomy of Neoplasms groups neoplasms by their embryologic origin. The putative value of this classification is based on the expectation that tumors of a common developmental lineage will share common metabolic pathways and common responses to drugs that target these pathways. The purpose of this manuscript is to show that grouping tumors according to their developmental lineage can reconcile certain fundamental discrepancies resulting from morphologic and molecular approaches to neoplasm classification. In this study, six issues in tumor classification are described that exemplify the growing rift between morphologic and molecular approaches to tumor classification: 1) the morphologic separation between epithelial and non-epithelial tumors; 2) the grouping of tumors based on shared cellular functions; 3) the distinction between germ cell tumors and pluripotent tumors of non-germ cell origin; 4) the distinction between tumors that have lost their differentiation and tumors that arise from uncommitted stem cells; 5) the molecular properties shared by morphologically disparate tumors that have a common developmental lineage, and 6) the problem of re-classifying morphologically identical but clinically distinct subsets of tumors. The discussion of these issues in the context of describing different methods of tumor classification is intended to underscore the clinical value of a robust tumor classification. SUMMARY: A classification of neoplasms should guide the rational design and selection of a new generation of cancer medications targeted to metabolic pathways. Without a scientifically sound neoplasm classification, biological measurements on individual tumor samples cannot be generalized to class-related tumors, and constitutive properties common to a class of tumors cannot be distinguished from uninformative data in complex and chaotic biological systems. This paper discusses the importance of biological classification and examines several different approaches to the specific problem of tumor classification.


Subject(s)
Medical Oncology/methods , Neoplasms/classification , Neoplasms/diagnosis , Cell Differentiation , Cell Lineage , Embryo, Mammalian/cytology , Humans , Neoplasms/genetics , Neoplasms, Germ Cell and Embryonal/diagnosis , Neoplasms, Germ Cell and Embryonal/pathology , Neoplasms, Glandular and Epithelial/diagnosis , Neoplasms, Glandular and Epithelial/pathology , Stem Cells/cytology
13.
In Silico Biol ; 5(3): 313-22, 2005.
Article in English | MEDLINE | ID: mdl-15984939

ABSTRACT

Assigning nomenclature codes to biomedical data is an arduous, expensive and error-prone task. Data records are coded to to provide a common representation of contained concepts, allowing facile retrieval of records via a standard terminology. In the medical field, cancer registrars, nurses, pathologists, and private clinicians all understand the importance of annotating medical records with vocabularies that codify the names of diseases, procedures, billing categories, etc. Molecular biologists need codified medical records so that they can discover or validate relationships between experimental data and clinical data. This paper introduces a new approach to retrieving data records without prior coding. The approach achieves the same result as a search over pre-coded records. It retrieves all records that contain any terms that are synonymous with a user's query-term. A recently described fast algorithm (the doublet method) permits quick iterative searches over every synonym for any term from any nomenclature occurring in a dataset of any size. As a demonstration, a 105+ Megabyte corpus of Pubmed abstracts was searched for medical terms. Query terms were matched against either of two vocabularies and expanded as an array of equivalent search items. A single search term may have over one hundred nomenclature synonyms, all of which were searched against the full database. Iterative searches of a list of concept-equivalent terms involves many more operations than a single search over pre-annotated concept codes. Nonetheless, the doublet method achieved fast query response times (0.05 seconds using Snomed and 5 seconds using the Developmental Lineage Classification of Neoplasms, on a computer with a 2.89 GHz processor). Pre-annotated datasets lose their value when the chosen vocabulary is replaced by a different vocabulary or by a different version of the same vocabulary. The doublet method can employ any version of any vocabulary with no pre-annotation. In many instances, the enormous effort and expense associated with data annotation can be eliminated by on-the-fly doublet matching. The algorithm for nomenclature-based database searches using the doublet method is described. Perl scripts for implementing the algorithm and testing execution speed are provided as open source documents available from the Association for Pathology Informatics (www.pathologyinformatics.org/informatics_r.htm).


Subject(s)
Information Storage and Retrieval , Medical Informatics , Terminology as Topic , Abstracting and Indexing , Algorithms , Systems Integration
14.
Expert Rev Mol Diagn ; 5(3): 329-36, 2005 May.
Article in English | MEDLINE | ID: mdl-15934811

ABSTRACT

Data integration occurs when a query proceeds through multiple data sets, thereby relating diverse data extracted from different data sources. Data integration is particularly important to biomedical researchers since data obtained from experiments on human tissue specimens have little applied value unless they can be combined with medical data (i.e., pathologic and clinical information). In the past, research data were correlated with medical data by manually retrieving, reading, assembling and abstracting patient charts, pathology reports, radiology reports and the results of special tests and procedures. Manual annotation of research data is impractical when experiments involve hundreds or thousands of tissue specimens resulting in large, complex data collections. The purpose of this paper is to review how XML (eXtensible Markup Language) provides the fundamental tools that support biomedical data integration. The article also discusses some of the most important challenges that block the widespread availability of annotated biomedical data sets.


Subject(s)
Data Collection/standards , Humans , Internet , Logic , Medical Informatics , Medical Records , Research
15.
J Urol ; 173(5): 1546-51, 2005 May.
Article in English | MEDLINE | ID: mdl-15821483

ABSTRACT

PURPOSE: Prostate cancer can occur in patients with low screening serum prostate specific antigen (PSA) values (less than 4.0 ng/ml). It is currently unclear whether these tumors are different from prostate cancer in patients with high PSA levels (greater than 4.0 ng/ml). MATERIALS AND METHODS: From the Cooperative Prostate Cancer Tissue Resource database through March 2004, 3,416 patients with screening PSA less than 16.0 ng/ml diagnosed with prostate cancer between 1993 and 2004 were stratified in groups based on screening serum PSA. These subsets were compared for race, age at diagnosis, clinical and pathological stage, Gleason score, positive surgical margins, posttreatment recurrent disease, and vital status. RESULTS: We identified 468 (14%) patients with screening PSA less than 4.0 ng/ml, 142 (4.2%) of whom had a PSA of less than 2.0 ng/ml. This group included 40 black and 376 white patients. Men with low screening PSA treated with radical prostatectomy had smaller cancers, lower Gleason scores, lower pathological tumor (T) stage and lower PSA recurrence rates than men with high PSA levels (4 ng/ml or greater). These differences held true for men who were younger than 62 years or were white, whereas older or black men had tumor characteristics and outcomes similar to those with higher PSA levels. CONCLUSIONS: Young (younger than 62 years) or white patients with screening serum PSA less than 4.0 ng/ml had smaller, lower grade tumors and lower recurrence rates than patients with PSA 4.0 ng/ml or greater. This was not true for those older than 62 years and for black men.


Subject(s)
Prostate-Specific Antigen/blood , Prostatic Neoplasms/blood , Humans , Male , Middle Aged , Prostatic Neoplasms/pathology
16.
Hum Pathol ; 36(2): 139-45, 2005 Feb.
Article in English | MEDLINE | ID: mdl-15754290

ABSTRACT

It is impossible to overstate the importance of XML (eXtensible Markup Language) as a data organization tool. With XML, pathologists can annotate all of their data (clinical and anatomic) in a format that can transform every pathology report into a database, without compromising narrative structure. The purpose of this manuscript is to provide an overview of XML for pathologists. Examples will demonstrate how pathologists can use XML to annotate individual data elements and to structure reports in a common format that can be merged with other XML files or queried using standard XML tools. This manuscript gives pathologists a glimpse into how XML allows pathology data to be linked to other types of biomedical data and reduces our dependence on centralized proprietary databases.


Subject(s)
Database Management Systems , Databases as Topic/organization & administration , Medical Informatics/methods , Pathology/methods , Programming Languages , Terminology as Topic , Databases as Topic/standards , Humans
17.
AMIA Annu Symp Proc ; : 515-9, 2005.
Article in English | MEDLINE | ID: mdl-16779093

ABSTRACT

The Shared Pathology Informatics Network (SPIN), a research initiative of the National Cancer Institute, will allow for the retrieval of more than 4 million pathology reports and specimens. In this paper, we describe the special query tool as developed for the Indianapolis/Regenstrief SPIN node, integrated into the ever-expanding Indiana Network for Patient care (INPC). This query tool allows for the retrieval of de-identified data sets using complex logic, auto-coded final diagnoses, and intrinsically supports multiple types of statistical analyses. The new SPIN/INPC database represents a new generation of the Regenstrief Medical Record system - a centralized, but federated system of repositories.


Subject(s)
Confidentiality , Database Management Systems , Databases as Topic , Information Storage and Retrieval/methods , Pathology , Hospital Information Systems , Humans , Logical Observation Identifiers Names and Codes , Medical Records Systems, Computerized , User-Computer Interface
18.
BMC Cancer ; 4: 88, 2004 Nov 30.
Article in English | MEDLINE | ID: mdl-15571625

ABSTRACT

BACKGROUND: The new "Developmental lineage classification of neoplasms" was described in a prior publication. The classification is simple (the entire hierarchy is described with just 39 classifiers), comprehensive (providing a place for every tumor of man), and consistent with recent attempts to characterize tumors by cytogenetic and molecular features. A taxonomy is a list of the instances that populate a classification. The taxonomy of neoplasia attempts to list every known term for every known tumor of man. METHODS: The taxonomy provides each concept with a unique code and groups synonymous terms under the same concept. A Perl script validated successive drafts of the taxonomy ensuring that: 1) each term occurs only once in the taxonomy; 2) each term occurs in only one tumor class; 3) each concept code occurs in one and only one hierarchical position in the classification; and 4) the file containing the classification and taxonomy is a well-formed XML (eXtensible Markup Language) document. RESULTS: The taxonomy currently contains 122,632 different terms encompassing 5,376 neoplasm concepts. Each concept has, on average, 23 synonyms. The taxonomy populates "The developmental lineage classification of neoplasms," and is available as an XML file, currently 9+ Megabytes in length. A representation of the classification/taxonomy listing each term followed by its code, followed by its full ancestry, is available as a flat-file, 19+ Megabytes in length. The taxonomy is the largest nomenclature of neoplasms, with more than twice the number of neoplasm names found in other medical nomenclatures, including the 2004 version of the Unified Medical Language System, the Systematized Nomenclature of Medicine Clinical Terminology, the National Cancer Institute's Thesaurus, and the International Classification of Diseases Oncolology version. CONCLUSIONS: This manuscript describes a comprehensive taxonomy of neoplasia that collects synonymous terms under a unique code number and assigns each tumor to a single class within the tumor hierarchy. The entire classification and taxonomy are available as open access files (in XML and flat-file formats) with this article.


Subject(s)
Cell Lineage , Neoplasms/classification , Databases, Factual , Female , Humans , Male , Neoplastic Stem Cells/classification
20.
BMC Med Inform Decis Mak ; 4: 16, 2004 Sep 15.
Article in English | MEDLINE | ID: mdl-15369595

ABSTRACT

BACKGROUND: Autocoding (or automatic concept indexing) occurs when a software program extracts terms contained within text and maps them to a standard list of concepts contained in a nomenclature. The purpose of autocoding is to provide a way of organizing large documents by the concepts represented in the text. Because textual data accumulates rapidly in biomedical institutions, the computational methods used to autocode text must be very fast. The purpose of this paper is to describe the doublet method, a new algorithm for very fast autocoding. METHODS: An autocoder was written that transforms plain-text into intercalated word doublets (e.g. "The ciliary body produces aqueous humor" becomes "The ciliary, ciliary body, body produces, produces aqueous, aqueous humor"). Each doublet is checked against an index of doublets extracted from a standard nomenclature. Matching doublets are assigned a numeric code specific for each doublet found in the nomenclature. Text doublets that do not match the index of doublets extracted from the nomenclature are not part of valid nomenclature terms. Runs of matching doublets from text are concatenated and matched against nomenclature terms (also represented as runs of doublets). RESULTS: The doublet autocoder was compared for speed and performance against a previously published phrase autocoder. Both autocoders are Perl scripts, and both autocoders used an identical text (a 170+ Megabyte collection of abstracts collected through a PubMed search) and the same nomenclature (neocl.xml, containing over 102,271 unique names of neoplasms). In side-by-side comparison on the same computer, the doublet method autocoder was 8.4 times faster than the phrase autocoder (211 seconds versus 1,776 seconds). The doublet method codes 0.8 Megabytes of text per second on a desktop computer with a 1.6 GHz processor. In addition, the doublet autocoder successfully matched terms that were missed by the phrase autocoder, while the phrase autocoder found no terms that were missed by the doublet autocoder. CONCLUSIONS: The doublet method of autocoding is a novel algorithm for rapid text autocoding. The method will work with any nomenclature and will parse any ascii plain-text. An implementation of the algorithm in Perl is provided with this article. The algorithm, the Perl implementation, the neoplasm nomenclature, and Perl itself, are all open source materials.


Subject(s)
Algorithms , Electronic Data Processing/methods , Natural Language Processing , Neoplasms/classification , Terminology as Topic , Abstracting and Indexing , Computers , Humans , Software , Software Design , Unified Medical Language System
SELECTION OF CITATIONS
SEARCH DETAIL
...