Pesquisa | Portal Regional da BVS

1.

Computational peptide discovery with a genetic programming approach.

Scalzitti, Nicolas; Miralavy, Iliya; Korenchan, David E; Farrar, Christian T; Gilad, Assaf A; Banzhaf, Wolfgang.

J Comput Aided Mol Des ; 38(1): 17, 2024 Apr 03.

Artigo em Inglês | MEDLINE | ID: mdl-38570405

RESUMO

The development of peptides for therapeutic targets or biomarkers for disease diagnosis is a challenging task in protein engineering. Current approaches are tedious, often time-consuming and require complex laboratory data due to the vast search spaces that need to be considered. In silico methods can accelerate research and substantially reduce costs. Evolutionary algorithms are a promising approach for exploring large search spaces and can facilitate the discovery of new peptides. This study presents the development and use of a new variant of the genetic-programming-based POET algorithm, called POET Regex , where individuals are represented by a list of regular expressions. This algorithm was trained on a small curated dataset and employed to generate new peptides improving the sensitivity of peptides in magnetic resonance imaging with chemical exchange saturation transfer (CEST). The resulting model achieves a performance gain of 20% over the initial POET models and is able to predict a candidate peptide with a 58% performance increase compared to the gold-standard peptide. By combining the power of genetic programming with the flexibility of regular expressions, new peptide targets were identified that improve the sensitivity of detection by CEST. This approach provides a promising research direction for the efficient identification of peptides with therapeutic or diagnostic potential.

Assuntos

Algoritmos , Imageamento por Ressonância Magnética , Humanos , Imagens de Fantasmas , Imageamento por Ressonância Magnética/métodos , Peptídeos

2.

Computational Peptide Discovery with a Genetic Programming Approach.

Scalzitti, Nicolas; Miralavy, Iliya; Korenchan, David E; Farrar, Christian T; Gilad, Assaf A; Banzhaf, Wolfgang.

Res Sq ; 2023 Sep 01.

Artigo em Inglês | MEDLINE | ID: mdl-37693481

RESUMO

Background: The development of peptides for therapeutic targets or biomarkers for disease diagnosis is a challenging task in protein engineering. Current approaches are tedious, often time-consuming and require complex laboratory data due to the vast search space. In silico methods can accelerate research and substantially reduce costs. Evolutionary algorithms are a promising approach for exploring large search spaces and facilitating the discovery of new peptides. Results: This study presents the development and use of a variant of the initial POET algorithm, called POETRegex, which is based on genetic programming, where individuals are represented by a list of regular expressions. The program was trained on a small curated dataset and employed to predict new peptides that can improve the problem of sensitivity in detecting peptides through magnetic resonance imaging using chemical exchange saturation transfer (CEST). The resulting model achieves a performance gain of 20% over the initial POET variant and is able to predict a candidate peptide with a 58% performance increase compared to the gold-standard peptide. Conclusions: By combining the power of genetic programming with the flexibility of regular expressions, new potential peptide targets were identified to improve the sensitivity of detection by CEST. This approach provides a promising research direction for the efficient identification of peptides with therapeutic or diagnostic potential.

3.

Automated extraction of information from free text of Spanish oncology pathology reports.

Mendoza-Urbano, Diana Marcela; Garcia, Johan Felipe; Moreno, Juan Sebastian; Bravo-Ocaña, Juan Carlos; Riascos, Alvaro José; Zambrano Harvey, Angela; Prada, Sergio I.

Colomb Med (Cali) ; 54(1): e2035300, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37614525

RESUMO

Background: Pathology reports are stored as unstructured, ungrammatical, fragmented, and abbreviated free text with linguistic variability among pathologists. For this reason, tumor information extraction requires a significant human effort. Recording data in an efficient and high-quality format is essential in implementing and establishing a hospital-based-cancer registry. Objective: This study aimed to describe implementing a natural language processing algorithm for oncology pathology reports. Methods: An algorithm was developed to process oncology pathology reports in Spanish to extract 20 medical descriptors. The approach is based on the successive coincidence of regular expressions. Results: The validation was performed with 140 pathological reports. The topography identification was performed manually by humans and the algorithm in all reports. The human identified morphology in 138 reports and by the algorithm in 137. The average fuzzy matching score was 68.3 for Topography and 89.5 for Morphology. Conclusions: A preliminary algorithm validation against human extraction was performed over a small set of reports with satisfactory results. This shows that a regular-expression approach can accurately and precisely extract multiple specimen attributes from free-text Spanish pathology reports. Additionally, we developed a website to facilitate collaborative validation at a larger scale which may be helpful for future research on the subject.

Introducción: Los reportes de patología están almacenados como texto libre sin estructura, gramática, fragmentados o abreviados, con variabilidad lingüística entre patólogos. Por esta razón, la extracción de información de tumores requiere un esfuerzo humano significativo. Almacenar información en un formato eficiente y de alta calidad es esencial para implementar y establecer un registro hospitalario de cáncer. Objetivo: Este estudio busca describir la implementación de un algoritmo de Procesamiento de Lenguaje Natural para reportes de patología oncológica. Métodos: Desarrollamos un algoritmo para procesar reportes de patología oncológica en Español, con el objetivo de extraer 20 descriptores médicos. El abordaje se basa en la coincidencia sucesiva de expresiones regulares. Resultados: La validación se hizo con 140 reportes de patología. La identificación topográfica se realizó por humanos y por el algoritmo en todos los reportes. La morfología fue identificada por humanos en 138 reportes y por el algoritmo en 137. El valor de coincidencias parciales (fuzzy matches) promedio fue de 68.3 para Topografía y 89.5 para Morfología. Conclusiones: Se hizo una validación preliminar del algoritmo contra extracción humana sobre un pequeño grupo de reportes, con resultados satisfactorios. Esto muestra que múltiples atributos del espécimen pueden ser extraídos de manera precisa de texto libre de reportes de patología en Español, usando un abordaje de expresiones regulares. Adicionalmente, desarrollamos una página web para facilitar la validación colaborativa a gran escala, lo que puede ser beneficioso para futuras investigaciones en el tema.

Assuntos

Algoritmos , Humanos , Sistema de Registros

4.

Automated extraction of information from free text of Spanish oncology pathology reports

Mendoza-Urbano, Diana Marcela; Garcia, Johan Felipe; Moreno, Juan Sebastian; Bravo-Ocaña, Juan Carlos; Riascos, Alvaro José; Zambrano Harvey, Angela; Prada, Sergio I.

Colomb. med ; 54(1)mar. 2023.

Artigo em Inglês | LILACS-Express | LILACS | ID: biblio-1534279

RESUMO

Background: Pathology reports are stored as unstructured, ungrammatical, fragmented, and abbreviated free text with linguistic variability among pathologists. For this reason, tumor information extraction requires a significant human effort. Recording data in an efficient and high-quality format is essential in implementing and establishing a hospital-based-cancer registry Objective: This study aimed to describe implementing a natural language processing algorithm for oncology pathology reports. Methods: An algorithm was developed to process oncology pathology reports in Spanish to extract 20 medical descriptors. The approach is based on the successive coincidence of regular expressions. Results: The validation was performed with 140 pathological reports. The topography identification was performed manually by humans and the algorithm in all reports. The human identified morphology in 138 reports and by the algorithm in 137. The average fuzzy matching score was 68.3 for Topography and 89.5 for Morphology. Conclusions: A preliminary algorithm validation against human extraction was performed over a small set of reports with satisfactory results. This shows that a regular-expression approach can accurately and precisely extract multiple specimen attributes from free-text Spanish pathology reports. Additionally, we developed a website to facilitate collaborative validation at a larger scale which may be helpful for future research on the subject.

Introducción: Los reportes de patología están almacenados como texto libre sin estructura, gramática, fragmentados o abreviados, con variabilidad lingüística entre patólogos. Por esta razón, la extracción de información de tumores requiere un esfuerzo humano significativo. Almacenar información en un formato eficiente y de alta calidad es esencial para implementar y establecer un registro hospitalario de cáncer. Objetivo: Este estudio busca describir la implementación de un algoritmo de Procesamiento de Lenguaje Natural para reportes de patología oncológica. Métodos: Desarrollamos un algoritmo para procesar reportes de patología oncológica en Español, con el objetivo de extraer 20 descriptores médicos. El abordaje se basa en la coincidencia sucesiva de expresiones regulares. Resultados: La validación se hizo con 140 reportes de patología. La identificación topográfica se realizó por humanos y por el algoritmo en todos los reportes. La morfología fue identificada por humanos en 138 reportes y por el algoritmo en 137. El valor de coincidencias parciales (fuzzy matches) promedio fue de 68.3 para Topografía y 89.5 para Morfología. Conclusiones: Se hizo una validación preliminar del algoritmo contra extracción humana sobre un pequeño grupo de reportes, con resultados satisfactorios. Esto muestra que múltiples atributos del espécimen pueden ser extraídos de manera precisa de texto libre de reportes de patología en Español, usando un abordaje de expresiones regulares. Adicionalmente, desarrollamos una página web para facilitar la validación colaborativa a gran escala, lo que puede ser beneficioso para futuras investigaciones en el tema.

5.

Inglämnlagare: a tool for restructuring Swedish HER record site data for statistical analysis.

Löwenborg, Daniel; Antomonov, Filipp.

F1000Res ; 11: 1370, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-37990690

RESUMO

Background: This paper presents a new software tool, Inglämnlagare, developed to be open-source, that restructures information about ancient remains in Sweden for analysis. The background is a new version of the ancient sites database, the Historic Environment Record, curated by the Swedish National Heritage Board, that was launched in 2018 with a new database model that structures the information differently compared to previous versions. Methods: The program, written in Python programming language, has multicore support in order to improve performance for large files and uses regular expressions to extract information about individual features of composite sites. Such features, together with their summed amount, are written as new individual fields to a comma-separated value file. The program is delivered as a source script file that can be executed in any Python environment. Use cases: As an example of use, a case study of exploring graves of rectangular shape found within Sweden is provided. The use case also describes the different steps involved in preparing the data in QGIS to run the program, as well as some methods to efficiently analyse and visualize the output. Conclusions: Inglämnlagare will make more information from the Swedish record of ancient sites accessible for research and can be used to explore different content of the record more efficiently than previously possible. While the tool is written specifically for this dataset it also provides an example of how open-source tools can be used for data wrangling making information designed for a specific purpose, such as online dissemination, appropriate for analysis.

Assuntos

Linguagens de Programação , Software , Suécia , Projetos de Pesquisa , Bases de Dados Factuais

6.

A generalizable data assembly algorithm for infectious disease outbreaks.

Majumder, Maimuna S; Rose, Sherri.

JAMIA Open ; 4(3): ooab058, 2021 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-34350393

RESUMO

During infectious disease outbreaks, health agencies often share text-based information about cases and deaths. This information is rarely machine-readable, thus creating challenges for outbreak researchers. Here, we introduce a generalizable data assembly algorithm that automatically curates text-based, outbreak-related information and demonstrate its performance across 3 outbreaks. After developing an algorithm with regular expressions, we automatically curated data from health agencies via 3 information sources: formal reports, email newsletters, and Twitter. A validation data set was also curated manually for each outbreak, and an implementation process was presented for application to future outbreaks. When compared against the validation data sets, the overall cumulative missingness and misidentification of the algorithmically curated data were ≤2% and ≤1%, respectively, for all 3 outbreaks. Within the context of outbreak research, our work successfully addresses the need for generalizable tools that can transform text-based information into machine-readable data across varied information sources and infectious diseases.

7.

A Service Discovery Solution for Edge Choreography-Based Distributed Embedded Systems.

Blanc, Sara; Bayo-Montón, José-Luis; Palanca-Barrio, Senén; Arreaga-Alvarado, Néstor X.

Sensors (Basel) ; 21(2)2021 Jan 19.

Artigo em Inglês | MEDLINE | ID: mdl-33478175

RESUMO

This paper presents a solution to support service discovery for edge choreography based distributed embedded systems. The Internet of Things (IoT) edge architectural layer is composed of Raspberry Pi machines. Each machine hosts different services organized based on the choreography collaborative paradigm. The solution adds to the choreography middleware three messages passing models to be coherent and compatible with current IoT messaging protocols. It is aimed to support blind hot plugging of new machines and help with service load balance. The discovery mechanism is implemented as a broker service and supports regular expressions (Regex) in message scope to discern both publishing patterns offered by data providers and client services necessities. Results compare Control Process Unit (CPU) usage in a request-response and datacentric configuration and analyze both regex interpreter latency times compared with a traditional message structure as well as its impact on CPU and memory consumption.

8.

Text Processing for Detection of Fungal Ocular Involvement in Critical Care Patients: Cross-Sectional Study.

Baxter, Sally L; Klie, Adam R; Radha Saseendrakumar, Bharanidharan; Ye, Gordon Y; Hogarth, Michael.

J Med Internet Res ; 22(8): e18855, 2020 08 14.

Artigo em Inglês | MEDLINE | ID: mdl-32795984

RESUMO

BACKGROUND: Fungal ocular involvement can develop in patients with fungal bloodstream infections and can be vision-threatening. Ocular involvement has become less common in the current era of improved antifungal therapies. Retrospectively determining the prevalence of fungal ocular involvement is important for informing clinical guidelines, such as the need for routine ophthalmologic consultations. However, manual retrospective record review to detect cases is time-consuming. OBJECTIVE: This study aimed to determine the prevalence of fungal ocular involvement in a critical care database using both structured and unstructured electronic health record (EHR) data. METHODS: We queried microbiology data from 46,467 critical care patients over 12 years (2000-2012) from the Medical Information Mart for Intensive Care III (MIMIC-III) to identify 265 patients with culture-proven fungemia. For each fungemic patient, demographic data, fungal species present in blood culture, and risk factors for fungemia (eg, presence of indwelling catheters, recent major surgery, diabetes, immunosuppressed status) were ascertained. All structured diagnosis codes and free-text narrative notes associated with each patient's hospitalization were also extracted. Screening for fungal endophthalmitis was performed using two approaches: (1) by querying a wide array of eye- and vision-related diagnosis codes, and (2) by utilizing a custom regular expression pipeline to identify and collate relevant text matches pertaining to fungal ocular involvement. Both approaches were validated using manual record review. The main outcome measure was the documentation of any fungal ocular involvement. RESULTS: In total, 265 patients had culture-proven fungemia, with Candida albicans (n=114, 43%) and Candida glabrata (n=74, 28%) being the most common fungal species in blood culture. The in-hospital mortality rate was 121 (46%). In total, 7 patients were identified as having eye- or vision-related diagnosis codes, none of whom had fungal endophthalmitis based on record review. There were 26,830 free-text narrative notes associated with these 265 patients. A regular expression pipeline based on relevant terms yielded possible matches in 683 notes from 108 patients. Subsequent manual record review again demonstrated that no patients had fungal ocular involvement. Therefore, the prevalence of fungal ocular involvement in this cohort was 0%. CONCLUSIONS: MIMIC-III contained no cases of ocular involvement among fungemic patients, consistent with prior studies reporting low rates of ocular involvement in fungemia. This study demonstrates an application of natural language processing to expedite the review of narrative notes. This approach is highly relevant for ophthalmology, where diagnoses are often based on physical examination findings that are documented within clinical notes.

Assuntos

Cuidados Críticos/métodos , Endoftalmite/diagnóstico , Olho/patologia , Micoses/diagnóstico por imagem , Processamento de Linguagem Natural , Estudos Transversais , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Estudos Retrospectivos , Fatores de Risco

9.

Extracting the characteristics of life cycle assessments via data mining.

Diaz-Elsayed, Nancy; Zhang, Qiong.

MethodsX ; 7: 101004, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32775227

RESUMO

Life cycle assessments (LCAs) follow the ISO 14040 standard and consist of the following steps: 1) goal and scope definition, 2) life cycle inventory analysis, 3) life cycle impact assessment, and 4) interpretation. Prior literature reviews of wastewater treatment and water reuse LCAs have evaluated the methods implemented within these assessments. In lieu of manually tabulating the characteristic features of LCAs, Data Mining LCAs provides a method to facilitate the extraction of key characteristics. The process consists of the following:â¢Each journal article is converted to a text file and read in Python.â¢Search terms are defined for each characteristic of the LCA to be extracted.â¢By employing Python's regular expressions operations and the natural language toolkit (NLTK), the functional unit, life cycle impact characterization method, and the location of each case study are identified.

10.

Construction of a semi-automatic ICD-10 coding system.

Zhou, Lingling; Cheng, Cheng; Ou, Dong; Huang, Hao.

BMC Med Inform Decis Mak ; 20(1): 67, 2020 04 15.

Artigo em Inglês | MEDLINE | ID: mdl-32293423

RESUMO

BACKGROUND: The International Classification of Diseases, 10th Revision (ICD-10) has been widely used to describe the diagnosis information of patients. Automatic ICD-10 coding is important because manually assigning codes is expensive, time consuming and error prone. Although numerous approaches have been developed to explore automatic coding, few of them have been applied in practice. Our aim is to construct a practical, automatic ICD-10 coding machine to improve coding efficiency and quality in daily work. METHODS: In this study, we propose the use of regular expressions (regexps) to establish a correspondence between diagnosis codes and diagnosis descriptions in outpatient settings and at admission and discharge. The description models of the regexps were embedded in our upgraded coding system, which queries a diagnosis description and assigns a unique diagnosis code. Like most studies, the precision (P), recall (R), F-measure (F) and overall accuracy (A) were used to evaluate the system performance. Our study had two stages. The datasets were obtained from the diagnosis information on the homepage of the discharge medical record. The testing sets were from October 1, 2017 to April 30, 2018 and from July 1, 2018 to January 31, 2019. RESULTS: The values of P were 89.27 and 88.38% in the first testing phase and the second testing phase, respectively, which demonstrate high precision. The automatic ICD-10 coding system completed more than 160,000 codes in 16 months, which reduced the workload of the coders. In addition, a comparison between the amount of time needed for manual coding and automatic coding indicated the effectiveness of the system-the time needed for automatic coding takes nearly 100 times less than manual coding. CONCLUSIONS: Our automatic coding system is well suited for the coding task. Further studies are warranted to perfect the description models of the regexps and to develop synthetic approaches to improve system performance.

Assuntos

Classificação Internacional de Doenças , Alta do Paciente , Automação , Codificação Clínica

11.

Text Processing.

Couto, Francisco M.

Adv Exp Med Biol ; 1137: 45-60, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-31183819

RESUMO

In the previous chapter we were able to automatically process structured data to retrieve biomedical text about any chemical compound, such as caffeine. This chapter will provide a step-by-step introduction to how we can process that text using shell script commands, specifically extract information about diseases related to caffeine. The goal is to equip the reader with an essential set of skills to extract meaningful information from any text.

Assuntos

Mineração de Dados/métodos , Processamento Eletrônico de Dados , Cafeína , Software

12.

Detection of nucleotide sequences capable of forming non-canonical DNA structures: Application of automata theory.

Yurushkin, M V; Gervich, L R; Bachurin, S S; Kletskii, M E.

Comput Biol Chem ; 80: 278-283, 2019 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-31054540

RESUMO

In this study, we develop a program that allows us to reveal DNA receptors, i.e. nucleotide sequences that may form more than one non-canonical structure. The data obtained may be analysed either experimentally or using DNA banks, and refers to the coding, non-coding or promotor region of the gene. These results provide a better understanding of the role that non-canonical structures play in pathological modifications of the genetic apparatus, resulting in tumour formation or inherited disease. They also reveal the effect of single nucleotide polymorphisms on gene expression, indicate so-called "risk regions" in which the substitution of a single nucleotide may lead to increased formation of non-canonical structures, and elucidate the epigenetic mechanisms of microorganism adaptation.

Assuntos

Sequência de Bases , DNA/química , Biologia Molecular/métodos , Conformação de Ácido Nucleico , Software , Algoritmos , Pareamento de Bases , DNA/genética

13.

Auto recognizing critical values from medical image examination reports by using regular expressions / 实用放射学杂志

Rongjie CAI; Dapeng LI; Fangyuan QIN; Jingjing HUANG; Xiaoyan YANG; Lipeng MA.

Journal of Practical Radiology ; (12): 444-446, 2018.

Artigo em Chinês | WPRIM (Pacífico Ocidental) | ID: wpr-696838

RESUMO

Objective To introduce a method of automatically identifying critical values from medical image examination reports and prompt the physician to report it,to prevent the omission of the critical value reporting and improve the medical quality.Methods According to the requirement of critical value reporting system,regular expressions were made for each emergency situation of medical image examination,in order to form a critical value feature library.And an algorithm was designed to find critical value and prompt doctors automatically.Results According to this method,the critical value auto recognize software was developed and had been tested in Nanfang Hospital for 6 months.The software ran well.Conclusion Using regular expressions to define a criteria value feature library and design an algorithm of identifying criteria values,can recognize critical values and prompt physician automatically.

14.

StreamQRE: Modular Specification and Efficient Evaluation of Quantitative Queries over Streaming Data.

Mamouras, Konstantinos; Raghothaman, Mukund; Alur, Rajeev; Ives, Zachary G; Khanna, Sanjeev.

Proc ACM SIGPLAN Conf Program Lang Des Implement ; 52(6): 693-708, 2017 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-29151821

RESUMO

Real-time decision making in emerging IoT applications typically relies on computing quantitative summaries of large data streams in an efficient and incremental manner. To simplify the task of programming the desired logic, we propose StreamQRE, which provides natural and high-level constructs for processing streaming data. Our language has a novel integration of linguistic constructs from two distinct programming paradigms: streaming extensions of relational query languages and quantitative extensions of regular expressions. The former allows the programmer to employ relational constructs to partition the input data by keys and to integrate data streams from different sources, while the latter can be used to exploit the logical hierarchy in the input stream for modular specifications. We first present the core language with a small set of combinators, formal semantics, and a decidable type system. We then show how to express a number of common patterns with illustrative examples. Our compilation algorithm translates the high-level query into a streaming algorithm with precise complexity bounds on per-item processing time and total memory footprint. We also show how to integrate approximation algorithms into our framework. We report on an implementation in Java, and evaluate it with respect to existing high-performance engines for processing streaming data. Our experimental evaluation shows that (1) StreamQRE allows more natural and succinct specification of queries compared to existing frameworks, (2) the throughput of our implementation is higher than comparable systems (for example, two-to-four times greater than RxJava), and (3) the approximation algorithms supported by our implementation can lead to substantial memory savings.

15.

Regular Expressions: The Build Study Vignette.

Wesley, David.

J Insur Med ; 46(1): 20-6, 2016.

Artigo em Inglês | MEDLINE | ID: mdl-27562109

RESUMO

The author creates a fictional vignette to illustrate how a medical director can use data analysis to help review cases comparing attending physician statement (APS) values for build vs those reported by the paramedical examiners for the same lives. In this first of two articles, a method for extracting suitable data is explored.

Assuntos

Seguro Saúde , Estatística como Assunto

16.

Regular expressions for decoding of neural network outputs.

Strauß, Tobias; Leifert, Gundram; Grüning, Tobias; Labahn, Roger.

Neural Netw ; 79: 1-11, 2016 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-27078574

RESUMO

This article proposes a convenient tool for decoding the output of neural networks trained by Connectionist Temporal Classification (CTC) for handwritten text recognition. We use regular expressions to describe the complex structures expected in the writing. The corresponding finite automata are employed to build a decoder. We analyze theoretically which calculations are relevant and which can be avoided. A great speed-up results from an approximation. We conclude that the approximation most likely fails if the regular expression does not match the ground truth which is not harmful for many applications since the low probability will be even underestimated. The proposed decoder is very efficient compared to other decoding methods. The variety of applications reaches from information retrieval to full text recognition. We refer to applications where we integrated the proposed decoder successfully.

Assuntos

Escrita Manual , Armazenamento e Recuperação da Informação/métodos , Redes Neurais de Computação , Humanos , Probabilidade

17.

Regular expression order-sorted unification and matching.

Kutsia, Temur; Marin, Mircea.

J Symb Comput ; 67: 42-67, 2015 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-26523088

RESUMO

We extend order-sorted unification by permitting regular expression sorts for variables and in the domains of function symbols. The obtained signature corresponds to a finite bottom-up unranked tree automaton. We prove that regular expression order-sorted (REOS) unification is of type infinitary and decidable. The unification problem presented by us generalizes some known problems, such as, e.g., order-sorted unification for ranked terms, sequence unification, and word unification with regular constraints. Decidability of REOS unification implies that sequence unification with regular hedge language constraints is decidable, generalizing the decidability result of word unification with regular constraints to terms. A sort weakening algorithm helps to construct a minimal complete set of REOS unifiers from the solutions of sequence unification problems. Moreover, we design a complete algorithm for REOS matching, and show that this problem is NP-complete and the corresponding counting problem is #P-complete.

18.

Learning regular expressions for clinical text classification.

Bui, Duy Duc An; Zeng-Treitler, Qing.

J Am Med Inform Assoc ; 21(5): 850-7, 2014.

Artigo em Inglês | MEDLINE | ID: mdl-24578357

RESUMO

OBJECTIVES: Natural language processing (NLP) applications typically use regular expressions that have been developed manually by human experts. Our goal is to automate both the creation and utilization of regular expressions in text classification. METHODS: We designed a novel regular expression discovery (RED) algorithm and implemented two text classifiers based on RED. The RED+ALIGN classifier combines RED with an alignment algorithm, and RED+SVM combines RED with a support vector machine (SVM) classifier. Two clinical datasets were used for testing and evaluation: the SMOKE dataset, containing 1091 text snippets describing smoking status; and the PAIN dataset, containing 702 snippets describing pain status. We performed 10-fold cross-validation to calculate accuracy, precision, recall, and F-measure metrics. In the evaluation, an SVM classifier was trained as the control. RESULTS: The two RED classifiers achieved 80.9-83.0% in overall accuracy on the two datasets, which is 1.3-3% higher than SVM's accuracy (p<0.001). Similarly, small but consistent improvements have been observed in precision, recall, and F-measure when RED classifiers are compared with SVM alone. More significantly, RED+ALIGN correctly classified many instances that were misclassified by the SVM classifier (8.1-10.3% of the total instances and 43.8-53.0% of SVM's misclassifications). CONCLUSIONS: Machine-generated regular expressions can be effectively used in clinical text classification. The regular expression-based classifier can be combined with other classifiers, like SVM, to improve classification performance.

Assuntos

Algoritmos , Sistemas Computadorizados de Registros Médicos/classificação , Processamento de Linguagem Natural , Inteligência Artificial , Processamento Eletrônico de Dados , Humanos , Dor/classificação , Fumar , Máquina de Vetores de Suporte

19.

An intuitive graphical webserver for multiple-choice protein sequence search.

Banky, Daniel; Szalkai, Balazs; Grolmusz, Vince.

Gene ; 539(1): 152-3, 2014 Apr 10.

Artigo em Inglês | MEDLINE | ID: mdl-24525401

RESUMO

Every day tens of thousands of sequence searches and sequence alignment queries are submitted to webservers. The capitalized word "BLAST" becomes a verb, describing the act of performing sequence search and alignment. However, if one needs to search for sequences that contain, for example, two hydrophobic and three polar residues at five given positions, the query formation on the most frequently used webservers will be difficult. Some servers support the formation of queries with regular expressions, but most of the users are unfamiliar with their syntax. Here we present an intuitive, easily applicable webserver, the Protein Sequence Analysis server, that allows the formation of multiple choice queries by simply drawing the residues to their positions; if more than one residue are drawn to the same position, then they will be nicely stacked on the user interface, indicating the multiple choice at the given position. This computer-game-like interface is natural and intuitive, and the coloring of the residues makes possible to form queries requiring not just certain amino acids in the given positions, but also small nonpolar, negatively charged, hydrophobic, positively charged, or polar ones. The webserver is available at http://psa.pitgroup.org.

Assuntos

Bases de Dados de Proteínas , Internet , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Software , Sequência de Aminoácidos , Interface Usuário-Computador , Jogos de Vídeo

20.

SPLOOCE: a new portal for the analysis of human splicing variants.

Kroll, José Eduardo; Galante, Pedro A F; Ohara, Daniel T; Navarro, Fábio C P; Ohno-Machado, Lucila; de Souza, Sandro J.

RNA Biol ; 9(11): 1339-43, 2012 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-23064119

RESUMO

Understanding alternative splicing is crucial to elucidate the mechanisms behind several biological phenomena, including diseases. The huge amount of expressed sequences available nowadays represents an opportunity and a challenge to catalog and display alternative splicing events (ASEs). Although several groups have faced this challenge with relative success, we still lack a computational tool that uses a simple and straightforward method to retrieve, name and present ASEs. Here we present SPLOOCE, a portal for the analysis of human splicing variants. SPLOOCE uses a method based on regular expressions for retrieval of ASEs. We propose a simple syntax that is able to capture the complexity of ASEs.

Assuntos

Processamento Alternativo , Biologia Computacional , Bases de Dados de Ácidos Nucleicos , Sítios de Splice de RNA , Humanos , Internet , Análise de Sequência com Séries de Oligonucleotídeos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA