Pesquisa | Portal Regional da BVS (teste)

DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text.

Chakravarthi, Bharathi Raja; Priyadharshini, Ruba; Muralidaran, Vigneshwaran; Jose, Navya; Suryawanshi, Shardul; Sherly, Elizabeth; McCrae, John P.

Lang Resour Eval ; 56(3): 765-806, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35996566

RESUMO

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff's alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo.

Toward an Integrative Approach for Making Sense Distinctions.

McCrae, John P; Fransen, Theodorus; Ahmadi, Sina; Buitelaar, Paul; Goswami, Koustava.

Front Artif Intell ; 5: 745626, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35198970

RESUMO

Word senses are the fundamental unit of description in lexicography, yet it is rarely the case that different dictionaries reach any agreement on the number and definition of senses in a language. With the recent rise in natural language processing and other computational approaches there is an increasing demand for quantitatively validated sense catalogues of words, yet no consensus methodology exists. In this paper, we look at four main approaches to making sense distinctions: formal, cognitive, distributional, and intercultural and examine the strengths and weaknesses of each approach. We then consider how these may be combined into a single sound methodology. We illustrate this by examining two English words, "wing" and "fish," using existing resources for each of these four approaches and illustrate the weaknesses of each. We then look at the impact of such an integrated method and provide some future perspectives on the research that is necessary to reach a principled method for making sense distinctions.

A Survey of Orthographic Information in Machine Translation.

Chakravarthi, Bharathi Raja; Rani, Priya; Arcan, Mihael; McCrae, John P.

SN Comput Sci ; 2(4): 330, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34723204

RESUMO

Machine translation is one of the applications of natural language processing which has been explored in different languages. Recently researchers started paying attention towards machine translation for resource-poor languages and closely related languages. A widespread and underlying problem for these machine translation systems is the linguistic difference and variation in orthographic conventions which causes many issues to traditional approaches. Two languages written in two different orthographies are not easily comparable but orthographic information can also be used to improve the machine translation system. This article offers a survey of research regarding orthography's influence on machine translation of under-resourced languages. It introduces under-resourced languages in terms of machine translation and how orthographic information can be utilised to improve machine translation. We describe previous work in this area, discussing what underlying assumptions were made, and showing how orthographic knowledge improves the performance of machine translation of under-resourced languages. We discuss different types of machine translation and demonstrate a recent trend that seeks to link orthographic information with well-established machine translation methods. Considerable attention is given to current efforts using cognate information at different levels of machine translation and the lessons that can be drawn from this. Additionally, multilingual neural machine translation of closely related languages is given a particular focus in this survey. This article ends with a discussion of the way forward in machine translation with orthographic information, focusing on multilingual settings and bilingual lexicon induction.

Putting patients in control of data from electronic health records.

New, John P; Leather, David; Bakerly, Nawar Diar; McCrae, John; Gibson, J Martin.

BMJ ; 360: j5554, 2018 01 02.

Artigo em Inglês | MEDLINE | ID: mdl-29295813

Assuntos

Acesso à Informação/legislação & jurisprudência , Confidencialidade/normas , Registros Eletrônicos de Saúde/normas , Documentação , Registros Eletrônicos de Saúde/estatística & dados numéricos , Humanos

Monitoring safety in a phase III real-world effectiveness trial: use of novel methodology in the Salford Lung Study.

Collier, Sue; Harvey, Catherine; Brewster, Jill; Bakerly, Nawar Diar; Elkhenini, Hanaa F; Stanciu, Roxana; Williams, Claire; Brereton, Jacqui; New, John P; McCrae, John; McCorkindale, Sheila; Leather, David.

Pharmacoepidemiol Drug Saf ; 26(3): 344-352, 2017 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-27804174

RESUMO

BACKGROUND: The Salford Lung Study (SLS) programme, encompassing two phase III pragmatic randomised controlled trials, was designed to generate evidence on the effectiveness of a once-daily treatment for asthma and chronic obstructive pulmonary disease in routine primary care using electronic health records. OBJECTIVE: The objective of this study was to describe and discuss the safety monitoring methodology and the challenges associated with ensuring patient safety in the SLS. Refinements to safety monitoring processes and infrastructure are also discussed. The study results are outside the remit of this paper. The results of the COPD study were published recently and a more in-depth exploration of the safety results will be the subject of future publications. ACHIEVEMENTS: The SLS used a linked database system to capture relevant data from primary care practices in Salford and South Manchester, two university hospitals and other national databases. Patient data were collated and analysed to create daily summaries that were used to alert a specialist safety team to potential safety events. Clinical research teams at participating general practitioner sites and pharmacies also captured safety events during routine consultations. Confidence in the safety monitoring processes over time allowed the methodology to be refined and streamlined without compromising patient safety or the timely collection of data. The information technology infrastructure also allowed additional details of safety information to be collected. CONCLUSION: Integration of multiple data sources in the SLS may provide more comprehensive safety information than usually collected in standard randomised controlled trials. Application of the principles of safety monitoring methodology from the SLS could facilitate safety monitoring processes for future pragmatic randomised controlled trials and yield important complementary safety and effectiveness data. © 2016 The Authors Pharmacoepidemiology and Drug Safety Published by John Wiley & Sons Ltd.

Assuntos

Asma/tratamento farmacológico , Registros Eletrônicos de Saúde/estatística & dados numéricos , Doença Pulmonar Obstrutiva Crônica/tratamento farmacológico , Projetos de Pesquisa , Androstadienos/administração & dosagem , Androstadienos/efeitos adversos , Álcoois Benzílicos/administração & dosagem , Álcoois Benzílicos/efeitos adversos , Clorobenzenos/administração & dosagem , Clorobenzenos/efeitos adversos , Bases de Dados Factuais , Combinação de Medicamentos , Humanos , Registro Médico Coordenado , Atenção Primária à Saúde

Synonym set extraction from the biomedical literature by lexical pattern discovery.

McCrae, John; Collier, Nigel.

BMC Bioinformatics ; 9: 159, 2008 Mar 24.

Artigo em Inglês | MEDLINE | ID: mdl-18366721

RESUMO

BACKGROUND: Although there are a large number of thesauri for the biomedical domain many of them lack coverage in terms and their variant forms. Automatic thesaurus construction based on patterns was first suggested by Hearst 1, but it is still not clear how to automatically construct such patterns for different semantic relations and domains. In particular it is not certain which patterns are useful for capturing synonymy. The assumption of extant resources such as parsers is also a limiting factor for many languages, so it is desirable to find patterns that do not use syntactical analysis. Finally to give a more consistent and applicable result it is desirable to use these patterns to form synonym sets in a sound way. RESULTS: We present a method that automatically generates regular expression patterns by expanding seed patterns in a heuristic search and then develops a feature vector based on the occurrence of term pairs in each developed pattern. This allows for a binary classifications of term pairs as synonymous or non-synonymous. We then model this result as a probability graph to find synonym sets, which is equivalent to the well-studied problem of finding an optimal set cover. We achieved 73.2% precision and 29.7% recall by our method, out-performing hand-made resources such as MeSH and Wikipedia. CONCLUSION: We conclude that automatic methods can play a practical role in developing new thesauri or expanding on existing ones, and this can be done with only a small amount of training data and no need for resources such as parsers. We also concluded that the accuracy can be improved by grouping into synonym sets.

Assuntos

Inteligência Artificial , Dicionários como Assunto , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Publicações Periódicas como Assunto , Semântica , Vocabulário Controlado , Biologia Computacional/métodos , Sistemas de Gerenciamento de Base de Dados , Reconhecimento Automatizado de Padrão/métodos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA