Pesquisa | Portal Regional da BVS (teste)

DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text.

Chakravarthi, Bharathi Raja; Priyadharshini, Ruba; Muralidaran, Vigneshwaran; Jose, Navya; Suryawanshi, Shardul; Sherly, Elizabeth; McCrae, John P.

Lang Resour Eval ; 56(3): 765-806, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35996566

RESUMO

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff's alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo.

Toward an Integrative Approach for Making Sense Distinctions.

McCrae, John P; Fransen, Theodorus; Ahmadi, Sina; Buitelaar, Paul; Goswami, Koustava.

Front Artif Intell ; 5: 745626, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35198970

RESUMO

Word senses are the fundamental unit of description in lexicography, yet it is rarely the case that different dictionaries reach any agreement on the number and definition of senses in a language. With the recent rise in natural language processing and other computational approaches there is an increasing demand for quantitatively validated sense catalogues of words, yet no consensus methodology exists. In this paper, we look at four main approaches to making sense distinctions: formal, cognitive, distributional, and intercultural and examine the strengths and weaknesses of each approach. We then consider how these may be combined into a single sound methodology. We illustrate this by examining two English words, "wing" and "fish," using existing resources for each of these four approaches and illustrate the weaknesses of each. We then look at the impact of such an integrated method and provide some future perspectives on the research that is necessary to reach a principled method for making sense distinctions.

A Survey of Orthographic Information in Machine Translation.

Chakravarthi, Bharathi Raja; Rani, Priya; Arcan, Mihael; McCrae, John P.

SN Comput Sci ; 2(4): 330, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34723204

RESUMO

Machine translation is one of the applications of natural language processing which has been explored in different languages. Recently researchers started paying attention towards machine translation for resource-poor languages and closely related languages. A widespread and underlying problem for these machine translation systems is the linguistic difference and variation in orthographic conventions which causes many issues to traditional approaches. Two languages written in two different orthographies are not easily comparable but orthographic information can also be used to improve the machine translation system. This article offers a survey of research regarding orthography's influence on machine translation of under-resourced languages. It introduces under-resourced languages in terms of machine translation and how orthographic information can be utilised to improve machine translation. We describe previous work in this area, discussing what underlying assumptions were made, and showing how orthographic knowledge improves the performance of machine translation of under-resourced languages. We discuss different types of machine translation and demonstrate a recent trend that seeks to link orthographic information with well-established machine translation methods. Considerable attention is given to current efforts using cognate information at different levels of machine translation and the lessons that can be drawn from this. Additionally, multilingual neural machine translation of closely related languages is given a particular focus in this survey. This article ends with a discussion of the way forward in machine translation with orthographic information, focusing on multilingual settings and bilingual lexicon induction.

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA