Search | VHL Regional Portal

Dbias: detecting biases and ensuring fairness in news articles.

Raza, Shaina; Reji, Deepak John; Ding, Chen.

Int J Data Sci Anal ; : 1-21, 2022 Sep 01.

Article in English | MEDLINE | ID: mdl-36065448

ABSTRACT

Because of the increasing use of data-centric systems and algorithms in machine learning, the topic of fairness is receiving a lot of attention in the academic and broader literature. This paper introduces Dbias (https://pypi.org/project/Dbias/), an open-source Python package for ensuring fairness in news articles. Dbias can take any text to determine if it is biased. Then, it detects biased words in the text, masks them, and suggests a set of sentences with new words that are bias-free or at least less biased. We conduct extensive experiments to assess the performance of Dbias. To see how well our approach works, we compare it to the existing fairness models. We also test the individual components of Dbias to see how effective they are. The experimental results show that Dbias outperforms all the baselines in terms of accuracy and fairness. We make this package (Dbias) as publicly available for the developers and practitioners to mitigate biases in textual data (such as news articles), as well as to encourage extension of this work.

Environmental due diligence data: A novel corpus for training environmental domain NLP models.

Aman, Afreen; Reji, Deepak John.

Data Brief ; 45: 108579, 2022 Dec.

Article in English | MEDLINE | ID: mdl-36148216

ABSTRACT

This article takes a step in the direction of adapting existing Natural Language Processing (NLP) models to diverse and heterogeneous settings of Environmental Due Diligence (EDD). The approach we followed was to enrich the vocabulary of deep learning models with more data from environmental domain by collecting the data from open-source regulatory documents provided by Environmental Protection Agency (EPA) [1]. We used active learning and data augmentation methods to resolve the imbalanced classes and fine-tuned DistilBERT on EDD data to develop environmental due diligence model which is hosted as an inference Application Programming Interface (API) on Hugging Face Hub. This model was packaged to predict EDD classes, determine relevancy and ranking, and allows users to fine tune the model to more EDD classes. This package, EnvBert is hosted on Python Package Index (PyPI) repository [2]. We anticipate that the rich EDD dataset that we used to train the model and create a package would help the users contribute for a variety of NLP tasks on EDD textual data, especially for text classification purposes. We present the data in raw format; it has been open sourced and publicly available at https://data.mendeley.com/datasets/tx6vmd4g9p/4.

Large-scale application of named entity recognition to biomedicine and epidemiology.

Raza, Shaina; Reji, Deepak John; Shajan, Femi; Bashir, Syed Raza.

PLOS Digit Health ; 1(12): e0000152, 2022 Dec.

Article in English | MEDLINE | ID: mdl-36812589

ABSTRACT

BACKGROUND: Despite significant advancements in biomedical named entity recognition methods, the clinical application of these systems continues to face many challenges: (1) most of the methods are trained on a limited set of clinical entities; (2) these methods are heavily reliant on a large amount of data for both pre-training and prediction, making their use in production impractical; (3) they do not consider non-clinical entities, which are also related to patient's health, such as social, economic or demographic factors. METHODS: In this paper, we develop Bio-Epidemiology-NER (https://pypi.org/project/Bio-Epidemiology-NER/) an open-source Python package for detecting biomedical named entities from the text. This approach is based on a Transformer-based system and trained on a dataset that is annotated with many named entities (medical, clinical, biomedical, and epidemiological). This approach improves on previous efforts in three ways: (1) it recognizes many clinical entity types, such as medical risk factors, vital signs, drugs, and biological functions; (2) it is easily configurable, reusable, and can scale up for training and inference; (3) it also considers non-clinical factors (age and gender, race and social history and so) that influence health outcomes. At a high level, it consists of the phases: pre-processing, data parsing, named entity recognition, and named entity enhancement. RESULTS: Experimental results show that our pipeline outperforms other methods on three benchmark datasets with macro-and micro average F1 scores around 90 percent and above. CONCLUSION: This package is made publicly available for researchers, doctors, clinicians, and anyone to extract biomedical named entities from unstructured biomedical texts.

ABSTRACT

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL