Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 3 de 3
Filter
Add more filters










Database
Language
Publication year range
1.
Front Big Data ; 7: 1296552, 2024.
Article in English | MEDLINE | ID: mdl-38495849

ABSTRACT

Traditional data curation processes typically depend on human intervention. As data volume and variety grow exponentially, organizations are striving to increase efficiency of their data processes by automating manual processes and making them as unsupervised as possible. An additional challenge is to make these unsupervised processes scalable to meet the demands of increased data volume. This paper describes the parallelization of an unsupervised entity resolution (ER) process. ER is a component of many different data curation processes because it clusters records from multiple data sources that refer to the same real-world entity, such as the same customer, patient, or product. The ability to scale ER processes is particularly important because the computation effort of ER increases quadratically with data volume. The Data Washing Machine (DWM) is an already proposed unsupervised ER system which clusters references from diverse data sources. This work aims at solving the single-threaded nature of the DWM by adopting the parallelization nature of Hadoop MapReduce. However, the proposed parallelization method can be applied to both supervised systems, where matching rules are created by experts, and unsupervised systems, where expert intervention is not required. The DWM uses an entropy measure to self-evaluate the quality of record clustering. The current single-threaded implementations of the DWM in Python and Java are not scalable beyond a few 1,000 records and rely on large, shared memory. The objective of this research is to solve the major two shortcomings of the current design of the DWM which are the creation and usage of shared memory and lack of scalability by leveraging on the power of Hadoop MapReduce. We propose Hadoop Data Washing Machine (HDWM), a MapReduce implementation of the legacy DWM. The scalability of the proposed system is displayed using publicly available ER datasets. Based on results from our experiment, we conclude that HDWM can cluster from 1,000's to millions of equivalent references using multiple computational nodes with independent RAM and CPU cores.

2.
Plant Biotechnol J ; 1(6): 479-90, 2003 Nov.
Article in English | MEDLINE | ID: mdl-17134405

ABSTRACT

A transgenic line of subterranean clover (Trifolium subterraneum) containing a gene for a sulphur-rich sunflower seed albumin (ssa gene) and a gene conferring tolerance to the herbicide phosphinothricin (bar gene) was previously shown to stably express these genes as far as the T3 generation. In subsequent generations there was a progressive decline in the level of expression of both of these genes such that, by the T7 generation, the plants were almost completely susceptible to the herbicide and the mean level of sunflower seed albumin was reduced to 10-30% of the level in the T2 and T3 generations. The decline in SSA protein correlated closely with a decline in the level of ssa RNA. In vitro transcription experiments with nuclei isolated from plants of the early and late generations showed that the reduced mRNA level was associated with a reduced level of transcription of the ssa transgene. Transcription of the bar transgene was also reduced in the late generations. Bisulphite sequencing analysis showed that the decline in expression of the ssa gene between T3 and subsequent generations correlated closely with increased CpG methylation in the promoter, but not in the coding region. Analysis of the bar gene promoter showed that high levels of CpG methylation preceded the first detectable decline in expression of the bar gene by one generation, suggesting that methylation was not the direct cause of transgene silencing in these plants.

3.
Curr Opin Plant Biol ; 5(3): 212-7, 2002 Jun.
Article in English | MEDLINE | ID: mdl-11960738

ABSTRACT

Seed composition is genetically programmed, but the implementation of that program is affected by many factors including the nutrition of the parent plant. In particular, seeds demonstrate a remarkable capacity to maintain nitrogen homeostasis in conditions of varying sulfur supply. They do this by altering the expression of individual genes encoding abundant storage proteins. The signal transduction pathways that modulate gene expression in seeds in response to N and S availability involve both transcriptional and post-transcriptional mechanisms.


Subject(s)
Nitrogen/pharmacology , Plant Proteins/metabolism , Seeds/metabolism , Sulfur/pharmacology , Albumins/metabolism , Globulins/metabolism , Glutens/metabolism , Plant Proteins/drug effects , Plant Proteins/genetics , Prolamins , Protein Processing, Post-Translational , RNA Processing, Post-Transcriptional , Seeds/chemistry , Seeds/drug effects , Signal Transduction , Zea mays/metabolism
SELECTION OF CITATIONS
SEARCH DETAIL
...