File-level Deduplication by using text files - Hive integration
2021 International Conference on Computer Communication and Informatics
; 2021.
Article
in English
| Web of Science | ID: covidwho-1361866
ABSTRACT
With the enormous increase in data size, the complexity of finding duplicate data is recognized as one of the significant challenges. Elimination of duplicate data is an essential step in data cleaning as redundant data can affect a system's performance in the data processing. In order to do this deduplication technique is used to eliminate the duplicated data at the file or content level which helps to only store one copy of the file in the database. In this paper a technique is proposed to solve the storage issues and deduplication where the Hadoop Distributed File System is used to solve the vast amount of data storage issues and to identify the duplicate data a cryptography algorithm SHA 256 is used. Finally, HBase a non-relational distributed database including Hive Integration is used for data retrieval. The dataset containing counts of tests and results for COVID-19 is taken from Data.gov for experimentation. The experimental results divulge an increase in deduplication ratio, less time consumed and a gain in the storage space used.
Full text:
Available
Collection:
Databases of international organizations
Database:
Web of Science
Language:
English
Journal:
2021 International Conference on Computer Communication and Informatics
Year:
2021
Document Type:
Article
Similar
MEDLINE
...
LILACS
LIS