File-level Deduplication by using text files - Hive integration

Sharma, N.; Prasad, A. V. K.; Kakulapati, V.; Ieee

Sharma, N.; Prasad, A. V. K.; Kakulapati, V.; Ieee.

2021 International Conference on Computer Communication and Informatics ; 2021.

Article in English | Web of Science | ID: covidwho-1361866

ABSTRACT

ABSTRACT

With the enormous increase in data size, the complexity of finding duplicate data is recognized as one of the significant challenges. Elimination of duplicate data is an essential step in data cleaning as redundant data can affect a system's performance in the data processing. In order to do this deduplication technique is used to eliminate the duplicated data at the file or content level which helps to only store one copy of the file in the database. In this paper a technique is proposed to solve the storage issues and deduplication where the Hadoop Distributed File System is used to solve the vast amount of data storage issues and to identify the duplicate data a cryptography algorithm SHA 256 is used. Finally, HBase a non-relational distributed database including Hive Integration is used for data retrieval. The dataset containing counts of tests and results for COVID-19 is taken from Data.gov for experimentation. The experimental results divulge an increase in deduplication ratio, less time consumed and a gain in the storage space used.

Fulltext

XML

Search on Google

Full text: Available Collection: Databases of international organizations Database: Web of Science Language: English Journal: 2021 International Conference on Computer Communication and Informatics Year: 2021 Document Type: Article

Similar

MEDLINE

LILACS

LIS

Fulltext

XML

Search on Google