Search | VHL Regional Portal

Bedshift: perturbation of genomic interval sets.

Gu, Aaron; Cho, Hyun Jae; Sheffield, Nathan C.

Genome Biol ; 22(1): 238, 2021 08 20.

Article in English | MEDLINE | ID: mdl-34416909

ABSTRACT

Functional genomics experiments, like ChIP-Seq or ATAC-Seq, produce results that are summarized as a region set. There is no way to objectively evaluate the effectiveness of region set similarity metrics. We present Bedshift, a tool for perturbing BED files by randomly shifting, adding, and dropping regions from a reference file. The perturbed files can be used to benchmark similarity metrics, as well as for other applications. We highlight differences in behavior between metrics, such as that the Jaccard score is most sensitive to added or dropped regions, while coverage score is most sensitive to shifted regions.

Subject(s)

Genome , Genomics/methods , Software , Chromatin Immunoprecipitation Sequencing , HCT116 Cells , Humans

Embeddings of genomic region sets capture rich biological associations in lower dimensions.

Gharavi, Erfaneh; Gu, Aaron; Zheng, Guangtao; Smith, Jason P; Cho, Hyun Jae; Zhang, Aidong; Brown, Donald E; Sheffield, Nathan C.

Bioinformatics ; 37(23): 4299-4306, 2021 12 07.

Article in English | MEDLINE | ID: mdl-34156475

ABSTRACT

MOTIVATION: Genomic region sets summarize functional genomics data and define locations of interest in the genome such as regulatory regions or transcription factor binding sites. The number of publicly available region sets has increased dramatically, leading to challenges in data analysis. RESULTS: We propose a new method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach. We compared our approach to two simpler methods based on interval unions or term frequency-inverse document frequency and evaluated the methods in three ways: First, by classifying the cell line, antibody or tissue type of the region set; second, by assessing whether similarity among embeddings can reflect simulated random perturbations of genomic regions; and third, by testing robustness of the proposed representations to different signal thresholds for calling peaks. Our word2vec-based region set embeddings reduce dimensionality from more than a hundred thousand to 100 without significant loss in classification performance. The vector representation could identify cell line, antibody and tissue type with over 90% accuracy. We also found that the vectors could quantitatively summarize simulated random perturbations to region sets and are more robust to subsampling the data derived from different peak calling thresholds. Our evaluations demonstrate that the vectors retain useful biological information in relatively lower-dimensional spaces. We propose that vector representation of region sets is a promising approach for efficient analysis of genomic region data. AVAILABILITY AND IMPLEMENTATION: https://github.com/databio/regionset-embedding. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Genomics , Protein Binding

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL