Aware Of Duplicated Partially Overlapped Regions In A Bed File When Using Bedtools Intersect Function

- 1 min

Aware of duplicated (partially overlapped) regions in a BED file when using bedtools intersect function

When we compare two genomic features, the very common task is to assess the difference in their common region. bedtools intersect is the mostly applied to extract the common region between two BED files. However, there will be unexpected common regions appearing if one or two of our input files contain overlapped region itself. Let’s see some examples:

with open('A.bed','w') as f:
    f.write('\n'.join([
        '\t'.join(['chr1','10','20']),
        '\t'.join(['chr1','15','20']),
        '\t'.join(['chr1','30','40']),
    ]))
    
with open('B.bed','w') as f:
    f.write('\n'.join([
        '\t'.join(['chr1','15','20']),
    ]))
!bedtools intersect -a A.bed -b B.bed
chr1	15	20
chr1	15	20

We count the overlapped region twice!!!

To avoid such redundant regions, we could using bedtools merge ahead of conducting intersection. For example:

!bedtools merge -i A.bed > A.merge.bed
cat A.merge.bed
chr1	10	20
chr1	30	40
!bedtools intersect -a A.merge.bed -b B.bed
chr1	15	20

Problem sovled!

Jingxin Fu, Ph.D.

Jingxin Fu, Ph.D.

Research Fellow interested in data mining on cancer genomics

comments powered by Disqus
rss facebook twitter github gitlab youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora