So I have two files , one VCF that looks like
88 Chr1 25 C - 3 2 1 1
88 Chr1 88 A T 7 2 1 1
88 Chr1 92 A C 16 4 1 1
and another with genes that looks like
GENEID Start END
GENE_ID 11 155
GENE_ID 165 999
I want a script that looks if there is a gene position (3rd column of VCF file) within the range of second and third position of the second file and then to print it out.
What I did so far was to join the files and do
awk '{if (3>$12 && $3< $13) print }' > out
What I did only compares current rows of joined files (it only prints if the value is in the same row), how can I make it compare all rows of column 3 to all rows of column 12 and 13?
Best,
Serg
I hope to help (EDIT i change the code for more efficient algorithm)
gawk '#read input.genes and create list of limits (min, max)NR == FNR {#without header in inputif(NR>1) {for(i=$2; i<=$3; i++){limits[i]=limits[i]","$2"-"$3;}};next}#read input.vcf, if column 3 is range of limits then print{if($3 in limits){print $0, "between("limits[$3]")"}}' input.genes input.vcf
you get:
88 Chr1 25 C - 3 2 1 1 between(,11-155)
88 Chr1 88 A T 7 2 1 1 between(,11-155)
88 Chr1 92 A C 16 4 1 1 between(,11-155)
This algorithm in python is optimized for very large file using dictionaries
limits = [line.strip().split() for line in open("input.genes")]
limits.pop(0) #remove the header
limits = [map(int,v[1:]) for v in limits]dict_limits = {}
for start, finish in limits:for i in xrange(start, finish+1):if i not in dict_limits:dict_limits[i] = []dict_limits[i].append((start,finish))OUTPUT = open("my_output.txt", "w")
for reg in open("input.vcf"):v_reg = reg.strip().split()if int(v_reg[2]) in dict_limits:OUTPUT.write(reg.strip() + "\tbetween({})\n".format(str(dict_limits[int(v_reg[2])])))OUTPUT.close()
you get:
88 Chr1 25 C - 3 2 1 1 between([(11, 155)])
88 Chr1 88 A T 7 2 1 1 between([(11, 155)])
88 Chr1 92 A C 16 4 1 1 between([(11, 155)])