Question 1

So I have two files , one VCF that looks like

88  Chr1    25  C   -   3   2   1   1
88  Chr1    88  A   T   7   2   1   1
88  Chr1    92  A   C   16  4   1   1

and another with genes that looks like

GENEID  Start END
GENE_ID 11 155
GENE_ID 165 999

I want a script that looks if there is a gene position (3rd column of VCF file) within the range of second and third position of the second file and then to print it out.

What I did so far was to join the files and do

awk '{if (3>$12 && $3< $13) print }' > out

What I did only compares current rows of joined files (it only prints if the value is in the same row), how can I make it compare all rows of column 3 to all rows of column 12 and 13?

Best, Serg

Question 2

I hope to help (EDIT i change the code for more efficient algorithm)

gawk '#read input.genes and create list of limits (min, max)NR == FNR {#without header in inputif(NR>1) {for(i=$2; i<=$3; i++){limits[i]=limits[i]","$2"-"$3;}};next}#read input.vcf, if column 3 is range of limits then print{if($3 in limits){print $0, "between("limits[$3]")"}}' input.genes input.vcf

you get:

88  Chr1    25  C   -   3   2   1   1 between(,11-155)
88  Chr1    88  A   T   7   2   1   1 between(,11-155)
88  Chr1    92  A   C   16  4   1   1 between(,11-155)

This algorithm in python is optimized for very large file using dictionaries

limits = [line.strip().split() for line in open("input.genes")]
limits.pop(0) #remove the header
limits = [map(int,v[1:]) for v in limits]dict_limits = {}
for start, finish in limits:for i in xrange(start, finish+1):if i not in dict_limits:dict_limits[i] = []dict_limits[i].append((start,finish))OUTPUT = open("my_output.txt", "w")
for reg in open("input.vcf"):v_reg = reg.strip().split()if int(v_reg[2]) in dict_limits:OUTPUT.write(reg.strip() + "\tbetween({})\n".format(str(dict_limits[int(v_reg[2])])))OUTPUT.close()

you get:

88  Chr1    25  C   -   3   2   1   1   between([(11, 155)])
88  Chr1    88  A   T   7   2   1   1   between([(11, 155)])
88  Chr1    92  A   C   16  4   1   1   between([(11, 155)])

Comparing one column value to all columns in linux enviroment

Related Q&A

Can be saved into a variable one condition?

Pycharm, can not find Python version 2.7.11

Python Convert Binary String to IP Address in Dotted Notation

Scrapy: Extracting data from source and its links

Rename file on upload to admin using Django

Ignore newline character in binary file with Python?

RegEx Parse Error by Parsley Python

Accessing Bangla (UTF-8) string by index in Python

Computing KL divergence for many distributions

Expanding mean over multiple series in pandas