Comparing one column value to all columns in linux enviroment

2024/10/12 18:20:26

So I have two files , one VCF that looks like

88  Chr1    25  C   -   3   2   1   1
88  Chr1    88  A   T   7   2   1   1
88  Chr1    92  A   C   16  4   1   1

and another with genes that looks like

GENEID  Start END
GENE_ID 11 155
GENE_ID 165 999

I want a script that looks if there is a gene position (3rd column of VCF file) within the range of second and third position of the second file and then to print it out.

What I did so far was to join the files and do

awk '{if (3>$12 && $3< $13) print }' > out

What I did only compares current rows of joined files (it only prints if the value is in the same row), how can I make it compare all rows of column 3 to all rows of column 12 and 13?

Best, Serg

Answer

I hope to help (EDIT i change the code for more efficient algorithm)

gawk '#read input.genes and create list of limits (min, max)NR == FNR {#without header in inputif(NR>1) {for(i=$2; i<=$3; i++){limits[i]=limits[i]","$2"-"$3;}};next}#read input.vcf, if column 3 is range of limits then print{if($3 in limits){print $0, "between("limits[$3]")"}}' input.genes input.vcf

you get:

88  Chr1    25  C   -   3   2   1   1 between(,11-155)
88  Chr1    88  A   T   7   2   1   1 between(,11-155)
88  Chr1    92  A   C   16  4   1   1 between(,11-155)

This algorithm in python is optimized for very large file using dictionaries

limits = [line.strip().split() for line in open("input.genes")]
limits.pop(0) #remove the header
limits = [map(int,v[1:]) for v in limits]dict_limits = {}
for start, finish in limits:for i in xrange(start, finish+1):if i not in dict_limits:dict_limits[i] = []dict_limits[i].append((start,finish))OUTPUT = open("my_output.txt", "w")
for reg in open("input.vcf"):v_reg = reg.strip().split()if int(v_reg[2]) in dict_limits:OUTPUT.write(reg.strip() + "\tbetween({})\n".format(str(dict_limits[int(v_reg[2])])))OUTPUT.close()

you get:

88  Chr1    25  C   -   3   2   1   1   between([(11, 155)])
88  Chr1    88  A   T   7   2   1   1   between([(11, 155)])
88  Chr1    92  A   C   16  4   1   1   between([(11, 155)])
https://en.xdnf.cn/q/118175.html

Related Q&A

Can be saved into a variable one condition?

It would be possible to store the condition itself in the variable, rather than the immediate return it, when to declare it?Example:a = 3 b = 5x = (a == b) print(x)a = 5 print(x)The return isFalse Fal…

Pycharm, can not find Python version 2.7.11

I installed Python 2.7.11 on this Mac, and from terminal Python 2.7.11 can be started. However, From the interpreter of Pycharm (2016.1 version) , there is no Python 2.7.11.Any suggestions ? ThanksPS…

Python Convert Binary String to IP Address in Dotted Notation

So Im trying to read in a file with some binary strings, i.e: 10000010 00000000 0000**** ********. The script will convert the *s to both 0 and 1, so there will be two binary strings that look like thi…

Scrapy: Extracting data from source and its links

Edited question to link to original:Scrapy getting data from links within tableFrom the link https://www.tdcj.state.tx.us/death_row/dr_info/trottiewillielast.htmlI am trying to get info from the main t…

Rename file on upload to admin using Django

I have used a function in Django 1.6 to rename my files when they are uploaded through admin, but this does not work in Django 1.8. Anyone know if it is still possible to do this in 1.8?class Entry(mo…

Ignore newline character in binary file with Python?

I open my file like so :f = open("filename.ext", "rb") # ensure binary reading with bMy first line of data looks like this (when using f.readline()):\x04\x00\x00\x00\x12\x00\x00\x00…

RegEx Parse Error by Parsley Python

I have made a simple parser for simple queries, to fetch data from a datastore. The operands I have used are <,<=,>,>=,==,!= The Parser works fine for every operand except for < I am a b…

Accessing Bangla (UTF-8) string by index in Python

I have a string in Bangla and Im trying to access characters by index.# -*- coding: utf-8 -*- bstr = "তরদজ" print bstr # This line is working fine for i in bstr:print i, # question marks …

Computing KL divergence for many distributions

I have a matrix of test probability distributions:qs = np.array([[0.1, 0.6], [0.9, 0.4] ])(sums up to 1 in each column) and "true" distribution:p = np.array([0.5, 0.5])I would like to calcula…

Expanding mean over multiple series in pandas

I have a groupby object I apply expanding mean to. However I want that calculation over another series/group at the same time. Here is my code:d = { home : [A, B, B, A, B, A, A], away : [B, A,A, B, A, …