Finding common IDs (intersection) in two dictionaries

2024/10/14 12:26:46

I wrote a piece of code that is supposed to find common intersecting ID's in line[1] in two different files. On my small sample files it works OK, but on my bigger files does not. I cannot figure out why, can you suggest me what is wrong? The exact problem is when my input is i.e. 200 it gives me 90 intersections, if I reduce it to 150, it gives me intersections of 110, logically it cannot be higher.

fileA = open("file1.txt",'r')
fileB = open("file2.txt",'r')
output = open("result.txt",'w') = dict()
for line1 in fileA:listA = line1.split('\t')dictA[listA[1]] = listAdictB = dict()
for line1 in fileB:listB = line1.split('\t')dictB[listB[1]] = listBfor key in set(dictA).intersection(dictB):output.write(dictB[key][0]+'\t'+dictA[key][1]+'\t'+dictA[key][4]+'\t'+dictA[key][5]+'\t'+dictA[key][9]+'\t'+dictA[key][10]+'\n')

My file1 is sorted by line[0] and has 0-15 lines, to make it simpler here I give an example putting only line[0] and line[1],

contig17    GRMZM2G052619_P03  x x x x x x x x x x x x x x
contig33    AT2G41790.1    x x x x x x x x x x x x x x
contig98    GRMZM5G888620_P01  x x x x x x x x x x x x x x  
contig102   GRMZM5G886789_P02  x x x x x x x x x x x x x x  
contig123   AT3G57470.1    x x x x x x x x x x x x x x

My file2 is not sorted and has 0-10 line, I give only line[1]

y GRMZM2G052619_P03 y y y y y y y y         
y GRMZM5G888620_P01 y y y y y y y y     
y GRMZM5G886789_P02 y y y y y y y y     

My desired output,

contig17    GRMZM2G052619_P03  y y y y
contig98    GRMZM5G888620_P01  y y y y  
contig102   GRMZM5G886789_P02  y y y y  

Pay attention to this:


It means you print file2 first column than file1 second column. It doesn't correspond with your examples and desired output.

As for intersection routine, it looks quite correct, so probably it's something wrong with your file. Are you sure all keys are unique? What do you mean by "reduce to 150" - do you mean just deleting some lines from this very file.

Also better replace

for key in set(dictA).intersection(dictB):


for key in dictA:if key in dictB:

It's actually the same, but should be faster and spends less memory.

