I need to parse a file which has contents that look like this:
20 31022550 G 1396 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:2:60.00:33.00:37.00:2:0:0.02:0.02:40.00:2:0.98:126.00:0.98 C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 G:1391:60.00:36.08:36.97:719:672:0.51:0.01:7.59:719:0.49:126.00:0.50 T:1:60.00:33.00:37.00:0:1:0.37:0.02:47.00:0:0.00:126.00:0.18 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 +A:2:60.00:0.00:37.00:2:0:0.67:0.01:0.00:2:0.65:126.00:0.65
20 31022551 A 1271 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:960:60.00:35.23:36.99:496:464:0.50:0.00:6.38:496:0.49:126.00:0.52 C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 G:13:60.00:35.00:35.92:4:9:0.13:0.02:44.92:4:0.98:126.00:0.37 T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 +G:288:60.00:0.00:37.00:171:117:0.57:0.01:8.17:171:0.54:126.00:0.53 +GG:9:60.00:0.00:37.00:5:4:0.71:0.03:23.67:5:0.50:126.00:0.57 +GGG:1:60.00:0.00:37.00:1:0:0.51:0.03:14.00:1:0.24:126.00:0.24
After parsing I would want it to look
20 31022550 G 1396 = 0 0 0 0 0 0 0 0 0 0 0 0
20 31022550 G 1396 A 2 60 33 37 2 0 0.02 0.02 40 2 0.98 126
20 31022550 G 1396 C 0 0 0 0 0 0 0 0 0 0 0 0
20 31022550 G 1396 G 1391 60 36.08 36.97 719 672 0.51 0.01 7.59 719 0.49 126
20 31022550 G 1396 T 1 60 33 37 0 1 0.37 0.02 47 0 0 126
20 31022550 G 1396 N 0 0 0 0 0 0 0 0 0 0 0 0
20 31022550 G 1396 +A 2 60 0 37 2 0 0.67 0.01 0 2 0.65 126
20 31022551 A 1271 = 0 0 0 0 0 0 0 0 0 0 0 0
20 31022551 A 1271 A 960 60 35.23 36.99 496 464 0.5 0 6.38 496 0.49 126
20 31022551 A 1271 C 0 0 0 0 0 0 0 0 0 0 0 0
20 31022551 A 1271 G 13 60 35 35.92 4 9 0.13 0.02 44.92 4 0.98 126
20 31022551 A 1271 T 0 0 0 0 0 0 0 0 0 0 0 0
20 31022551 A 1271 N 0 0 0 0 0 0 0 0 0 0 0 0
20 31022551 A 1271 +G 288 60 0 37 171 117 0.57 0.01 8.17 171 0.54 126
20 31022551 A 1271 +GG 9 60 0 37 5 4 0.71 0.03 23.67 5 0.5 126
20 31022551 A 1271 +GGG 1 60 0 37 1 0 0.51 0.03 14 1 0.24 126
I have more lines where it increments based on column[1]
31022550...31022NNN
Code
What I am trying to do here is to only print certain parts of the file with this pseudo code keeping the column[1]
as key
from collections import defaultdict
ids = defaultdict(list)with open('~/file.tsv', 'r') as f:for line in f:lines = line.strip().split('\t')pos = (lines[0:3])for ele in lines[4:]:# print posp = pos[1].strip()base = ele.split(':')[0]ids[p] = {'pos': pos[0].strip(),'base': base,'count': ele.split(':')[1],'_pos': ele.split(':')[5],'_neg': ele.split(':')[6]}
\
for k,v in ids.iteritems():print k,v
Output
31022550 {'count': '2', 'base': '+A', 'pos': '20', '_neg': '0', '_pos': '2'}
31022551 {'count': '1', 'base': '+GGG', 'pos': '20', '_neg': '0', '_pos': '1'}
Not sure why I do not see all the fields that 31022550 holds as key value pair.