Using defaultdict to parse multi delimiter file

2024/10/12 9:28:27

I need to parse a file which has contents that look like this:

20  31022550    G   1396    =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:2:60.00:33.00:37.00:2:0:0.02:0.02:40.00:2:0.98:126.00:0.98    C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  G:1391:60.00:36.08:36.97:719:672:0.51:0.01:7.59:719:0.49:126.00:0.50    T:1:60.00:33.00:37.00:0:1:0.37:0.02:47.00:0:0.00:126.00:0.18    N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  +A:2:60.00:0.00:37.00:2:0:0.67:0.01:0.00:2:0.65:126.00:0.65
20  31022551    A   1271    =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:960:60.00:35.23:36.99:496:464:0.50:0.00:6.38:496:0.49:126.00:0.52 C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  G:13:60.00:35.00:35.92:4:9:0.13:0.02:44.92:4:0.98:126.00:0.37   T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  +G:288:60.00:0.00:37.00:171:117:0.57:0.01:8.17:171:0.54:126.00:0.53 +GG:9:60.00:0.00:37.00:5:4:0.71:0.03:23.67:5:0.50:126.00:0.57   +GGG:1:60.00:0.00:37.00:1:0:0.51:0.03:14.00:1:0.24:126.00:0.24

After parsing I would want it to look

20  31022550    G   1396    =   0   0   0   0   0   0   0   0   0   0   0   0
20  31022550    G   1396    A   2   60  33  37  2   0   0.02    0.02    40  2   0.98    126
20  31022550    G   1396    C   0   0   0   0   0   0   0   0   0   0   0   0
20  31022550    G   1396    G   1391    60  36.08   36.97   719 672 0.51    0.01    7.59    719 0.49    126
20  31022550    G   1396    T   1   60  33  37  0   1   0.37    0.02    47  0   0   126
20  31022550    G   1396    N   0   0   0   0   0   0   0   0   0   0   0   0
20  31022550    G   1396    +A  2   60  0   37  2   0   0.67    0.01    0   2   0.65    126
20  31022551    A   1271    =   0   0   0   0   0   0   0   0   0   0   0   0
20  31022551    A   1271    A   960 60  35.23   36.99   496 464 0.5 0   6.38    496 0.49    126
20  31022551    A   1271    C   0   0   0   0   0   0   0   0   0   0   0   0
20  31022551    A   1271    G   13  60  35  35.92   4   9   0.13    0.02    44.92   4   0.98    126
20  31022551    A   1271    T   0   0   0   0   0   0   0   0   0   0   0   0
20  31022551    A   1271    N   0   0   0   0   0   0   0   0   0   0   0   0
20  31022551    A   1271    +G  288 60  0   37  171 117 0.57    0.01    8.17    171 0.54    126
20  31022551    A   1271    +GG 9   60  0   37  5   4   0.71    0.03    23.67   5   0.5 126
20  31022551    A   1271    +GGG    1   60  0   37  1   0   0.51    0.03    14  1   0.24    126

I have more lines where it increments based on column[1] 31022550...31022NNN

Code

What I am trying to do here is to only print certain parts of the file with this pseudo code keeping the column[1] as key

from collections import defaultdict
ids = defaultdict(list)with open('~/file.tsv', 'r') as f:for line in f:lines = line.strip().split('\t')pos = (lines[0:3])for ele in lines[4:]:# print posp = pos[1].strip()base = ele.split(':')[0]ids[p] = {'pos': pos[0].strip(),'base': base,'count': ele.split(':')[1],'_pos': ele.split(':')[5],'_neg': ele.split(':')[6]}
\
for k,v in ids.iteritems():print k,v

Output

31022550 {'count': '2', 'base': '+A', 'pos': '20', '_neg': '0', '_pos': '2'}
31022551 {'count': '1', 'base': '+GGG', 'pos': '20', '_neg': '0', '_pos': '1'}

Not sure why I do not see all the fields that 31022550 holds as key value pair.

Answer

You are assigning only the last dictionary to your p key:

ids[p] = {'pos': pos[0].strip(),'base': base,'count': ele.split(':')[1],'_pos': ele.split(':')[5],'_neg': ele.split(':')[6]
}

This bypasses the factory for new keys altogether; you are just assigning a dictionary value instead. If you wanted to build a list of dictionaries per key, you'd need to use list.append():

ids[p].append({'pos': pos[0].strip(),'base': base,'count': ele.split(':')[1],'_pos': ele.split(':')[5],'_neg': ele.split(':')[6]
})

This looks up the ids[p] value (which then is created as an empty list if the key does not yet exist), and you then append your dictionary to the end of that list.

I'd simplify the code somewhat using the csv module to handle splitting of the lines:

import csv
from collections import defaultdict
ids = defaultdict(list)with open('~/file.tsv', 'rb') as f:reader = csv.reader(f, delimiter='\t')for row in reader:pos, key = row[:2]for elems in row[4:]:elems = elems.split(':')ids[key].append({'pos': pos,'base': elems[0],'count': elems[1],'_pos': elems[5],'_neg': elems[6]})for key, rows in ids.iteritems():for row in rows:print '{}\t{}'.format(key, row)

This produces:

31022550    {'count': '0', 'base': '=', 'pos': '20', '_neg': '0', '_pos': '0'}
31022550    {'count': '2', 'base': 'A', 'pos': '20', '_neg': '0', '_pos': '2'}
31022550    {'count': '0', 'base': 'C', 'pos': '20', '_neg': '0', '_pos': '0'}
31022550    {'count': '1391', 'base': 'G', 'pos': '20', '_neg': '672', '_pos': '719'}
31022550    {'count': '1', 'base': 'T', 'pos': '20', '_neg': '1', '_pos': '0'}
31022550    {'count': '0', 'base': 'N', 'pos': '20', '_neg': '0', '_pos': '0'}
31022550    {'count': '2', 'base': '+A', 'pos': '20', '_neg': '0', '_pos': '2'}
31022551    {'count': '0', 'base': '=', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551    {'count': '960', 'base': 'A', 'pos': '20', '_neg': '464', '_pos': '496'}
31022551    {'count': '0', 'base': 'C', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551    {'count': '13', 'base': 'G', 'pos': '20', '_neg': '9', '_pos': '4'}
31022551    {'count': '0', 'base': 'T', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551    {'count': '0', 'base': 'N', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551    {'count': '288', 'base': '+G', 'pos': '20', '_neg': '117', '_pos': '171'}
31022551    {'count': '9', 'base': '+GG', 'pos': '20', '_neg': '4', '_pos': '5'}
31022551    {'count': '1', 'base': '+GGG', 'pos': '20', '_neg': '0', '_pos': '1'}
https://en.xdnf.cn/q/118216.html

Related Q&A

Iterating in DataFrame and writing down the index of the values where a condition is met

I have a data made of 20 rows and 2500 columns. Each column is a unique product and rows are time series, results of measurements. Therefore each product is measured 20 times and there are 2500 product…

Access denied to ClearDB database using Python/Django on Heroku

Im trying to build a webapp on Heroku using Python/Django, and I just followed the tutorial to set up a Django project and push it to Heroku. However, I can never even get to the normal Django "I…

Replacing a line in a file based on a keyword search, by line from another file

Here is my file1: agadfadsdffasdfElement 1, 0, 0, 0PcomElement 2Here is my file2: PBARElement 1, 100, 200, 300, 400Element 2Continue...I want to search with a keyword, "Element 1" in file1,…

How to check for pop up alert using selenium in python

What I want is to continue with the next iteration if there is a pop up message in the webpage being scrapped. That is if there is any pop up message, I want to accept that message and go to the next i…

Rally host is non-existent or unreachable via pyral

I am trying to call rally server simply using below: rally = Rally(server, user, password, workspace=workspace, project=project)But it is giving below error:Traceback (most recent call last):File "…

Query tangled array in Pymongo

I am trying to query a very tangled collection. The schema:{tags: {variables: [{value: 3x9, var_name: s},{value: 12:00AM, var_name: x},{value: goog, var_name: y}]},url: https://www.google.com}]The Quer…

manipulating value of pandas dataframe cell based on value in previous row without iteration

I have a pandas dataframe with~3900 rows and 6 columns compiled from Google Finance . One of these columns defines a time in unix format, specifically defining a time during the trading day for a marke…

Convert ctypes code to cython

Id like to convert some ctypes code to use cython instead, but Im struggling. Essentially, the ctypes code:copies the contents (floats) of two lists into C-compatible structs sends the structs to my b…

Enable PyROOT Ubuntu 14.04

I downloaded madpgraph5, but when I run it I get the following error:ERROR: ROOT file called ROOT.py or ROOT.pyc is not foundERROR: Please check that ROOT is properly installed.When I try locate ROOT.p…

pygal on windows - cannot access classes from pygal

I have such short script:import pygal if __name__ == __main__:bar_chart = pygal.Bar()and following error: AttributeError: module object has no attribute BarDo you have any idea what is wrong? Shall I …