Counting line frequencies and producing output files

2024/10/11 0:27:25

With a textfile like this:

a;b
b;a
c;d
d;c
e;a
f;g
h;b
b;f
b;f
c;g
a;b
d;f

How can one read it, and produce two output text files: one keeping only the lines representing the most often occurring couple for each letter; and one keeping all the couples that include any of the top 25% of most commonly occurring letters.

Sorry for not sharing any code. Been trying lots of stuff with list comprehensions, counts, and pandas, but not fluent enough.

Answer

Here is an answer without frozen set.

df1 = df.apply(sorted, 1)
df_count =df1.groupby(['A', 'B']).size().reset_index().sort_values(0, ascending=False)
df_count.columns = ['A', 'B', 'Count']df_all = pd.concat([df_count.assign(letter=lambda x: x['A']), df_count.assign(letter=lambda x: x['B'])]).sort_values(['letter', 'Count'], ascending =[True, False])df_first = df_all.groupby(['letter']).first().reset_index()top = int(len(df_count) / 4)
df_top_25 = df_count.iloc[:top]

------------older answer --------

Since order matters you can use a frozen set as the key to a groupby

import pandas as pd
df = pd.read_csv('text.csv', header=None, names=['A','B'], sep=';')
s = df.apply(frozenset, 1)
df_count = s.value_counts().reset_index()
df_count.columns = ['Combos', 'Count']

Which will give you this

   Combos  Count
0  (a, b)      3
1  (b, f)      2
2  (d, c)      2
3  (g, f)      1
4  (b, h)      1
5  (c, g)      1
6  (d, f)      1
7  (e, a)      1

To get the highest combo for each letter we will concatenate this dataframe on top of itself and make another column that will hold either the first or second letter.

df_a = df_count.copy()
df_b = df_count.copy()df_a['letter'] = df_a['Combos'].apply(lambda x: list(x)[0])
df_b['letter'] = df_b['Combos'].apply(lambda x: list(x)[1])df_all = pd.concat([df_a, df_b]).sort_values(['letter', 'Count'], ascending =[True, False])

And since this is sorted by letter and count (descending) just get the first row of each group.

df_first = df_all.groupby('letter').first()

And to get the top 25%, just use

top = int(len(df_count) / 4)
df_top_25 = df_count.iloc[:top]

And then use .to_csv to output to file.

https://en.xdnf.cn/q/118389.html

Related Q&A

Check if parent dict is not empty and retrieve the value of the nested dict

Lets suppose that I have a nested dictionary which looks like that:parent_dict = { parent_key: {child_key: child_value}How can I write the following code:if parent_dict.get(parent_key) is not None and …

List combinations in defined range

I am writing parallel rainbow tables generator using parallel python and multiple machines. So far, I have it working on a single machine. It creates all possible passwords, hashes them, saves to file.…

Python turtle drawing a symbol

import turtlewin=turtle.Screen()t = turtle.Turtle() t.width(5)#The vertical and horizontal lines t.left(90) t.forward(70) t.left(90) t.forward(20)t.left(90) t.forward(60) t.left(120) t.forward(35) t.b…

Display a countdown for the python sleep function in discord embed in python

hi all I am doing one discord bot I need to send one countdown its like a cooldown embed after every request I did this code but I dont know how to add this in my embedfor i in range(60,0,-1):print(f&q…

Bypass rate limit for requests.get

I want to constantly scrape a website - once every 3-5 seconds withrequests.get(http://www.example.com, headers=headers2, timeout=35).json()But the example website has a rate limit and I want to bypass…

ValueError when using if commands in function

Im creating some functions that I can call use keywords to call out specific functions,import scipy.integrate as integrate import numpy as npdef HubbleParam(a, model = "None"):if model == &qu…

Python consecutive subprocess calls with adb

I am trying to make a python script to check the contents of a database via adb. The thing is that in my code,only the first subprocess.call() is executed and the rest are ignored. Since i am fairly ne…

Django Page not found(404) error (Library not found)

This is my music\urls.py code:-#/music/ url(r^/$, views.index, name=index),#/music/712/ url(r^(?P<album_id>[0-9]+)/$, views.detail, name=detail),And this is my views.py code:-def index(request):…

Django. Create object ManyToManyField error

I am trying to write tests for my models.I try to create object like this:GiftEn.objects.create(gift_id=1,name="GiftEn",description="GiftEn description",short_description="Gift…

No module named discord

Im creating a discord bot, but when I try to import discord, I am getting this error: Traceback (most recent call last):File "C:\Users\Someone\Desktop\Discord bot\bot.py", line 2, in <modu…