Question 1

With a textfile like this:

a;b
b;a
c;d
d;c
e;a
f;g
h;b
b;f
b;f
c;g
a;b
d;f

How can one read it, and produce two output text files: one keeping only the lines representing the most often occurring couple for each letter; and one keeping all the couples that include any of the top 25% of most commonly occurring letters.

Sorry for not sharing any code. Been trying lots of stuff with list comprehensions, counts, and pandas, but not fluent enough.

Question 2

Here is an answer without frozen set.

df1 = df.apply(sorted, 1)
df_count =df1.groupby(['A', 'B']).size().reset_index().sort_values(0, ascending=False)
df_count.columns = ['A', 'B', 'Count']df_all = pd.concat([df_count.assign(letter=lambda x: x['A']), df_count.assign(letter=lambda x: x['B'])]).sort_values(['letter', 'Count'], ascending =[True, False])df_first = df_all.groupby(['letter']).first().reset_index()top = int(len(df_count) / 4)
df_top_25 = df_count.iloc[:top]

------------older answer --------

Since order matters you can use a frozen set as the key to a groupby

import pandas as pd
df = pd.read_csv('text.csv', header=None, names=['A','B'], sep=';')
s = df.apply(frozenset, 1)
df_count = s.value_counts().reset_index()
df_count.columns = ['Combos', 'Count']

Which will give you this

   Combos  Count
0  (a, b)      3
1  (b, f)      2
2  (d, c)      2
3  (g, f)      1
4  (b, h)      1
5  (c, g)      1
6  (d, f)      1
7  (e, a)      1

To get the highest combo for each letter we will concatenate this dataframe on top of itself and make another column that will hold either the first or second letter.

df_a = df_count.copy()
df_b = df_count.copy()df_a['letter'] = df_a['Combos'].apply(lambda x: list(x)[0])
df_b['letter'] = df_b['Combos'].apply(lambda x: list(x)[1])df_all = pd.concat([df_a, df_b]).sort_values(['letter', 'Count'], ascending =[True, False])

And since this is sorted by letter and count (descending) just get the first row of each group.

df_first = df_all.groupby('letter').first()

And to get the top 25%, just use

top = int(len(df_count) / 4)
df_top_25 = df_count.iloc[:top]

And then use .to_csv to output to file.

Counting line frequencies and producing output files

Related Q&A

Check if parent dict is not empty and retrieve the value of the nested dict

List combinations in defined range

Python turtle drawing a symbol

Display a countdown for the python sleep function in discord embed in python

Bypass rate limit for requests.get

ValueError when using if commands in function

Python consecutive subprocess calls with adb

Django Page not found(404) error (Library not found)

Django. Create object ManyToManyField error

No module named discord