Unique strings in a pandas dataframe

2024/9/21 8:26:09

I have following sample DataFrame d consisting of two columns 'col1' and 'col2'. I would like to find the list of unique names for the whole DataFrame d.

    d = {'col1':['Pat, Joseph', 'Tony, Hoffman', 'Miriam, Goodwin', 'Roxanne, Padilla','Julie, Davis', 'Muriel, Howell', 'Salvador, Reese', 'Kristopher, Mckenzie','Lucille, Thornton', 'Brenda, Wilkerson'],'col2':['Kristopher, Mckenzie', 'Lucille, Thornton','Pete, Fitzgerald; Cecelia, Bass; Julie, Davis', 'Muriel, Howell', 'Harriet, Phillips','Belinda, Drake;David, Ford', 'Jared, Cummings;Joanna, Burns;Bob, Cunningham','Keith, Hernandez;Pat, Joseph', 'Kristopher, Mckenzie', 'Lucille, Thornton']}df = pd.DataFrame(data=d)

For column col1 i can get it done by using function unique().

df.col1.unique()
array(['Pat, Joseph', 'Tony, Hoffman', 'Miriam, Goodwin','Roxanne, Padilla', 'Julie, Davis', 'Muriel, Howell','Salvador, Reese', 'Kristopher, Mckenzie', 'Lucille, Thornton','Brenda, Wilkerson'], dtype=object)
len(df.col1) 10 # total number of rows
len(df.col1.unique())  9 # total number of unique rows

For col2 some of the rows have multiple names separated by a semicolon. e.g. 'Pete, Fitzgerald; Cecelia, Bass; Julie, Davis'.

How can I get the unique names from the col2 using vector operation? I am trying to avoid the for loop since the actual data set is large.

Answer

First split by ;s\* (regex - ; with zero or more whitespaces) to DataFrame, then reshape by stack for Series and last use unique:

print (df['col2'].str.split(';\s*', expand=True).stack().unique())
['Kristopher, Mckenzie' 'Lucille, Thornton' 'Pete, Fitzgerald''Cecelia, Bass' 'Julie, Davis' 'Muriel, Howell' 'Harriet, Phillips''Belinda, Drake' 'David, Ford' 'Jared, Cummings' 'Joanna, Burns''Bob, Cunningham' 'Keith, Hernandez' 'Pat, Joseph']

Detail:

print (df['col2'].str.split(';\s*', expand=True))0               1                2
0  Kristopher, Mckenzie            None             None
1     Lucille, Thornton            None             None
2      Pete, Fitzgerald   Cecelia, Bass     Julie, Davis
3        Muriel, Howell            None             None
4     Harriet, Phillips            None             None
5        Belinda, Drake     David, Ford             None
6       Jared, Cummings   Joanna, Burns  Bob, Cunningham
7      Keith, Hernandez     Pat, Joseph             None
8  Kristopher, Mckenzie            None             None
9     Lucille, Thornton            None             Noneprint (df['col2'].str.split(';\s*', expand=True).stack())
0  0    Kristopher, Mckenzie
1  0       Lucille, Thornton
2  0        Pete, Fitzgerald1           Cecelia, Bass2            Julie, Davis
3  0          Muriel, Howell
4  0       Harriet, Phillips
5  0          Belinda, Drake1             David, Ford
6  0         Jared, Cummings1           Joanna, Burns2         Bob, Cunningham
7  0        Keith, Hernandez1             Pat, Joseph
8  0    Kristopher, Mckenzie
9  0       Lucille, Thornton
dtype: object

Alternative solution:

print (np.unique(np.concatenate(df['col2'].str.split(';\s*').values)))
['Belinda, Drake' 'Bob, Cunningham' 'Cecelia, Bass' 'David, Ford''Harriet, Phillips' 'Jared, Cummings' 'Joanna, Burns' 'Julie, Davis''Keith, Hernandez' 'Kristopher, Mckenzie' 'Lucille, Thornton''Muriel, Howell' 'Pat, Joseph' 'Pete, Fitzgerald']

EDIT:

For all unique names add stack first for Series form all columns:

print (df.stack().str.split(';\s*', expand=True).stack().unique())['Pat, Joseph' 'Kristopher, Mckenzie' 'Tony, Hoffman' 'Lucille, Thornton''Miriam, Goodwin' 'Pete, Fitzgerald' 'Cecelia, Bass' 'Julie, Davis''Roxanne, Padilla' 'Muriel, Howell' 'Harriet, Phillips' 'Belinda, Drake''David, Ford' 'Salvador, Reese' 'Jared, Cummings' 'Joanna, Burns''Bob, Cunningham' 'Keith, Hernandez' 'Brenda, Wilkerson']
https://en.xdnf.cn/q/72257.html

Related Q&A

finding index of multiple items in a list

I have a list myList = ["what is your name", "Hi, how are you","What about you", "How about a coffee", "How are you"]Now I want to search index of all …

Debugging asyncio code in PyCharm causes absolutely crazy unrepeatable errors

In my project that based on asyncio and asyncio tcp connections that debugs with PyCharm debugger I got very and very very absurd errors.If I put breakpoint on code after running, the breakpoint never …

how to generate pie chart using dict_values in Python 3.4?

I wanted the frequency of numbers in a list which I got using Counter library. Also, I got the keys and values using keys = Counter(list).keys() and values = Counter(list).values() respectively, where …

How can I make start_url in scrapy to consume from a message queue?

I am building a scrapy project in which I have multiple spiders( A spider for each domain). Now, the urls to be scraped come dynamically from a user given query. so basically I do not need to do broad…

Pip install results in this error cl.exe failed with exit code 2

Ive read all of the other questions on this error and frustratingly enough, none give a solution that works. If I run pip install sentencepiece in the cmd line, it gives me the following output.src/sen…

how to communicate two separate python processes?

I have two python programs and I want to communicate them. Both of them are system services and none of them is forked by parent process.Is there any way to do this without using sockets? (eg by cra…

Why does one use of iloc() give a SettingWithCopyWarning, but the other doesnt?

Inside a method from a class i use this statement:self.__datacontainer.iloc[-1][c] = valueDoing this i get a "SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a Data…

Tkinter color name to color object

I need to modify a widgets color in some way, for example, to make it darker, greener, to invert it. The widgets color is given by name, for example, orchid4. How do I get RGB values from a color name …

Creating a TfidfVectorizer over a text column of huge pandas dataframe

I need to get matrix of TF-IDF features from the text stored in columns of a huge dataframe, loaded from a CSV file (which cannot fit in memory). I am trying to iterate over dataframe using chunks but…

Automatically convert jupyter notebook to .py

I know there have been a few questions about this but I have not found anything robust enough.Currently I am using, from terminal, a command that creates .py, then moves them to another folder:jupyter …