Question 1

I have following sample DataFrame d consisting of two columns 'col1' and 'col2'. I would like to find the list of unique names for the whole DataFrame d.

    d = {'col1':['Pat, Joseph', 'Tony, Hoffman', 'Miriam, Goodwin', 'Roxanne, Padilla','Julie, Davis', 'Muriel, Howell', 'Salvador, Reese', 'Kristopher, Mckenzie','Lucille, Thornton', 'Brenda, Wilkerson'],'col2':['Kristopher, Mckenzie', 'Lucille, Thornton','Pete, Fitzgerald; Cecelia, Bass; Julie, Davis', 'Muriel, Howell', 'Harriet, Phillips','Belinda, Drake;David, Ford', 'Jared, Cummings;Joanna, Burns;Bob, Cunningham','Keith, Hernandez;Pat, Joseph', 'Kristopher, Mckenzie', 'Lucille, Thornton']}df = pd.DataFrame(data=d)

For column col1 i can get it done by using function unique().

df.col1.unique()
array(['Pat, Joseph', 'Tony, Hoffman', 'Miriam, Goodwin','Roxanne, Padilla', 'Julie, Davis', 'Muriel, Howell','Salvador, Reese', 'Kristopher, Mckenzie', 'Lucille, Thornton','Brenda, Wilkerson'], dtype=object)

len(df.col1) 10 # total number of rows
len(df.col1.unique())  9 # total number of unique rows

For col2 some of the rows have multiple names separated by a semicolon. e.g. 'Pete, Fitzgerald; Cecelia, Bass; Julie, Davis'.

How can I get the unique names from the col2 using vector operation? I am trying to avoid the for loop since the actual data set is large.

Question 2

First split by ;s\* (regex - ; with zero or more whitespaces) to DataFrame, then reshape by stack for Series and last use unique:

print (df['col2'].str.split(';\s*', expand=True).stack().unique())
['Kristopher, Mckenzie' 'Lucille, Thornton' 'Pete, Fitzgerald''Cecelia, Bass' 'Julie, Davis' 'Muriel, Howell' 'Harriet, Phillips''Belinda, Drake' 'David, Ford' 'Jared, Cummings' 'Joanna, Burns''Bob, Cunningham' 'Keith, Hernandez' 'Pat, Joseph']

Detail:

print (df['col2'].str.split(';\s*', expand=True))0               1                2
0  Kristopher, Mckenzie            None             None
1     Lucille, Thornton            None             None
2      Pete, Fitzgerald   Cecelia, Bass     Julie, Davis
3        Muriel, Howell            None             None
4     Harriet, Phillips            None             None
5        Belinda, Drake     David, Ford             None
6       Jared, Cummings   Joanna, Burns  Bob, Cunningham
7      Keith, Hernandez     Pat, Joseph             None
8  Kristopher, Mckenzie            None             None
9     Lucille, Thornton            None             Noneprint (df['col2'].str.split(';\s*', expand=True).stack())
0  0    Kristopher, Mckenzie
1  0       Lucille, Thornton
2  0        Pete, Fitzgerald1           Cecelia, Bass2            Julie, Davis
3  0          Muriel, Howell
4  0       Harriet, Phillips
5  0          Belinda, Drake1             David, Ford
6  0         Jared, Cummings1           Joanna, Burns2         Bob, Cunningham
7  0        Keith, Hernandez1             Pat, Joseph
8  0    Kristopher, Mckenzie
9  0       Lucille, Thornton
dtype: object

Alternative solution:

print (np.unique(np.concatenate(df['col2'].str.split(';\s*').values)))
['Belinda, Drake' 'Bob, Cunningham' 'Cecelia, Bass' 'David, Ford''Harriet, Phillips' 'Jared, Cummings' 'Joanna, Burns' 'Julie, Davis''Keith, Hernandez' 'Kristopher, Mckenzie' 'Lucille, Thornton''Muriel, Howell' 'Pat, Joseph' 'Pete, Fitzgerald']

EDIT:

For all unique names add stack first for Series form all columns:

print (df.stack().str.split(';\s*', expand=True).stack().unique())['Pat, Joseph' 'Kristopher, Mckenzie' 'Tony, Hoffman' 'Lucille, Thornton''Miriam, Goodwin' 'Pete, Fitzgerald' 'Cecelia, Bass' 'Julie, Davis''Roxanne, Padilla' 'Muriel, Howell' 'Harriet, Phillips' 'Belinda, Drake''David, Ford' 'Salvador, Reese' 'Jared, Cummings' 'Joanna, Burns''Bob, Cunningham' 'Keith, Hernandez' 'Brenda, Wilkerson']

Unique strings in a pandas dataframe

Related Q&A

finding index of multiple items in a list

Debugging asyncio code in PyCharm causes absolutely crazy unrepeatable errors

how to generate pie chart using dict_values in Python 3.4?

How can I make start_url in scrapy to consume from a message queue?

Pip install results in this error cl.exe failed with exit code 2

how to communicate two separate python processes?

Why does one use of iloc() give a SettingWithCopyWarning, but the other doesnt?

Tkinter color name to color object

Creating a TfidfVectorizer over a text column of huge pandas dataframe

Automatically convert jupyter notebook to .py