Use Pandas string method contains on a Series containing lists of strings

2024/10/3 10:46:15

Given a simple Pandas Series that contains some strings which can consist of more than one sentence:

In:
import pandas as pd
s = pd.Series(['This is a long text. It has multiple sentences.','Do you see? More than one sentence!','This one has only one sentence though.'])Out:
0    This is a long text. It has multiple sentences.
1                Do you see? More than one sentence!
2             This one has only one sentence though.
dtype: object

I use pandas string method split and a regex-pattern to split each row into its single sentences (which produces unnecessary empty list elements - any suggestions on how to improve the regex?).

In:
s = s.str.split(r'([A-Z][^\.!?]*[\.!?])')Out:
0    [, This is a long text.,  , It has multiple se...
1        [, Do you see?,  , More than one sentence!, ]
2         [, This one has only one sentence though., ]
dtype: object

This converts each row into lists of strings, each element holding one sentence.

Now, my goal is to use the string method contains to check each element in each row seperately to match a specific regex pattern and create a new Series accordingly which stores the returned boolean values, each signalizing if the regex matched on at least one of the list elements.

I would expect something like:

In:
s.str.contains('you')Out:
0   False
1   True
2   False

<-- Row 0 does not contain 'you' in any of its elements, but row 1 does, while row 2 does not.

However, when doing the above, the return is

0   NaN
1   NaN
2   NaN
dtype: float64

I also tried a list comprehension which does not work:

result = [[x.str.contains('you') for x in y] for y in s]
AttributeError: 'str' object has no attribute 'str'

Any suggestions on how this can be achieved?

Answer

you can use python find() method

>>> s.apply(lambda x : any((i for i in x if i.find('you') >= 0)))
0    False
1     True
2    False
dtype: bool

I guess s.str.contains('you') is not working because elements of your series is not strings, but lists. But you can also do something like this:

>>> s.apply(lambda x: any(pd.Series(x).str.contains('you')))
0    False
1     True
2    False
https://en.xdnf.cn/q/70733.html

Related Q&A

Is this the correct way of whitening an image in python?

I am trying to zero-center and whiten CIFAR10 dataset, but the result I get looks like random noise! Cifar10 dataset contains 60,000 color images of size 32x32. The training set contains 50,000 and tes…

Python zlib output, how to recover out of mysql utf-8 table?

In python, I compressed a string using zlib, and then inserted it into a mysql column that is of type blob, using the utf-8 encoding. The string comes back as utf-8, but its not clear how to get it bac…

Incorrect user for supervisord celeryd

I have some periodic tasks that I run with celery (daemonized by supervisord), but after trying to create a directory in the home dir for the user i setup for the supervisord process I got a "perm…

Pandas drop rows where column contains *

Im trying to drop all rows from this df where column DB Serial contains the character *:DB Serial 0 13058 1 13069 2 *13070 3 13070 4 13044 5 13042I am using:df = df[~df[DB Serial…

How to stop scrapy spider after certain number of requests?

I am developing an simple scraper to get 9 gag posts and its images but due to some technical difficulties iam unable to stop the scraper and it keeps on scraping which i dont want.I want to increase t…

What is the difference between single and double bracket Numpy array?

import numpy as np a=np.random.randn(1, 2) b=np.zeros((1,2)) print("Data type of A: ",type(a)) print("Data type of A: ",type(b))Output:Data type of A: <class numpy.ndarray> D…

How to make tkinter button widget take up full width of grid

Ive tried this but it didnt help. Im making a calculator program. Ive made this so far: from tkinter import * window = Tk()disp = Entry(window, state=readonly, readonlybackground="white") dis…

Python strip() unicode string?

How can you use string methods like strip() on a unicode string? and cant you access characters of a unicode string like with oridnary strings? (ex: mystring[0:4] )

Python equivalent for MATLABs normplot?

Is there a python equivalent function similar to normplot from MATLAB? Perhaps in matplotlib?MATLAB syntax:x = normrnd(10,1,25,1); normplot(x)Gives:I have tried using matplotlib & numpy module to…

python mask netcdf data using shapefile

I am using the following packages:import pandas as pd import numpy as np import xarray as xr import geopandas as gpdI have the following objects storing data:print(precip_da)Out[]:<xarray.DataArray …