How to speed up pandas string function?

2024/9/28 11:16:31

I am using the pandas vectorized str.split() method to extract the first element returned from a split on "~". I also have also tried using df.apply() with a lambda and str.split() to produce equivalent results. When using %timeit, I'm finding that df.apply() is performing faster than the vectorized version.

Everything that I have read about vectorization seems to indicate that the first version should have better performance. Can someone please explain why I am getting these results? Example:

id     facility      
0   3466     abc~24353  
1   4853     facility1~3.4.5.6   
2   4582     53434_Facility~34432~cde   
3   9972     facility2~FACILITY2~343
4   2356     Test~23 ~FAC1  

The above dataframe has about 500,000 rows and I have also tested at around 1 million with similar results. Here is some example input and output:

Vectorization

In [1]: %timeit df['facility'] = df['facility'].str.split('~').str[0]
1.1 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Lambda Apply

In [2]: %timeit df['facility'] = df['facility'].astype(str).apply(lambda facility: facility.split('~')[0])
650 ms ± 52.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Does anyone know why I am getting this behavior?
Thanks!

Answer

Pandas string methods are only "vectorized" in the sense that you don't have to write the loop yourself. There isn't actually any parallelization going on, because string (especially regex problems) are inherently difficult (impossible?) to parallelize. If you really want speed, you actually should fall back to python here.

%timeit df['facility'].str.split('~', n=1).str[0]
%timeit [x.split('~', 1)[0] for x in df['facility'].tolist()]411 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
132 ms ± 302 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

For more information on when loops are faster than pandas functions, take a look at For loops with pandas - When should I care?.

As for why apply is faster, I'm of the belief that the function apply is applying (i.e., str.split) is a lot more lightweight than the string splitting happening in the bowels of Series.str.split.

https://en.xdnf.cn/q/71342.html

Related Q&A

sqlalchemy autoloaded orm persistence

We are using sqlalchemys autoload feature to do column mapping to prevent hardcoding in our code.class users(Base):__tablename__ = users__table_args__ = {autoload: True,mysql_engine: InnoDB,mysql_chars…

Data Normalization with tensorflow tf-transform

Im doing a neural network prediction with my own datasets using Tensorflow. The first I did was a model that works with a small dataset in my computer. After this, I changed the code a little bit in or…

Relationship of metaclasss __call__ and instances __init__?

Say Ive got a metaclass and a class using it:class Meta(type):def __call__(cls, *args):print "Meta: __call__ with", argsclass ProductClass(object):__metaclass__ = Metadef __init__(self, *args…

How to present numpy array into pygame surface?

Im writing a code that part of it is reading an image source and displaying it on the screen for the user to interact with. I also need the sharpened image data. I use the following to read the data an…

Following backreferences of unknown kinds in NDB

Im in the process of writing my first RESTful web service atop GAE and the Python 2.7 runtime; Ive started out using Guidos shiny new ndb API.However, Im unsure how to solve a particular case without t…

How to enable math in sphinx?

I am using sphinx with the pngmath extension to document my code that has a lot of mathematical expressions. Doing that in a *.rst file is working just fine.a \times b becomes: However, if I try the sa…

How to set the xticklabels for date in matplotlib

I am trying to plot values from two list. The x axis values are date. Tried these things so faryear = [20070102,20070806,20091208,20111109,20120816,20140117,20140813] yvalues = [-0.5,-0.5,-0.75,-0.75,…

PyParsing: Is this correct use of setParseAction()?

I have strings like this:"MSE 2110, 3030, 4102"I would like to output:[("MSE", 2110), ("MSE", 3030), ("MSE", 4102)]This is my way of going about it, although I h…

Indent and comments in function in Python

I am using Python 2.7 and wrote the following:def arithmetic(A):x=1 """ Some comments here """ if x=1:x=1elif x=2:x=2return 0But it has the indentation issue:if x=1:^ Ind…

Read a large big-endian binary file

I have a very large big-endian binary file. I know how many numbers in this file. I found a solution how to read big-endian file using struct and it works perfect if file is small:data = []file = open(…