I am using the pandas vectorized str.split() method to extract the first element returned from a split on "~". I also have also tried using df.apply() with a lambda and str.split() to produce equivalent results. When using %timeit, I'm finding that df.apply() is performing faster than the vectorized version.
Everything that I have read about vectorization seems to indicate that the first version should have better performance. Can someone please explain why I am getting these results? Example:
id facility
0 3466 abc~24353
1 4853 facility1~3.4.5.6
2 4582 53434_Facility~34432~cde
3 9972 facility2~FACILITY2~343
4 2356 Test~23 ~FAC1
The above dataframe has about 500,000 rows and I have also tested at around 1 million with similar results. Here is some example input and output:
Vectorization
In [1]: %timeit df['facility'] = df['facility'].str.split('~').str[0]
1.1 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Lambda Apply
In [2]: %timeit df['facility'] = df['facility'].astype(str).apply(lambda facility: facility.split('~')[0])
650 ms ± 52.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Does anyone know why I am getting this behavior?
Thanks!