I was thinking about using polars
in place of numpy
in a parsing problem where I turn a structured text file into a character table and operate on different columns. However, it seems that polars
is about 5 times slower than numpy
in most operations I'm performing. I was wondering why that's the case and whether I'm doing something wrong given that polars
is supposed to be faster.
Example:
import requests
import numpy as np
import polars as pl# Download the text file
text = requests.get("https://files.rcsb.org/download/3w32.pdb").text# Turn it into a 2D array of characters
char_tab_np = np.array(file.splitlines()).view(dtype=(str,1)).reshape(-1, 80)# Create a polars DataFrame from the numpy array
char_tab_pl = pl.DataFrame(char_tab_np)# Sort by first column with numpy
char_tab_np[np.argsort(char_tab_np[:,0])]# Sort by first column with polars
char_tab_pl.sort(by="column_0")
Using %%timeit
in Jupyter
, the numpy
sorting takes about 320 microseconds, whereas the polars
sort takes about 1.3 milliseconds, i.e. about five times slower.
I also tried char_tab_pl.lazy().sort(by="column_0").collect()
, but it had no effect on the duration.
Another example (Take all rows where the first column is equal to 'A'):
# with numpy
%%timeit
char_tab_np[char_tab_np[:, 0] == "A"]
# with polars
%%timeit
char_tab_pl.filter(pl.col("column_0") == "A")
Again, numpy
takes 226 microseconds, whereas polars
takes 673 microseconds, about three times slower.
Update
Based on the comments I tried two other things:
1. Making the file 1000 times larger to see whether polars performs better on larger data.
Results: numpy
was still about 2 times faster (1.3 ms vs. 2.1 ms). In addition, creating the character array took numpy
about 2 seconds, whereas polars
needed about 2 minutes to create the dataframe, i.e. 60 times slower.
To re-produce, just add text *= 1000
before creating the numpy array in the code above.
2. Casting to integer.
For the original (smaller) file, casting to int sped up the process for both numpy
and polars
. The filtering in numpy
was still about 5 times faster than polars
(30 microseconds vs. 120), wheres the sorting time became more similar (150 microseconds for numpy vs. 200 for polars).
However, for the large file, polars
was marginally faster than numpy
, but the huge instantiation time makes it only worth if the dataframe is to be queried thousands of times.