polars slower than numpy?

2024/11/15 13:02:07

I was thinking about using polars in place of numpy in a parsing problem where I turn a structured text file into a character table and operate on different columns. However, it seems that polars is about 5 times slower than numpy in most operations I'm performing. I was wondering why that's the case and whether I'm doing something wrong given that polars is supposed to be faster.

Example:

import requests
import numpy as np
import polars as pl# Download the text file
text = requests.get("https://files.rcsb.org/download/3w32.pdb").text# Turn it into a 2D array of characters
char_tab_np = np.array(file.splitlines()).view(dtype=(str,1)).reshape(-1, 80)# Create a polars DataFrame from the numpy array
char_tab_pl = pl.DataFrame(char_tab_np)# Sort by first column with numpy
char_tab_np[np.argsort(char_tab_np[:,0])]# Sort by first column with polars
char_tab_pl.sort(by="column_0")

Using %%timeit in Jupyter, the numpy sorting takes about 320 microseconds, whereas the polars sort takes about 1.3 milliseconds, i.e. about five times slower.

I also tried char_tab_pl.lazy().sort(by="column_0").collect(), but it had no effect on the duration.

Another example (Take all rows where the first column is equal to 'A'):

# with numpy
%%timeit
char_tab_np[char_tab_np[:, 0] == "A"]
# with polars
%%timeit
char_tab_pl.filter(pl.col("column_0") == "A")

Again, numpy takes 226 microseconds, whereas polars takes 673 microseconds, about three times slower.

Update

Based on the comments I tried two other things:

1. Making the file 1000 times larger to see whether polars performs better on larger data.

Results: numpy was still about 2 times faster (1.3 ms vs. 2.1 ms). In addition, creating the character array took numpy about 2 seconds, whereas polars needed about 2 minutes to create the dataframe, i.e. 60 times slower.

To re-produce, just add text *= 1000 before creating the numpy array in the code above.

2. Casting to integer.

For the original (smaller) file, casting to int sped up the process for both numpy and polars. The filtering in numpy was still about 5 times faster than polars (30 microseconds vs. 120), wheres the sorting time became more similar (150 microseconds for numpy vs. 200 for polars).

However, for the large file, polars was marginally faster than numpy, but the huge instantiation time makes it only worth if the dataframe is to be queried thousands of times.

Answer

Polars does extra work in filtering string data that is not worth it in this case. Polars uses arrow large-utf8 buffers for their string data. This makes filtering more expensive than filtering python strings/chars (e.g. pointers or u8 bytes).

Sometimes it is worth it, sometimes not. If you have homogeneous data, numpy is a better fit than polars. If you have heterogenous data, polars will likely be faster. Especially if you consider your whole query instead of these micro benchmarks.

https://en.xdnf.cn/q/71880.html

Related Q&A

namespace error lxml xpath python

I am transforming word documents to xml to compare them using the following code:word = win32com.client.Dispatch(Word.Application) wd = word.Documents.Open(inFile) # Converts the word infile to xml out…

lark grammar: How does the escaped string regex work?

The lark parser predefines some common terminals, including a string. It is defined as follows:_STRING_INNER: /.*?/ _STRING_ESC_INNER: _STRING_INNER /(?<!\\)(\\\\)*?/ ESCAPED_STRING : "\&quo…

Pycharm unresolved reference on join of os.path

After upgrade pycharm to 2018.1, and upgrade python to 3.6.5, pycharm reports "unresolved reference join". The last version of pycharm doesnt show any warning for the line below:from os.path …

Apply Border To Range Of Cells Using Openpyxl

I am using python 2.7.10 and openpyxl 2.3.2 and I am a Python newbie.I am attempting to apply a border to a specified range of cells in an Excel worksheet (e.g. C3:H10). My attempt below is failing wit…

Make a functional field editable in Openerp?

How to make functional field editable in Openerp?When we createcapname: fields.function(_convert_capital, string=Display Name, type=char, store=True ),This will be displayed has read-only and we cant …

how to read a fasta file in python?

Im trying to read a FASTA file and then find specific motif(string) and print out the sequence and number of times it occurs. A FASTA file is just series of sequences(strings) that starts with a header…

Passing a pandas dataframe column to an NLTK tokenizer

I have a pandas dataframe raw_df with 2 columns, ID and sentences. I need to convert each sentence to a string. The code below produces no errors and says datatype of rule is "object." raw_d…

SWIG - Wrap C string array to python list

I was wondering what is the correct way to wrap an array of strings in C to a Python list using SWIG.The array is inside a struct :typedef struct {char** my_array;char* some_string; }Foo;SWIG automati…

How to show an Image with pillow and update it?

I want to show an image recreated from an img-vector, everything fine. now I edit the Vector and want to show the new image, and that multiple times per second. My actual code open tons of windows, wit…

How do I map Alt Gr key combinations in vim?

Suppose I wanted to map the command :!python % <ENTER> to pressing the keys Alt Gr and j together?