python-polars split string column into many columns by delimiter

2024/10/1 21:29:13

In pandas, the following code will split the string from col1 into many columns. is there a way to do this in polars?

d = {'col1': ["a/b/c/d", "a/b/c/d"]}
df= pd.DataFrame(data=d)
df[["a","b","c","d"]]=df["col1"].str.split('/',expand=True)
Answer

Here's an algorithm that will automatically adjust for the required number of columns -- and should be quite performant.

Let's start with this data. Notice that I've purposely added the empty string "" and a null value - to show how the algorithm handles these values. Also, the number of split strings varies widely.

import polars as pl
df = pl.DataFrame({"my_str": ["cat", "cat/dog", None, "", "cat/dog/aardvark/mouse/frog"],}
)
df
shape: (5, 1)
┌─────────────────────────────┐
│ my_str                      │
│ ---                         │
│ str                         │
╞═════════════════════════════╡
│ cat                         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat/dog                     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null                        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│                             │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat/dog/aardvark/mouse/frog │
└─────────────────────────────┘

The Algorithm

The algorithm below may be a bit more than you need, but you can edit/delete/add as you need.

(df.with_row_count('id').with_column(pl.col("my_str").str.split("/").alias("split_str")).explode("split_str").with_column(("string_" + pl.arange(0, pl.count()).cast(pl.Utf8).str.zfill(2)).over("id").alias("col_nm")).pivot(index=['id', 'my_str'],values='split_str',columns='col_nm',).with_column(pl.col('^string_.*$').fill_null(""))
)
shape: (5, 7)
┌─────┬─────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ id  ┆ my_str                      ┆ string_00 ┆ string_01 ┆ string_02 ┆ string_03 ┆ string_04 │
│ --- ┆ ---                         ┆ ---       ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ u32 ┆ str                         ┆ str       ┆ str       ┆ str       ┆ str       ┆ str       │
╞═════╪═════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0   ┆ cat                         ┆ cat       ┆           ┆           ┆           ┆           │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ cat/dog                     ┆ cat       ┆ dog       ┆           ┆           ┆           │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ null                        ┆           ┆           ┆           ┆           ┆           │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆                             ┆           ┆           ┆           ┆           ┆           │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ cat       ┆ dog       ┆ aardvark  ┆ mouse     ┆ frog      │
└─────┴─────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘

How it works

We first assign a row number id (which we'll need later), and use split to separate the strings. Note that the split strings form a list.

(df.with_row_count('id').with_column(pl.col("my_str").str.split("/").alias("split_str"))
)
shape: (5, 3)
┌─────┬─────────────────────────────┬────────────────────────────┐
│ id  ┆ my_str                      ┆ split_str                  │
│ --- ┆ ---                         ┆ ---                        │
│ u32 ┆ str                         ┆ list[str]                  │
╞═════╪═════════════════════════════╪════════════════════════════╡
│ 0   ┆ cat                         ┆ ["cat"]                    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ cat/dog                     ┆ ["cat", "dog"]             │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ null                        ┆ null                       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆                             ┆ [""]                       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ ["cat", "dog", ... "frog"] │
└─────┴─────────────────────────────┴────────────────────────────┘

Next, we'll use explode to put each string on its own row. (Notice how the id column tracks the original row that each string came from.)

(df.with_row_count('id').with_column(pl.col("my_str").str.split("/").alias("split_str")).explode("split_str")
)
shape: (10, 3)
┌─────┬─────────────────────────────┬───────────┐
│ id  ┆ my_str                      ┆ split_str │
│ --- ┆ ---                         ┆ ---       │
│ u32 ┆ str                         ┆ str       │
╞═════╪═════════════════════════════╪═══════════╡
│ 0   ┆ cat                         ┆ cat       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ cat/dog                     ┆ cat       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ cat/dog                     ┆ dog       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ null                        ┆ null      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆                             ┆           │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ cat       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ dog       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ aardvark  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ mouse     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ frog      │
└─────┴─────────────────────────────┴───────────┘

In the next step, we're going to generate our column names. I chose to call each column string_XX where XX is the offset with regards to the original string.

I've used the handy zfill expression so that 1 becomes 01. (This makes sure that string_02 comes before string_10 if you decide to sort your columns later.)

You can substitute your own naming in this step as you need.

(df.with_row_count('id').with_column(pl.col("my_str").str.split("/").alias("split_str")).explode("split_str").with_column(("string_" + pl.arange(0, pl.count()).cast(pl.Utf8).str.zfill(2)).over("id").alias("col_nm"))
)
shape: (10, 4)
┌─────┬─────────────────────────────┬───────────┬───────────┐
│ id  ┆ my_str                      ┆ split_str ┆ col_nm    │
│ --- ┆ ---                         ┆ ---       ┆ ---       │
│ u32 ┆ str                         ┆ str       ┆ str       │
╞═════╪═════════════════════════════╪═══════════╪═══════════╡
│ 0   ┆ cat                         ┆ cat       ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ cat/dog                     ┆ cat       ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ cat/dog                     ┆ dog       ┆ string_01 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ null                        ┆ null      ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆                             ┆           ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ cat       ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ dog       ┆ string_01 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ aardvark  ┆ string_02 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ mouse     ┆ string_03 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ frog      ┆ string_04 │
└─────┴─────────────────────────────┴───────────┴───────────┘

In the next step, we'll use the pivot function to place each string in its own column.

(df.with_row_count('id').with_column(pl.col("my_str").str.split("/").alias("split_str")).explode("split_str").with_column(("string_" + pl.arange(0, pl.count()).cast(pl.Utf8).str.zfill(2)).over("id").alias("col_nm")).pivot(index=['id', 'my_str'],values='split_str',columns='col_nm',)
)
shape: (5, 7)
┌─────┬─────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ id  ┆ my_str                      ┆ string_00 ┆ string_01 ┆ string_02 ┆ string_03 ┆ string_04 │
│ --- ┆ ---                         ┆ ---       ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ u32 ┆ str                         ┆ str       ┆ str       ┆ str       ┆ str       ┆ str       │
╞═════╪═════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0   ┆ cat                         ┆ cat       ┆ null      ┆ null      ┆ null      ┆ null      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ cat/dog                     ┆ cat       ┆ dog       ┆ null      ┆ null      ┆ null      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ null                        ┆ null      ┆ null      ┆ null      ┆ null      ┆ null      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆                             ┆           ┆ null      ┆ null      ┆ null      ┆ null      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ cat       ┆ dog       ┆ aardvark  ┆ mouse     ┆ frog      │
└─────┴─────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘

All that remains is to use fill_null to replace the null values with an empty string "". Notice that I've used a regex expression in the col expression to target only those columns whose names start with "string_". (Depending on your other data, you may not want to replace null with "" everywhere in your data.)

https://en.xdnf.cn/q/70928.html

Related Q&A

pylint giving not-callable error for object property that is callable

Not sure if I am doing something wrong or if this is a problem with pylint. In the code below I get a linting error that self.type is not callable E1102.Although I could just ignore it and keep workin…

ModuleNotFoundError: No module named api

I created a Django project inside of api folder called bucks:api |____ categories/|____ __init__.py|____ ...|____ models.py|____ tests.py|____ views.py |____ .../ |____ bucks/ |____ users/|____ __init_…

Reading csv header white space and case insensitive

Is there a possibility to read the header of a CSV file white space and case insensitive? As for now I use csv.dictreader like this:import csv csvDict = csv.DictReader(open(csv-file.csv, rU))# determi…

How to remove the seconds of Pandas dataframe index?

Given a dataframe with time series that looks like this:Close 2015-02-20 14:00:00 1200.1 2015-02-20 14:10:00 1199.8 2015-02-21 14:00:00 1199.3 2015-02-21 14:10:00 1199.0 2015-02-22 14:00:00 1198.4…

Slow loading SQL Server table into pandas DataFrame

Pandas gets ridiculously slow when loading more than 10 million records from a SQL Server DB using pyodbc and mainly the function pandas.read_sql(query,pyodbc_conn). The following code takes up to 40-4…

compress a string in python 3?

I dont understand in 2.X it worked :import zlib zlib.compress(Hello, world)now i have a :zlib.compress("Hello world!") TypeError: must be bytes or buffer, not strHow can i compress my string …

How to set color of text using xlwt

I havent been able to find documentation on how to set the color of text. How would the following be done in xlwt?style = xlwt.XFStyle()# bold font = xlwt.Font() font.bold = True style.font = font# ba…

How to apply linregress in Pandas bygroup

I would like to apply a scipy.stats.linregress within Pandas ByGroup. I had looked through the documentation but all I could see was how to apply something to a single column like grouped.agg(np.sum)or…

Python Shared Memory Array, no attribute get_obj()

I am working on manipulating numpy arrays using the multiprocessing module and am running into an issue trying out some of the code I have run across here. Specifically, I am creating a ctypes array f…

What is a qualified/unqualified name in Python?

In Python: what is a "qualified name" or "unqualified name"?Ive seen it mentioned a couple of times, but no explanation as to what it is.