I am trying to bucketize columns that contain the word "road" in a 5k dataset. And create a new dataframe.
I am not sure how to do that, here is what I have tried far :
from pyspark.ml.feature import Bucketizerspike_cols = [col for col in df.columns if "road" in col]for x in spike_cols :bucketizer = Bucketizer(splits=[-float("inf"), 10, 100, float("inf")],inputCol=x, outputCol=x + "bucket")bucketedData = bucketizer.transform(df)
Either modify df
in the loop:
from pyspark.ml.feature import Bucketizerfor x in spike_cols :bucketizer = Bucketizer(splits=[-float("inf"), 10, 100, float("inf")],inputCol=x, outputCol=x + "bucket")df = bucketizer.transform(df)
or use Pipeline
:
from pyspark.ml import Pipeline
from pyspark.ml.feature import Bucketizer model = Pipeline(stages=[Bucketizer(splits=[-float("inf"), 10, 100, float("inf")],inputCol=x, outputCol=x + "bucket") for x in spike_cols
]).fit(df)model.transform(df)