Question 1

I have a DataFrame:

+-----+--------+---------+
|  usn|log_type|item_code|
+-----+--------+---------+
|    0|      11|    I0938|
|  916|      19|    I0009|
|  916|      51|    I1097|
|  916|      19|    C0723|
|  916|      19|    I0010|
|  916|      19|    I0010|
|12331|      19|    C0117|
|12331|      19|    C0117|
|12331|      19|    I0009|
|12331|      19|    I0009|
|12331|      19|    I0010|
|12838|      19|    I1067|
|12838|      19|    I1067|
|12838|      19|    C1083|
|12838|      11|    B0250|
|12838|      19|    C1346|
+-----+--------+---------+

And I want distinct item_code and make an index for each item_code like this:

+---------+------+
|item_code| numId|
+---------+------+
|    I0938|   0  |
|    I0009|   1  |
|    I1097|   2  |
|    C0723|   3  |
|    I0010|   4  |
|    C0117|   5  | 
|    I1067|   6  |
|    C1083|   7  |
|    B0250|   8  | 
|    C1346|   9  |
+---------+------+

I don't use monotonically_increasing_id because it returns a bigint.

Question 2

Using monotanicallly_increasing_id only guarantees that the numbers are increasing, the starting number and consecutive numbering is not guaranteed. If you want to be sure to get 0,1,2,3,... you can use the RDD function zipWithIndex().

Since I'm not too familiar with spark together with python, the below example is using scala but it should be easy to convert it.

val spark = SparkSession.builder.getOrCreate()
import spark.implicits._val df = Seq("I0938","I0009","I1097","C0723","I0010","I0010","C0117","C0117","I0009","I0009","I0010","I1067","I1067","C1083","B0250","C1346").toDF("item_code")val df2 = df.distinct.rdd.map{case Row(item: String) => item}.zipWithIndex().toDF("item_code", "numId")

Which will give you the requested result:

+---------+-----+
|item_code|numId|
+---------+-----+
|    I0010|    0|
|    I1067|    1|
|    C0117|    2|
|    I0009|    3|
|    I1097|    4|
|    C1083|    5|
|    I0938|    6|
|    C0723|    7|
|    B0250|    8|
|    C1346|    9|
+---------+-----+

How to make an integer index row?

Related Q&A

Matplotlib plt.plot with enumerate not working

using complex conditions to form a pandas data frame from the existing one

Crawl and scrape a complete site with scrapy

Why is pip freezing and not showing a module, although pip install says its already installed?

Flatten a list of strings which contains sublists

Portscanner producing possible error

Import error on first-party library with dev_appserver.py

Split dictionary based on values

Using defaultdict to parse multi delimiter file

Iterating in DataFrame and writing down the index of the values where a condition is met