How to make an integer index row?

2024/10/12 7:15:46

I have a DataFrame:

+-----+--------+---------+
|  usn|log_type|item_code|
+-----+--------+---------+
|    0|      11|    I0938|
|  916|      19|    I0009|
|  916|      51|    I1097|
|  916|      19|    C0723|
|  916|      19|    I0010|
|  916|      19|    I0010|
|12331|      19|    C0117|
|12331|      19|    C0117|
|12331|      19|    I0009|
|12331|      19|    I0009|
|12331|      19|    I0010|
|12838|      19|    I1067|
|12838|      19|    I1067|
|12838|      19|    C1083|
|12838|      11|    B0250|
|12838|      19|    C1346|
+-----+--------+---------+

And I want distinct item_code and make an index for each item_code like this:

+---------+------+
|item_code| numId|
+---------+------+
|    I0938|   0  |
|    I0009|   1  |
|    I1097|   2  |
|    C0723|   3  |
|    I0010|   4  |
|    C0117|   5  | 
|    I1067|   6  |
|    C1083|   7  |
|    B0250|   8  | 
|    C1346|   9  |
+---------+------+

I don't use monotonically_increasing_id because it returns a bigint.

Answer

Using monotanicallly_increasing_id only guarantees that the numbers are increasing, the starting number and consecutive numbering is not guaranteed. If you want to be sure to get 0,1,2,3,... you can use the RDD function zipWithIndex().

Since I'm not too familiar with spark together with python, the below example is using scala but it should be easy to convert it.

val spark = SparkSession.builder.getOrCreate()
import spark.implicits._val df = Seq("I0938","I0009","I1097","C0723","I0010","I0010","C0117","C0117","I0009","I0009","I0010","I1067","I1067","C1083","B0250","C1346").toDF("item_code")val df2 = df.distinct.rdd.map{case Row(item: String) => item}.zipWithIndex().toDF("item_code", "numId")

Which will give you the requested result:

+---------+-----+
|item_code|numId|
+---------+-----+
|    I0010|    0|
|    I1067|    1|
|    C0117|    2|
|    I0009|    3|
|    I1097|    4|
|    C1083|    5|
|    I0938|    6|
|    C0723|    7|
|    B0250|    8|
|    C1346|    9|
+---------+-----+
https://en.xdnf.cn/q/118225.html

Related Q&A

Matplotlib plt.plot with enumerate not working

import numpy as np import matplotlib.pyplot as plt array = np.array([[1,2,3,4,5,6],[10,20,30,40,50,60],[3,4,5,6,7,8],[100,200,300,400,500,600]])def plot(list):fig = plt.figure()ax = fig.add_subplot(11…

using complex conditions to form a pandas data frame from the existing one

Ive got the following dataframe containing function names, their arguments, the default values of the arguments and argument types:FULL_NAME ARGUMENT DEF_VALS TYPE function1 f1_arg1 NAN …

Crawl and scrape a complete site with scrapy

import scrapy from scrapy import Request#scrapy crawl jobs9 -o jobs9.csv -t csv class JobsSpider(scrapy.Spider): name = "jobs9" allowed_domains = ["vapedonia.com"] start_urls = [&qu…

Why is pip freezing and not showing a module, although pip install says its already installed?

Im following these instructions to install Odoo on Mac. It required that I install all the Python modules for the user like so: sudo pip install -—user -r requirements.txt(*A note about the --user par…

Flatten a list of strings which contains sublists

I have a list of strings which contains a sublist os strings:ids = [uspotify:track:3ftnDaaL02tMeOZBunIwls, uspotify:track:4CKjTXDDWIrS0cwSA9scgk, [uspotify:track:6oRbm1KOqskLTFc1rvGi5F, uspotify:track:…

Portscanner producing possible error

I have written a simple portscanner in python. I have already asked something about it, you can find the code here.I corrected the code and now am able to create a connection to e.g. stackoverflow.netB…

Import error on first-party library with dev_appserver.py

On Ubuntu 16.04, am suddenly getting import errors from the local GAE development server. The local dev server starts up, including the admin interface, but app no longer loads.Native python imports o…

Split dictionary based on values

I have a dictionary:data = {cluster: A, node: B, mount: [C, D, E]}Im trying to split the dictionary data into number of dictionaries based on values in key mount.I tried using:for value in data.items()…

Using defaultdict to parse multi delimiter file

I need to parse a file which has contents that look like this:20 31022550 G 1396 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:2:60.00:33.00:37.00:2:0:0.02:0.02:40.00:2:0.98:126.00…

Iterating in DataFrame and writing down the index of the values where a condition is met

I have a data made of 20 rows and 2500 columns. Each column is a unique product and rows are time series, results of measurements. Therefore each product is measured 20 times and there are 2500 product…