RunTimeError during one hot encoding

2024/9/21 22:41:30

I have a dataset where class values go from -2 to 2 by 1 step (i.e., -2,-1,0,1,2) and where 9 identifies the unlabelled data. Using one hot encode

self._one_hot_encode(labels)

I get the following error: RuntimeError: index 1 is out of bounds for dimension 1 with size 1

due to

self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)

The error should raise from [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1], where I have 9 in the mapping setting equal index 9 to 1. It is unclear to me how to fix it, even after going through past questions and answers to similar problems (e.g., index 1 is out of bounds for dimension 0 with size 1). The part of code involved in the error is the following:

def _one_hot_encode(self, labels):# Get the number of classesclasses = torch.unique(labels)classes = classes[classes != 9] # unlabelled self.n_classes = classes.size(0)# One-hot encode labeled data instances and zero rows corresponding to unlabeled instancesunlabeled_mask = (labels == 9)labels = labels.clone()  # defensive copyinglabels[unlabeled_mask] = 0self.one_hot_labels = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)self.one_hot_labels[unlabeled_mask, 0] = 0self.labeled_mask = ~unlabeled_maskdef fit(self, labels, max_iter, tol):self._one_hot_encode(labels)self.predictions = self.one_hot_labels.clone()prev_predictions = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)for i in range(max_iter):# Stop iterations if the system is considered at a steady statevariation = torch.abs(self.predictions - prev_predictions).sum().item()prev_predictions = self.predictionsself._propagate()

Example of dataset:

ID  Target  Weight  Label   Score   Scale_Cat   Scale_num
0   A   D   65.1    1   87  Up  1
1   A   X   35.8    1   87  Up  1
2   B   C   34.7    1   37.5    Down    -2
3   B   P   33.4    1   37.5    Down    -2
4   C   B   33.1    1   37.5    Down    -2
5   S   X   21.4    0   12.5    NA  9

The source code I am using as reference is here: https://mybinder.org/v2/gh/thibaudmartinez/label-propagation/master?filepath=notebook.ipynb

Full track of the error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-126-792a234f63dd> in <module>4 label_propagation = LabelPropagation(adj_matrix_t)
----> 6 label_propagation.fit(labels_t) # causing error7 label_propagation_output_labels = label_propagation.predict_classes()8 <ipython-input-115-54a7dbc30bd1> in fit(self, labels, max_iter, tol)100 101     def fit(self, labels, max_iter=1000, tol=1e-3):
--> 102         super().fit(labels, max_iter, tol)103 104 ## Label spreading<ipython-input-115-54a7dbc30bd1> in fit(self, labels, max_iter, tol)58             Convergence tolerance: threshold to consider the system at steady state.59         """
---> 60         self._one_hot_encode(labels)61 62         self.predictions = self.one_hot_labels.clone()<ipython-input-115-54a7dbc30bd1> in _one_hot_encode(self, labels)42         labels[unlabeled_mask] = 043         self.one_hot_labels = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)
---> 44         self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)45         self.one_hot_labels[unlabeled_mask, 0] = 046 RuntimeError: index 1 is out of bounds for dimension 1 with size 1
Answer

I ran through your notebook (I think you changed the 9 to -1 for things to run) and saw that for this part of the code:

# Learn with Label Propagation
label_propagation = LabelPropagation(adj_matrix_t)
print("Label Propagation: ", end="")
label_propagation.fit(labels_t)
label_propagation_output_labels = label_propagation.predict_classes()

Which eventually calls:

self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)

Is where things were going wrong.

Take a brief moment to read the pytorch manual on scatter here: torch Scatter and we learn that for scatter it's important to understand the dim, index, src and self matrixes. For one hot encoding, dim=1 or 0 doesn't matter and our src matrix is 1 (We'll look a little more into this later). You are now calling scatter on dimension 1 with an index matrix of [40,1] and a result(self) matrix of [40,5].

I see two issues here:

  1. You are using the literal category dummy variables (-2,-1,0,1,2) as the encoding indexes in your index matrix. Which will lead scatter to search for these indices in the src matrix. This is where the index out of bounds in coming from
  2. You mention that there are 6 classes of -2,-1,0,1,2 and 9 for unlabelled but you are one hot encoding on 5 classes. (Yes, I know you want the unlabeled class to be all zeros but that's a little difficult to achieve with scatter. I'll explain later).

So how do we fix this?

Issue 1: Let's start with a small example:

index = torch.tensor([[5],[0],[3],[5],[1],[4]]); print(index.shape); print(index)
result = torch.zeros(6, 6, dtype=src.dtype).scatter_(1, index, src); print(result.shape); print(result)

This will give us

torch.Size([6, 1])
tensor([[5],[0],[3],[5],[1],[4]])
torch.Size([6, 6])
tensor([[0, 0, 0, 0, 0, 1],[1, 0, 0, 0, 0, 0],[0, 0, 0, 1, 0, 0],[0, 0, 0, 0, 0, 1],[0, 1, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0]])

Index matrix is 6 observations with 1 observed value (category) Self matrix is 6 observations with a 6 category one hot encoding vector The way that scatter(dim=1) creates the self matrix is torch first checks the row (observation) and then changes the value of that row to the value of the value stored in the src matrix at the same row but at the column of the value stored in index.

self[i][index[i][j][k]][k] = src[i][j][k]

So in your case you were trying to apply the value of 1 into a row in self[40,1] at the column of index[0](which is equal to 1). Giving you the error in the question. Although I checked your notebook and the error is index -1 is out of bounds for dimension 1 with size 5. They are both the same root cause.

Issue 2: One-hot-encoding

It is just easier to do complete one-hot instead of one-hot with cold encodings in this case. The reason being is that for one-hot with cold encodings, you need to create a 0 value in your src matrix for every unlabelled observation. Which is much more painful than just using a 1 for the src. Also reading this link: Is it valid to have full zeros for OHE? I think it makes more sense to use one-hot for every category.

So, for the second issue we just need to simply map the categories in the indexes of the result/self matrix. Since we have 6 categories we just need to map them into 0,1,2,3,4,5. A simple lambda function would do the trick. I used a random sampler to get my data labels from a class list as shown below: (I randomly created 40 observations from 6 classes)

classes = list([-2,-1,0,1,2,9])labels = list()
for i in range(0,40):labels.append(list([(lambda x: x+2 if x !=9 else 5)(random.sample(classes,1)[0])]))index_aka_labels = torch.tensor(labels)
print(index_aka_labels)
print(index_aka_labels.shape)
torch.zeros(40, 6, dtype=src.dtype).scatter_(1, index_aka_labels, 1)

Finally, we have achieved our desired result of OHE:

tensor([[0, 0, 0, 0, 0, 1],[0, 0, 1, 0, 0, 0],[0, 0, 0, 0, 1, 0],[0, 0, 0, 0, 1, 0],... (40 observations)[0, 1, 0, 0, 0, 0],[0, 0, 0, 1, 0, 0],[1, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 1],
https://en.xdnf.cn/q/72008.html

Related Q&A

Is there a Mercurial or Git version control plugin for PyScripter? [closed]

Closed. This question is seeking recommendations for books, tools, software libraries, and more. It does not meet Stack Overflow guidelines. It is not currently accepting answers.We don’t allow questi…

How to make a color map with many unique colors in seaborn

I want to make a colormap with many (in the order of hundreds) unique colors. This code: custom_palette = sns.color_palette("Paired", 12) sns.palplot(custom_palette)returns a palplot with 12 …

Swap column values based on a condition in pandas

I would like to relocate columns by condition. In case country is Japan, I need to relocate last_name and first_name reverse.df = pd.DataFrame([[France,Kylian, Mbappe],[Japan,Hiroyuki, Tajima],[Japan,…

How to improve performance on a lambda function on a massive dataframe

I have a df with over hundreds of millions of rows.latitude longitude time VAL 0 -39.20000076293945312500 140.80000305175781250000 1…

How to detect if text is rotated 180 degrees or flipped upside down

I am working on a text recognition project. There is a chance the text is rotated 180 degrees. I have tried tesseract-ocr on terminal, but no luck. Is there any way to detect it and correct it? An exa…

Infinite loops using for in Python [duplicate]

This question already has answers here:Is there an expression for an infinite iterator?(7 answers)Closed 5 years ago.Why does this not create an infinite loop? a=5 for i in range(1,a):print(i)a=a+1or…

How to print the percentage of zipping a file python

I would like to get the percentage a file is at while zipping it. For instance it will print 1%, 2%, 3%, etc. I have no idea on where to start. How would I go about doing this right now I just have the…

kafka-python read from last produced message after a consumer restart

i am using kafka-python to consume messages from a kafka queue (kafka version 0.10.2.0). In particular i am using KafkaConsumer type. If the consumer stops and after a while it is restarted i would lik…

Python lib to Read a Flash swf Format File

Im interested in using Python to hack on the data in Flash swf files. There is good documentation available on the format of swf files, and I am considering writing my own Python lib to parse that dat…

PyQt5 Signals and Threading

I watched a short tutorial on PyQt4 signals on youtube and am having trouble getting a small sample program running. How do I connect my signal being emitted from a thread to the main window?import cp…