Pytorch loss is nan

2024/11/16 1:57:10

I'm trying to write my first neural network with pytorch. Unfortunately, I encounter a problem when I want to get the loss. The following error message:

RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.

So I tried debugging and found something strange. The input has no nans and infs as I verify with the following:

print(torch.any(torch.isnan(inputs)))

But if I always let the individual steps in the model x be output, I see that there will be inf at some point.

training

inputs, labels = data
print(torch.any(torch.isnan(inputs)))
optimizer.zero_grad()
outputs = model(inputs)
print(outputs)
loss = criterion(outputs, labels)
print(f"epoch: {epoch + 1} loss: {loss.item()}")
loss.backward()optimizer.step()

model

class Net(Module):def __init__(self):super(Net, self).__init__()self.layer1 = Conv1d(in_channels=1, out_channels=5, kernel_size=5, stride=2, dtype=torch.float64)self.act1 = ReLU()self.pool1 = MaxPool1d(2)self.layer2 = Conv1d(in_channels=5, out_channels=1, kernel_size=2, dtype=torch.float64)self.fcl1 = Linear(1350, 16, dtype=torch.float64)def forward(self, x):print("raw", x)x = self.layer1(x)print("conv1d 1", x)x = self.act1(x)print("relu", x)x = self.layer2(x)print("conv1d 2", x)x = self.pool1(x)x = self.pool1(x)x = self.pool1(x)x = self.pool1(x)x = self.pool1(x)x = self.pool1(x)x = self.pool1(x)print("pools", x)x = self.fcl1(x)print("linear", x)return x

output

tensor(False)
raw tensor([[9.0616e+227, 2.4353e-152,  1.0294e-71,  ...,  0.0000e+00,0.0000e+00,  0.0000e+00]], dtype=torch.float64)
conv1d 1 tensor([[   -inf,    -inf,    -inf,  ..., -0.2516, -0.2516, -0.2516],[    inf,     inf,     inf,  ...,  0.3377,  0.3377,  0.3377],[   -inf,    -inf,    -inf,  ...,  0.4285,  0.4285,  0.4285],[   -inf,    -inf,    -inf,  ..., -0.1230, -0.1230, -0.1230],[    inf,     inf,     inf,  ...,  0.3793,  0.3793,  0.3793]],dtype=torch.float64, grad_fn=<SqueezeBackward1>)
relu tensor([[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],[   inf,    inf,    inf,  ..., 0.3377, 0.3377, 0.3377],[0.0000, 0.0000, 0.0000,  ..., 0.4285, 0.4285, 0.4285],[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],[   inf,    inf,    inf,  ..., 0.3793, 0.3793, 0.3793]],dtype=torch.float64, grad_fn=<ReluBackward0>)
conv1d 2 tensor([[        -inf,         -inf,         -inf,  ..., -5.4167e+265,-5.4167e+265, -5.4167e+265]], dtype=torch.float64,grad_fn=<SqueezeBackward1>)
pools tensor([[        -inf, -5.4167e+265, -5.4167e+265,  ..., -5.4167e+265,-5.4167e+265, -5.4167e+265]], dtype=torch.float64,grad_fn=<SqueezeBackward1>)
linear tensor([[inf, inf, -inf, -inf, -inf, inf, inf, inf, inf, inf, inf, -inf, inf, inf, -inf, -inf]],dtype=torch.float64, grad_fn=<AddmmBackward0>)
tensor([[inf, inf, -inf, -inf, -inf, inf, inf, inf, inf, inf, inf, -inf, inf, inf, -inf, -inf]],dtype=torch.float64, grad_fn=<AddmmBackward0>)
epoch: 1 loss: nan

Thanks for helping

Answer

Sorry, my reputation is not enough for me to comment directly. This may be caused by the exploding gradient due to the excessive learning rate. It is recommended that you reduce the learning rate or use weight_decay.

https://en.xdnf.cn/q/71396.html

Related Q&A

How do you debug python code with kubernetes and skaffold?

I am currently running a django app under python3 through kubernetes by going through skaffold dev. I have hot reload working with the Python source code. Is it currently possible to do interactive deb…

Discrepancies between R optim vs Scipy optimize: Nelder-Mead

I wrote a script that I believe should produce the same results in Python and R, but they are producing very different answers. Each attempts to fit a model to simulated data by minimizing deviance usi…

C++ class not recognized by Python 3 as a module via Boost.Python Embedding

The following example from Boost.Python v1.56 shows how to embed the Python 3.4.2 interpreter into your own application. Unfortunately that example does not work out of the box on my configuration with…

Python NET call C# method which has a return value and an out parameter

Im having the following static C# methodpublic static bool TryParse (string s, out double result)which I would like to call from Python using the Python NET package.import clr from System import Double…

ValueError: Length of passed values is 7, index implies 0

I am trying to get 1minute open, high, low, close, volume values from bitmex using ccxt. everything seems to be fine however im not sure how to fix this error. I know that the index is 7 because there …

What is pythons strategy to manage allocation/freeing of large variables?

As a follow-up to this question, it appears that there are different allocation/deallocation strategies for little and big variables in (C)Python. More precisely, there seems to be a boundary in the ob…

Why is cross_val_predict so much slower than fit for KNeighborsClassifier?

Running locally on a Jupyter notebook and using the MNIST dataset (28k entries, 28x28 pixels per image, the following takes 27 seconds. from sklearn.neighbors import KNeighborsClassifierknn_clf = KNeig…

Do I need to do any text cleaning for Spacy NER?

I am new to NER and Spacy. Trying to figure out what, if any, text cleaning needs to be done. Seems like some examples Ive found trim the leading and trailing whitespace and then muck with the start/st…

Hi , I have error related to object detection project

I have error related to simple object detection .output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()] IndexError: invalid index to scalar variable.import cv2.cv2 as cv import…

What is the fastest way to calculate / create powers of ten?

If as the input you provide the (integer) power, what is the fastest way to create the corresponding power of ten? Here are four alternatives I could come up with, and the fastest way seems to be usin…