Keras/TensorFlow - high acc, bad prediction

2024/9/20 12:41:02

I'm new to machine learning and I'm trying to train a model which detects Prague city in a sentence. It can be in many word forms.

Prague, PRAHA, Z Prahy etc...

So I have a train dataset which consists of title and result where result is binary - 1 or 0 (about 5000 examples)

You can see the sample in code comments.

My thougths:

  1. load train dataset (title,result) and test dataset (title)
  2. set X_train, y_train
  3. convert title column from X_train to sequences of numbers
  4. create model and set layers (I'm not sure here if I do it right)
  5. train
  6. test

Train prints this:

Epoch 15/20- 0s - loss: 0.0303 - acc: 0.9924
Epoch 16/20- 0s - loss: 0.0304 - acc: 0.9922
Epoch 17/20- 0s - loss: 0.0648 - acc: 0.9779
Epoch 18/20- 0s - loss: 0.0589 - acc: 0.9816
Epoch 19/20- 0s - loss: 0.0494 - acc: 0.9844
Epoch 20/20

But test returns this values:

[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0]

Which means it detected word Prague in these two sentences from test csv:

  1. Silvestr v Dublinu z Prahy
  2. Přímé lety do BRUSELU z PRAHY od 518 Kč

The first sentence is a substring from one sentence from X_train and the second sentence is equal to one of the X_train sentences.

I tried to increase epochs number ak batch_size number withou success...

Other test sentences have been created randomly or by modifying X_test sentences.

def train():# load train dataset#  "TIP! Ukraine Airlines - Thajsko - levné letenky Bangkok z Prahy (a zpět) 9.790,- kč",1# Predvianočná MALAGA s odletom z Viedne už za 18€,0# S 5* Singapore Airlines z Prahy do Singapuru a pak na Maledivy za 15.940 Kč,1# Athény z Katowic či Blavy,0# Z Prahy na kanárský ostrov Tenerife vč. zavazadla. Letenky od 1 990 Kč,1# Hotel v Praze i na víkend za 172Kč! (i jednolůžkové pokoje),1dataframe = pandas.read_csv("prague_train_set.csv")dataframe['title'] = dataframe['title'].str.lower()dataset = dataframe.values# load test dataset# v Praze je super # Should be 1, predicts 0# Silvestr v Dublinu z Prahy # Should be 1, predicts 1# do Prahy zavita peter # Should be 1, predicts 0# toto nie # Should be 0, predicts 0# xxx # Should be 0, predicts 0# Praha **** # Should be 1, predicts 0# z Prahy Přímo # Should be 1, predicts 0# Tip na dárek: Řím z Prahy za 778Kč (letfdenky tam i zpět) # Should be 1, predicts 0# lety do BRUSELU z PRAHY od 518 K # Should be 1, predicts 0# Přímé lety do BRUSELU z PRAHY od 518 Kč # Should be 1, predicts 1# Gelachovský stit # Should be 0, predicts 0tdataframe = pandas.read_csv("prague_test_set.csv")tdataframe['title'] = tdataframe['title'].str.lower()tdataset = tdataframe.values# Preprocess datasetX_train = dataset[:,0]X_test = tdataset[:,0]y_train = dataset[:,1]tokenizer = Tokenizer(char_level=True)tokenizer.fit_on_texts(X_train)X_train = tokenizer.texts_to_sequences(X_train)SEQ_MAX_LEN = 200X_train = sequence.pad_sequences(X_train, maxlen=SEQ_MAX_LEN)X_test = tokenizer.texts_to_sequences(X_test)X_test = sequence.pad_sequences(X_test, maxlen=SEQ_MAX_LEN)# create modelmodel = Sequential()# model.add(Embedding(tokenizer.word_index.__len__(), 32, input_length=100))model.add(Dense(SEQ_MAX_LEN, input_dim=SEQ_MAX_LEN, init='uniform', activation='relu'))model.add(Dense(10, init='uniform', activation='relu'))model.add(Dense(1, init='uniform', activation='sigmoid'))# Compile modelmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])# Fit the modelmodel.fit(X_train, y_train, epochs=20, batch_size=32, verbose=2)# model.save("trainmodel.h5")# model = load_model("trainmodel.h5")# calculate predictionspredictions = model.predict(X_test)# round predictionsrounded = [round(x[0]) for x in predictions]print(rounded)

Do you know what should I do to make it work correctly?

Answer

There are two possible problems with this. 1. Data skewness 2. Overfitting

  1. Data skewness: your dataset data might be skewed, for example it has only 1% positives, then simple algorithm that predicts 0, will be 99% accurate. Here you need to use following metrics to quantify "goodness"

    • precision and recall
    • f1-score
  2. Overfitting: also called generalisation problem, in theory if training parameters are more (your weights and biases of neural network) then it might fit it parameters to do good on training but can't generalise it. Theoretically VC-dimesion is the limit to it, which depends on your training example (m), So you can try

    • increasing training data size (by getting more)
    • Adding regularisation
    • Using dropout
    • you can look into, to understand how many nodes should be in neural network
https://en.xdnf.cn/q/119597.html

Related Q&A

Flask-HTTPAuth: how to pass an extra argument to a function decorated with @auth.verify_password?

Heres a small Flask app authenticated with Flask-HTTPAuth. How to pass an argument (such as authentication on/off flag, or verbosity level / debug on/off flag) to a function (such as authenticate below…

AttributeError: numpy.ndarray object has no attribute split

Given a text file with one DNA sequence on each line and joining these together I now want to split the string into 5 sequences (corresponding to each of the 5 rows). This is the file source: http://ww…

How do I determine if a lat/long point is within a polygon?

I have a shapefile of all the counties that make up my state. Using the shapefile (which contains geometric for the district polygons) I was able to use geopandas to plot the shapes in a figure. I have…

How to return different types of arrays?

The high level problem Im having in C# is to make a single copy of a data structure that describes a robot control network packet (Ethercat), and then to use that single data structure to extract data …

How do I pass an array of strings to a python script as an argument?

Ive written a swift app that outputs an array of strings. I would like to import this array into a python script for further processing into an excel file via xlsxwriter, I would like to do this as an …

Match values of different dataframes

This dataframe is the principal with the original tweets. "original_ds_.csv" id tweet --------------------------------------------- 78 "onetoone"…

EOF while parsing

def main():NUMBER_OF_DAYS = 10NUMBER_OF_HOURS = 24data = []for i in range(NUMBER_OF_DAYS):data.append([])for j in range(NUMBER_OF_HOURS):data[i].append([])data[i][j].append(0)data[i][j].append(0)for k …

Why is bool(x) where x is any integer equal to True

I expected bool(1) to equate to True using Python - it does - then I expected other integers to error when converted to bool but that doesnt seem to be the case:>>> x=23 #<-- replace with a…

Getting TypeError while fetching value from table using Python and Django

I am getting error while fetching value from table using Python and Django. The error is below:Exception Type: TypeError Exception Value: not all arguments converted during string formattingMy code…

ValueError: The view **** didnt return an HttpResponse object. It returned None instead

Im using Django forms to handle user input for some point on my Django app. but it keeps showing this error whenever the user tries to submit the form. ValueError: The view *my view name goes here* di…