Keras/TensorFlow - high acc, bad prediction

2024/9/20 12:41:02

I'm new to machine learning and I'm trying to train a model which detects Prague city in a sentence. It can be in many word forms.

Prague, PRAHA, Z Prahy etc...

So I have a train dataset which consists of title and result where result is binary - 1 or 0 (about 5000 examples)

You can see the sample in code comments.

My thougths:

  1. load train dataset (title,result) and test dataset (title)
  2. set X_train, y_train
  3. convert title column from X_train to sequences of numbers
  4. create model and set layers (I'm not sure here if I do it right)
  5. train
  6. test

Train prints this:

Epoch 15/20- 0s - loss: 0.0303 - acc: 0.9924
Epoch 16/20- 0s - loss: 0.0304 - acc: 0.9922
Epoch 17/20- 0s - loss: 0.0648 - acc: 0.9779
Epoch 18/20- 0s - loss: 0.0589 - acc: 0.9816
Epoch 19/20- 0s - loss: 0.0494 - acc: 0.9844
Epoch 20/20

But test returns this values:

[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0]

Which means it detected word Prague in these two sentences from test csv:

  1. Silvestr v Dublinu z Prahy
  2. Přímé lety do BRUSELU z PRAHY od 518 Kč

The first sentence is a substring from one sentence from X_train and the second sentence is equal to one of the X_train sentences.

I tried to increase epochs number ak batch_size number withou success...

Other test sentences have been created randomly or by modifying X_test sentences.

def train():# load train dataset#  "TIP! Ukraine Airlines - Thajsko - levné letenky Bangkok z Prahy (a zpět) 9.790,- kč",1# Predvianočná MALAGA s odletom z Viedne už za 18€,0# S 5* Singapore Airlines z Prahy do Singapuru a pak na Maledivy za 15.940 Kč,1# Athény z Katowic či Blavy,0# Z Prahy na kanárský ostrov Tenerife vč. zavazadla. Letenky od 1 990 Kč,1# Hotel v Praze i na víkend za 172Kč! (i jednolůžkové pokoje),1dataframe = pandas.read_csv("prague_train_set.csv")dataframe['title'] = dataframe['title'].str.lower()dataset = dataframe.values# load test dataset# v Praze je super # Should be 1, predicts 0# Silvestr v Dublinu z Prahy # Should be 1, predicts 1# do Prahy zavita peter # Should be 1, predicts 0# toto nie # Should be 0, predicts 0# xxx # Should be 0, predicts 0# Praha **** # Should be 1, predicts 0# z Prahy Přímo # Should be 1, predicts 0# Tip na dárek: Řím z Prahy za 778Kč (letfdenky tam i zpět) # Should be 1, predicts 0# lety do BRUSELU z PRAHY od 518 K # Should be 1, predicts 0# Přímé lety do BRUSELU z PRAHY od 518 Kč # Should be 1, predicts 1# Gelachovský stit # Should be 0, predicts 0tdataframe = pandas.read_csv("prague_test_set.csv")tdataframe['title'] = tdataframe['title'].str.lower()tdataset = tdataframe.values# Preprocess datasetX_train = dataset[:,0]X_test = tdataset[:,0]y_train = dataset[:,1]tokenizer = Tokenizer(char_level=True)tokenizer.fit_on_texts(X_train)X_train = tokenizer.texts_to_sequences(X_train)SEQ_MAX_LEN = 200X_train = sequence.pad_sequences(X_train, maxlen=SEQ_MAX_LEN)X_test = tokenizer.texts_to_sequences(X_test)X_test = sequence.pad_sequences(X_test, maxlen=SEQ_MAX_LEN)# create modelmodel = Sequential()# model.add(Embedding(tokenizer.word_index.__len__(), 32, input_length=100))model.add(Dense(SEQ_MAX_LEN, input_dim=SEQ_MAX_LEN, init='uniform', activation='relu'))model.add(Dense(10, init='uniform', activation='relu'))model.add(Dense(1, init='uniform', activation='sigmoid'))# Compile modelmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])# Fit the, y_train, epochs=20, batch_size=32, verbose=2)#"trainmodel.h5")# model = load_model("trainmodel.h5")# calculate predictionspredictions = model.predict(X_test)# round predictionsrounded = [round(x[0]) for x in predictions]print(rounded)

Do you know what should I do to make it work correctly?


There are two possible problems with this. 1. Data skewness 2. Overfitting

  1. Data skewness: your dataset data might be skewed, for example it has only 1% positives, then simple algorithm that predicts 0, will be 99% accurate. Here you need to use following metrics to quantify "goodness"

    • precision and recall
    • f1-score
  2. Overfitting: also called generalisation problem, in theory if training parameters are more (your weights and biases of neural network) then it might fit it parameters to do good on training but can't generalise it. Theoretically VC-dimesion is the limit to it, which depends on your training example (m), So you can try

    • increasing training data size (by getting more)
    • Adding regularisation
    • Using dropout
    • you can look into, to understand how many nodes should be in neural network

