Question 1

I'm new to machine learning and I'm trying to train a model which detects Prague city in a sentence. It can be in many word forms.

Prague, PRAHA, Z Prahy etc...

So I have a train dataset which consists of title and result where result is binary - 1 or 0 (about 5000 examples)

You can see the sample in code comments.

My thougths:

load train dataset (title,result) and test dataset (title)
set X_train, y_train
convert title column from X_train to sequences of numbers
create model and set layers (I'm not sure here if I do it right)
train
test

Train prints this:

Epoch 15/20- 0s - loss: 0.0303 - acc: 0.9924
Epoch 16/20- 0s - loss: 0.0304 - acc: 0.9922
Epoch 17/20- 0s - loss: 0.0648 - acc: 0.9779
Epoch 18/20- 0s - loss: 0.0589 - acc: 0.9816
Epoch 19/20- 0s - loss: 0.0494 - acc: 0.9844
Epoch 20/20

But test returns this values:

[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0]

Which means it detected word Prague in these two sentences from test csv:

Silvestr v Dublinu z Prahy
Přímé lety do BRUSELU z PRAHY od 518 Kč

The first sentence is a substring from one sentence from X_train and the second sentence is equal to one of the X_train sentences.

I tried to increase epochs number ak batch_size number withou success...

Other test sentences have been created randomly or by modifying X_test sentences.

def train():# load train dataset#  "TIP! Ukraine Airlines - Thajsko - levné letenky Bangkok z Prahy (a zpět) 9.790,- kč",1# Predvianočná MALAGA s odletom z Viedne už za 18€,0# S 5* Singapore Airlines z Prahy do Singapuru a pak na Maledivy za 15.940 Kč,1# Athény z Katowic či Blavy,0# Z Prahy na kanárský ostrov Tenerife vč. zavazadla. Letenky od 1 990 Kč,1# Hotel v Praze i na víkend za 172Kč! (i jednolůžkové pokoje),1dataframe = pandas.read_csv("prague_train_set.csv")dataframe['title'] = dataframe['title'].str.lower()dataset = dataframe.values# load test dataset# v Praze je super # Should be 1, predicts 0# Silvestr v Dublinu z Prahy # Should be 1, predicts 1# do Prahy zavita peter # Should be 1, predicts 0# toto nie # Should be 0, predicts 0# xxx # Should be 0, predicts 0# Praha **** # Should be 1, predicts 0# z Prahy Přímo # Should be 1, predicts 0# Tip na dárek: Řím z Prahy za 778Kč (letfdenky tam i zpět) # Should be 1, predicts 0# lety do BRUSELU z PRAHY od 518 K # Should be 1, predicts 0# Přímé lety do BRUSELU z PRAHY od 518 Kč # Should be 1, predicts 1# Gelachovský stit # Should be 0, predicts 0tdataframe = pandas.read_csv("prague_test_set.csv")tdataframe['title'] = tdataframe['title'].str.lower()tdataset = tdataframe.values# Preprocess datasetX_train = dataset[:,0]X_test = tdataset[:,0]y_train = dataset[:,1]tokenizer = Tokenizer(char_level=True)tokenizer.fit_on_texts(X_train)X_train = tokenizer.texts_to_sequences(X_train)SEQ_MAX_LEN = 200X_train = sequence.pad_sequences(X_train, maxlen=SEQ_MAX_LEN)X_test = tokenizer.texts_to_sequences(X_test)X_test = sequence.pad_sequences(X_test, maxlen=SEQ_MAX_LEN)# create modelmodel = Sequential()# model.add(Embedding(tokenizer.word_index.__len__(), 32, input_length=100))model.add(Dense(SEQ_MAX_LEN, input_dim=SEQ_MAX_LEN, init='uniform', activation='relu'))model.add(Dense(10, init='uniform', activation='relu'))model.add(Dense(1, init='uniform', activation='sigmoid'))# Compile modelmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])# Fit the modelmodel.fit(X_train, y_train, epochs=20, batch_size=32, verbose=2)# model.save("trainmodel.h5")# model = load_model("trainmodel.h5")# calculate predictionspredictions = model.predict(X_test)# round predictionsrounded = [round(x[0]) for x in predictions]print(rounded)

Do you know what should I do to make it work correctly?

Question 2

There are two possible problems with this. 1. Data skewness 2. Overfitting

Data skewness: your dataset data might be skewed, for example it has only 1% positives, then simple algorithm that predicts 0, will be 99% accurate. Here you need to use following metrics to quantify "goodness"
- precision and recall
- f1-score
Overfitting: also called generalisation problem, in theory if training parameters are more (your weights and biases of neural network) then it might fit it parameters to do good on training but can't generalise it. Theoretically VC-dimesion is the limit to it, which depends on your training example (m), So you can try
- increasing training data size (by getting more)
- Adding regularisation
- Using dropout
- you can look into, to understand how many nodes should be in neural network

Keras/TensorFlow - high acc, bad prediction

Related Q&A

Flask-HTTPAuth: how to pass an extra argument to a function decorated with @auth.verify_password?

AttributeError: numpy.ndarray object has no attribute split

How do I determine if a lat/long point is within a polygon?

How to return different types of arrays?

How do I pass an array of strings to a python script as an argument?

Match values of different dataframes

EOF while parsing

Why is bool(x) where x is any integer equal to True

Getting TypeError while fetching value from table using Python and Django

ValueError: The view **** didnt return an HttpResponse object. It returned None instead