I'm new to machine learning and I'm trying to train a model which detects Prague
city in a sentence. It can be in many word forms.
Prague, PRAHA, Z Prahy etc...
So I have a train dataset which consists of title
and result
where result
is binary - 1 or 0 (about 5000 examples)
You can see the sample in code comments.
My thougths:
- load train dataset (title,result) and test dataset (title)
- set X_train, y_train
- convert title column from X_train to sequences of numbers
- create model and set layers (I'm not sure here if I do it right)
- train
- test
Train prints this:
Epoch 15/20- 0s - loss: 0.0303 - acc: 0.9924
Epoch 16/20- 0s - loss: 0.0304 - acc: 0.9922
Epoch 17/20- 0s - loss: 0.0648 - acc: 0.9779
Epoch 18/20- 0s - loss: 0.0589 - acc: 0.9816
Epoch 19/20- 0s - loss: 0.0494 - acc: 0.9844
Epoch 20/20
But test returns this values:
[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0]
Which means it detected word Prague
in these two sentences from test csv:
- Silvestr v Dublinu z Prahy
- Přímé lety do BRUSELU z PRAHY od 518 Kč
The first sentence is a substring from one sentence from X_train
and the second sentence is equal to one of the X_train
sentences.
I tried to increase epochs
number ak batch_size
number withou success...
Other test sentences have been created randomly or by modifying X_test
sentences.
def train():# load train dataset# "TIP! Ukraine Airlines - Thajsko - levné letenky Bangkok z Prahy (a zpět) 9.790,- kč",1# Predvianočná MALAGA s odletom z Viedne už za 18€,0# S 5* Singapore Airlines z Prahy do Singapuru a pak na Maledivy za 15.940 Kč,1# Athény z Katowic či Blavy,0# Z Prahy na kanárský ostrov Tenerife vč. zavazadla. Letenky od 1 990 Kč,1# Hotel v Praze i na víkend za 172Kč! (i jednolůžkové pokoje),1dataframe = pandas.read_csv("prague_train_set.csv")dataframe['title'] = dataframe['title'].str.lower()dataset = dataframe.values# load test dataset# v Praze je super # Should be 1, predicts 0# Silvestr v Dublinu z Prahy # Should be 1, predicts 1# do Prahy zavita peter # Should be 1, predicts 0# toto nie # Should be 0, predicts 0# xxx # Should be 0, predicts 0# Praha **** # Should be 1, predicts 0# z Prahy Přímo # Should be 1, predicts 0# Tip na dárek: Řím z Prahy za 778Kč (letfdenky tam i zpět) # Should be 1, predicts 0# lety do BRUSELU z PRAHY od 518 K # Should be 1, predicts 0# Přímé lety do BRUSELU z PRAHY od 518 Kč # Should be 1, predicts 1# Gelachovský stit # Should be 0, predicts 0tdataframe = pandas.read_csv("prague_test_set.csv")tdataframe['title'] = tdataframe['title'].str.lower()tdataset = tdataframe.values# Preprocess datasetX_train = dataset[:,0]X_test = tdataset[:,0]y_train = dataset[:,1]tokenizer = Tokenizer(char_level=True)tokenizer.fit_on_texts(X_train)X_train = tokenizer.texts_to_sequences(X_train)SEQ_MAX_LEN = 200X_train = sequence.pad_sequences(X_train, maxlen=SEQ_MAX_LEN)X_test = tokenizer.texts_to_sequences(X_test)X_test = sequence.pad_sequences(X_test, maxlen=SEQ_MAX_LEN)# create modelmodel = Sequential()# model.add(Embedding(tokenizer.word_index.__len__(), 32, input_length=100))model.add(Dense(SEQ_MAX_LEN, input_dim=SEQ_MAX_LEN, init='uniform', activation='relu'))model.add(Dense(10, init='uniform', activation='relu'))model.add(Dense(1, init='uniform', activation='sigmoid'))# Compile modelmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])# Fit the modelmodel.fit(X_train, y_train, epochs=20, batch_size=32, verbose=2)# model.save("trainmodel.h5")# model = load_model("trainmodel.h5")# calculate predictionspredictions = model.predict(X_test)# round predictionsrounded = [round(x[0]) for x in predictions]print(rounded)
Do you know what should I do to make it work correctly?