Using pretrained glove word embedding with scikit-learn

2024/10/11 16:24:44

I have used keras to use pre-trained word embeddings but I am not quite sure how to do it on scikit-learn model.

I need to do this in sklearn as well because I am using vecstack to ensemble both keras sequential model and sklearn model.

This is what I have done for keras model:

glove_dir = '/home/Documents/Glove'
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.200d.txt'), 'r', encoding='utf-8')
for line in f:values = line.split()word = values[0]coefs = np.asarray(values[1:], dtype='float32')embeddings_index[word] = coefs
f.close()embedding_dim = 200embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():if i < max_words:embedding_vector = embeddings_index.get(word)if embedding_vector is not None:embedding_matrix[i] = embedding_vectormodel = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
.
.
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False
model.compile(----)
model.fit(-----)

I am very new to scikit-learn, from what I have seen to make an model in sklearn you do:

lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.predict(x_test)

So, my question is how do I use pre-trained Glove with this model? where do I pass the pre-trained glove embedding_matrix

Thank you very much and I really appreciate your help.

Answer

You can simply use the Zeugma library.

You can install it with pip install zeugma, then create and train your model with the following lines of code (assuming corpus_train and corpus_test are lists of strings):

from sklearn.linear_model import LogisticRegresion
from zeugma.embeddings import EmbeddingTransformerglove = EmbeddingTransformer('glove')
x_train = glove.transform(corpus_train)model = LogisticRegression()
model.fit(x_train, y_train)x_test = glove.transform(corpus_test)
model.predict(x_test)

You can also use different pre-trained embeddings (complete list here) or train your own (see Zeugma's documentation for how to do this).

https://en.xdnf.cn/q/69749.html

Related Q&A

Is there an easy way to tell how much time is spent waiting for the Python GIL?

I have a long-running Python service and Id like to know how much cumulative wall clock time has been spent by any runnable threads (i.e., threads that werent blocked for some other reason) waiting for…

Inverse filtering using Python

Given an impulse response h and output y (both one-dimensional arrays), Im trying to find a way to compute the inverse filter x such that h * x = y, where * denotes the convolution product.For example,…

Quadruple Precision Eigenvalues, Eigenvectors and Matrix Logarithms

I am attempting to diagonalize matrices in quadruple precision, and to take their logarithms. Is there a language in which I can accomplish this using built-in functions?Note, the languages/packages i…

How to use pyinstaller with pipenv / pyenv

I am trying to ship an executable from my python script which lives inside a virtual environment using pipenv which again relies on pyenv for python versioning. For that, I want to us pyinstaller. Wha…

Sending DHCP Discover using python scapy

I am new to python and learning some network programming, I wish to send an DHCP Packet through my tap interface to my DHCP server and expecting some response from it. I tried with several packet build…

cnf argument for tkinter widgets

So, Im digging through the code here and in every class (almost) I see an argument cnf={} to the constructor, but unless Ive missed it, it is not explicitly stated what cnf is / expected to contain. Ca…

python exceptions.UnicodeDecodeError: ascii codec cant decode byte 0xa7 in

I am using scrapy with python and I have this code in a python item piplinedef process_item(self, item, spider):import pdb; pdb.set_trace()ID = str(uuid.uuid5(uuid.NAMESPACE_DNS, item[link]))I got this…

Trace Bug which happends only sometimes in CI

I have a strange bug in python code which only happens sometimes in CI.We cant reproduce it.Where is the test code:response=self.admin_client.post(url, post) self.assertEqual(200, response.status_code,…

Limit neural network output to subset of trained classes

Is it possible to pass a vector to a trained neural network so it only chooses from a subset of the classes it was trained to recognize. For example, I have a network trained to recognize numbers and l…

Unable to fully remove border of PyQt QGraphicsView

I have tried calling self.setStyleSheet("background: transparent; border: transparent;") on a QGraphicsView, but it still leaves a 1 pixel border on the top edge. I have also tried replacing …