I am having trouble with calculating cosine similarity between large list of 100-dimensional vectors. When I use from sklearn.metrics.pairwise import cosine_similarity
, I get MemoryError
on my 16 GB machine. Each array fits perfectly in my memory but I get MemoryError
during np.dot()
internal call
Here's my use-case and how I am currently tackling it.
Here's my parent vector of 100-dimension which I need to compare with other 500,000 different vectors of same dimension (i.e. 100)
parent_vector = [1, 2, 3, 4 ..., 100]
Here are my child vectors (with some made-up random numbers for this example)
child_vector_1 = [2, 3, 4, ....., 101]
child_vector_2 = [3, 4, 5, ....., 102]
child_vector_3 = [4, 5, 6, ....., 103]
.......
.......
child_vector_500000 = [3, 4, 5, ....., 103]
My final goal is to get top-N child vectors (with their names such as child_vector_1
and their corresponding cosine score) who have very high cosine similarity with the parent vector.
My current approach (which I know is inefficient and memory consuming):
Step 1: Create a super-dataframe of following shape
parent_vector 1, 2, 3, ....., 100
child_vector_1 2, 3, 4, ....., 101
child_vector_2 3, 4, 5, ....., 102
child_vector_3 4, 5, 6, ....., 103
......................................
child_vector_500000 3, 4, 5, ....., 103
Step 2: Use
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df)
to get pair-wise cosine similarity between all vectors (shown in above dataframe)
Step 3: Make a list of tuple to store the key
such as child_vector_1
and value such as the cosine similarity number for all such combinations.
Step 4: Get the top-N using sort()
of list -- so that I get the child vector name as well as its cosine similarity score with the parent vector.
PS: I know this is highly inefficient but I couldn't think of a betterway to faster compute cosine similarity between each of child vectorand parent vector and get the top-N values.
Any help would be highly appreciated.