Question 1

I am having trouble with calculating cosine similarity between large list of 100-dimensional vectors. When I use from sklearn.metrics.pairwise import cosine_similarity, I get MemoryError on my 16 GB machine. Each array fits perfectly in my memory but I get MemoryError during np.dot() internal call

Here's my use-case and how I am currently tackling it.

Here's my parent vector of 100-dimension which I need to compare with other 500,000 different vectors of same dimension (i.e. 100)

parent_vector = [1, 2, 3, 4 ..., 100]

Here are my child vectors (with some made-up random numbers for this example)

child_vector_1 = [2, 3, 4, ....., 101]
child_vector_2 = [3, 4, 5, ....., 102]
child_vector_3 = [4, 5, 6, ....., 103]
.......
.......
child_vector_500000 = [3, 4, 5, ....., 103]

My final goal is to get top-N child vectors (with their names such as child_vector_1 and their corresponding cosine score) who have very high cosine similarity with the parent vector.

My current approach (which I know is inefficient and memory consuming):

Step 1: Create a super-dataframe of following shape

parent_vector         1,    2,    3, .....,    100   
child_vector_1        2,    3,    4, .....,    101   
child_vector_2        3,    4,    5, .....,    102   
child_vector_3        4,    5,    6, .....,    103   
......................................   
child_vector_500000   3,    4,    5, .....,    103

Step 2: Use

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df)

to get pair-wise cosine similarity between all vectors (shown in above dataframe)

Step 3: Make a list of tuple to store the key such as child_vector_1 and value such as the cosine similarity number for all such combinations.

Step 4: Get the top-N using sort() of list -- so that I get the child vector name as well as its cosine similarity score with the parent vector.

PS: I know this is highly inefficient but I couldn't think of a betterway to faster compute cosine similarity between each of child vectorand parent vector and get the top-N values.

Any help would be highly appreciated.

Question 2

even though your (500000, 100) array (the parent and its children) fits into memory any pairwise metric on it won't. The reason for that is that pairwise metric as the name suggests computes the distance for any two children. In order to store these distances you would need a (500000,500000) sized array of floats which if my calculations are right would take about 100 GB of memory.

Thankfully there is an easy solution for your problem. If I understand you correctly you only want to have the distance between child and parents which will result in a vector of length 500000 which is easily stored in memory.

To do this, you simply need to provide a second argument to cosine_similarity containing only the parent_vector

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similaritydf = pd.DataFrame(np.random.rand(500000,100)) 
df['distances'] = cosine_similarity(df, df.iloc[0:1]) # Here I assume that the parent vector is stored as the first row in the dataframe, but you could also store it separatelyn = 10 # or however many you want
n_largest = df['distances'].nlargest(n + 1) # this contains the parent itself as the most similar entry, hence n+1 to get n children

hope that solves your question.

Cosine similarity for very large dataset

Related Q&A

What exactly are the csv modules Dialect settings for excel-tab?

Python: how to make a recursive generator function

Change default options in pandas

python-messaging Failed to handle HTTP request

Plotting confidence and prediction intervals with repeated entries

Saving and Loading of dataframe to csv results in Unnamed columns

Python: print specific character from string

Python AttributeError: module string has no attribute maketrans

How to add attribute to class in python

Number of occurrence of pair of value in dataframe