Cosine similarity for very large dataset

2024/10/12 12:30:13

I am having trouble with calculating cosine similarity between large list of 100-dimensional vectors. When I use from sklearn.metrics.pairwise import cosine_similarity, I get MemoryError on my 16 GB machine. Each array fits perfectly in my memory but I get MemoryError during np.dot() internal call

Here's my use-case and how I am currently tackling it.

Here's my parent vector of 100-dimension which I need to compare with other 500,000 different vectors of same dimension (i.e. 100)

parent_vector = [1, 2, 3, 4 ..., 100]

Here are my child vectors (with some made-up random numbers for this example)

child_vector_1 = [2, 3, 4, ....., 101]
child_vector_2 = [3, 4, 5, ....., 102]
child_vector_3 = [4, 5, 6, ....., 103]
.......
.......
child_vector_500000 = [3, 4, 5, ....., 103]

My final goal is to get top-N child vectors (with their names such as child_vector_1 and their corresponding cosine score) who have very high cosine similarity with the parent vector.

My current approach (which I know is inefficient and memory consuming):

Step 1: Create a super-dataframe of following shape

parent_vector         1,    2,    3, .....,    100   
child_vector_1        2,    3,    4, .....,    101   
child_vector_2        3,    4,    5, .....,    102   
child_vector_3        4,    5,    6, .....,    103   
......................................   
child_vector_500000   3,    4,    5, .....,    103

Step 2: Use

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df)

to get pair-wise cosine similarity between all vectors (shown in above dataframe)

Step 3: Make a list of tuple to store the key such as child_vector_1 and value such as the cosine similarity number for all such combinations.

Step 4: Get the top-N using sort() of list -- so that I get the child vector name as well as its cosine similarity score with the parent vector.

PS: I know this is highly inefficient but I couldn't think of a betterway to faster compute cosine similarity between each of child vectorand parent vector and get the top-N values.

Any help would be highly appreciated.

Answer

even though your (500000, 100) array (the parent and its children) fits into memory any pairwise metric on it won't. The reason for that is that pairwise metric as the name suggests computes the distance for any two children. In order to store these distances you would need a (500000,500000) sized array of floats which if my calculations are right would take about 100 GB of memory.

Thankfully there is an easy solution for your problem. If I understand you correctly you only want to have the distance between child and parents which will result in a vector of length 500000 which is easily stored in memory.

To do this, you simply need to provide a second argument to cosine_similarity containing only the parent_vector

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similaritydf = pd.DataFrame(np.random.rand(500000,100)) 
df['distances'] = cosine_similarity(df, df.iloc[0:1]) # Here I assume that the parent vector is stored as the first row in the dataframe, but you could also store it separatelyn = 10 # or however many you want
n_largest = df['distances'].nlargest(n + 1) # this contains the parent itself as the most similar entry, hence n+1 to get n children

hope that solves your question.

https://en.xdnf.cn/q/69651.html

Related Q&A

What exactly are the csv modules Dialect settings for excel-tab?

The csv module implements classes to read and write tabular data in CSV format. It allows programmers to say, “write this data in the formatpreferred by Excel,” or “read data from this file which wa…

Python: how to make a recursive generator function

I have been working on generating all possible submodels for a biological problem. I have a working recursion for generating a big list of all the submodels I want. However, the lists get unmanageably …

Change default options in pandas

Im wondering if theres any way to change the default display options for pandas. Id like to change the display formatting as well as the display width each time I run python, eg:pandas.options.display.…

python-messaging Failed to handle HTTP request

I am using the code below to try to send an MMS message with python-messaging https://github.com/pmarti/python-messaging/blob/master/doc/tutorial/mms.rst Although the connection seems to go smoothly I …

Plotting confidence and prediction intervals with repeated entries

I have a correlation plot for two variables, the predictor variable (temperature) on the x-axis, and the response variable (density) on the y-axis. My best fit least squares regression line is a 2nd or…

Saving and Loading of dataframe to csv results in Unnamed columns

prob in the title. exaple:x=[(a,a,c) for i in range(5)] df = DataFrame(x,columns=[col1,col2,col3]) df.to_csv(test.csv) df1 = read_csv(test.csv)Unnamed: 0 col1 col2 col3 0 0 a a c 1 …

Python: print specific character from string

How do I print a specific character from a string in Python? I am still learning and now trying to make a hangman like program. The idea is that the user enters one character, and if it is in the word…

Python AttributeError: module string has no attribute maketrans

I am receiving the below error when trying to run a command in Python 3.5.2 shell:Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:01:18) [MSC v.1900 32 bit (Intel)] on win32 Type "copyrig…

How to add attribute to class in python

I have: class A:a=1b=2I want to make as setattr(A,c)then all objects that I create it from class A has c attribute. i did not want to use inheritance

Number of occurrence of pair of value in dataframe

I have dataframe with following columns:Name, Surname, dateOfBirth, city, countryI am interested to find what is most common combination of name and surname and how much it occurs as well. Would be nic…