Cosine similarity of word2vec more than 1

2024/11/13 22:23:08

I used a word2vec algorithm of spark to compute documents vector of a text.

I then used the findSynonyms function of the model object to get synonyms of few words.

I see something like this:

w2vmodel.findSynonyms('science',4).show(5)
+------------+------------------+
|        word|        similarity|
+------------+------------------+
|     physics| 1.714908638833209|
|     fiction|1.5189824643358183|
|neuroscience|1.4968051528391833|
|  psychology| 1.458865636374223|
+------------+------------------+

I do not understand why the cosine similarity is being calculated as more than 1. Cosine similarity should be between 0 and 1 or max -1 and +1 (taking negative angles).

Why it is more than 1 here? What's going wrong here?

Answer

You should normalize the word vectors that you got from word2vec, otherwise you would get unbounded dot product or cosine similarity values.

From Levy et al., 2015 (and, actually, most of the literature on word embeddings):

Vectors are normalized to unit length before they are used for similarity calculation, making cosine similarity and dot-product equivalent.

How to do normalization?

You can do something like below.

import numpy as npdef normalize(word_vec):norm=np.linalg.norm(word_vec)if norm == 0: return word_vecreturn word_vec/norm

References

  • Should I do normalization to word embeddings from word2vec if I want to do semantic tasks?
  • Should I normalize word2vec's word vectors before using them?

Update: Why cosine similarity of word2vec is greater than 1?

According to this answer, in spark implementation of word2vec, findSynonyms doesn't actually return cosine distances, but rather cosine distances times the norm of the query vector.

The ordering and relative values are consistent with the true cosine distance, but the actual values are all scaled.

https://en.xdnf.cn/q/71940.html

Related Q&A

Handling empty case with tuple filtering and unpacking

I have a situation with some parallel lists that need to be filtered based on the values in one of the lists. Sometimes I write something like this to filter them:lista = [1, 2, 3] listb = [7, 8, 9] f…

pip3 install pyautogui fails with error code 1 Mac OS

I tried installing the autogui python extension:pip3 install pyautoguiAnd this installation attempt results in the following error message:Collecting pyautoguiUsing cached PyAutoGUI-0.9.33.zipComplete …

BERT get sentence embedding

I am replicating code from this page. I have downloaded the BERT model to my local system and getting sentence embedding. I have around 500,000 sentences for which I need sentence embedding and it is t…

Python Subversion wrapper library

In Subversions documentation theres an example of using Subversion from Python#!/usr/bin/python import svn.fs, svn.core, svn.reposdef crawl_filesystem_dir(root, directory):"""Recursively…

How to convert a selenium webelement to string variable in python

from selenium import webdriver from time import sleep from selenium.common.exceptions import NoSuchAttributeException from selenium.common.exceptions import NoSuchElementException from selenium.webdriv…

Why are session methods unbound in sqlalchemy using sqlite?

Code replicating the error:from sqlalchemy import create_engine, Table, Column, Integer from sqlalchemy.ext.declarative import declarative_base from sqlalchemy.orm import sessionmakerBase = declarative…

Combining Tkinter and win32ui makes Python crash on exit

While building a basic app using the winapi with Python 2.7 (Im on Windows 8.1), I tried to add a small Tkinter gui to the program. The problem is, whenever I close the app window, Python crashes compl…

horizontal tree with graphviz_layout

in python, with networkx. I can plot a vertical tree with : g=nx.balanced_tree(2,4)pos = nx.graphviz_layout(g, prog=dot)nx.draw(g,pos,labels=b_all, node_size=500)plt.show()similar to [root]|| |nod…

Finding first n primes? [duplicate]

This question already has answers here:Closed 12 years ago.Possible Duplicate:Fastest way to list all primes below N in python Although I already have written a function to find all primes under n (pr…

Scipy.optimize.root does not converge in Python while Matlab fsolve works, why?

I am trying to find the root y of a function called f using Python. Here is my code:def f(y):w,p1,p2,p3,p4,p5,p6 = y[:7] t1 = w - 0.99006633*(p1**0.5) - (-1.010067)*((1-p1))t2 = w - 22.7235687*(p2**0.…