Python NLTK WUP Similarity Score not unity for exact same word

2024/10/2 18:31:20

Simple code like follows gives out similarity score of 0.75 for both cases. As you can see both the words are the exact same. To avoid any confusion I also compared a word with itself. The score refuses to bulge from 0.75. What is going on here?

from nltk.corpus import wordnet as wn
actual=wn.synsets('orange')[0]
predicted=wn.synsets('orange')[0]
similarity=actual.wup_similarity(predicted)
print similarity
similarity=actual.wup_similarity(actual)
print similarity
Answer

This is an interesting problem.

TL;DR:

Sorry there's no short answer to this problem =(


Too long, want to read:

Looking at the code for wup_similarity(), the problem comes from not the similarity calculations but the way NLTK traverse the WordNet hierarchies to get the lowest_common_hypernym() (see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805).

Normally, the lowest common hypernyms between a synset and itself would have to be itself:

>>> from nltk.corpus import wordnet as wn
>>> y = wn.synsets('car')[0]
>>> y.lowest_common_hypernyms(y, use_min_depth=True)
[Synset('car.n.01')]

But in the case of orange it gives fruit too:

>>> from nltk.corpus import wordnet as wn
>>> x = wn.synsets('orange')[0]
>>> x.lowest_common_hypernyms(x, use_min_depth=True)
[Synset('fruit.n.01'), Synset('orange.n.01')]

We'll have to take a look at the code for the lowest_common_hypernym(), from the docstring of https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805

Get a list of lowest synset(s) that both synsets have as a hypernym.When use_min_depth == False this means that the synset which appears as ahypernym of both self and other with the lowest maximum depth is returnedor if there are multiple such synsets at the same depth they are all returnedHowever, if use_min_depth == True then the synset(s) which has/have the lowestminimum depth and appear(s) in both paths is/are returned

So let's try the lowest_common_hypernym() with use_min_depth=False:

>>> x.lowest_common_hypernyms(x, use_min_depth=False)
[Synset('orange.n.01')]

Seems like that resolves the ambiguity of the tied path. But the wup_similarity() API doesn't have the use_min_depth parameter:

>>> x.wup_similarity(x, use_min_depth=False)
Traceback (most recent call last):File "<stdin>", line 1, in <module>
TypeError: wup_similarity() got an unexpected keyword argument 'use_min_depth'

Note the difference is that when use_min_depth==False, the lowest_common_hypernym checks for maximum depth while traversing synsets. But when use_min_depth==True, it checks for minimum depth, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L602

So if we trace the lowest_common_hypernym code:

>>> synsets_to_search = x.common_hypernyms(x)
>>> synsets_to_search
[Synset('citrus.n.01'), Synset('natural_object.n.01'), Synset('orange.n.01'), Synset('object.n.01'), Synset('plant_organ.n.01'), Synset('edible_fruit.n.01'), Synset('produce.n.01'), Synset('food.n.02'), Synset('physical_entity.n.01'), Synset('entity.n.01'), Synset('reproductive_structure.n.01'), Synset('solid.n.01'), Synset('matter.n.03'), Synset('plant_part.n.01'), Synset('fruit.n.01'), Synset('whole.n.02')]# if use_min_depth==True
>>> max_depth = max(x.min_depth() for x in synsets_to_search)
>>> max_depth
8
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.min_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[Synset('orange.n.01'), Synset('fruit.n.01')]
>>> 
# if use_min_depth==False
>>> max_depth = max(x.max_depth() for x in synsets_to_search)
>>> max_depth
11
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.max_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[Synset('orange.n.01')]

This weird phenomena with wup_similarity is actually highlighted in the code comments, https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843

# Note that to preserve behavior from NLTK2 we set use_min_depth=True
# It is possible that more accurate results could be obtained by
# removing this setting and it should be tested later on
subsumers = self.lowest_common_hypernyms(other, simulate_root=simulate_root and need_root, use_min_depth=True)

And when the first subsumer in the list is selected at https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843:

subsumer = subsumers[0]

Naturally, in the case of orange synset, fruit is selected first sense it's first of the list that have tied lowest common hypernyms.

To conclude, the default parameter is sort of a feature not a bug to maintain the reproducibility as with NLTK v2.x.

So the solution might be to either manually change the NLTK source to force use_min_depth=False:

https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L845


EDITED

To resolve the problem, possibly you can do an ad-hoc check for same synset:

def wup_similarity_hacked(synset1, synset2):if synset1 == synset2:return 1.0else:return synset1.wup_similarity(synset2)
https://en.xdnf.cn/q/70816.html

Related Q&A

SystemExit: 2 error when calling parse_args() in iPython Notebook

Im learning to use Python and scikit-learn and executed the following block of codes (originally from http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html#example-docu…

How to convert numpy datetime64 [ns] to python datetime?

I need to convert dates from pandas frame values in the separate function:def myfunc(lat, lon, when):ts = (when - np.datetime64(1970-01-01T00:00:00Z,s)) / np.timedelta64(1, s)date = datetime.datetime.u…

how to remove a back slash from a JSON file

I want to create a json file like this:{"946705035":4,"946706692":4 ...}I am taking a column that only contains Unix Timestamp and group them.result = data[Last_Modified_Date_unixti…

Can sockets be used to connect multiple computers on different networks in python? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 6…

python logging close and application exit

I am using the logging module in an application and it occurred to me that it would be neat if the logging module supported a method which would gracefully close file handles etc and then close the app…

LookupError: unknown encoding: cp0

I am on window 7, python 2.7.2, pandas 0.11.0, django 1.4, wsgi and apache 2.2. I have a pandas script that works fine if I run it directly with python and also works in ipython with %run. However, w…

Cant activate Python venv in Windows 10

I created a virtual environment with python -m venv myenv from the command prompt, but I dont know how to activate it. I tried executing activate.bat from the command prompt but it does not activate.In…

Opening file path not working in python [duplicate]

This question already has answers here:open() gives FileNotFoundError / IOError: [Errno 2] No such file or directory(11 answers)Closed last year.I am writing a database program and personica is my test…

Cython: Segmentation Fault Using API Embedding Cython to C

Im trying to embed Cython code into C following Oreilly Cython book chapter 8. I found this paragraph on Cythons documentation but still dont know what should I do:If the C code wanting to use these fu…

Computing AUC and ROC curve from multi-class data in scikit-learn (sklearn)?

I am trying to use the scikit-learn module to compute AUC and plot ROC curves for the output of three different classifiers to compare their performance. I am very new to this topic, and I am struggli…