Question 1

Simple code like follows gives out similarity score of 0.75 for both cases. As you can see both the words are the exact same. To avoid any confusion I also compared a word with itself. The score refuses to bulge from 0.75. What is going on here?

from nltk.corpus import wordnet as wn
actual=wn.synsets('orange')[0]
predicted=wn.synsets('orange')[0]
similarity=actual.wup_similarity(predicted)
print similarity
similarity=actual.wup_similarity(actual)
print similarity

Question 2

This is an interesting problem.

TL;DR:

Sorry there's no short answer to this problem =(

Too long, want to read:

Looking at the code for wup_similarity(), the problem comes from not the similarity calculations but the way NLTK traverse the WordNet hierarchies to get the lowest_common_hypernym() (see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805).

Normally, the lowest common hypernyms between a synset and itself would have to be itself:

>>> from nltk.corpus import wordnet as wn
>>> y = wn.synsets('car')[0]
>>> y.lowest_common_hypernyms(y, use_min_depth=True)
[Synset('car.n.01')]

But in the case of orange it gives fruit too:

>>> from nltk.corpus import wordnet as wn
>>> x = wn.synsets('orange')[0]
>>> x.lowest_common_hypernyms(x, use_min_depth=True)
[Synset('fruit.n.01'), Synset('orange.n.01')]

We'll have to take a look at the code for the lowest_common_hypernym(), from the docstring of https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805

Get a list of lowest synset(s) that both synsets have as a hypernym.When use_min_depth == False this means that the synset which appears as ahypernym of both self and other with the lowest maximum depth is returnedor if there are multiple such synsets at the same depth they are all returnedHowever, if use_min_depth == True then the synset(s) which has/have the lowestminimum depth and appear(s) in both paths is/are returned

So let's try the lowest_common_hypernym() with use_min_depth=False:

>>> x.lowest_common_hypernyms(x, use_min_depth=False)
[Synset('orange.n.01')]

Seems like that resolves the ambiguity of the tied path. But the wup_similarity() API doesn't have the use_min_depth parameter:

>>> x.wup_similarity(x, use_min_depth=False)
Traceback (most recent call last):File "<stdin>", line 1, in <module>
TypeError: wup_similarity() got an unexpected keyword argument 'use_min_depth'

Note the difference is that when use_min_depth==False, the lowest_common_hypernym checks for maximum depth while traversing synsets. But when use_min_depth==True, it checks for minimum depth, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L602

So if we trace the lowest_common_hypernym code:

>>> synsets_to_search = x.common_hypernyms(x)
>>> synsets_to_search
[Synset('citrus.n.01'), Synset('natural_object.n.01'), Synset('orange.n.01'), Synset('object.n.01'), Synset('plant_organ.n.01'), Synset('edible_fruit.n.01'), Synset('produce.n.01'), Synset('food.n.02'), Synset('physical_entity.n.01'), Synset('entity.n.01'), Synset('reproductive_structure.n.01'), Synset('solid.n.01'), Synset('matter.n.03'), Synset('plant_part.n.01'), Synset('fruit.n.01'), Synset('whole.n.02')]# if use_min_depth==True
>>> max_depth = max(x.min_depth() for x in synsets_to_search)
>>> max_depth
8
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.min_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[Synset('orange.n.01'), Synset('fruit.n.01')]
>>> 
# if use_min_depth==False
>>> max_depth = max(x.max_depth() for x in synsets_to_search)
>>> max_depth
11
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.max_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[Synset('orange.n.01')]

This weird phenomena with wup_similarity is actually highlighted in the code comments, https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843

# Note that to preserve behavior from NLTK2 we set use_min_depth=True
# It is possible that more accurate results could be obtained by
# removing this setting and it should be tested later on
subsumers = self.lowest_common_hypernyms(other, simulate_root=simulate_root and need_root, use_min_depth=True)

And when the first subsumer in the list is selected at https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843:

subsumer = subsumers[0]

Naturally, in the case of orange synset, fruit is selected first sense it's first of the list that have tied lowest common hypernyms.

To conclude, the default parameter is sort of a feature not a bug to maintain the reproducibility as with NLTK v2.x.

So the solution might be to either manually change the NLTK source to force use_min_depth=False:

https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L845

EDITED

To resolve the problem, possibly you can do an ad-hoc check for same synset:

def wup_similarity_hacked(synset1, synset2):if synset1 == synset2:return 1.0else:return synset1.wup_similarity(synset2)

Python NLTK WUP Similarity Score not unity for exact same word

Related Q&A

SystemExit: 2 error when calling parse_args() in iPython Notebook

How to convert numpy datetime64 [ns] to python datetime?

how to remove a back slash from a JSON file

Can sockets be used to connect multiple computers on different networks in python? [closed]

python logging close and application exit

LookupError: unknown encoding: cp0

Cant activate Python venv in Windows 10

Opening file path not working in python [duplicate]

Cython: Segmentation Fault Using API Embedding Cython to C

Computing AUC and ROC curve from multi-class data in scikit-learn (sklearn)?