difflib.SequenceMatcher isjunk argument not considered?

2024/10/8 18:36:13

In the python difflib library, is the SequenceMatcher class behaving unexpectedly, or am I misreading what the supposed behavior is?

Why does the isjunk argument seem to not make any difference in this case?

difflib.SequenceMatcher(None, "AA", "A A").ratio() return 0.8difflib.SequenceMatcher(lambda x: x in ' ', "AA", "A A").ratio() returns 0.8

My understanding is that if space is omitted, the ratio should be 1.

Answer

This is happening because the ratio function uses total sequences' length while calculating the ratio, but it doesn't filter elements using isjunk. So, as long as the number of matches in the matching blocks results in the same value (with and without isjunk), the ratio measure will be the same.

I assume that sequences are not filtered by isjunk because of performance reasons.

def ratio(self):   """Return a measure of the sequences' similarity (float in [0,1]).Where T is the total number of elements in both sequences, andM is the number of matches, this is 2.0*M / T."""matches = sum(triple[-1] for triple in self.get_matching_blocks())return _calculate_ratio(matches, len(self.a) + len(self.b))

self.a and self.b are the strings (sequences) passed to the SequenceMatcher object ("AA" and "A A" in your example). The isjunk function lambda x: x in ' ' is only used to determine the matching blocks. Your example is quite simple, so the resulting ratio and matching blocks are the same for both calls.

difflib.SequenceMatcher(None, "AA", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=2, b=3, size=0)]difflib.SequenceMatcher(lambda x: x == ' ', "AA", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=2, b=3, size=0)]

Same matching blocks, the ratio is: M = 2, T = 6 => ratio = 2.0 * 2 / 6

Now consider the following example:

difflib.SequenceMatcher(None, "AA ", "A A").get_matching_blocks()
[Match(a=1, b=0, size=2), Match(a=3, b=3, size=0)]difflib.SequenceMatcher(lambda x: x == ' ', "AA ", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=3, b=3, size=0)]

Now matching blocks are different, but the ratio will be the same because the number of matches is still equal:

When isjunk is None: M = 2, T = 6 => ratio = 2.0 * 2 / 6

When isjunk is lambda x: x == ' ': M = 1 + 1, T = 6 => ratio = 2.0 * 2 / 6

Finally, a different number of matches:

difflib.SequenceMatcher(None, "AA ", "A A ").get_matching_blocks()
[Match(a=1, b=0, size=2), Match(a=3, b=4, size=0)]difflib.SequenceMatcher(lambda x: x == ' ', "AA ", "A A ").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=2), Match(a=3, b=4, size=0)]

The number of matches is different

When isjunk is None: M = 2, T = 7 => ratio = 2.0 * 2 / 7

When isjunk is lambda x: x == ' ': M = 1 + 2, T = 6 => ratio = 2.0 * 3 / 7

https://en.xdnf.cn/q/70108.html

Related Q&A

PyCharm Code Folding/Outlining Generates Wrong Boundaries

Im having a very frustrating issue with PyCharm in that it does not want to properly outline the code so that blocks fold correctly. Ive looked all over the place and couldnt find any help with this pa…

How to clear the conda environment variables?

While I was setting an environment variable on a conda base env, I made an error in the path that was supposed to be assigned to the variable. I was trying to set the $PYSPARK_PYTHON env variable on th…

Create duplicates in the list

I havelist = [a, b, c, d]andnumbers = [2, 4, 3, 1]I want to get a list of the type of:new_list = [a, a, b, b, b, b, c, c, c, d]This is what I have so far:new_list=[] for i in numbers: for x in list: f…

cxfreeze aiohttp cannot import compat

Im trying to use cx_freeze to build a binary dist for an web application written in Python 3 using the aiohttp package.Basically I did:cxfreeze server.pyand got a dist outputBut when running the ./serv…

Python open() requires full path [duplicate]

This question already has answers here:open() gives FileNotFoundError / IOError: [Errno 2] No such file or directory(11 answers)Closed 9 months ago.I am writing a script to read a csv file. The csv fil…

Pandas to parquet file

I am trying to save a pandas object to parquet with the following code: LABL = datetime.now().strftime("%Y%m%d_%H%M%S") df.to_parquet("/data/TargetData_Raw_{}.parquet".format(LABL))…

Wildcard in dictionary key

Suppose I have a dictionary:rank_dict = {V*: 1, A*: 2, V: 3,A: 4}As you can see, I have added a * to the end of one V. Whereas a 3 may be the value for just V, I want another key for V1, V2, V2234432, …

How to read emails from gmail?

I am trying to connect my gmail to python, but show me this error: I already checked my password, any idea what can be? b[AUTHENTICATIONFAILED] Invalid credentials (Failure) Traceback (most recent cal…

Python multiprocessing returning AttributeError when following documentation code [duplicate]

This question already has answers here:python multiprocessing in Jupyter on Windows: AttributeError: Cant get attribute "abc"(4 answers)Closed 4 years ago.I decided to try and get into the mu…

Python - How can I find if an item exists in multidimensional array?

Ive tried a few approaches, none of which seem to work for me. board = [[0,0,0,0],[0,0,0,0]]if not 0 in board:# the board is "full"I then tried:if not 0 in board[0] or not 0 in board[1]:# the…