how to find similarity between many strings and plot it

2024/9/20 13:50:12

I have a xls file with one column and 10000 strings I want to do few things

1- make a heatmap or a cluster figure shows the similarity percentage between each string with another one.

In order to find the percentage of similaity between one with another, I found this post Find the similarity percent between two strings and I tried to make it work for me

As an example, I have these in a xls file where each line is one string

AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR
AAAAAGPLQPETENAGTSV
AAAAANNGAAPPDLSLMALAR
AAAAASAVNDYYGTWGQK
AAAAASGASNTDSSATKPK
AAAAGFNWDDADVK
AAAAGFNWDDADVKK

I could not figure out how to use that example, for when I have many combinations for example in my example , I have 7 strings and each one has a similarity with another one.

import xlrd
from difflib import SequenceMatcherworkbook = xlrd.open_workbook('data.xlsx')def similar(a, b):return SequenceMatcher(None, a, b).ratio()
Answer

Taking two of the strings from your list as a sample, I offer this way of calculating a measure.

>>> from collections import Counter
>>> stringA = 'AAAAAGPLQPETENAGTSV'
>>> stringB = 'AAAAANNGAAPPDLSLMALAR'
>>> unionSize = len(stringA) + len(stringB)
>>> A=Counter(list(stringA))
>>> B=Counter(list(stringB))
>>> A
Counter({'A': 6, 'G': 2, 'E': 2, 'T': 2, 'P': 2, 'V': 1, 'Q': 1, 'S': 1, 'N': 1, 'L': 1})
>>> B
Counter({'A': 9, 'L': 3, 'N': 2, 'P': 2, 'G': 1, 'M': 1, 'S': 1, 'R': 1, 'D': 1})
>>> symDiff = set(A.keys()).symmetric_difference(set(B.keys()))
>>> symDiff
{'M', 'V', 'Q', 'E', 'T', 'D', 'R'}
>>> symDiffSize = 0
>>> for key in symDiff:
...     if key in A.keys():
...         symDiffSize += A[key]
...     else:
...         symDiffSize += B[key]
...         
>>> symDiffSize, unionSize
(9, 40)

If the two strings had all letters in common then there would be no letters in their 'symmetric difference', which would make the denominator zero. This would seem to mean that the more letters in common and the fewer that are unshared the greater the fraction. You could perhaps take its logarithm.


I don't have Excel. This code accepts a list of strings which you could glean from Excel. It avoids redundant calculations of multisets (aka bags) to save resources. Also, it returns a pair, rather than a ratio because sometimes the denominator can be zero.

from collections import Counterstrings = ['AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR', 'AAAAAGPLQPETENAGTSV', 'AAAAANNGAAPPDLSLMALAR', 'AAAAASAVNDYYGTWGQK', 'AAAAASGASNTDSSATKPK', 'AAAAGFNWDDADVK', 'AAAAGFNWDDADVKK', ]class NikDistance():def __init__ (self, strings):self.stringLengths = [len(str) for str in strings]self.stringCounters = []for str in strings:self.stringCounters.append(Counter(list(str)))def __call__ (self, i, j):unionDiff = self.stringLengths[i] + self.stringLengths[j]symDiff = set(self.stringCounters[i].keys()).symmetric_difference(set(self.stringCounters[j].keys()))symDiffSize = 0for key in symDiff:if key in self.stringCounters[i].keys():symDiffSize += self.stringCounters[i][key]else:symDiffSize += self.stringCounters[j][key]return (symDiffSize, unionDiff)nikDistance = NikDistance(strings)for i in range(len(strings)):for j in range(i+1, len(strings)):print (strings[i], strings[j], nikDistance(i,j))

Result:

AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAAGPLQPETENAGTSV (7, 52)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAANNGAAPPDLSLMALAR (11, 54)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAASAVNDYYGTWGQK (9, 51)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAASGASNTDSSATKPK (9, 52)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAGFNWDDADVK (13, 47)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAGFNWDDADVKK (14, 48)
AAAAAGPLQPETENAGTSV AAAAANNGAAPPDLSLMALAR (9, 40)
AAAAAGPLQPETENAGTSV AAAAASAVNDYYGTWGQK (10, 37)
AAAAAGPLQPETENAGTSV AAAAASGASNTDSSATKPK (8, 38)
AAAAAGPLQPETENAGTSV AAAAGFNWDDADVK (15, 33)
AAAAAGPLQPETENAGTSV AAAAGFNWDDADVKK (16, 34)
AAAAANNGAAPPDLSLMALAR AAAAASAVNDYYGTWGQK (14, 39)
AAAAANNGAAPPDLSLMALAR AAAAASGASNTDSSATKPK (9, 40)
AAAAANNGAAPPDLSLMALAR AAAAGFNWDDADVK (12, 35)
AAAAANNGAAPPDLSLMALAR AAAAGFNWDDADVKK (13, 36)
AAAAASAVNDYYGTWGQK AAAAASGASNTDSSATKPK (6, 37)
AAAAASAVNDYYGTWGQK AAAAGFNWDDADVK (6, 32)
AAAAASAVNDYYGTWGQK AAAAGFNWDDADVKK (6, 33)
AAAAASGASNTDSSATKPK AAAAGFNWDDADVK (10, 33)
AAAAASGASNTDSSATKPK AAAAGFNWDDADVKK (10, 34)
AAAAGFNWDDADVK AAAAGFNWDDADVKK (0, 29)

Consider the last item. There are 29 characters altogether, and there are no (zero) characters that don't appear in both strings.

Look at the penultimate item. There are a total of 34 characters. Ten (10) of them do not appear in both strings.

https://en.xdnf.cn/q/119328.html

Related Q&A

Python and Variable Scope

So I am recently new to Python, but I seem to be able to program some stuff and get it working. However Ive been trying to expand my knowledge of how things work in the language, and putting this simp…

How to check if element is orthogonally adjacent (next to) to existing elements?

Im trying to make a simple game where a building placed in a nested list must be next to another building. The problem I face is that if the building was placed at the sides, I cant use for loops to ch…

How to add new column(header) to a csv file from command line arguments

The output of the following code:-import datetime import csv file_name=sample.txt with open(file_name,rb) as f: reader = csv.reader(f,delimiter=",") …

Pattern matching and replacing in Python

Im trying to take a string that can be anything like "Hello here is a [URL]www.url.com[/URL] and its great." and be able to extract whatever is between [URL] and [/URL] and then modify the st…

In Python word search, searching diagonally, printing result of where word starts and ends

I have a friend of mine tutoring me in learning Python and he gave me this project where a user will read a word search into the program and the file includes a list of words that will be in the word s…

Selenium python : element not interactable

I am trying to scrape information from this website example website I need to get the version 2021 and search by code. Here is my code: from selenium import webdriver from selenium.webdriver.chrome.opt…

Date into matplotlib graph

How can I use a date from a Sqlite database on the x-axis to make a bar graph with matplotlib?If I convert the date to unix timestamp the graph works, but I would like to get something like this: http…

Non blocking IO - Programming model

In non blocking IO programming model, a thread blocked on data available channels, as shown below, in python,while True:readers, _, _ = select.select([sys.stdin, sock], [], []) # blocked select()for re…

How to globally change xticks label relplot in Seaborn

I am facing an issue to globally changed the x-tick label for plot render using the relplot The idea was to change the int to string x-tick label for both plot. The desired label is [a,b,c,d] However, …

How to solve MatplotlibDeprecationWarning: scipy.stats.norm.pdf warning?

I am using matplotlib in my Python code. I got following warning: xxx.py:88: MatplotlibDeprecationWarning: scipy.stats.norm.pdfy = 100 * mlab.normpdf(bin_middles, mu, sigma)*bin_width I was wondering…