I have a xls file with one column and 10000 strings
I want to do few things
1- make a heatmap or a cluster figure shows the similarity percentage between each string with another one.
In order to find the percentage of similaity between one with another, I found this post Find the similarity percent between two strings and I tried to make it work for me
As an example, I have these in a xls file where each line is one string
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR
AAAAAGPLQPETENAGTSV
AAAAANNGAAPPDLSLMALAR
AAAAASAVNDYYGTWGQK
AAAAASGASNTDSSATKPK
AAAAGFNWDDADVK
AAAAGFNWDDADVKK
I could not figure out how to use that example, for when I have many combinations
for example in my example , I have 7 strings and each one has a similarity with another one.
import xlrd
from difflib import SequenceMatcherworkbook = xlrd.open_workbook('data.xlsx')def similar(a, b):return SequenceMatcher(None, a, b).ratio()
Taking two of the strings from your list as a sample, I offer this way of calculating a measure.
>>> from collections import Counter
>>> stringA = 'AAAAAGPLQPETENAGTSV'
>>> stringB = 'AAAAANNGAAPPDLSLMALAR'
>>> unionSize = len(stringA) + len(stringB)
>>> A=Counter(list(stringA))
>>> B=Counter(list(stringB))
>>> A
Counter({'A': 6, 'G': 2, 'E': 2, 'T': 2, 'P': 2, 'V': 1, 'Q': 1, 'S': 1, 'N': 1, 'L': 1})
>>> B
Counter({'A': 9, 'L': 3, 'N': 2, 'P': 2, 'G': 1, 'M': 1, 'S': 1, 'R': 1, 'D': 1})
>>> symDiff = set(A.keys()).symmetric_difference(set(B.keys()))
>>> symDiff
{'M', 'V', 'Q', 'E', 'T', 'D', 'R'}
>>> symDiffSize = 0
>>> for key in symDiff:
... if key in A.keys():
... symDiffSize += A[key]
... else:
... symDiffSize += B[key]
...
>>> symDiffSize, unionSize
(9, 40)
If the two strings had all letters in common then there would be no letters in their 'symmetric difference', which would make the denominator zero. This would seem to mean that the more letters in common and the fewer that are unshared the greater the fraction. You could perhaps take its logarithm.
I don't have Excel. This code accepts a list of strings which you could glean from Excel. It avoids redundant calculations of multisets (aka bags) to save resources. Also, it returns a pair, rather than a ratio because sometimes the denominator can be zero.
from collections import Counterstrings = ['AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR', 'AAAAAGPLQPETENAGTSV', 'AAAAANNGAAPPDLSLMALAR', 'AAAAASAVNDYYGTWGQK', 'AAAAASGASNTDSSATKPK', 'AAAAGFNWDDADVK', 'AAAAGFNWDDADVKK', ]class NikDistance():def __init__ (self, strings):self.stringLengths = [len(str) for str in strings]self.stringCounters = []for str in strings:self.stringCounters.append(Counter(list(str)))def __call__ (self, i, j):unionDiff = self.stringLengths[i] + self.stringLengths[j]symDiff = set(self.stringCounters[i].keys()).symmetric_difference(set(self.stringCounters[j].keys()))symDiffSize = 0for key in symDiff:if key in self.stringCounters[i].keys():symDiffSize += self.stringCounters[i][key]else:symDiffSize += self.stringCounters[j][key]return (symDiffSize, unionDiff)nikDistance = NikDistance(strings)for i in range(len(strings)):for j in range(i+1, len(strings)):print (strings[i], strings[j], nikDistance(i,j))
Result:
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAAGPLQPETENAGTSV (7, 52)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAANNGAAPPDLSLMALAR (11, 54)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAASAVNDYYGTWGQK (9, 51)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAASGASNTDSSATKPK (9, 52)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAGFNWDDADVK (13, 47)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAGFNWDDADVKK (14, 48)
AAAAAGPLQPETENAGTSV AAAAANNGAAPPDLSLMALAR (9, 40)
AAAAAGPLQPETENAGTSV AAAAASAVNDYYGTWGQK (10, 37)
AAAAAGPLQPETENAGTSV AAAAASGASNTDSSATKPK (8, 38)
AAAAAGPLQPETENAGTSV AAAAGFNWDDADVK (15, 33)
AAAAAGPLQPETENAGTSV AAAAGFNWDDADVKK (16, 34)
AAAAANNGAAPPDLSLMALAR AAAAASAVNDYYGTWGQK (14, 39)
AAAAANNGAAPPDLSLMALAR AAAAASGASNTDSSATKPK (9, 40)
AAAAANNGAAPPDLSLMALAR AAAAGFNWDDADVK (12, 35)
AAAAANNGAAPPDLSLMALAR AAAAGFNWDDADVKK (13, 36)
AAAAASAVNDYYGTWGQK AAAAASGASNTDSSATKPK (6, 37)
AAAAASAVNDYYGTWGQK AAAAGFNWDDADVK (6, 32)
AAAAASAVNDYYGTWGQK AAAAGFNWDDADVKK (6, 33)
AAAAASGASNTDSSATKPK AAAAGFNWDDADVK (10, 33)
AAAAASGASNTDSSATKPK AAAAGFNWDDADVKK (10, 34)
AAAAGFNWDDADVK AAAAGFNWDDADVKK (0, 29)
Consider the last item. There are 29 characters altogether, and there are no (zero) characters that don't appear in both strings.
Look at the penultimate item. There are a total of 34 characters. Ten (10) of them do not appear in both strings.