Question 1

I have a xls file with one column and 10000 strings I want to do few things

1- make a heatmap or a cluster figure shows the similarity percentage between each string with another one.

In order to find the percentage of similaity between one with another, I found this post Find the similarity percent between two strings and I tried to make it work for me

As an example, I have these in a xls file where each line is one string

AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR
AAAAAGPLQPETENAGTSV
AAAAANNGAAPPDLSLMALAR
AAAAASAVNDYYGTWGQK
AAAAASGASNTDSSATKPK
AAAAGFNWDDADVK
AAAAGFNWDDADVKK

I could not figure out how to use that example, for when I have many combinations for example in my example , I have 7 strings and each one has a similarity with another one.

import xlrd
from difflib import SequenceMatcherworkbook = xlrd.open_workbook('data.xlsx')def similar(a, b):return SequenceMatcher(None, a, b).ratio()

Question 2

Taking two of the strings from your list as a sample, I offer this way of calculating a measure.

>>> from collections import Counter
>>> stringA = 'AAAAAGPLQPETENAGTSV'
>>> stringB = 'AAAAANNGAAPPDLSLMALAR'
>>> unionSize = len(stringA) + len(stringB)
>>> A=Counter(list(stringA))
>>> B=Counter(list(stringB))
>>> A
Counter({'A': 6, 'G': 2, 'E': 2, 'T': 2, 'P': 2, 'V': 1, 'Q': 1, 'S': 1, 'N': 1, 'L': 1})
>>> B
Counter({'A': 9, 'L': 3, 'N': 2, 'P': 2, 'G': 1, 'M': 1, 'S': 1, 'R': 1, 'D': 1})
>>> symDiff = set(A.keys()).symmetric_difference(set(B.keys()))
>>> symDiff
{'M', 'V', 'Q', 'E', 'T', 'D', 'R'}
>>> symDiffSize = 0
>>> for key in symDiff:
...     if key in A.keys():
...         symDiffSize += A[key]
...     else:
...         symDiffSize += B[key]
...         
>>> symDiffSize, unionSize
(9, 40)

If the two strings had all letters in common then there would be no letters in their 'symmetric difference', which would make the denominator zero. This would seem to mean that the more letters in common and the fewer that are unshared the greater the fraction. You could perhaps take its logarithm.

I don't have Excel. This code accepts a list of strings which you could glean from Excel. It avoids redundant calculations of multisets (aka bags) to save resources. Also, it returns a pair, rather than a ratio because sometimes the denominator can be zero.

from collections import Counterstrings = ['AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR', 'AAAAAGPLQPETENAGTSV', 'AAAAANNGAAPPDLSLMALAR', 'AAAAASAVNDYYGTWGQK', 'AAAAASGASNTDSSATKPK', 'AAAAGFNWDDADVK', 'AAAAGFNWDDADVKK', ]class NikDistance():def __init__ (self, strings):self.stringLengths = [len(str) for str in strings]self.stringCounters = []for str in strings:self.stringCounters.append(Counter(list(str)))def __call__ (self, i, j):unionDiff = self.stringLengths[i] + self.stringLengths[j]symDiff = set(self.stringCounters[i].keys()).symmetric_difference(set(self.stringCounters[j].keys()))symDiffSize = 0for key in symDiff:if key in self.stringCounters[i].keys():symDiffSize += self.stringCounters[i][key]else:symDiffSize += self.stringCounters[j][key]return (symDiffSize, unionDiff)nikDistance = NikDistance(strings)for i in range(len(strings)):for j in range(i+1, len(strings)):print (strings[i], strings[j], nikDistance(i,j))

Result:

AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAAGPLQPETENAGTSV (7, 52)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAANNGAAPPDLSLMALAR (11, 54)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAASAVNDYYGTWGQK (9, 51)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAASGASNTDSSATKPK (9, 52)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAGFNWDDADVK (13, 47)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAGFNWDDADVKK (14, 48)
AAAAAGPLQPETENAGTSV AAAAANNGAAPPDLSLMALAR (9, 40)
AAAAAGPLQPETENAGTSV AAAAASAVNDYYGTWGQK (10, 37)
AAAAAGPLQPETENAGTSV AAAAASGASNTDSSATKPK (8, 38)
AAAAAGPLQPETENAGTSV AAAAGFNWDDADVK (15, 33)
AAAAAGPLQPETENAGTSV AAAAGFNWDDADVKK (16, 34)
AAAAANNGAAPPDLSLMALAR AAAAASAVNDYYGTWGQK (14, 39)
AAAAANNGAAPPDLSLMALAR AAAAASGASNTDSSATKPK (9, 40)
AAAAANNGAAPPDLSLMALAR AAAAGFNWDDADVK (12, 35)
AAAAANNGAAPPDLSLMALAR AAAAGFNWDDADVKK (13, 36)
AAAAASAVNDYYGTWGQK AAAAASGASNTDSSATKPK (6, 37)
AAAAASAVNDYYGTWGQK AAAAGFNWDDADVK (6, 32)
AAAAASAVNDYYGTWGQK AAAAGFNWDDADVKK (6, 33)
AAAAASGASNTDSSATKPK AAAAGFNWDDADVK (10, 33)
AAAAASGASNTDSSATKPK AAAAGFNWDDADVKK (10, 34)
AAAAGFNWDDADVK AAAAGFNWDDADVKK (0, 29)

Consider the last item. There are 29 characters altogether, and there are no (zero) characters that don't appear in both strings.

Look at the penultimate item. There are a total of 34 characters. Ten (10) of them do not appear in both strings.

how to find similarity between many strings and plot it

Related Q&A

Python and Variable Scope

How to check if element is orthogonally adjacent (next to) to existing elements?

How to add new column(header) to a csv file from command line arguments

Pattern matching and replacing in Python

In Python word search, searching diagonally, printing result of where word starts and ends

Selenium python : element not interactable

Date into matplotlib graph

Non blocking IO - Programming model

How to globally change xticks label relplot in Seaborn

How to solve MatplotlibDeprecationWarning: scipy.stats.norm.pdf warning?