I have a large list(over 1,000,000 items), which contains english words:
tokens = ["today", "good", "computer", "people", "good", ... ]
I'd like to get all the items that occurs only once in the list
now I'm using:
tokens_once = set(word for word in set(tokens) if tokens.count(word) == 1)
but it's really slow. how could I make this faster?
You iterate over a list and then for each element you do it again, which makes it O(N²). If you replace your count
by a Counter
, you iterate once over the list and then once again over the list of unique elements, which makes it, in the worst case, O(2N), i.e. O(N).
from collections import Countertokens = ["today", "good", "computer", "people", "good"]
single_tokens = [k for k, v in Counter(tokens).iteritems() if v == 1 ]
# single_tokens == ['today', 'computer', 'people']