I need to process a series of space separated strings i.e. text sentences. ‘Co-occurrence’ is when two tags (or words) appear on the same sentence. I need to list all the co-occurring words when they appear together on at least two lines (two sentences). The list has to be ordered and spaced.
Example of input:
tag1 tag2
tag1 tag3
tag2 tag4 tag3
tag2 tag3
The output should be:
tag2 tag3
I can’t assume that the input will fit in memory. What I know is there are not going to be more that 10,000 tags. My problem is the brute force of reading the whole input and creating a matrix of all the words and ticking it out when a co-occurrence appears will not work.
There must be an algorithm or methodology that I've not found. I'd appreciate tips/links or references to an algo or function that might be of use. I understand c, c++, MATLAB, python