Text processing to find co-occurences of strings

2024/10/5 7:54:48

I need to process a series of space separated strings i.e. text sentences. ‘Co-occurrence’ is when two tags (or words) appear on the same sentence. I need to list all the co-occurring words when they appear together on at least two lines (two sentences). The list has to be ordered and spaced.

Example of input:

tag1 tag2

tag1 tag3

tag2 tag4 tag3

tag2 tag3

The output should be:

tag2 tag3

I can’t assume that the input will fit in memory. What I know is there are not going to be more that 10,000 tags. My problem is the brute force of reading the whole input and creating a matrix of all the words and ticking it out when a co-occurrence appears will not work.

There must be an algorithm or methodology that I've not found. I'd appreciate tips/links or references to an algo or function that might be of use. I understand c, c++, MATLAB, python


Somewhat cumbersome:

import re
tags = list(set(input_string.split()))
tag_length = len(tags)
for i in xrange(tag_length - 1):for j in xrange(tag_length - 2 - i):tag1, tag2 = tags[i], tags[i + j + 1]matches = re.findall(r'\b{0}\b.+\b{1}\b'.format(tag1, tag2), input_string)if len(matches) > 1:print tag1, tag2

Related Q&A

Flask doesnt render any image [duplicate]

This question already has answers here:How to serve static files in Flask(24 answers)Link to Flask static files with url_for(2 answers)Closed 6 years ago.I have a flask application where I need to rend…

Bug in python thread

I have some raspberry pi running some python code. Once and a while my devices will fail to check in. The rest of the python code continues to run perfectly but the code here quits. I am not sure wh…

how does a function changes the value of a variable outside its scope? Python

i was coding this code and noticed something weird, after my function has been called on the variable, the value of the variable gets changed although its outside of the functions scope, how exactly is…

Python extracting element using bs4, very basic thing I think I dont understand

So Im using Beautiful Soup to try to get an element off of a page using the tag and class. Here is my code: import requests from bs4 import BeautifulSoup# Send a GET request to the webpage url = "…

Why Isnt my Gmail Account Bruteforcer Working? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 4 years ago.Improve…

Python: Split Start and End Date into All Days Between Start and End Date

Ive got data called Planned Leave which includes Start Date, End Date, User ID and Leave Type.I want to be able to create a new data-frame which shows all days between Start and End Date, per User ID.S…

Python and java AES/ECB/PKCS5 encryption

JAVA VERSION:public class EncryptUtil {public static String AESEncode(String encodeRules, String content) {try {KeyGenerator keygen = KeyGenerator.getInstance("AES");keygen.init(128, new Secu…

How to find the center point of this rectangle

I am trying to find the center point of the green rectangle which is behind the fish, but my approach is not working. Here is my code:#Finding contours (almost always finds those 2 retangles + some noi…

Simple Battleships game implementation in Python

Okay Im not sure how to develop another board with hidden spaces for the computers ships per-se, and have it test for hits. Again Im not even sure how Im going to test for hits on the board I have now.…

How to remove WindowsPath and parantheses from a string [duplicate]

This question already has an answer here:Reference - What does this regex mean?(1 answer)Closed 4 years ago.I need to remove WindowsPath( and some of the closing parentheses ) from a directory string.…