Counting phrases in Python using NLTK

2024/10/13 13:19:34

I am trying to get a phrase count from a text file but so far I am only able to obtain a word count (see below). I need to extend this logic to count the number of times a two-word phrase appears in the text file.

Phrases can be defined/grouped by using logic from NLTK from my understanding. I believe the collections function is what I need to obtain the desired result, but I'm not sure how to go about implementing it from reading the NLTK documentation. Any tips/help would be greatly appreciated.

import re
import string
frequency = {}
document_text = open('Words.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)for word in match_pattern:count = frequency.get(word,0)frequency[word] = count + 1frequency_list = frequency.keys()for words in frequency_list:print (words, frequency[words])
Answer

You can get all the two word phrases using the collocations module. This tool identifies words that often appear consecutively within corpora.

To find the two word phrases you need to first calculate the frequencies of words and their appearance in the context of other words. NLTK has a BigramCollocationFinder class that can do this. Here's how we can find the Bigram Collocations:

import re
import string
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasuresfrequency = {}
document_text = open('Words.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)finder = BigramCollocationFinder.from_words(match_pattern)
bigram_measures = nltk.collocations.BigramAssocMeasures()
print(finder.nbest(bigram_measures.pmi, 2))

NLTK Collocations Docs: http://www.nltk.org/api/nltk.html?highlight=collocation#module-nltk.collocations

https://en.xdnf.cn/q/118079.html

Related Q&A

Break python list into multiple lists, shuffle each lists separately [duplicate]

This question already has answers here:Shuffling a list of objects [duplicate](26 answers)Closed 7 years ago.Lets say I have posts in ordered list according to their date.[<Post: 6>, <Post: 5&…

AlterField on auto generated _ptr field in migration causes FieldError

I have two models:# app1 class ParentModel(models.Model):# some fieldsNow, in another app, I have child model:# app2 from app1.models import ParentModelclass ChildModel(ParentModel):# some fields here …

How do I replace values in 2D numpy array using a dictionary of {value:(row#,column#)} pairs

import numpy as npthe array looks like so:array = np.zeros((10,10))array = [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][ 0…

Processing items with Scrapy pipeline

Im running Scrapy from a Python script.I was told that in Scrapy, responses are built in parse()and further processed in pipeline.py. This is how my framework is set so far:Python scriptdef script(self…

How to click a button to vote with python

Im practicing with web scraping in python. Id like to press a button on a site that votes an item. Here is the code<html> <head></head> <body role="document"> <div …

Python 2.7 connection to Oracle: loosing (Polish) characters

I connect from Python 2.7 to Oracle data base. When I use:cursor.execute("SELECT column1 FROM table").fetchall()]I have got almost proper values for column1 because all Polish characters (&qu…

getting friendlist from facebook graph-api

I am trying to get users friend list from facebook Graph-api. So after getting access token when I try to open by urlopen byhttps://graph.facebook.com/facebook_id/friends?access_token=authentic_access…

Sorting Angularjs ng-repeat by date

I am relatively new to AngularJS. Could use some helpI have a table with the following info<table><tr><th><span ng-click="sortType = first_name; sortReverse = !sortReverse&quo…

Html missing when using View page source

Im trying to extract all the images from a page. I have used Mechanize Urllib and selenium to extract the Html but the part i want to extract is never there. Also when i view the page source im not abl…

Move file to a folder or make a renamed copy if it exists in the destination folder

I have a piece of code i wrote for school:import ossource = "/home/pi/lab" dest = os.environ["HOME"]for file in os.listdir(source):if file.endswith(".c")shutil.move(file,d…