How to get n-gram collocations and association in python nltk?

2024/10/9 2:26:37

In this documentation, there is example using nltk.collocations.BigramAssocMeasures(), BigramCollocationFinder,nltk.collocations.TrigramAssocMeasures(), and TrigramCollocationFinder.

There is example method find nbest based on pmi for bigram and trigram. example:

finder = BigramCollocationFinder.from_words(
...     nltk.corpus.genesis.words('english-web.txt'))
>>> finder.nbest(bigram_measures.pmi, 10)

I know that BigramCollocationFinder and TrigramCollocationFinder inherit from AbstractCollocationFinder. While BigramAssocMeasures() and TrigramAssocMeasures() inherit from NgramAssocMeasures.

How can I use the methods(e.g. nbest()) in AbstractCollocationFinder and NgramAssocMeasures for 4-gram, 5-gram, 6-gram, ...., n-gram (like using bigram and trigram easily)?

Should I create class which inherit AbstractCollocationFinder?



If you want to find the grams beyond 2 or 3 grams you can use scikit package and Freqdist function to get the count for these grams. I tried doing this with nltk.collocations, but I dont think we can find out more than 3-grams score into it. So I rather decided to go with count of grams. I hope this can help u a little bit. Thankz

here is the code

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltkquery = "This document gives a very short introduction to machine learning problems"
vect = CountVectorizer(ngram_range=(1,4))
analyzer = vect.build_analyzer()
listNgramQuery = analyzer(query)
print "listNgramQuery=", listNgramQuery
NgramQueryWeights = nltk.FreqDist(listNgramQuery)
print "\nNgramQueryWeights=", NgramQueryWeights

This will give output as

listNgramQuery= [u'to machine learning problems', u'introduction to machine learning', u'short introduction to machine', u'very short introduction to', u'gives very short introduction', u'document gives very short', u'this document gives very', u'machine learning problems', u'to machine learning', u'introduction to machine', u'short introduction to', u'very short introduction', u'gives very short', u'document gives very', u'this document gives', u'learning problems', u'machine learning', u'to machine', u'introduction to', u'short introduction', u'very short', u'gives very', u'document gives', u'this document', u'problems', u'learning', u'machine', u'to', u'introduction', u'short', u'very', u'gives', u'document', u'this']NgramQueryWeights= <FreqDist: u'document': 1, u'document gives': 1, u'document gives very': 1, u'document gives very short': 1, u'gives': 1, u'gives very': 1, u'gives very short': 1, u'gives very short introduction': 1, u'introduction': 1, u'introduction to': 1, ...>

Related Q&A

Using Python3 on macOS as default but pip still get using python 2.7

Im using macOS Big Sur 11.0.1. Im setting up a virtual env $python3 -m venv $my_workdir)/.virtualenvbut getting this error at building wheel package: building _openssl extensioncreating build/temp.maco…

Python Matplotlib Box Plot Two Data Sets Side by Side

I would like to make a boxplot using two data sets. Each set is a list of floats. A and B are examples of the two data setsA = [] B = []for i in xrange(10):l = [random.random() for i in xrange(100)]m =…

perform() and reset_actions() in ActionChains not working selenium python

This is the code that habe no error: perform() and reset_actions() but these two functions have to work combinedly import os import time from selenium import webdriver from…

nosetests not recognized on Windows after being installed and added to PATH

Im on exercise 46 of Learn Python the Hard Way, and Im meant to install nose and run nosetests. Ive installed nose already using pip, but when I run nosetests in the directory above the tests folder, I…

Using a context manager with mysql connector python

Im moving my code across from an sqlite database to mysql and Im having a problem with the context manager, getting the following attribute error.Ive tried combinations of mydb.cursor() as cursor, mydb…

Value of Py_None

It is clear to me that None is used to signify the lack of a value. But since everything must have an underlying value during implementation, Im looking to see what value has been used in order to sign…

Getting the href of a tag which is in li

How to get the href of the all the tag that is under the class "Subforum" in the given code?<li class="subforum"> <a href="Link1">Link1 Text</a> </l…

Put value at centre of bins for histogram

I have the following code to plot a histogram. The values in time_new are the hours when something occurred.time_new=[9, 23, 19, 9, 1, 2, 19, 5, 4, 20, 23, 10, 20, 5, 21, 17, 4, 13, 8, 13, 6, 19, 9, 1…

plot in Pandas immediately closes

I have a problem of plotting the data. I run the following python code:import pandas as pd df = pd.read_csv("table.csv")values = df["blah"] values.plot() print 1df[blahblah].plot() …

Django template: Embed css from file

Im working on an email template, therefor I would like to embed a css file<head><style>{{ embed css/TEST.css content here }}</style> </head>instead of linking it<head><…