How to extract word frequency from document-term matrix?

2024/10/11 17:20:07

I am doing LDA analysis with Python. And I used the following code to create a document-term matrix

corpus = [dictionary.doc2bow(text) for text in texts].

Is there any easy ways to count the word frequency over the whole corpus. Since I do have the dictionary which is a term-id list, I think I can match the word frequency with term-id.

Answer

You can use nltk in order to count word frequency in string texts

from nltk import FreqDist
import nltk
texts = 'hi there hello there'
words = nltk.tokenize.word_tokenize(texts)
fdist = FreqDist(words)

fdist will give you word frequency of given string texts.

However, you have a list of text. One way to count frequency is to use CountVectorizer from scikit-learn for list of strings.

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
texts = ['hi there', 'hello there', 'hello here you are']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
freq = np.ravel(X.sum(axis=0)) # sum each columns to get total counts for each word

this freq will correspond to value in dictionary vectorizer.vocabulary_

import operator
# get vocabulary keys, sorted by value
vocab = [v[0] for v in sorted(vectorizer.vocabulary_.items(), key=operator.itemgetter(1))]
fdist = dict(zip(vocab, freq)) # return same format as nltk
https://en.xdnf.cn/q/118296.html

Related Q&A

Remove only overlapping ticks in subplots grid

I have created a subplots grid without any spaces between the subplots, with shared x,y-axes. I only show the ticks and labels for the outer subplots. The problem is that the tick numbers overlap at th…

how to start a thread when django runserver?

I want to start a thread when django project runserver successfully. where can I put the create-thread-and-start code? Is there any hook for the django runserver?

pandas groupby plot values

I have a pandas dataframe that looks like this:**real I SI weights**0 1 3 0.3 0 2 4 0.20 1 3 0.50 1 5 0.51 2 5 0.3…

Any python module for customized BNF parser?

friends.I have a make-like style file needed to be parsed. The grammar is something like:samtools=/path/to/samtools picard=/path/to/picardtask1: des: descriptionpath: /path/to/task1para: [$global.samto…

How to draw an histogram with multiple categories in python

I am a freshman in python, and I have a problem of how to draw a histogram in python.First of all, I have ten intervals that are divided evenly according to the length of flowers petal, from min to max…

Turtle in Tkinter creating multiple windows

I am attempting to create a quick turtle display using Tkinter, but some odd things are happening.First two turtle windows are being created, (one blank, one with the turtles), secondly, any attempt of…

Array tkinter Entry to Label

Hey Guys I am beginner and working on Project Linear and Binary search GUI application using Tkinter, I want to add multiple Entry boxes values to label and in an array here, I tried but its not workin…

Grid search and cross validation SVM

i am implementing svm using best parameter of grid search on 10fold cross validation and i need to understand prediction results why are different i got two accuracy results testing on training set not…

Accessing dynamically created tkinter widgets

I am trying to make a GUI where the quantity of tkinter entries is decided by the user.My Code:from tkinter import*root = Tk()def createEntries(quantity):for num in range(quantity):usrInput = Entry(roo…

Graphene-Django Filenaming Conventions

Im rebuilding a former Django REST API project as a GraphQL one. I now have queries & mutations working properly.Most of my learning came from looking at existing Graphene-Django & Graphene-Py…