python nltk keyword extraction from sentence

2024/9/27 5:48:10

"First thing we do, let's kill all the lawyers." - William Shakespeare

Given the quote above, I would like to pull out "kill" and "lawyers" as the two prominent keywords to describe the overall meaning of the sentence. I have extracted the following noun/verb POS tags:

[["First", "NNP"], ["thing", "NN"], ["do", "VBP"], ["lets", "NNS"], ["kill", "VB"], ["lawyers", "NNS"]]

The more general problem I am trying to solve is to distill a sentence to the "most important"* words/tags to summarise the overall "meaning"* of a sentence.

*note the scare quotes. I acknowledge this is a very hard problem and there is most likely no perfect solution at this point in time. Nonetheless, I am interested to see attempts at solving the specific problem (extracting "kill" and "lawyers") and the general problem (summarising the overall meaning of a sentence in keywords/tags)

Answer

I don't think theres any perfect answer to this question because there aren't any gold-set of input/output mappings which everybody will agree upon. You think the most important words for that sentence are ('kill', 'lawyers'), someone else might argue the correct answer should be ('first', 'kill', 'lawyers'). If you are able to very precisely and completely unambiguously describe exactly what you want your system to do, your problem will be more than half solved.

Until then, I can suggest some additional heuristics to help you get what you want.
Build an idf dictionary using your data, i.e. build a mapping from every word to a number that correlates with how rare that word is. Bonus points for doing it for larger n-grams as well.

By combining the idf values of each word in your input sentence along with their POS tags, you answer questions of the form 'What is the rarest verb in this sentence?', 'What is the rarest noun in this sentence', etc. In any reasonable corpus, 'kill' should be rarer than 'do', and 'lawyers' rarer than 'thing', so maybe trying to find the rarest noun and rarest verb in a sentence and returning just those two will do the trick for most of your intended use cases. If not, you can always make your algorithm a little more complicated and see if that seems to do the job better.

Ways to expand this include trying to identify larger phrases using n-gram idf's, building a full parse-tree of the sentence (using maybe the stanford parser) and identifying some pattern within these trees to help you figure out which parts of the tree do important things tend to be based, etc.

https://en.xdnf.cn/q/71486.html

Related Q&A

Getting the parameter names of scipy.stats distributions

I am writing a script to find the best-fitting distribution over a dataset using scipy.stats. I first have a list of distribution names, over which I iterate:dists = [alpha, anglit, arcsine, beta, bet…

Does Python 3 gzip closes the fileobj?

The gzip docs for Python 3 states thatCalling a GzipFile object’s close() method does not close fileobj, since you might wish to append more material after the compressed dataDoes this mean that the g…

pip stopped working after upgrading anaconda v4.4 to v5.0

I ran the command conda update anaconda to update anaconda v4.4 to v5.0After anaconda was successfully upgraded to v5.0, I had problems running pip.This is the error output I see after running pip;Trac…

Python Django- How do I read a file from an input file tag?

I dont want the file to be saved on my server, I just want the file to be read and printed out in the next page. Right now I have this.(index.html)<form name="fileUpload" method="post…

ImportError: cannot import name AutoModelWithLMHead from transformers

This is literally all the code that I am trying to run: from transformers import AutoModelWithLMHead, AutoTokenizer import torchtokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small&…

UnicodeEncodeError: ascii codec cant encode characters in position 0-6: ordinal not in range(128)

Ιve tried all the solution that I could find, but nothing seems to work: teext = str(self.tableWidget.item(row, col).text())Im writing in greek by the way...

selenium PhantomJS send_keys doesnt work

I am using selenium and PhantomJS for testing. I followed Seleniums simple usage, but send_keys doesnt work on PhantomJS, it works on Firefox. Why? I have to use button.click() instead?#!/usr/bin/pyt…

Replace values in column of Pandas DataFrame using a Series lookup table

I want to replace a column of values in a DataFrame with a more accurate/complete set of values generated by a look-up table in the form of a Series that I have prepared.I thought I could do it this wa…

Behavior of round function in Python

Could anyone explain me this pice of code:>>> round(0.45, 1) 0.5 >>> round(1.45, 1) 1.4 >>> round(2.45, 1) 2.5 >>> round(3.45, 1) 3.5 >>> round(4.45, 1) 4.5…

Pygame application runs slower on Mac than on PC

A friend and I are making a game in Python (2.7) with the Pygame module. I have mostly done the art for the game so far and he has mostly done the coding but eventually I plan to help code with him onc…