Python: NLTK and TextBlob in french

2024/10/4 21:21:34

I'm using NLTK and TextBlob to find nouns and noun phrases in a text:

from textblob import TextBlob 
import nltkblob = TextBlob(text)
print(blob.noun_phrases)
tokenized = nltk.word_tokenize(text)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)]
print(nouns)

This works fine if my text is in english but it's not good anymore if my text is in french.

I was unable to find how to adapt this code for french language, how do I do that?

And is there a list somewhere of all the languages that are possible to parse?

Answer

Extract words from french sentence with NLTK

Under WSL2 Ubuntu with Python3, I can download Punkt like this:

import nltk
nltk.download('punkt')

The zip archive has been downloaded under:

/home/my_username/nltk_data/tokenizers/punkt.zip

Once it has been unzipped, you've got many language stored as Pickle's serialized object.

Now with:

tokenizer = nltk.data.load('path/to/punkt_folder/french.pickle')

You can use the tokenizer._tokenize_words method:

words_generator = tokenizer._tokenize_words("Depuis huit jours, j'avais déchiré mes bottines Aux cailloux des chemins. J'entrais à Charleroi. - Au Cabaret-Vert : je demandai des tartines De beurre et du jambon qui fût à moitié froid.")
words = [word for word in words_generator]

words is a list of PunktToken object:

>>> words
[PunktToken('Depuis', type='depuis', linestart=True), PunktToken('huit', ), PunktToken('jours', ),... PunktToken('à', ), PunktToken('moitié', ), PunktToken('froid.', )]
>>> str_words = [str(w) for w in words]
>>> str_words
['Depuis', 'huit', 'jours', ',', 'j', "'avais", 'déchiré', 'mes', 'bottines', 'Aux', 'cailloux', 'des', 'chemins.', 'J', "'entrais", 'à', 'Charleroi.', '-', 'Au', 'Cabaret-Vert', ':', 'je', 'demandai', 'des', 'tartines', 'De', 'beurre', 'et', 'du', 'jambon', 'qui', 'fût', 'à', 'moitié', 'froid.']

Use nltk.pos_tag with french sentences

The OP want to use nltk.pos_tag. It is not possible with the method described previously.

A way to go seems to install the Standford Tagger which has been coded in JAVA (found in this other SO question)

Download the lastest version of Standford Tagger (Available here)

> wget https://nlp.stanford.edu/software/stanford-tagger-4.2.0.zip

Once unzipped, you've got a folder which looks like this (OP ask the list of available languages):

...
├── data
│   ....
├── models...
│   ├── arabic-train.tagger
│   ├── arabic-train.tagger.props
│   ├── arabic.tagger
│   ├── arabic.tagger.props
│   ├── chinese-distsim.tagger
│   ├── chinese-distsim.tagger.props
│   ├── chinese-nodistsim.tagger
│   ├── chinese-nodistsim.tagger.props
│   ├── english-bidirectional-distsim.tagger
│   ├── english-bidirectional-distsim.tagger.props
│   ├── english-caseless-left3words-distsim.tagger
│   ├── english-caseless-left3words-distsim.tagger.props
│   ├── english-left3words-distsim.tagger
│   ├── english-left3words-distsim.tagger.props
│   ├── french-ud.tagger
│   ├── french-ud.tagger.props
│   ├── german-ud.tagger
│   ├── german-ud.tagger.props
│   ├── spanish-ud.tagger
│   └── spanish-ud.tagger.props
─ french-ud.tagger.props...
├── stanford-postagger-4.2.0.jar
...

Java must be installed and you must know where. Now you can do:

import osfrom nltk.tag import StanfordPOSTagger
from textblob import TextBlobjar = 'path/to/stanford-postagger-full-2020-11-17/stanford-postagger.jar'
model = 'path/to/stanford-postagger-full-2020-11-17/models/french-ud.tagger'
os.environ['JAVAHOME'] = '/path/to/java'blob = TextBlob("""Depuis huit jours, j'avais déchiré mes bottines Aux cailloux des chemins. J'entrais à Charleroi. - Au Cabaret-Vert : je demandai des tartines De beurre et du jambon qui fût à moitié froid.
""")pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8' )
res = pos_tagger.tag(blob.split())
print(res)

It will display:

[('Depuis', 'ADP'), ('huit', 'NUM'), ('jours,', 'NOUN'), ("j'avais", 'ADJ'), ('déchiré', 'VERB'), ('mes', 'DET'), ('bottines', 'NOUN'), ('Aux', 'PROPN'), ('cailloux', 'VERB'), ('des', 'DET'), ('chemins.', 'NOUN'), ("J'entrais", 'ADJ'), ('à', 'ADP'), ('Charleroi.', 'PROPN'), ('-', 'PUNCT'), ('Au', 'PROPN'), ('Cabaret-Vert', 'PROPN'), (':', 'PUNCT'), ('je', 'PRON'), ('demandai', 'VERB'), ('des', 'DET'), ('tartines', 'NOUN'), ('De', 'ADP'), ('beurre', 'NOUN'), ('et', 'CCONJ'), ('du', 'DET'), ('jambon', 'NOUN'), ('qui', 'PRON'), ('fût', 'AUX'), ('à', 'ADP'), ('moitié', 'NOUN'), ('froid.', 'ADJ')]

Et voilà !

https://en.xdnf.cn/q/70559.html

Related Q&A

How can I run a script as part of a Travis CI build?

As part of a Python package I have a script myscript.py at the root of my project and setup(scripts=[myscript.py], ...) in my setup.py.Is there an entry I can provide to my .travis.yml that will run my…

Writing nested schema to BigQuery from Dataflow (Python)

I have a Dataflow job to write to BigQuery. It works well for non-nested schema, however fails for the nested schema.Here is my Dataflow pipeline:pipeline_options = PipelineOptions()p = beam.Pipeline(o…

Python decorators on class members fail when decorator mechanism is a class

When creating decorators for use on class methods, Im having trouble when the decorator mechanism is a class rather than a function/closure. When the class form is used, my decorator doesnt get treated…

Why does comparison of a numpy array with a list consume so much memory?

This bit stung me recently. I solved it by removing all comparisons of numpy arrays with lists from the code. But why does the garbage collector miss to collect it?Run this and watch it eat your memor…

StringIO portability between python2 and python3 when capturing stdout

I have written a python package which I have managed to make fully compatible with both python 2.7 and python 3.4, with one exception that is stumping me so far. The package includes a command line scr…

How to redirect data to a getpass like password input?

Im wring a python script for running some command. Some of those commands require user to input password, I did try to input data in their stdin, but it doesnt work, here is two simple python program…

How to grab one random item from a database in Django/postgreSQL?

So i got the database.objects.all() and database.objects.get(name) but how would i got about getting one random item from the database. Im having trouble trying to figure out how to get it ot select on…

Pyspark Dataframe pivot and groupby count

I am working on a pyspark dataframe which looks like belowid category1 A1 A1 B2 B2 A3 B3 B3 BI want to unstack the category column and count their occurrences. So, the result I want is shown belowid A …

Create an excel file from BytesIO using python

I am using pandas library to store excel into bytesIO memory. Later, I am storing this bytesIO object into SQL Server as below-df = pandas.DataFrame(data1, columns=[col1, col2, col3])output = BytesIO()…

python send csv data to spark streaming

I would like to try and load a csv data in python and stream each row spark via SPark Streaming.Im pretty new to network stuff. Im not exactly if Im supposed to create a server python script that once …