Python: NLTK and TextBlob in french

2024/10/4 21:21:34

I'm using NLTK and TextBlob to find nouns and noun phrases in a text:

from textblob import TextBlob 
import nltkblob = TextBlob(text)
tokenized = nltk.word_tokenize(text)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)]

This works fine if my text is in english but it's not good anymore if my text is in french.

I was unable to find how to adapt this code for french language, how do I do that?

And is there a list somewhere of all the languages that are possible to parse?


Extract words from french sentence with NLTK

Under WSL2 Ubuntu with Python3, I can download Punkt like this:

import nltk'punkt')

The zip archive has been downloaded under:


Once it has been unzipped, you've got many language stored as Pickle's serialized object.

Now with:

tokenizer ='path/to/punkt_folder/french.pickle')

You can use the tokenizer._tokenize_words method:

words_generator = tokenizer._tokenize_words("Depuis huit jours, j'avais déchiré mes bottines Aux cailloux des chemins. J'entrais à Charleroi. - Au Cabaret-Vert : je demandai des tartines De beurre et du jambon qui fût à moitié froid.")
words = [word for word in words_generator]

words is a list of PunktToken object:

>>> words
[PunktToken('Depuis', type='depuis', linestart=True), PunktToken('huit', ), PunktToken('jours', ),... PunktToken('à', ), PunktToken('moitié', ), PunktToken('froid.', )]
>>> str_words = [str(w) for w in words]
>>> str_words
['Depuis', 'huit', 'jours', ',', 'j', "'avais", 'déchiré', 'mes', 'bottines', 'Aux', 'cailloux', 'des', 'chemins.', 'J', "'entrais", 'à', 'Charleroi.', '-', 'Au', 'Cabaret-Vert', ':', 'je', 'demandai', 'des', 'tartines', 'De', 'beurre', 'et', 'du', 'jambon', 'qui', 'fût', 'à', 'moitié', 'froid.']

Use nltk.pos_tag with french sentences

The OP want to use nltk.pos_tag. It is not possible with the method described previously.

A way to go seems to install the Standford Tagger which has been coded in JAVA (found in this other SO question)

Download the lastest version of Standford Tagger (Available here)

> wget

Once unzipped, you've got a folder which looks like this (OP ask the list of available languages):

├── data
│   ....
├── models...
│   ├── arabic-train.tagger
│   ├── arabic-train.tagger.props
│   ├── arabic.tagger
│   ├── arabic.tagger.props
│   ├── chinese-distsim.tagger
│   ├── chinese-distsim.tagger.props
│   ├── chinese-nodistsim.tagger
│   ├── chinese-nodistsim.tagger.props
│   ├── english-bidirectional-distsim.tagger
│   ├── english-bidirectional-distsim.tagger.props
│   ├── english-caseless-left3words-distsim.tagger
│   ├── english-caseless-left3words-distsim.tagger.props
│   ├── english-left3words-distsim.tagger
│   ├── english-left3words-distsim.tagger.props
│   ├── french-ud.tagger
│   ├── french-ud.tagger.props
│   ├── german-ud.tagger
│   ├── german-ud.tagger.props
│   ├── spanish-ud.tagger
│   └── spanish-ud.tagger.props
─ french-ud.tagger.props...
├── stanford-postagger-4.2.0.jar

Java must be installed and you must know where. Now you can do:

import osfrom nltk.tag import StanfordPOSTagger
from textblob import TextBlobjar = 'path/to/stanford-postagger-full-2020-11-17/stanford-postagger.jar'
model = 'path/to/stanford-postagger-full-2020-11-17/models/french-ud.tagger'
os.environ['JAVAHOME'] = '/path/to/java'blob = TextBlob("""Depuis huit jours, j'avais déchiré mes bottines Aux cailloux des chemins. J'entrais à Charleroi. - Au Cabaret-Vert : je demandai des tartines De beurre et du jambon qui fût à moitié froid.
""")pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8' )
res = pos_tagger.tag(blob.split())

It will display:

[('Depuis', 'ADP'), ('huit', 'NUM'), ('jours,', 'NOUN'), ("j'avais", 'ADJ'), ('déchiré', 'VERB'), ('mes', 'DET'), ('bottines', 'NOUN'), ('Aux', 'PROPN'), ('cailloux', 'VERB'), ('des', 'DET'), ('chemins.', 'NOUN'), ("J'entrais", 'ADJ'), ('à', 'ADP'), ('Charleroi.', 'PROPN'), ('-', 'PUNCT'), ('Au', 'PROPN'), ('Cabaret-Vert', 'PROPN'), (':', 'PUNCT'), ('je', 'PRON'), ('demandai', 'VERB'), ('des', 'DET'), ('tartines', 'NOUN'), ('De', 'ADP'), ('beurre', 'NOUN'), ('et', 'CCONJ'), ('du', 'DET'), ('jambon', 'NOUN'), ('qui', 'PRON'), ('fût', 'AUX'), ('à', 'ADP'), ('moitié', 'NOUN'), ('froid.', 'ADJ')]

Et voilà !

