Setting NLTK with Stanford NLP (both StanfordNERTagger and StanfordPOSTagger) for Spanish

2024/10/7 0:25:11

The NLTK documentation is rather poor in this integration. The steps I followed were:

  • Download http://nlp.stanford.edu/software/stanford-postagger-full-2015-04-20.zip to /home/me/stanford

  • Download http://nlp.stanford.edu/software/stanford-spanish-corenlp-2015-01-08-models.jar to /home/me/stanford

Then in a ipython console:

In [11]: import nltk

In [12]: nltk.__version__
Out[12]: '3.1'In [13]: from nltk.tag import StanfordNERTagger

Then

st = StanfordNERTagger('/home/me/stanford/stanford-postagger-full-2015-04-20.zip', '/home/me/stanford/stanford-spanish-corenlp-2015-01-08-models.jar')

But when I tried to run it:

st.tag('Adolfo se la pasa corriendo'.split())
Error: no se ha encontrado o cargado la clase principal edu.stanford.nlp.ie.crf.CRFClassifier---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-14-0c1a96b480a6> in <module>()
----> 1 st.tag('Adolfo se la pasa corriendo'.split())/home/nanounanue/.pyenv/versions/3.4.3/lib/python3.4/site-packages/nltk/tag/stanford.py in tag(self, tokens)64     def tag(self, tokens):65         # This function should return list of tuple rather than list of list
---> 66         return sum(self.tag_sents([tokens]), [])67 68     def tag_sents(self, sentences):/home/nanounanue/.pyenv/versions/3.4.3/lib/python3.4/site-packages/nltk/tag/stanford.py in tag_sents(self, sentences)87         # Run the tagger and get the output88         stanpos_output, _stderr = java(cmd, classpath=self._stanford_jar,
---> 89                                                        stdout=PIPE, stderr=PIPE)90         stanpos_output = stanpos_output.decode(encoding)91 /home/nanounanue/.pyenv/versions/3.4.3/lib/python3.4/site-packages/nltk/__init__.py in java(cmd, classpath, stdin, stdout, stderr, blocking)132     if p.returncode != 0:133         print(_decode_stdoutdata(stderr))
--> 134         raise OSError('Java command failed : ' + str(cmd))135 136     return (stdout, stderr)OSError: Java command failed : ['/usr/bin/java', '-mx1000m', '-cp', '/home/nanounanue/Descargas/stanford-spanish-corenlp-2015-01-08-models.jar', 'edu.stanford.nlp.ie.crf.CRFClassifier', '-loadClassifier', '/home/nanounanue/Descargas/stanford-postagger-full-2015-04-20.zip', '-textFile', '/tmp/tmp6y169div', '-outputFormat', 'slashTags', '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerOptions', '"tokenizeNLs=false"', '-encoding', 'utf8']

The same occur with the StandfordPOSTagger

NOTE: I need that this will be the spanish version. NOTE: I am running this in python 3.4.3

Answer

Try:

# StanfordPOSTagger
from nltk.tag.stanford import StanfordPOSTagger
stanford_dir = '/home/me/stanford/stanford-postagger-full-2015-04-20/'
modelfile = stanford_dir + 'models/english-bidirectional-distsim.tagger'
jarfile = stanford_dir + 'stanford-postagger.jar'st = StanfordPOSTagger(model_filename=modelfile, path_to_jar=jarfile)# NERTagger
stanford_dir = '/home/me/stanford/stanford-ner-2015-04-20/'
jarfile = stanford_dir + 'stanford-ner.jar'
modelfile = stanford_dir + 'classifiers/english.all.3class.distsim.crf.ser.gz'st = StanfordNERTagger(model_filename=modelfile, path_to_jar=jarfile)

For detailed information on NLTK API with Stanford tools, take a look at: https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software#stanford-tagger-ner-tokenizer-and-parser

Note: The NLTK APIs are for the individual Stanford tools, if you're using Stanford Core NLP, it's best to follow @dimazest instructions on http://www.eecs.qmul.ac.uk/~dm303/stanford-dependency-parser-nltk-and-anaconda.html


EDITED

As for Spanish NER Tagging, I strongly suggest that you us Stanford Core NLP (http://nlp.stanford.edu/software/corenlp.shtml) instead of using the Stanford NER package (http://nlp.stanford.edu/software/CRF-NER.shtml). And follow @dimazest solution for JSON file reading.

Alternatively, if you must use the NER packge, you can try following the instructions from https://github.com/alvations/nltk_cli (Disclaimer: This repo is not affiliated with NLTK officially). Do the following on the unix command line:

cd $HOME
wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2015-01-08-models.jar
unzip stanford-spanish-corenlp-2015-01-08-models.jar -d stanford-spanish
cp stanford-spanish/edu/stanford/nlp/models/ner/* /home/me/stanford/stanford-ner-2015-04-20/ner/classifiers/

Then in python:

# NERTagger
stanford_dir = '/home/me/stanford/stanford-ner-2015-04-20/'
jarfile = stanford_dir + 'stanford-ner.jar'
modelfile = stanford_dir + 'classifiers/spanish.ancora.distsim.s512.crf.ser.gz'st = StanfordNERTagger(model_filename=modelfile, path_to_jar=jarfile)
https://en.xdnf.cn/q/70307.html

Related Q&A

python variable scope in nested functions

I am reading this article about decorator.At Step 8 , there is a function defined as:def outer():x = 1def inner():print x # 1return innerand if we run it by:>>> foo = outer() >>> foo.…

How can I throttle Python threads?

I have a thread doing a lot of CPU-intensive processing, which seems to be blocking out other threads. How do I limit it?This is for web2py specifically, but a general solution would be fine.

get lastweek dates using python?

I am trying to get the date of the last week with python. if date is : 10 OCT 2014 meansIt should be print10 OCT 2014, 09 OCT 2014, 08 OCT 2014, 07 OCT 2014, 06 OCT 2014, 05 OCT 2014, 04 OCT 2014I trie…

Why is vectorized numpy code slower than for loops?

I have two numpy arrays, X and Y, with shapes (n,d) and (m,d), respectively. Assume that we want to compute the Euclidean distances between each row of X and each row of Y and store the result in array…

Handle TCP Provider: Error code 0x68 (104)

Im using this code to sync my db with the clients:import pyodbcSYNC_FETCH_ARRAY_SIZE=25000# define connection + cursorconnection = pyodbc.connect()cursor = connection.cursor()query = select some_column…

vectorized radix sort with numpy - can it beat np.sort?

Numpy doesnt yet have a radix sort, so I wondered whether it was possible to write one using pre-existing numpy functions. So far I have the following, which does work, but is about 10 times slower tha…

Which library should I use to write an XLS from Linux / Python?

Id love a good native Python library to write XLS, but it doesnt seem to exist. Happily, Jython does.So Im trying to decide between jexcelapi and Apache HSSF: http://www.andykhan.com/jexcelapi/tutoria…

put_records() only accepts keyword arguments in Kinesis boto3 Python API

from __future__ import print_function # Python 2/3 compatibility import boto3 import json import decimal#kinesis = boto3.resource(kinesis, region_name=eu-west-1) client = boto3.client(kinesis) with ope…

Setting a transparent main window

How to set main window background transparent on QT? Do I need an attribute or a style? Ive tried setting the opacity, but it didnt work for me. app.setStyleSheet("QMainWindow {opacity:0}"

Elementwise division of sparse matrices, ignoring 0/0

I have two sparse matrices E and D, which have non-zero entries at the same places. Now I want to have E/D as a sparse matrix, defined only where D is non-zero.For example take the following code:impor…