Question 1

I'm trying to apply Spacy NLP (Natural Language Processing) pipline to a big text file like Wikipedia Dump. Here is my code based on Spacy's documentation example:

from spacy.en import Englishinput = open("big_file.txt")
big_text= input.read()
input.close()nlp= English()    out = nlp.pipe([unicode(big_text, errors='ignore')], n_threads=-1)
doc = out.next()

Spacy applies all nlp operations like POS tagging, Lemmatizing and etc all at once. It is like a pipeline for NLP that takes care of everything you need in one step. Applying pipe method tho is supposed to make the process a lot faster by multithreading the expensive parts of the pipeline. But I don't see big improvement in speed and my CPU usage is around 25% (only one of 4 cores working). I also tried to read the file in multiple chuncks and increase the batch of input texts:

out = nlp.pipe([part1, part2, ..., part4], n_threads=-1)

but still the same performance. Is there anyway to speed up the process? I suspect that OpenMP feature should be enabled compiling Spacy to utilize multi-threading feature. But there is no instructions on how to do it on Windows.

Question 2

I figured what the problem was. OpenMP is the package used in implementing multithreading for spacy pipe() method. This option is disabled for MSVC compiler by default. After I compiled the source code with openmp support it works great. I also made a pull request to enable this on the next releases. So for releases after 0.100.7 (which is the latest version) multithreading with pipe() should work on Windows with no issue.

Multi-Threaded NLP with Spacy pipe

Related Q&A

Django Tastypie throws a maximum recursion depth exceeded when full=True on reverse relation.

Adding a colorbar to two subplots with equal aspect ratios

Why is C++ much faster than python with boost?

pandas: How to get .to_string() method to align column headers with column values?

Do I need to use `nogil` in Cython

supervisord environment variables setting up application

Updating gui items withing the process

UnicodeDecodeError with Djangos request.FILES

bifurcation diagram with python

How do I use nordvpn servers as python requests proxies