Multi-Threaded NLP with Spacy pipe

2024/10/13 23:16:36

I'm trying to apply Spacy NLP (Natural Language Processing) pipline to a big text file like Wikipedia Dump. Here is my code based on Spacy's documentation example:

from spacy.en import Englishinput = open("big_file.txt")
big_text= input.read()
input.close()nlp= English()    out = nlp.pipe([unicode(big_text, errors='ignore')], n_threads=-1)
doc = out.next() 

Spacy applies all nlp operations like POS tagging, Lemmatizing and etc all at once. It is like a pipeline for NLP that takes care of everything you need in one step. Applying pipe method tho is supposed to make the process a lot faster by multithreading the expensive parts of the pipeline. But I don't see big improvement in speed and my CPU usage is around 25% (only one of 4 cores working). I also tried to read the file in multiple chuncks and increase the batch of input texts:

out = nlp.pipe([part1, part2, ..., part4], n_threads=-1)

but still the same performance. Is there anyway to speed up the process? I suspect that OpenMP feature should be enabled compiling Spacy to utilize multi-threading feature. But there is no instructions on how to do it on Windows.

Answer

I figured what the problem was. OpenMP is the package used in implementing multithreading for spacy pipe() method. This option is disabled for MSVC compiler by default. After I compiled the source code with openmp support it works great. I also made a pull request to enable this on the next releases. So for releases after 0.100.7 (which is the latest version) multithreading with pipe() should work on Windows with no issue.

https://en.xdnf.cn/q/69476.html

Related Q&A

Django Tastypie throws a maximum recursion depth exceeded when full=True on reverse relation.

I get a maximum recursion depth exceeded if a run the code below: from tastypie import fields, utils from tastypie.resources import ModelResource from core.models import Project, Clientclass ClientReso…

Adding a colorbar to two subplots with equal aspect ratios

Im trying to add a colorbar to a plot consisting of two subplots with equal aspect ratios, i.e. with set_aspect(equal):The code used to create this plot can be found in this IPython notebook.The image …

Why is C++ much faster than python with boost?

My goal is to write a small library for spectral finite elements in Python and to that purpose I tried extending python with a C++ library using Boost, with the hope that it would make my code faster. …

pandas: How to get .to_string() method to align column headers with column values?

This has been stumping me for a while and I feel like there has to be a solution since printing a dataframe always aligns the columns headers with their respective values.example:df = pd.DataFrame({Fir…

Do I need to use `nogil` in Cython

I have some Cython code that Id like to run as quickly as possible. Do I need to release the GIL in order to do this? Lets suppose my code is similar to this: import numpy as np# trivial definition ju…

supervisord environment variables setting up application

Im running an application from supervisord and I have to set up an environment for it. There are about 30 environment variables that need to be set. Ive tried putting all on one bigenvironment=line and…

Updating gui items withing the process

I am trying to make a GUI for my app and ran into a problem: using PySimpleGUI I have to define layout at first and only then display the whole window. Right now the code is like this:import PySimpleGU…

UnicodeDecodeError with Djangos request.FILES

I have the following code in the view call..def view(request):body = u"" for filename, f in request.FILES.items():body = body + Filename: + filename + \n + f.read() + \nOn some cases I getU…

bifurcation diagram with python

Im a beginner and I dont speak english very well so sorry about that. Id like to draw the bifurcation diagram of the sequence : x(n+1)=ux(n)(1-x(n)) with x(0)=0.7 and u between 0.7 and 4.I am supposed …

How do I use nordvpn servers as python requests proxies

Dont ask how, but I parsed the server endpoints of over 5000 nordvpn servers. They usually are something like ar15.nordvpn.com for example. Im trying to use nordvpn servers as request proxies. I know i…