train spacy for text classification

2024/10/15 7:29:48

After reading the docs and doing the tutorial I figured I'd make a small demo. Turns out my model does not want to train. Here's the code

import spacy
import random
import jsonTRAINING_DATA = [["My little kitty is so special", {"KAT": True}],["Dude, Totally, Yeah, Video Games", {"KAT": False}],["Should I pay $1,000 for the iPhone X?", {"KAT": False}],["The iPhone 8 reviews are here", {"KAT": False}],["Noa is a great cat name.", {"KAT": True}],["We got a new kitten!", {"KAT": True}]
]nlp = spacy.blank("en")
category = nlp.create_pipe("textcat")
nlp.add_pipe(category)
category.add_label("KAT")# Start the training
nlp.begin_training()# Loop for 10 iterations
for itn in range(100):# Shuffle the training datarandom.shuffle(TRAINING_DATA)losses = {}# Batch the examples and iterate over themfor batch in spacy.util.minibatch(TRAINING_DATA, size=2):texts = [text for text, entities in batch]annotations = [{"textcat": [entities]} for text, entities in batch]nlp.update(texts, annotations, losses=losses)if itn % 20 == 0:print(losses)

When I run this the output suggests that very little is learned.

{'textcat': 0.0}
{'textcat': 0.0}
{'textcat': 0.0}
{'textcat': 0.0}
{'textcat': 0.0}

This feels wrong. There should be an error or a meaningful tag. The predictions confirm this.

for text, d in TRAINING_DATA:print(text, nlp(text).cats)# Dude, Totally, Yeah, Video Games {'KAT': 0.45303162932395935}
# The iPhone 8 reviews are here {'KAT': 0.45303162932395935}
# Noa is a great cat name. {'KAT': 0.45303162932395935}
# Should I pay $1,000 for the iPhone X? {'KAT': 0.45303162932395935}
# We got a new kitten! {'KAT': 0.45303162932395935}
# My little kitty is so special {'KAT': 0.45303162932395935}

It feels like my code is missing something but I can't figure out what.

Answer

If you update and use spaCy 3 - the code above will no longer work. The solution is to migrate with some changes. I've modified the example from cantdutchthis accordingly.

Summary of changes:

  • use the config to change the architecture. The old default was "bag of words", the new default is "text ensemble" which uses attention. Keep this in mind when tuning the models
  • labels now need to be one-hot encoded
  • the add_pipe interface has changed slightly
  • nlp.update now requires an Example object rather than a tuple of text, annotation
import spacy
# Add imports for example, as well as textcat config...
from spacy.training import Example
from spacy.pipeline.textcat import single_label_bow_config, single_label_default_config
from thinc.api import Config
import random# labels should be one-hot encoded
TRAINING_DATA = [["My little kitty is so special", {"KAT0": True}],["Dude, Totally, Yeah, Video Games", {"KAT1": True}],["Should I pay $1,000 for the iPhone X?", {"KAT1": True}],["The iPhone 8 reviews are here", {"KAT1": True}],["Noa is a great cat name.", {"KAT0": True}],["We got a new kitten!", {"KAT0": True}]
]# bow
# config = Config().from_str(single_label_bow_config)# textensemble with attention
config = Config().from_str(single_label_default_config)nlp = spacy.blank("en")
# now uses `add_pipe` instead
category = nlp.add_pipe("textcat", last=True, config=config)
category.add_label("KAT0")
category.add_label("KAT1")# Start the training
nlp.begin_training()# Loop for 10 iterations
for itn in range(100):# Shuffle the training datarandom.shuffle(TRAINING_DATA)losses = {}# Batch the examples and iterate over themfor batch in spacy.util.minibatch(TRAINING_DATA, size=4):texts = [nlp.make_doc(text) for text, entities in batch]annotations = [{"cats": entities} for text, entities in batch]# uses an example object rather than text/annotation tupleexamples = [Example.from_dict(doc, annotation) for doc, annotation in zip(texts, annotations)]nlp.update(examples, losses=losses)if itn % 20 == 0:print(losses)
https://en.xdnf.cn/q/69316.html

Related Q&A

Python threading vs. multiprocessing in Linux

Based on this question I assumed that creating new process should be almost as fast as creating new thread in Linux. However, little test showed very different result. Heres my code: from multiprocessi…

How to create a visualization for events along a timeline?

Im building a visualization with Python. There Id like to visualize fuel stops and the fuel costs of my car. Furthermore, car washes and their costs should be visualized as well as repairs. The fuel c…

Multiplying Numpy 3D arrays by 1D arrays

I am trying to multiply a 3D array by a 1D array, such that each 2D array along the 3rd (depth: d) dimension is calculated like:1D_array[d]*2D_arrayAnd I end up with an array that looks like, say:[[ [1…

Django Performing System Checks is running very slow

Out of nowhere Im running into an issue with my Django application where it runs the "Performing System Checks" command very slow. If I start the server with python manage.py runserverIt take…

str.translate vs str.replace - When to use which one?

When and why to use the former instead of the latter and vice versa?It is not entirely clear why some use the former and why some use the latter.

python BeautifulSoup searching a tag

My first post here, Im trying to find all tags in this specific html and i cant get them out, this is the code:from bs4 import BeautifulSoup from urllib import urlopenurl = "http://www.jutarnji.h…

How to remove extra whitespace from image in opencv? [duplicate]

This question already has answers here:How to remove whitespace from an image in OpenCV?(3 answers)Closed 3 years ago.I have the following image which is a receipt image and a lot of white space aroun…

Is there a way in numpy to test whether a matrix is Unitary

I was wondering if there is any function in numpy to determine whether a matrix is Unitary?This is the function I wrote but it is not working. I would be thankful if you guys can find an error in my f…

Two unique marker symbols for one legend

I would like to add a "red filled square" symbol beside the "red filled circle" symbol under legend. How do I achieve this? I prefer to stick with pyplot rather than pylab. Below i…

What is Rubys equivalent to Pythons multiprocessing module?

To get real concurrency in Ruby or Python, I need to create new processes. Python makes this pretty straightforward using the multiprocessing module, which abstracts away all the fork / wait goodness a…