How do I do use non-integer string labels with SVM from scikit-learn? Python

2024/10/1 3:22:48

Scikit-learn has fairly user-friendly python modules for machine learning.

I am trying to train an SVM tagger for Natural Language Processing (NLP) where my labels and input data are words and annotation. E.g. Part-Of-Speech tagging, rather than using double/integer data as input tuples [[1,2], [2,0]], my tuples will look like this [['word','NOUN'], ['young', 'adjective']]

Can anyone give an example of how i can use the SVM with string tuples? the tutorial/documentation given here are for integer/double inputs. http://scikit-learn.org/stable/modules/svm.html

Answer

Most machine learning algorithm process input samples that are vector of floats such that a small (often euclidean) distance between a pair of samples means that the 2 samples are similar in a way that is relevant for the problem at hand.

It is the responsibility of the machine learning practitioner to find a good set of float features to encode. This encoding is domain specific hence there is not general way to build that representation out of the raw data that would work across all application domains (various NLP tasks, computer vision, transaction log analysis...). This part of the machine learning modeling work is called feature extraction. When it involves a lot of manual work, this is often referred to as feature engineering.

Now for your specific problem, POS tags of a window of words around a word of interest in a sentence (e.g. for sequence tagging such as named entity detection) can be encoded appropriately by using the DictVectorizer feature extraction helper class of scikit-learn.

https://en.xdnf.cn/q/71002.html

Related Q&A

Python - walk through a huge set of files but in a more efficient manner

I have huge set of files that I want to traverse through using python. I am using os.walk(source) for the same and is working but since I have a huge set of files it is taking too much and memory resou…

Python: handling a large set of data. Scipy or Rpy? And how?

In my python environment, the Rpy and Scipy packages are already installed. The problem I want to tackle is such:1) A huge set of financial data are stored in a text file. Loading into Excel is not pos…

Jupyter notebook - cant import python functions from other folders

I have a Jupyter notebook, I want to use local python functions from other folders in my computer. When I do import to these functions I get this error: "ModuleNotFoundError: No module named xxxxx…

Can pandas plot a time-series without trying to convert the index to Periods?

When plotting a time-series, I observe an unusual behavior, which eventually results in not being able to format the xticks of the plot. It seems that pandas internally tries to convert the index into …

pip install syntax for allowing insecure

I tried to run$pip install --upgrade --allow-insecure setuptoolsbut it doesnt seem to work? is my syntax wrong?this is on ubuntu 13.10 I need --allow-insecure as I havent been able to the get the co…

how do I determine the locations of the points after perspective transform, in the new image plane?

Im using OpenCV+Python+Numpy and I have three points in the image, I know the exact locations of those points.(P1, P2);N1I am going to transform the image to another view, (for example I am transformin…

How to do a simple Gaussian mixture sampling and PDF plotting with NumPy/SciPy?

I add three normal distributions to obtain a new distribution as shown below, how can I do sampling according to this distribution in python?import matplotlib.pyplot as plt import scipy.stats as ss im…

Python dict.get() or None scenario [duplicate]

This question already has answers here:Truth value of a string in python(4 answers)Closed 7 years ago.I am attempting to access a dictionarys values based on a list of keys I have. If the key is not pr…

p-values from ridge regression in python

Im using ridge regression (ridgeCV). And Ive imported it from: from sklearn.linear_model import LinearRegression, RidgeCV, LarsCV, Ridge, Lasso, LassoCVHow do I extract the p-values? I checked but rid…

AutoTokenizer.from_pretrained fails to load locally saved pretrained tokenizer (PyTorch)

I am new to PyTorch and recently, I have been trying to work with Transformers. I am using pretrained tokenizers provided by HuggingFace.I am successful in downloading and running them. But if I try to…