Normalize/Standardize a numpy recarray

2024/10/6 14:24:46

I wonder what the best way of normalizing/standardizing a numpy recarray is. To make it clear, I'm not talking about a mathematical matrix, but a record array that also has e.g. textual columns (such as labels).

a = np.genfromtxt("iris.csv", delimiter=",", dtype=None)
print a.shape
> (150,)

As you can see, I cannot e.g. process a[:,:-1] as the shape is one-dimensional.

The best I found is to iterate over all columns:

for nam in a.dtype.names[:-1]:col = a[nam]a[nam] = (col - col.min()) / (col.max() - col.min())

Any more elegant way of doing this? Is there some method such as "normalize" or "standardize" somewhere?

Answer

There are a number of ways to do it, but some are cleaner than others.

Usually, in numpy, you keep the string data in a separate array.

(Things are a bit more low-level than, say, R's data frame. You typically just wrap things up in a class for the association, but keep different data types separate.)

Honestly, numpy isn't optimized for handling "flexible" datatypes such as this (though it can certainly do it). Things like pandas provide a better interface for "spreadsheet-like" data (and pandas is just a layer on top of numpy).

However, structured arrays (which is what you have here) will allow you to slice them column-wise when you pass in a list of field names. (e.g. data[['col1', 'col2', 'col3']])

At any rate, one way is to do something like this:

import numpy as npdata = np.recfromcsv('iris.csv')# In this case, it's just all but the last, but we could be more general
# This must be a list and not a tuple, though.
float_fields = list(data.dtype.names[:-1])float_dat = data[float_fields]# Now we just need to view it as a "regular" 2D array...
float_dat = float_dat.view(np.float).reshape((data.size, -1))# And we can normalize columns as usual.
normalized = (float_dat - float_dat.min(axis=0)) / float_dat.ptp(axis=0)

However, this is far from ideal. If you want to do the operation in-place (as you currently are) the easiest solution is what you already have: Just iterate over the field names.

Incidentally, using pandas, you'd do something like this:

import pandas
data = pandas.read_csv('iris.csv', header=None)float_dat = data[data.columns[:-1]]
dmin, dmax = float_dat.min(axis=0), float_dat.max(axis=0)data[data.columns[:-1]] = (float_dat - dmin) / (dmax - dmin)
https://en.xdnf.cn/q/70354.html

Related Q&A

How to read /dev/log?

I would like to directly access to syslog messages from Python by reading /dev/log.My (very limited) understanding is that the correct way is to read from there is to bind a datagram socket. import soc…

Determining if a number evenly divides by 25, Python

Im trying to check if each number in a list is evenly divisible by 25 using Python. Im not sure what is the right process. I want to do something like this:n = [100, 101, 102, 125, 355, 275, 435, 134, …

OpenCV SimpleBlobDetector detect() call throws cv2.error: Unknown C++ exception from OpenCV code?

I need to detect semicircles on image and I find follow stuff for this: import cv2 import numpy as npdef get_circle(img_path):im = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)detector = cv2.SimpleBlobDet…

Is it possible to add a global argument for all subcommands in Click based interfaces?

I am using Click under a virtualenv and use the entry_point directive in setuptools to map the root to a function called dispatch.My tool exposes two subcommands serve and config, I am using an option …

Error importing h5py

Ive been trying to import h5py to read this type of file.Here is my code:import h5pyfile_1 = h5py.File("Out_fragment.h5py")print file_1The output is:Traceback (most recent call last):File &qu…

Tensorflow leaks 1280 bytes with each session opened and closed?

It seems that each Tensorflow session I open and close consumes 1280 bytes from the GPU memory, which are not released until the python kernel is terminated. To reproduce, save the following python scr…

Python Pandas -- Forward filling entire rows with value of one previous column

New to pandas development. How do I forward fill a DataFrame with the value contained in one previously seen column?Self-contained example:import pandas as pd import numpy as np O = [1, np.nan, 5, np.…

Microsoft Visual C++ 14.0 is required - error - pip install fbprophet

I am trying pip install fbprophet. I am getting that error: "Microsoft Visual C++ 14.0 is required" It has been discussed many times (e.g. Microsoft Visual C++ 14.0 is required (Unable to fin…

find position of item in for loop over a sequence [duplicate]

This question already has answers here:Closed 12 years ago.Possible Duplicate:Accessing the index in Python for loops list = [1,2,2,3,5,5,6,7]for item in mylist:...How can I find the index of the item…

How to convert BeautifulSoup.ResultSet to string

So I parsed a html page with .findAll (BeautifulSoup) to variable named result. If I type result in Python shell then press Enter, I see normal text as expected, but as I wanted to postprocess this res…