Data Normalization with tensorflow tf-transform

2024/9/28 11:17:36

I'm doing a neural network prediction with my own datasets using Tensorflow. The first I did was a model that works with a small dataset in my computer. After this, I changed the code a little bit in order to use Google Cloud ML-Engine with bigger datasets to realize in ML-Engine the train and the predictions.

I am normalizing the features in the panda dataframe but this introduces skew and I get poor prediction results.

What I would really like is use the library tf-transform to normalize the data in the graph. To do this, I would like to create a function preprocessing_fn and use the 'tft.scale_to_0_1'. https://github.com/tensorflow/transform/blob/master/getting_started.md

The main problem that I found is when I'm trying to do the predict. I'm looking for internet but I don't find any example of exported model where the data is normalized in the training. In all the examples I found, the data is NOT normalized anywhere.

What I would like to know is If I normalize the data in the training and I send a new instance with new data to do the prediction, how is normalized this data?

¿Maybe in the Tensorflow Data Pipeline? The variables to do the normalization are saved in some place?

In summary: I'm looking for a way to normalize the inputs for my model and then that the new instances also become standardized.

Answer

First of all, you don't really need tf.transform for this. All you need to do is to write a function that you call from both the training/eval input_fn and from your serving input_fn.

For example, assuming that you've used Pandas on your whole dataset to figure out the min and max

def add_engineered(features):min_x = 22max_x = 43features['x'] = (features['x'] - min_x) / (max_x - min_x)return features

Then, in your input_fn, wrap the features you return with a call to add_engineered:

def input_fn():features = ...label = ...return add_engineered(features), label

and in your serving_input fn, make sure to similarly wrap the returned features (NOT the feature_placeholders) with a call to add_engineered:

def serving_input_fn():feature_placeholders = ...features = feature_placeholders.copy()return tf.estimator.export.ServingInputReceiver(add_engineered(features), feature_placeholders)

Now, your JSON input at prediction time would only need to contain the original, unscaled values.

Here's a complete working example of this approach.

https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/feateng/taxifare/trainer/model.py#L130

tf.transform provides for a two-phase process: an analysis step to compute the min, max and a graph-modification step to insert the scaling for you into your TensorFlow graph. So, to use tf.transform, you first need to write a Dataflow pipeline does the analysis and then plug in calls to tf.scale_0_to_1 inside your TensorFlow code. Here's an example of doing this:

https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/criteo_tft

The add_engineered() approach is simpler and is what I would suggest. The tf.transform approach is needed if your data distributions will shift over time, and so you want to automate the entire pipeline (e.g. for continuous training).

https://en.xdnf.cn/q/71340.html

Related Q&A

Relationship of metaclasss __call__ and instances __init__?

Say Ive got a metaclass and a class using it:class Meta(type):def __call__(cls, *args):print "Meta: __call__ with", argsclass ProductClass(object):__metaclass__ = Metadef __init__(self, *args…

How to present numpy array into pygame surface?

Im writing a code that part of it is reading an image source and displaying it on the screen for the user to interact with. I also need the sharpened image data. I use the following to read the data an…

Following backreferences of unknown kinds in NDB

Im in the process of writing my first RESTful web service atop GAE and the Python 2.7 runtime; Ive started out using Guidos shiny new ndb API.However, Im unsure how to solve a particular case without t…

How to enable math in sphinx?

I am using sphinx with the pngmath extension to document my code that has a lot of mathematical expressions. Doing that in a *.rst file is working just fine.a \times b becomes: However, if I try the sa…

How to set the xticklabels for date in matplotlib

I am trying to plot values from two list. The x axis values are date. Tried these things so faryear = [20070102,20070806,20091208,20111109,20120816,20140117,20140813] yvalues = [-0.5,-0.5,-0.75,-0.75,…

PyParsing: Is this correct use of setParseAction()?

I have strings like this:"MSE 2110, 3030, 4102"I would like to output:[("MSE", 2110), ("MSE", 3030), ("MSE", 4102)]This is my way of going about it, although I h…

Indent and comments in function in Python

I am using Python 2.7 and wrote the following:def arithmetic(A):x=1 """ Some comments here """ if x=1:x=1elif x=2:x=2return 0But it has the indentation issue:if x=1:^ Ind…

Read a large big-endian binary file

I have a very large big-endian binary file. I know how many numbers in this file. I found a solution how to read big-endian file using struct and it works perfect if file is small:data = []file = open(…

SWIG Python Structure Array

Ive been searching for a few days trying to figure out how to turn an array of structures into a Python list. I have a function that returns a pointer to the beginning of the array.struct foo {int memb…

Hashing tuple in Python causing different results in different systems

I was practicing tuple hashing. In there I was working on Python 2.7. Below is the code:num = int(raw_input()) num_list = [int(x) for x in raw_input().split()] print(hash(tuple(num_list)))The above cod…