How to mix unbalanced Datasets to reach a desired distribution per label?

2024/9/24 5:31:36

I am running my neural network on ubuntu 16.04, with 1 GPU (GTX 1070) and 4 CPUs.

My dataset contains around 35,000 images, but the dataset is not balanced: class 0 has 90%, and class 1,2,3,4 share the other 10%. Therefore I over-sample class 1-4 by using dataset.repeat(class_weight) [I also use a function to apply random augmentation], and then concatenate them.

The re-sampling strategy is:

1) At the very beginning, class_weight[n] will be set to a large number so that each class will have the same amount of images as class 0.

2) As the training goes, number of epochs increases, and the weights will drop according to the epoch number, so that the distribution becomes closer to the actual distribution.

Because my class_weight will vary epoch by epoch, I can't shuffle the whole dataset at the very beginning. Instead, I have to take in data class by class, and shuffle the whole dataset after I concatenate the over-sampled data from each class. And, in order to achieve balanced batches, I have to element-wise shuffle the whole dataset.

The following is part of my code.

def my_estimator_func():d0 = tf.data.TextLineDataset(train_csv_0).map(_parse_csv_train)d1 = tf.data.TextLineDataset(train_csv_1).map(_parse_csv_train)d2 = tf.data.TextLineDataset(train_csv_2).map(_parse_csv_train)d3 = tf.data.TextLineDataset(train_csv_3).map(_parse_csv_train)d4 = tf.data.TextLineDataset(train_csv_4).map(_parse_csv_train)d1 = d1.repeat(class_weight[1])d2 = d2.repeat(class_weight[2])d3 = d3.repeat(class_weight[3])d4 = d4.repeat(class_weight[4])dataset = d0.concatenate(d1).concatenate(d2).concatenate(d3).concatenate(d4)    dataset = dataset.shuffle(180000) # <- This is where the issue comes fromdataset = dataset.batch(100)iterator = dataset.make_one_shot_iterator()  feature, label = iterator.get_next()return feature, labeldef _parse_csv_train(line):parsed_line= tf.decode_csv(line, [[""], []])filename = parsed_line[0]label = parsed_line[1]image_string = tf.read_file(filename)image_decoded = tf.image.decode_jpeg(image_string, channels=3)# my_random_augmentation_func will apply random augmentation on the image. image_aug = my_random_augmentation_func(image_decoded)image_resized = tf.image.resize_images(image_aug, image_resize)return image_resized, label

To make it clear, let me describe why I am facing this issue step by step:

  1. Because classes in my dataset are not balanced, I want to over-sample those minority classes.

  2. Because of 1., I want to apply random augmentation on those classes and concatenate the majority class (class 0) with them.

  3. After doing some research, I find that repeat() will generate different results if there is a random function in it, so I use repeat() along with my_random_augmentation_func to achieve 2.

  4. Now, having achieved 2., I want to combine all the datasets, so I use concatenate()

  5. After 4. I am now facing an issue: there are around 40,000 - 180,000 images in total (because class_weight changes epoch by epoch, at the beginning there will be 180,000 images in total, and finally there will be about 40,000), and they are concatenated class by class, the dataset will look like [0000-1111-2222-3333-4444], therefore with batch size 100, without any shuffling, there will almost always be only one class in each batch, which means the distribution in each batch will be imbalanced.

  6. In order to solve the "imbalanced batch" issue in 5., I come up with the idea that I should shuffle the whole dataset, thus I use shuffle(180000).

  7. And finally, boom, my computer freeze when it comes to shuffle 180000 items in the dataset.

So, is there a better way that I can get balanced batches, but still keep the characteristics I want (e.g. changing distribution epoch by epoch) ?

--- Edit: Issue solved ---

It turned out that I should not apply the map function at the beginning. I should just take in the filenames instead of the real files, and just shuffle the filenames, then map them to the real files.

More detailedly, delete the map(_parse_csv_train) part after d0 = tf.data.TextLineDataset(train_csv_0) and other 4 lines, and add a new line dataset = dataset.map(_parse_csv_train) after shuffle(180000).

I also want to say thank you to @P-Gn , the blog link in his "shuffling" part is really helpful. It answered a question that was in my mind but I didn't ask: "Can I get similar randomness by using many small shuffles v.s. one large shuffle?" (I'm not gonna give an answer here, check that blog!) The method in that blog might also be a potential solution to this issue, but I haven't tried it out.

Answer

I would suggest to rely on tf.contrib.data.choose_from_datasets, with labels picked by a tf.multinomial distribution. The advantage of this, compared to other functions based on sample rejection is that you do not loose I/O bandwidth reading unused samples.

Here is a working example on a case similar to yours, with a dummy dataset:

import tensorflow as tf# create dummy datasets
class_num_samples = [900, 25, 25, 25, 25]
class_start = [0, 1000, 2000, 3000, 4000]
ds = [tf.data.Dataset.range(class_start[0], class_start[0] + class_num_samples[0]),tf.data.Dataset.range(class_start[1], class_start[1] + class_num_samples[1]),tf.data.Dataset.range(class_start[2], class_start[2] + class_num_samples[2]),tf.data.Dataset.range(class_start[3], class_start[3] + class_num_samples[3]),tf.data.Dataset.range(class_start[4], class_start[4] + class_num_samples[4])
]# pick from dataset according to a parameterizable distribution
class_relprob_ph = tf.placeholder(tf.float32, shape=len(class_num_samples))
pick = tf.data.Dataset.from_tensor_slices(tf.multinomial(tf.log(class_relprob_ph)[None], max(class_num_samples))[0])
ds = tf.contrib.data.choose_from_datasets(ds, pick).repeat().batch(20)iterator = ds.make_initializable_iterator()
batch = iterator.get_next()with tf.Session() as sess:# choose uniform distributionsess.run(iterator.initializer, feed_dict={class_relprob_ph: [1, 1, 1, 1, 1]})print(batch.eval())
# [   0 1000 1001    1 3000 4000 3001 4001    2    3 1002 1003 2000    4    5 2001 3002 1004    6 2002]# now follow input distributionsess.run(iterator.initializer, feed_dict={class_relprob_ph: class_num_samples})print(batch.eval())
# [   0    1 4000    2    3    4    5 3000    6    7    8    9 2000   10   11   12   13 4001   14   15]

Note that the length of an "epoch" is now defined by the length of the multinomial sampling. I have set it somewhat arbitrarily to max(class_num_samples) here — there is indeed no good choice for a definition of an epoch when you start mixing datasets of different lengths.

However there is a concrete reason to have it at least as large as the largest dataset: as you noticed, calling iterator.initializer restart the Dataset from the beginning. Therefore, now that your shuffling buffer are much smaller than your data (which is usually the case), it is important not to restart to early on to make sure training sees all of the data.

About shuffling

This answer solves the problem of interleaving datasets with a custom weighing, not of dataset shuffling, which is an unrelated problem. Shuffling a large dataset requires making compromises — you cannot have an efficient dynamic shuffling without sacrificing on memory and performance somehow. There is for example this excellent blog post on that topic that illustrates graphically the impact of the buffer size on the quality of the shuffling.

https://en.xdnf.cn/q/71740.html

Related Q&A

Is there a way to prevent pandas to_json from adding \?

I am trying to send a pandas dataframe to_json and I am having some issues with the date. I am getting an addtional \ so that my records look like Updated:09\/06\/2016 03:09:44. Is it possible to not…

Convert SQL into json in Python [duplicate]

This question already has answers here:return SQL table as JSON in python(15 answers)Closed 8 years ago.I need to pass an object that I can convert using $.parseJSON. The query looks like this:cursor.e…

Django Middleware - How to edit the HTML of a Django Response object?

Im creating a custom middleware to django edit response object to act as a censor. I would like to find a way to do a kind of search and replace, replacing all instances of some word with one that I c…

Disable or restrict /o/applications (django rest framework, oauth2)

I am currently writing a REST API using Django rest framework, and oauth2 for authentication (using django-oauth-toolkit). Im very happy with both of them, making exactly what I want.However, I have on…

find command with exec in python subprocess gives error

Im trying to execute the following command using subprocess module (python)/usr/bin/find <filepath> -maxdepth 1 -type f -iname "<pattern>" -exec basename {} \;But, it gives the fo…

Tensorflow import error on Pycharm (Mac)

Error msg (check the screenshot picture please):ImportError: cannot import name symbol_databaseError importing tensorflow. Unless you are using bazel, you should not try to import tensorflow from its …

ssl.SSLCertVerificationError for flask application OAuth login with keycloak

I have referred a sample hello-world flask app integrated with key-cloak login from https://gist.github.com/thomasdarimont/145dc9aa857b831ff2eff221b79d179a My client-secrets.json is as follows: {"…

Need to transfer multiple files from client to server

Im recently working on a project in which Im basically making a dropbox clone. The server and client are working fine but Im having a slight issue. Im able to transfer a single file from the client to …

pyplot bar charts with individual data points

I have data from a control and treatment group. Is matplotlib able to create a bar chart where the bar height is the mean of each group overlaid with the individual data points from that group? Id lik…

python only works with sudo

My python 2.7 script works on my Ubuntu system if I call it using sudo python [filename].pyor from a bash script using sudo ./[bashscriptname].shBut if I call it from Pycharm I get oauth errors, and fr…