About shuffling

Question 1

I am running my neural network on ubuntu 16.04, with 1 GPU (GTX 1070) and 4 CPUs.

My dataset contains around 35,000 images, but the dataset is not balanced: class 0 has 90%, and class 1,2,3,4 share the other 10%. Therefore I over-sample class 1-4 by using dataset.repeat(class_weight) [I also use a function to apply random augmentation], and then concatenate them.

The re-sampling strategy is:

1) At the very beginning, class_weight[n] will be set to a large number so that each class will have the same amount of images as class 0.

2) As the training goes, number of epochs increases, and the weights will drop according to the epoch number, so that the distribution becomes closer to the actual distribution.

Because my class_weight will vary epoch by epoch, I can't shuffle the whole dataset at the very beginning. Instead, I have to take in data class by class, and shuffle the whole dataset after I concatenate the over-sampled data from each class. And, in order to achieve balanced batches, I have to element-wise shuffle the whole dataset.

The following is part of my code.

def my_estimator_func():d0 = tf.data.TextLineDataset(train_csv_0).map(_parse_csv_train)d1 = tf.data.TextLineDataset(train_csv_1).map(_parse_csv_train)d2 = tf.data.TextLineDataset(train_csv_2).map(_parse_csv_train)d3 = tf.data.TextLineDataset(train_csv_3).map(_parse_csv_train)d4 = tf.data.TextLineDataset(train_csv_4).map(_parse_csv_train)d1 = d1.repeat(class_weight[1])d2 = d2.repeat(class_weight[2])d3 = d3.repeat(class_weight[3])d4 = d4.repeat(class_weight[4])dataset = d0.concatenate(d1).concatenate(d2).concatenate(d3).concatenate(d4)    dataset = dataset.shuffle(180000) # <- This is where the issue comes fromdataset = dataset.batch(100)iterator = dataset.make_one_shot_iterator()  feature, label = iterator.get_next()return feature, labeldef _parse_csv_train(line):parsed_line= tf.decode_csv(line, [[""], []])filename = parsed_line[0]label = parsed_line[1]image_string = tf.read_file(filename)image_decoded = tf.image.decode_jpeg(image_string, channels=3)# my_random_augmentation_func will apply random augmentation on the image. image_aug = my_random_augmentation_func(image_decoded)image_resized = tf.image.resize_images(image_aug, image_resize)return image_resized, label

To make it clear, let me describe why I am facing this issue step by step:

Because classes in my dataset are not balanced, I want to over-sample those minority classes.
Because of 1., I want to apply random augmentation on those classes and concatenate the majority class (class 0) with them.
After doing some research, I find that repeat() will generate different results if there is a random function in it, so I use repeat() along with my_random_augmentation_func to achieve 2.
Now, having achieved 2., I want to combine all the datasets, so I use concatenate()
After 4. I am now facing an issue: there are around 40,000 - 180,000 images in total (because class_weight changes epoch by epoch, at the beginning there will be 180,000 images in total, and finally there will be about 40,000), and they are concatenated class by class, the dataset will look like [0000-1111-2222-3333-4444], therefore with batch size 100, without any shuffling, there will almost always be only one class in each batch, which means the distribution in each batch will be imbalanced.
In order to solve the "imbalanced batch" issue in 5., I come up with the idea that I should shuffle the whole dataset, thus I use shuffle(180000).
And finally, boom, my computer freeze when it comes to shuffle 180000 items in the dataset.

So, is there a better way that I can get balanced batches, but still keep the characteristics I want (e.g. changing distribution epoch by epoch) ?

--- Edit: Issue solved ---

It turned out that I should not apply the map function at the beginning. I should just take in the filenames instead of the real files, and just shuffle the filenames, then map them to the real files.

More detailedly, delete the map(_parse_csv_train) part after d0 = tf.data.TextLineDataset(train_csv_0) and other 4 lines, and add a new line dataset = dataset.map(_parse_csv_train) after shuffle(180000).

I also want to say thank you to @P-Gn , the blog link in his "shuffling" part is really helpful. It answered a question that was in my mind but I didn't ask: "Can I get similar randomness by using many small shuffles v.s. one large shuffle?" (I'm not gonna give an answer here, check that blog!) The method in that blog might also be a potential solution to this issue, but I haven't tried it out.

Question 2

I would suggest to rely on tf.contrib.data.choose_from_datasets, with labels picked by a tf.multinomial distribution. The advantage of this, compared to other functions based on sample rejection is that you do not loose I/O bandwidth reading unused samples.

Here is a working example on a case similar to yours, with a dummy dataset:

import tensorflow as tf# create dummy datasets
class_num_samples = [900, 25, 25, 25, 25]
class_start = [0, 1000, 2000, 3000, 4000]
ds = [tf.data.Dataset.range(class_start[0], class_start[0] + class_num_samples[0]),tf.data.Dataset.range(class_start[1], class_start[1] + class_num_samples[1]),tf.data.Dataset.range(class_start[2], class_start[2] + class_num_samples[2]),tf.data.Dataset.range(class_start[3], class_start[3] + class_num_samples[3]),tf.data.Dataset.range(class_start[4], class_start[4] + class_num_samples[4])
]# pick from dataset according to a parameterizable distribution
class_relprob_ph = tf.placeholder(tf.float32, shape=len(class_num_samples))
pick = tf.data.Dataset.from_tensor_slices(tf.multinomial(tf.log(class_relprob_ph)[None], max(class_num_samples))[0])
ds = tf.contrib.data.choose_from_datasets(ds, pick).repeat().batch(20)iterator = ds.make_initializable_iterator()
batch = iterator.get_next()with tf.Session() as sess:# choose uniform distributionsess.run(iterator.initializer, feed_dict={class_relprob_ph: [1, 1, 1, 1, 1]})print(batch.eval())
# [   0 1000 1001    1 3000 4000 3001 4001    2    3 1002 1003 2000    4    5 2001 3002 1004    6 2002]# now follow input distributionsess.run(iterator.initializer, feed_dict={class_relprob_ph: class_num_samples})print(batch.eval())
# [   0    1 4000    2    3    4    5 3000    6    7    8    9 2000   10   11   12   13 4001   14   15]

Note that the length of an "epoch" is now defined by the length of the multinomial sampling. I have set it somewhat arbitrarily to max(class_num_samples) here — there is indeed no good choice for a definition of an epoch when you start mixing datasets of different lengths.

However there is a concrete reason to have it at least as large as the largest dataset: as you noticed, calling iterator.initializer restart the Dataset from the beginning. Therefore, now that your shuffling buffer are much smaller than your data (which is usually the case), it is important not to restart to early on to make sure training sees all of the data.

About shuffling

This answer solves the problem of interleaving datasets with a custom weighing, not of dataset shuffling, which is an unrelated problem. Shuffling a large dataset requires making compromises — you cannot have an efficient dynamic shuffling without sacrificing on memory and performance somehow. There is for example this excellent blog post on that topic that illustrates graphically the impact of the buffer size on the quality of the shuffling.

How to mix unbalanced Datasets to reach a desired distribution per label?

About shuffling

Related Q&A

Is there a way to prevent pandas to_json from adding \?

Convert SQL into json in Python [duplicate]

Django Middleware - How to edit the HTML of a Django Response object?

Disable or restrict /o/applications (django rest framework, oauth2)

find command with exec in python subprocess gives error

Tensorflow import error on Pycharm (Mac)

ssl.SSLCertVerificationError for flask application OAuth login with keycloak

Need to transfer multiple files from client to server

pyplot bar charts with individual data points

python only works with sudo