I am running my neural network on ubuntu 16.04, with 1 GPU (GTX 1070) and 4 CPUs.
My dataset contains around 35,000 images, but the dataset is not balanced: class 0 has 90%, and class 1,2,3,4 share the other 10%. Therefore I over-sample class 1-4 by using dataset.repeat(class_weight)
[I also use a function to apply random augmentation], and then concatenate
them.
The re-sampling strategy is:
1) At the very beginning, class_weight[n]
will be set to a large number so that each class will have the same amount of images as class 0.
2) As the training goes, number of epochs increases, and the weights will drop according to the epoch number, so that the distribution becomes closer to the actual distribution.
Because my class_weight
will vary epoch by epoch, I can't shuffle the whole dataset at the very beginning. Instead, I have to take in data class by class, and shuffle the whole dataset after I concatenate the over-sampled data from each class. And, in order to achieve balanced batches, I have to element-wise shuffle the whole dataset.
The following is part of my code.
def my_estimator_func():d0 = tf.data.TextLineDataset(train_csv_0).map(_parse_csv_train)d1 = tf.data.TextLineDataset(train_csv_1).map(_parse_csv_train)d2 = tf.data.TextLineDataset(train_csv_2).map(_parse_csv_train)d3 = tf.data.TextLineDataset(train_csv_3).map(_parse_csv_train)d4 = tf.data.TextLineDataset(train_csv_4).map(_parse_csv_train)d1 = d1.repeat(class_weight[1])d2 = d2.repeat(class_weight[2])d3 = d3.repeat(class_weight[3])d4 = d4.repeat(class_weight[4])dataset = d0.concatenate(d1).concatenate(d2).concatenate(d3).concatenate(d4) dataset = dataset.shuffle(180000) # <- This is where the issue comes fromdataset = dataset.batch(100)iterator = dataset.make_one_shot_iterator() feature, label = iterator.get_next()return feature, labeldef _parse_csv_train(line):parsed_line= tf.decode_csv(line, [[""], []])filename = parsed_line[0]label = parsed_line[1]image_string = tf.read_file(filename)image_decoded = tf.image.decode_jpeg(image_string, channels=3)# my_random_augmentation_func will apply random augmentation on the image. image_aug = my_random_augmentation_func(image_decoded)image_resized = tf.image.resize_images(image_aug, image_resize)return image_resized, label
To make it clear, let me describe why I am facing this issue step by step:
Because classes in my dataset are not balanced, I want to over-sample those minority classes.
Because of 1., I want to apply random augmentation on those classes and concatenate the majority class (class 0) with them.
After doing some research, I find that repeat() will generate different results if there is a random function in it, so I use repeat() along with my_random_augmentation_func to achieve 2.
Now, having achieved 2., I want to combine all the datasets, so I use
concatenate()
After 4. I am now facing an issue: there are around 40,000 - 180,000 images in total (because
class_weight
changes epoch by epoch, at the beginning there will be 180,000 images in total, and finally there will be about 40,000), and they are concatenated class by class, the dataset will look like [0000-1111-2222-3333-4444], therefore with batch size 100, without any shuffling, there will almost always be only one class in each batch, which means the distribution in each batch will be imbalanced.In order to solve the "imbalanced batch" issue in 5., I come up with the idea that I should shuffle the whole dataset, thus I use
shuffle(180000)
.And finally, boom, my computer freeze when it comes to shuffle 180000 items in the dataset.
So, is there a better way that I can get balanced batches, but still keep the characteristics I want (e.g. changing distribution epoch by epoch) ?
--- Edit: Issue solved ---
It turned out that I should not apply the map function at the beginning. I should just take in the filenames instead of the real files, and just shuffle the filenames, then map them to the real files.
More detailedly, delete the map(_parse_csv_train)
part after d0 = tf.data.TextLineDataset(train_csv_0)
and other 4 lines, and add a new line dataset = dataset.map(_parse_csv_train)
after shuffle(180000).
I also want to say thank you to @P-Gn , the blog link in his "shuffling" part is really helpful. It answered a question that was in my mind but I didn't ask: "Can I get similar randomness by using many small shuffles v.s. one large shuffle?" (I'm not gonna give an answer here, check that blog!) The method in that blog might also be a potential solution to this issue, but I haven't tried it out.