I want to build a neural network that will take the feature map from the last layer of a CNN (VGG or resnet for example), concatenate an additional vector (for example , 1X768 bert vector) , and re-train the last layer on classification problem.
So the architecture should be like in:
but I want to concat an additional vector to each feature vector (I have a sentence to describe each frame).
I have 5 possible labels , and 100 frames in the input frames.
Can someone help me as to how to implement this type of network?
I would recommend looking into the Keras functional API.
Unlike a sequential model (which is usually enough for many introductory problems), the functional API allows you to create any acyclic graph you want. This means that you can have two input branches, one for the CNN (image data) and the other for any NLP you need to do (relating to the descriptive sentence that you mentioned). Then, you can feed in the combined outputs of these two branches into the final layers of your network and produce your result.
Even if you've already created your model using models.Sequential()
, it shouldn't be too hard to rewrite it to use the functional API.
For more information and implementation details, look at the official documentation here: https://keras.io/guides/functional_api/