12.2. Convolution Operation (CNN)
The convolution operation is the fundamental building block of a Convolutional Neural Network. It involves sliding the kernel, which is a small matrix of weights, across the input image or feature map and performing element-wise multiplication followed by a sum (or alternatively, an average) for each subsection of the input.
Here are the steps involved in the convolution operation:
Initialization: First, you initialize a window, which is the size of your filter (e.g., 3x3, 5x5, etc.). This window slides over the input matrix (image). The window size is defined in the filter, and you start with the window in the top left corner of the image.
Element-wise Multiplication: Next, you perform element-wise multiplication between the values in the filter and the values in the image that currently reside within the window.
Summation: After the element-wise multiplication, you sum up all the resulting values. This sum is then placed into a new matrix called the convolved feature or feature map.
Sliding the Window: Then, you slide the window across the image. How you move this window is determined by the ‘stride’ value. If the stride is 1, you move the window one pixel at a time. If it’s 2, you move the window two pixels at a time, and so forth.
Repeat: You repeat steps 2–4 for all positions of the window in the image. As you slide the window across the width and height of the input, you end up with a matrix of summed values, which is your feature map.
Activation Function: Finally, an activation function (like ReLU) is applied to the feature map to introduce non-linearity into the model. Non-linearities are important because they allow the model to learn more complex representations.
Through this process, the convolution operation helps to bring out features in the image that the model can learn from. For example, a filter might learn to recognize edges in the image, while another might learn to recognize textures. Each filter in the convolutional layer learns to recognize different features, and through the combination of these features, the model can learn to understand a wide array of complex visual information.




