In computer vision and image processing, convolution is a way to mathematically filter images to do things like blur, sharpen or edge detection. I'll show you examples of those later, but first, let's see how convolution is performed. Convolution is the process of adding each element or pixel in an image to its neighbors weighted by a kernel. A kernel is a small matrix that does not change throughout the convolution operation and holds the weights for this summation process. I usually see kernels as three-by-three, or five-by-five matrices although other sizes do exist, the numbers in the kernel determine the type of filtering that we do. I'll pick these for this example. Next, we take a section of pixels the same size as the kernel, starting with the top left of the image, each pixel is multiplied with the corresponding kernel element. All the values in this element-wise multiplication are then added up to give us the first value in our output image or feature map. The window then slides over by one pixel and the operation is performed again. We use the same kernel, but a different subsection of the image this time, this gives us our second value. The window slides again to give us the third value, when it reaches the end of the row, the window slides down one pixel, and starts again on a new row. This process continues until we have used all of the pixels in the original image to create a new output image. All of this assumes a stride of one, which means the window moves over or down only one pixel at a time. With a stride of two, we would start by computing the weighted sum of the window set in the upper left, just like we did before, then we would slide the window over by two pixels, and repeat the calculation. We would move the window down by two pixels as well to repeat, but since we cannot do that, the convolution is done, and we end up with a two by one array for the output image instead, you'll also notice that the output image in any of these cases is smaller than the input image. There are a few ways we can account for this, which I'll talk about in a minute, but first, here's the general formula for image convolution. Capital I is the input image matrix, and capital K is the kernel we index into each of these arrays, and some the element-wise multiplication to get a value for one of the pixels of the output image o. Often in the math world, you'll see that matrices and vectors start with one for the first element. However, I've shown the equation for starting with zero as the first index, as this should be easier to implement in code. S is the stride, H which I'll show in a second is the height of the image, and W is the kernel. Capital M is the number of rows in the kernel, and capital N is the number of columns. Lowercase I is the row index of the output image, and it goes from zero to the floor of the image height minus the kernel height divided by the stride plus 1. Lowercase j is the column index of the output image, and it counts from zero to the floor of the image width minus the kernel width divided by the stride plus 1. Here is an example of our previous convolution example with the zero-based indices marked for each matrix, the output image would be a two-by-three matrix if we set the stride to one. Let's say that we're working with a really odd kernel like this, which causes the output value to be really large, if we're working with eight-bit gray-scale values for our input image, then we would usually want to work with eight-bit gray-scale values for the output image. These values must be between 0 and 255, so we usually set anything above 255 to 255, and anything below 0 to 0. You'll sometimes hear this referred to as clamping of value, in this case, we'd be clamping the value between 0 and 255. If you're curious, the output image would look something like this as all values would just be clipped to 255. If you're working with floating-point values for your pixels, which can sometimes happen in machine learning, you'll also want to clip or clamp those. Zero to one is the normal range, but you might sometimes see negative one to one as well. Now, let's talk about padding. Vincent, do Milan made these great animations, and shared them on GitHub, I'll use them to demonstrate how padding works. There are a few types of padding that you can use on the input matrix, we've been looking at what's known as valid padding, which essentially means there's no padding of the input image, only the valid elements of the input array are used. In this instance, the input image is a four by four matrix and the kernel is three-by-three. The kernel isn't shown, but we can assume that the element-wise multiplication and summation are occurring to give us the element in the output array shown at the top with a stride of one, the output array is a two-by-two matrix. We saw this example earlier, where we convolve a four-by-five input matrix with a three-by-three kernel and use a stride of two. Notice that while the top three rows are used, the bottom row is skipped because we would not be able to move our window down by two, we only use the elements in the input matrix that are valid with the given window size and stride setting. In same padding, we pad the input matrix with elements in order to give us an output matrix that is the same size as the input. The padded elements are shown by the dotted lines. For example, here we show a five-by-five input matrix with a three-by-three kernel, and a stride of one with these elements added around the input matrix, the output matrix will be five-by-five. Note that even with padding, if you set the stride to more than one, you still may end up with an output matrix that is smaller than the input matrix. In some libraries like Kairos, if you specify same padding, it will attend to add as many padded rows and columns as necessary to give you an output matrix that is the same size as the input matrix. There are a few ways to fill the padded elements. One common approach is to just set all of the padded elements to zero. This technique is probably the most common and it's what kairos does when you specify same padding. Another option is to copy the values on the edges of the matrix out to the padding. This has the effect of making the image look a little bigger than it is and preventing some skewing of the data at the borders that might occur if we use zero padding. In some applications, you don't need to maintain the information at the borders so you can get away with no padding. In fact, it's more computationally efficient to work without padding. Not only does it save you convolution steps, it also gives you a smaller output matrix which might be beneficial for future steps. For example, if you're using this filtered output image as an input to a neural network, having a smaller matrix means fewer input dimensions and therefore fewer computations in each node. However, there are times when you want to use padding. For example, if you're just looking to filter an image to blur or sharpen it, then you will likely want to maintain the same size and shape. Keeping the same shape also means you can perform more convolution steps without continually losing information at the borders. And sometimes you might determine that the information at the edges of the image really is important to whatever vision project you're doing. So you need to use padding to keep that relevant data. If you're working with color images, you would first break the image apart into its separate red, green, and blue color channels. Each channel would then act like the grayscale images we were looking at earlier. A separate kernel would be convolved with each channel to produce three separate output images. These output images represent the new RGB channels and could be combined to create the new filtered color image. Note that the kernel can be the same for each channel, but it doesn't have to be. You could use different kernel values for each channel if you wanted to. This would create some interesting effects as you might blur one color, sharpen another and do edge detection on the third. When working with color images, you no longer have a simple two-dimensional kernel. Since you have one kernel for each channel, you can stack the kernels together to create a three dimensional array. In this example, the first three by three Matrix would be for the red channel. The second three by three matrix would be for green, and the third three by three matrix would be for blue when working with something like NumPy, SciPy or kairos, it's often easy to store these kernels together. So you would likely store them as a single three by three by three array. Using convolution, we can create a variety of filters for our image. Let's look at a few filters that have been applied to this elephant photo. Note that I'm using a small image, 200 by 130 pixels so that our three by three kernels will have a noticeable effect. First up is the Gaussian blur. This simply mixes the pixels in the window in such a way to soften transitions and edges, we can also use the kernel we saw earlier to create a sharpened image. Note that sharpening an image using this method will also amplify any noise present in the image. We can also do this funky emboss effect to make it look like the Elephant is carved out of the page. In addition to forming the basis for fun effects for things like Instagram filters, we can use convolution to do edge detection. The first is a basic outline that picks out any areas of harsh transition between light and dark. The next is a Sobel filter, which is a combination of two filters that pick out edges in the X direction and Y direction. Edge detection is extremely useful in computer vision as it helps us determine the boundaries between objects in the image and the background. It can also help reduce noise. This is the kind of information that a Machine Learning model can use to pick out objects in an image or extract features like patterns and shapes to use for classification. We can set the kernel parameters manually if we know what kind of filter or filters we want to use. Alternatively, we can use Learning algorithms to set them automatically during the training process if we are using something like a convolutional neural network. In this case, the algorithm essentially tries a bunch of different filters and picks the ones that give it the best features to meet its goal, like classification or object detection. I hope this helps you get a sense of how image convolution works, how we can use it to create fun filtering effects, and how it forms the basis for feature extraction for many computer vision applications.