There's no definite answer to this: filter size is one of hyperparameters you generally need to tune. However, there're some useful observations, that may help you. It's often preferred to choose smaller filters, but have greater number of those.
Example: four 5x5
filters have 100 parameters (ignoring bias), while 10 3x3
filters have 90 parameters. Through the larger of filters you still can capture the variety of features in the image, but with fewer parameters. More on this here.
Modern CNNs go even further with this idea and choose consecutive 3x1
and 1x3
convolutional layers. This reduces the number of parameters even more, but doesn't affect the performance. See the evolution of inception network.
The choice of stride is also important, but it affects the tensor shape after the convolution, hence the whole network. The general rule is to use stride=1
in usual convolutions and preserve the spatial size with padding, and use stride=2
when you want to downsample the image.