Such behaviour is normally governed by the Application Binary Interface (ABI) and the mostly used x86 ABIs (Win32 and Sys V) just requires that each parameter occupies at least 4 bytes. This is mainly due to the fact that most x86 implementations suffer from performance penalties if data is not properly aligned. While your example would not "de-align" the stack, a subroutine taking only three byte sized parameters would do so. Of course, one could define special rules in the ABI to overcome this but it complicates things for little gain.
Keep also in mind, that the x86 ABIs were designed around 1990. At this time, the number of instructions was a very good measure for the speed of a certain piece of code. You example requires one extra instruction compared with four pushes if para1-para4 are located in registers and five extra instructions in the worst case, that all parameters must be loaded from memory (x86 supports pushing memory locations directly).
Further, in your example, you trade saving 12 bytes on the stack for 14 extra code bytes: your code sequence requires 18 bytes of code in case para1-para4 (e.g. al-dl) are located in registers while four pushes require 4 bytes. So overall, you reduce the memory footprint only if you have recursions in your code.
push ax
in any mode (16, 32, or 64-bit), it's just normally not useful outside of 16-bit mode. As you say, normal calling conventions pad stack args to fill a whole arg-passing "slot" (a register, or register-width chunk of stack memory). – Copperas