Let us say that at time t, for each particle, you have:
P position
V speed
and a N*(N-1)/2 array of information between particle A(i) and A(j) where i < j; you use symmetry to evaluate an upper triangular matrix instead of a full N*(N-1) grid.
MAT[i][j] = { dx, dy, dz, sx, sy, sz }.
which means that in respect to particle j, particle j has distance made up of three components dx, dy and dz; and a delta-vee multipled by dt which is sx, sy, sz.
To move to instant t+dt you tentatively update the positions of all particles based on their speed
px[i] += dx[i] // px,py,pz make up vector P; dx,dy,dz is vector V premultiplied by dt
py[i] += dy[i] // Which means that we could use "particle 0" as a fixed origin
pz[i] += dz[i] // except you can't collide with the origin, since it's virtual
Then you check the whole N*(N-1)/2 array and tentatively calculate the new relative distance between every couple of particles.
dx1 = dx + sx
dy1 = dy + sy
dz1 = dz + sz
DN = dx1*dx1+dy1*dy1+dz1*dz1 # This is the new distance
If DN < D^2 with D diameter of the particle, you have had a collision in the dt just past.
You then calculate exactly where this happened, i.e. you calculate the exact d't of collision, which you can do from the old distance squared D2 (dx*dx+dy*dy+dz*dz) and the new DN: it's
d't = [(SQRT(D2)-D)/(SQRT(D2)-SQRT(DN))]*dt
(Time needed to reduce distance from SQRT(D2) to D, at a speed that covers the distance SQRT(D2)-SQRT(DN) in time dt). This makes the hypothesis that particle j, seen from the refrence frame of particle i, hasn't "overshooted".
It is a more hefty calculation, but you only need it when you get a collision.
Knowing d't, and d"t = dt-d't, you can repeat the position calculation on Pi and Pj using dx*d't/dt etc. and obtain the exact position P of particles i and j at the instant of collision; you update speeds, then integrate it for the remaining d"t and get the positions at the end of time dt.
Note that if we stopped here this method would break if a three-particle collision took place, and would only handle two-particle collisions.
So instead of running the calculations we just mark that a collision occurred at d't for particles (i,j), and at the end of the run, we save the minimum d't at which a collision occurred, and between whom.
I.e., say we check particles 25 and 110 and find a collision at 0.7 dt; then we find a collision between 110 and 139 at 0.3 dt. There are no collisions earlier than 0.3 dt.
We enter collision updating phase, and "collide" 110 and 139 and update their position and speed. Then repeat the 2*(N-2) calculations for each (i, 110) and (i, 139).
We will discover that there probably still is a collision with particle 25, but now at 0.5 dt, and maybe, say, another between 139 and 80 at 0.9 dt. 0.5 dt is the new minimum, so we repeat collision calculation between 25 and 110, and repeat, suffering a slight "slow down" in the algorithm for each collision.
Thus implemented, the only risk now is that of "ghost collisions", i.e., a particle is at D > diameter from a target at time t-dt, and is at D > diameter on the other side at time t.
This you can only avoid by choosing a dt so that no particle ever travels more than half its own diameter in any given dt. Actually, you might use an adaptive dt based on the speed of the fastest particle. Ghost glancing collisions are still possible; a further refinement is to reduce dt based on the nearest distance between any two particles.
This way, it is true that the algorithm slows down considerably in the vicinity of a collision, but it speeds up enormously when collisions aren't likely. If the minimum distance (which we calculate at almost no cost during the loop) between two particles is such that the fastest particle (which also we find out at almost no cost) can't cover it in less than fifty dts, that's a 4900% speed increase right there.
Anyway, in the no-collision generic case we have now done five sums (actually more like thirty-four due to array indexing), three products and several assignments for every particle couple. If we include the (k,k) couple to take into account the particle update itself, we have a good approximation of the cost so far.
This method has the advantage of being O(N^2) - it scales with the number of particles - instead of being O(M^3) - scaling with the volume of space involved.
I'd expect a C program on a modern processor to be able to manage in real time a number of particles in the order of the tens of thousands.
P.S.: this is actually very similar to Nicolas Repiquet's approach, including the necessity of slowing down in the 4D vicinity of multiple collisions.