This is the fastest routine I've found so far that doesn't use Cython or a JIT like Numba. I takes about 1.6 μs on my machine to process a 4x4 array (average time over a list of 100K 4x4 arrays):
inds_cache = {}
def upper_triangular_to_symmetric(ut):
n = ut.shape[0]
try:
inds = inds_cache[n]
except KeyError:
inds = np.tri(n, k=-1, dtype=np.bool)
inds_cache[n] = inds
ut[inds] = ut.T[inds]
Here are some other things I've tried that are not as fast:
The above code, but without the cache. Takes about 8.3 μs per 4x4 array:
def upper_triangular_to_symmetric(ut):
n = ut.shape[0]
inds = np.tri(n, k=-1, dtype=np.bool)
ut[inds] = ut.T[inds]
A plain Python nested loop. Takes about 2.5 μs per 4x4 array:
def upper_triangular_to_symmetric(ut):
n = ut.shape[0]
for r in range(1, n):
for c in range(r):
ut[r, c] = ut[c, r]
Floating point addition using np.triu
. Takes about 11.9 μs per 4x4 array:
def upper_triangular_to_symmetric(ut):
ut += np.triu(ut, k=1).T
Numba version of Python nested loop. This was the fastest thing I found (about 0.4 μs per 4x4 array), and was what I ended up using in production, at least until I started running into issues with Numba and had to revert back to a pure Python version:
import numba
@numba.njit()
def upper_triangular_to_symmetric(ut):
n = ut.shape[0]
for r in range(1, n):
for c in range(r):
ut[r, c] = ut[c, r]
Cython version of Python nested loop. I'm new to Cython so this may not be fully optimized. Since Cython adds operational overhead, I'm interested in hearing both Cython and pure-Numpy answers. Takes about 0.6 μs per 4x4 array:
cimport numpy as np
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def upper_triangular_to_symmetric(np.ndarray[np.float64_t, ndim=2] ut):
cdef int n, r, c
n = ut.shape[0]
for r in range(1, n):
for c in range(r):
ut[r, c] = ut[c, r]