I am running TensorFlow on NVidia Jetson TX1 and faced memory shortage when I train large network such as GoogleNet.
CPU and GPU in TX1 do not have separate memory and they share one memory. However, It seems that TensorFlow is trying to allocate separate memory space and copy from CPU side to GPU side. Thus it requests 2x memory than it really needs.
In my opinion, this situation can be handled by something like DMA access between CPU and GPU. As far as I know, TensorFlow utilizes DMA between GPUs (not sure which one handles this. TensorFlow? or GPU driver?). Can I use DMA between CPU and GPU also in TensorFlow? or any other suggestions?
EDIT: I just found that there is Zero Copy feature in CUDA which is what I exactly wanted. However, can I use this feature in TensorFlow?