Yes, you can. You can specify that the first thread in the block sets it, while the other's don't eg.:
extern __shared__ unsigned int local_bin[]; // Size specified in kernel call
if (threadIdx.x == 0) // Wipe on first thread - include " && threadIdx.y == 0" and " && threadIdx.z == 0" if threadblock has 2 or 3 dimensions instead of 1.
{
// For-loop to set all local_bin array indexes to specified value here - note you cannot use cudaMemset as it translates to a kernel call itself
}
// Do stuff unrelated to local_bin here
__syncthreads(); // To make sure the memset above has completed before other threads start writing values to local_bin.
// Do stuff to local_bin here
Ideally you should do as much work as possible before the syncthreads call, as this allows for all the other threads to do their work before the memset is complete - obviously this only matters if the work has the potential to have quite different thread completion times, for example if there is conditional branching.
Note that for the thread 0 "setting" for-loop, you need to have passed the size of the local_bin array as a parameter to the kernel so you know the size of the array you are iterating.
Original concept source