-
Yongjia Zhang authored
local memory from the opencv source code. It use the local shared memory to reduce global memory access, which significantly reduces the kernel's processing time by 70 percent compared to the original one. Because of the barriers and local worksize limitation, processing with a radius larger than 110 becomes slower than original algorithm, so I keep the original kernels in order to deal with box-blur with radius larger than 110. All the tests are based on Intel Beginet and Intel IvyBridge CPU and GPU. v2:add kernel attribute to restrict the local size to (256,1,1). Signed-off-by: Yongjia <Zhang<yongjia.zhang@intel.com>
e7e640f3