Convolution layers are the core of Convolutional Neural Networks (CNNs), a class of Deep Neural Networks which provide state-of-the-art results in pattern recognition tasks in various domains, such as image recognition, speech recognition, and natural language processing. However, CNNs are very time consuming to train due to the computationally intensive nature of convolution operation. As a result, we have a prolific number of optimized implementations of convolution using matrix-matrix multiplication formulation, FFT formulation, Winograd algorithm, and direct convolution. Interestingly, most of these implementations target GPU architecture, which is mainly due to - 1) higher theoretical peak FLOPs of GPU and 2) popularity of CUDA deep learning library. However, not much investigation has been done on the performance of convolution on high-end CPUs such as Intel second generation Xeon Phi, codenamed Knights Landing (KNL) with theoretical peak performance of 6 TFLOPS. So, here we shed some light on what we can achieve regarding performance for convolution on these high-end CPUs. In this work, we optimize directed convolution for Intel Xeon Phi systems with AVX-512 support. Our strategy involves dynamic compilation approach along with standard compiler optimizations and software prefetching. We show that our JIT-based approach for direct convolution achieves close to peak performance on KNL for many cases. We also analyze the performance of convolution layers of several state-of-the-art CNNs, pointing out what helps in performance gain and what breaks our approach.