Publication Date

2017

Document Type

Dissertation

Committee Members

Travis Doom (Committee Member), Jack Jean (Committee Member), Meilin Liu (Advisor), Jun Wang (Committee Member)

Degree Name

Doctor of Philosophy (PhD)

Abstract

General purpose GPU (GPGPU) is an effective many-core architecture that can yield high throughput for many scientific applications with thread-level parallelism. However, several challenges still limit further performance improvements and make GPU programming challenging for programmers who lack the knowledge of GPU hardware architecture. In this dissertation, we describe an Optimization Compiler Framework Based on Polyhedron Model for GPGPUs to bridge the speed gap between the GPU cores and the off-chip memory and improve the overall performance of the GPU systems. The optimization compiler framework includes a detailed data reuse analyzer based on the extended polyhedron model for GPU kernels, a compiler-assisted programmable warp scheduler, a compiler-assisted cooperative thread array (CTA) mapping scheme, a compiler-assisted software-managed cache optimization framework, and a compiler-assisted synchronization optimization framework. The extended polyhedron model is used to detect intra-warp data dependencies, cross-warp data dependencies, and to do data reuse analysis. The compiler-assisted programmable warp scheduler for GPGPUs takes advantage of the inter-warp data locality and intra-warp data locality simultaneously. The compiler-assisted CTA mapping scheme is designed to further improve the performance of the programmable warp scheduler by taking inter thread block data reuses into consideration. The compiler-assisted software-managed cache optimization framework is designed to make a better use of the shared memory of the GPU systems and bridge the speed gap between the GPU cores and global off-chip memory. The synchronization optimization framework is developed to automatically insert synchronization statements into GPU kernels at compile time, while simultaneously minimizing the number of inserted synchronization statements. Experiments are designed and conducted to validate our optimization compiler framework. Experimental results show that our optimization compiler framework could automatically optimize the GPU kernel programs and correspondingly improve the GPU system performance. Our compiler-assisted programmable warp scheduler could improve the performance of the input benchmark programs by 85.1% on average. Our compiler-assisted CTA mapping algorithm could improve the performance of the input benchmark programs by 23.3% on average. The compiler-assisted software managed cache optimization framework improves the performance of the input benchmark applications by 2.01x on average.Finally, the synchronization optimization framework can insert synchronization statements automatically into the GPU programs correctly. In addition, the number of synchronization statements in the optimized GPU kernels is reduced by 32.5%, and the number of synchronization statements executed is reduced by 28.2% on average by our synchronization optimization framework.

Page Count

201

Department or Program

Department of Computer Science and Engineering

Year Degree Awarded

2017

Creative Commons License

Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.


Share

COinS