Understanding The Challenges with Using Hardware Pre-Fetchers for CPU-Based Matrix Multiply Units