Unison hardware

Unison hardware full#

Shannon, "Amoeba-cache: Adaptive blocks for eliminating waste in the memory hierarchy," in Proceedings of the 45th International Symposium on Microarchitecture, 2012. Wilkerson, "Exploiting spatial locality in data caches using spatial footprints," in Proceedings of the 25th International Symposium on Computer Architecture, Jun. Jimenez, "Reducing network-on-chip energy consumption through spatial locality speculation," in Proceedings of the 5th International Symposium on Networks-on-Chip, May 2011. Flautner, "PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor," in Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. Liarou, "The Researcher's Guide To The Data Deluge: Querying A Scientific Database In Just A Few Seconds," in Proceedings of International Conference on Very Large Data Bases (VLDB), 2011. Balasubramonian, "Chop: Adaptive filter-based dram caching for cmp server platforms," in Proceedings of the 16th International Symposium on High Performance Computer Architecture, Jan. Falsafi, "Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache," in Proceedings of the 40th Annual International Symposium on Computer Architecture, Jul. Lee, "An optimized 3d-stacked memory architecture by exploiting excessive, high-density tsv bandwidth," in Proceedings of the 16th International Symposium on High Performance Computer Architecture, Jan. Ailamaki, "Toward dark silicon in servers," IEEE Micro, vol. Falsafi, "Clearing the clouds: a study of emerging scale-out workloads on modern hardware," in Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. Jouppi, "Simple but effective heterogeneous main memory with on-chip memory controller support," in Proceedings of the 2010 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. Maly, 3-Dimensional VLSI: A 2.5-Dimensional Integration Scheme, 1st ed. Maly, "Interconnect characteristics of 2.5-d system integration scheme," in Proceedings of the 2001 International Symposium on Physical Design, ser. Moshovos, "Accurate and complexity-effective spatial pattern prediction," in Proceedings of the 10th International Symposium on High Performance Computer Architecture, feb 2004. Emer, "Predictive sequential associative cache," in Proceedings of the 2nd International Symposium on High-Performance Computer Architecture, Feb. Our evaluation using server workloads and caches of up to 8GB reveals that Unison cache improves performance by 14% compared to Alloy Cache due to its high hit rate, while outperforming the state-of-the art page-based designs that require impractical SRAM-based tags of around 50MB. Then, leveraging the insights from the Footprint Cache design, Unison Cache employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads, while predicting and fetching only the useful blocks within each page to minimize the off-chip traffic.

Similar to Alloy Cache's approach, Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities.

We introduce a novel stacked-DRAM cache design, Unison Cache. However, multi-gigabyte stacked DRAM caches will soon be practical and needed by server applications, thereby mandating tens of MBs of tag storage even for page-based DRAM caches. In doing so, the Footprint Cache achieves high hit rates with moderate on-chip tag storage and reasonable lookup latency. In contrast, the state-of-the-art page-based design, called Footprint Cache, organizes the DRAM cache at page granularity (e.g., 4KB), but fetches only the blocks that will likely be touched within a page. However, such a design suffers from low hit rates due to poor temporal locality in the DRAM cache. The state-of-the-art block-based design, called Alloy Cache, collocates a tag with each data block (e.g., 64B) in the stacked DRAM to provide fast access to data in a single DRAM access. Today's stacked DRAM cache designs fall into two categories based on the granularity at which they manage data: block-based and page-based.

Unison hardware full#

To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use of off-chip bandwidth.

Recent research advocates large die-stacked DRAM caches in many core servers to break the memory latency and bandwidth wall.