6. Thread-level Parallelism
6.1 Cache coherence, memory consistency
6.2 Synchronization
6.3 Multithreaded processors
6.4 Chip Multiprocessor
Also called:
  • SMPs on a single chip
  • Multicore processors
 
Motivation
Moore's Law
    • Transistor count doubles every 18 months
    • graphic
    How to use additional transistors?
    Better execution cores
    • Enhance pipelinig
    • Extend supscalar execution.
    • Better vector processing (SIMD)
    Larger caches
    • improves memory access
    More execution cores; multi- and manycore
    How to speedup processors?
    Higher clock rate
    • increases power consumption
      • proportional to f and U²
      • higher frequency needs higher voltage
      • Smaller structures: Energy loss by leakage
    • increases heat output and cooling requirements
    • limit chip size (speed of light)
    • at fixed technology (e.g. 20 nm)
      • Smaller number of transistor levels per pipeline stage possible
      • More, simplified pipeline stages (P4: >30 stages)
      • Higher penalty of pipeline stalls (on conflicts, e.g. branch misprediction)
    More parallelism
    data parallelism
    Instruction level parallelism
    • exploits parallelism found in a instruction stream
    • limited by data/control dependencies
    • can be increased by speculation
    • average of ILP in typical programs: 6-7
    • modern superscalar processors can not get better…
     
    Thread level parallelism
    • Hardware multithreaded (e.g. SMT: Hyperthreading)
    • better exploitation of superscalar execution units
    • Multiple cores
      • Legacy software must be parallelized
      • Challenge for whole software industry
      • Intel invested into the tools business
     
    Advantages and Disadvantages
    Advantage of chip multiprocessors
    • Efficient exploitation of available transistor budget
    • Improves throughput and speed of parallelized applications
    • Allows tight coupling of cores
    • better communication between cores than in SMP
    • shared caches
    • Low power consumption
    • low clock rates
    • idle cores can be suspended
    Disadvantage
    • Only improves speed of parallelized applications
    • Increased gap to memory speed
    Design decisions
    graphic
    Homogeneous vs. heterogeneous
    specialized accelerator cores
    • SIMD
    • GPU operations
    • cryptography
    • DSP functions (e.g. FFT)
    • FPGA (programmable circuits)
    • Homogeneous
      • ARM A9, A15
      • SUN Ultrasparc T2 (Niagara 2)
      • Intel Core Duo
      • AMD Bulldozer
    • Heterogeneous
      • Intel Westmere und Sandy Bridge with GPU
      • AMD Fusion with GPU
      • Nvidia Tegra 4 with Cortex A9 MPcore and low power companion core
    Shared vs. private last level cache
    On-chip network: broadcast vs non-broadcast
    Shared vs. private last level cache
    Shared LLC
    Shared L2 and private L1
    graphic
    • A snooping-based cache coherence protocol via shared bus.
    graphic
    • Scalable network with directory-based cache coherence
    Private LLC
    graphic
    • Lower effective cache capacity due to replication of blocks
    • When accessing shared data it induces greater latency
    • Private L2 cache is looked up before sending requests
    • Protocol overhead is larger due to inspection of all replicated cache tags
    • Several messages on the critical path to retrieve up-to-date copy from another cache.
    Advantages/Disadvantages
    Shared Cache: Advantages
    • No coherence protocol at shared cache level
    • Less latency of communication
    • Coherence protocol overhead is smaller, Replication of tags for private
    • Processors with overlapping working set
    • One processor may prefetch data for the other
    • Smaller cache size needed
    • Better usage of loaded cache lines before eviction (spatial locality)
    • Less congestion on limited memory connection
    • Dynamic sharing of cache space
    • if one processor needs less space, the other can use more
    • Avoidance of false sharing
    Disadvantages:
    • Multiple CPUs impose higher requirements
    • higher bandwidth
    • Shared cache is larger than the individual private caches. This induces higher latency.
    • Design more complex
    • One CPU can evict data of other CPU
     
    Implementation of shared caches
    Centralized vs distributed (tiled)
    • Centralized: designed as a central cache
    • Distributed: designed as a tiled cache
    Centralized shared cache
    • Centralized LLC controller
    graphic
    • L2 is typically organized in banks and the L2 cache controller functionality is replicated across the banks.
    graphic
    • Although the cache is banked and the controller is replicated to avoid to go to a central controller across the banks, the design is called centralized because the LLC occupies a contiguous area on the chip.
    • The interconnect from the cache banks to the on-chip memory controller is simplified by a centralized design.
    Distributed (tiled) shared cache
    graphic
    • Tiled architecture gives more flexibility to manufacturers.
    • In case of many cores, this design is favorable with respect to termal and power density reasons.
     
    Uniform vs non-uniform Cache Architecture
    • Same access latency independent of the bank vs different access latencies.
    Uniform Cache Architecture (UCA)
    • All accesses take same time (are equally slow).
    • A UCA banked cache design often adopts an H-tree topology for the interconnect fabric connecting the banks to the cache controller.
    graphic
    • Relatively simple network where the requests are pipelined from the cache controller to the banks.
    Non-Uniform Cache Architecture (NUCA)
    • Centralized with multiple banks and replicated controller or distributed cache architecture
    graphic
    • Requires complex network with routing and flow-control.
    • This complexity is required to provide low-latency access to a fraction of cached data.
    • Anyway, future multicore processors will require complex on-chip networks to handle somewhat arbitrary messaging between the numerous cores and cache banks.
    Static-NUCA or S-NUCA: mapping of blocks to banks is unique
    • Simple policy distributing entire cache sets across banks.
    • Thus, the bank that potentially holds the requested address can be computed from the address bits.
    • No search mechanism required.
    Dynamic-NUCA or D-NUCA:blocks can be in multiple banks.
    • Sets and ways are distributed across banks.
    • Blocks need to be searched in a number of banks.
    • Frequently accessed blocks are moved nearer to the cache controller. These banks are searched first.
    • Implementation is much more complex than S-NUCA.
    Examples
    SUN Ultra Sparc T1 (Niagara)
    • UltraSparc T1 (Niagara):
    • 8 cores
    • 4x multithreaded per core
    • one FPU for all cores
    • low power
    • UltraSparc T2 (Niagara 2)
    • graphic
    Intel Itanium 2 Dual core - Montecito
    • Two Itanium 2 cores
    • Multi-threading (2 Threads)
    • Simultaneous multi-threading for memory hierarchy resources
    • Temporal multi-threading for core resources
    • Besides end of time slice, an event, typically an L3 cache miss, might lead to a thread switch.
    • Caches
    • L1D 16 KB, L1I 16 KB
    • L2D 256 KB, L2I 1 MB
    • L3 9 MB
    • Caches private to cores
    • 1,7 Billion transistors
    • graphic
    IBM Cell
    • IBM, Sony, Toshiba
    • Playstation 3 (Q1 2006)
    • 256 GFlops
    • Bei 3 GHz nur ~30W
    • ganze PS3 nur 300-400$
    graphic
    • 9 parallel cores
    • 1 large PPE - 8 SPE (Synergistic Processing Elements)
      • Specialized for different tasks
    • graphic
    • graphic
    • Cell: SPE Synergistic Processing Element
    • 128 registers 128-Bit
    • SIMD
    • Single Thread
    • 256KByte local memory not cache
    • DMA execute memory transfers
    • Simple ISA
    • Less functionality to save space
    • Limitations can become a problem if memory access is too slow.
    • graphic
    • 25,6 GFlops single precision für multiply-add operations
    Intel Westmere EX
    • Processor of the fat node of SuperMUC @ LRZ
    • 2,4 GHz
    • 9.6 Gflop/s per core
    • 96 Gflop/s per socket
    • 10 hyperthreaded cores, i.e. two logical cores each
    • Caches
    • 32 KB L1 private
    • 256 KB L2 private
    • 30 MB L3 shared
    • 2,9 billion transistors
    • Xeon E7-4870 (2,4 GHz, 10 Kerne, 30 MByte L3)
    • graphic
    • graphic
    • On-chip NUMA
    • L3 Cache organized in 10 slices
    • Interconnection via a bidirectional ring bus
    • 10-way physical address hashing to avoid hot spots, and can handle five parallel cache requests per clock cycle
    • Mapping algorithm is not known, no migration support
    • Coherence based on Core Valid Bits in each slice
    • Details see presentation at Hot Chips conference
    • Off-chip NUMA
    • Glueless combination of up to 8 sockets into SMP
    • 4 Quick Path Interconnect (QPI) interfaces
    graphic
    graphic
    graphic
    • 2 on-chip memory controllers
    Thread-levelParallelism_img1.gif Unbenannt
    Uncore Westmere EX and Sandy Bridge
    graphic
    • Cbox
    • Connects core to ringbus and one memory bank
    • Responsible for processor read/write/writeback and external snoops, and returning cached data to core and QuickPath agents.
    • Distribution of physical addresses is determined by hash function
    • Sbox
    • Caching Agent
    • Each associated with 5 Cboxes
    • Bbox
    • Home agent
    • Responsible for cache coherency of the cache line in this memory. Keeps track of the Cbox replies due to coherence messages.
    • Directory Assisted Snoopy (DAS)
    • Access to data across multiple sockets.
    • Keeps states per cache line (I – Idle or no remote sharers, R – may be present on remote socket, E/D owned by IO Hub)
    • If line is in I state it can be forwarded without waiting for snoop replies.
     
    • graphic
    6.5 Multisocket Multiprocessor
    6.6 Large-scale NUMA systems
    SGI Altix 4700
    SGI Ultra violet
    SGI UV 2000
    • Based on Xeon E5-4600
    • up to 256 sockets (2048 cores, 4096 threads) in 4 rack system
    • Maximum of 64 TB memory
    • NUMAlink 6 Interconnect with 6.7 GB/s bidirectional
    • Blades
      • 2 Intel Xeon E5 processors
      • 1 Intel Xeon E5 + 1 accelerator card
      • Intel Xeon Phi
      • Nvidia GPU
    • Large  multi-partion UV2000 systems
    • NUMAlink6 support for up to 16,384 socket system
    • Support for shared memory up to 8 petabytes
     
    6.7 Resources
    Heise article zu Westmere EX