Van: Raphael 'kena' Poss <r.poss@uva.nl>
Datum: 12 januari 2013 13:24:33 GMT+01:00
Aan: <ca2012@list.uva.nl>
Onderwerp: Antw.: Proeftentamen

Hallo iedereen,

hier is een uitwerking voor het voorbeeld tentamen van 12 november. De woorden/zinnen *buiten* haakjes [...] vormen een basisch, "voldoende" antwoord; de woorden/zinnen *binnen* haakjes zorgen voor extra punten als het basische antwoord er al is.

1.a)

A pipeline enables executing independent sub-tasks in parallel and thus can increase the processor's performance measured in instructions per second [or computational throughput]. This is possible because once an operation is decomposed into sub-tasks, each sub-task [has a smaller critical path: the corresponding circuits] become smaller, and it becomes possible to run them at a higher frequency.

1.b) 

When an operation can be decomposed into [concurrent sub-tasks, eg ] processes or threads, these can be distributed to multiple cores to perform the work in parallel. [However, just parallelism does not
increase performance/watt, because the extra processors needed to get extra performance must be powered too. However,] power consumption in processors is a quadratic function of voltage and frequency. [Also, the maximum frequency of a circuit places a lower bound on voltage.] Therefore, if we lower the frequency as we increase the number of cores, keeping the performance constant, we can also lower the voltage and the power consumption will decrease more than linearly. So the performance/power ratio increases.

1.c)

i. bandwidth is the maximum capacity of a network link or processor in operations/second or bits/second; whereas throughput is a measurement of the actual activity of these components (same unit). For example a home DSL link can have a bandwidth of 30MBits/s but browsing a Facebook page on that link only involves a throughput of 100KBits/s.

ii. a cache replacement policy is involved during a cache miss in set-associative caches; it determines which line (or block) to use within a set. For example least recently used (LRU) and random-replacement (RR) are replacement policies. The write policy determines how to handle store requests. For example write-through [(forward the request even if there is a hit)] and write-back [(only send the store upon a later conflict)] are example write policies.

iii. static RAM [uses only transistors (or gates) in silicon to hold the data. The circuit] is stable: as long as it is powered, reading the data does not affect the cell's value. In contrast dynamic RAM [uses transistors and a capacitor and it decays: when the data is read the cell's value is lost, and it also slowly degrades over time. Because of this DRAM] must be refreshed: the data of each must be read and written back at regular intervals.

iv. multithreading is about the multiplexing of multiple threads of execution onto a single processor pipeline. Hardware multithreading changes the hardware architecture of the pipeline to have multiple program counters (PC) side-by-side, together with a scheduler in the fetch stage to select which PC to use next. Software multithreading works with a single hardware PC and relies on an operating system in software to replace the PC at regular intervals using a timer interrupt.

1.d) "Write-through" means that all store requests from the processor, including those that hit lines already in the cache, cause a store request from the cache to memory.

So we have (234 - 10%*234) of load requests not going to memory
Plus 100% of 115 stores
i.e. 211 + 325 requests = 325.


2.a)

i. A data dependency, also called "true dependency" or Read-After-Write (RAW) occurs when the result of an operation is needed as input of a subsequent operation. For example:
  mul r1 <- r2, r3
  add r0 <- r1, r1  
There is a RAW dependency between the two instructions[, because "add" uses as input the output of "mul".]

ii. An output dependency, also called Write-After-Write (WAW) occurs when two instructions write to the same output register, and the result of the outputs must be preserved. For example:
  mul r1 <- r2, r3  
  mul r1 <- r4, r5
  add r0 <- r1, r1
There is a WAW dependency between the two "mul" [: they use separate inputs and thus could run in parallel, however it is important that the value visible to the last "add" is the one set by the 2nd mul.]

iii. An anti-dependency, also called Write-After-Read (WAR) occurs when an instruction uses a register as input which is written to by a subsequent instruction. For example:
  add r0 <- r1, r1
  mul r1 <- r2, r3  
There is a WAR dependency between "add" and "mul" [: the "mul" cannot be allowed to complete before "add" has read its input operands.]

2.b)
The purpose of a branch predictor is to reduce the number of control hazards due to branches. To do this, it decides how to modify the PC used in the fetch stage of the pipeline at every cycle, and can replace speculatively the PC when it predicts that the branch instruction that just got into the pipeline at the previous cycle will cause a branch. [To predict the branch it first needs to recognize branch instructions; for this either the pipeline uses a combined fetch/decode stage (e.g. in MIPS), or like the Athlon it has a table (BHT) acting as a cache of previously encountered branch instructions. Then the prediction can be made using either a simple state machine or using a cache of branch targets (BTB).]

2.c)
Memory-mapped I/O means that programs can communicate with I/O devices by executing regular memory load and store instructions. Which address is used in the instruction determines whether the operation goes to RAM or to the I/O device. "Uncached" means that the operation is routed directly to the I/O system at the memory stage and avoid [bypasses] the memory caches.

2.d)
CISC instructions typically have a variable length (1 to 6 bytes or more) and thus require a variable number of cycles to be fetched and translated to the RISC microcode used by the pipeline. This incurs extra complexity at the start of the pipeline, and causes unused cycles with an in-order pipeline. To reduce this impact, one can use:
- a cache of translated instructions [also called trace cache], e.g. the one found in Pentium 4 and later Intel Pentium processors or Transmeta Crusoe;
- out-of-order execution, such as found in all superscalar processors including all Pentium processors.

3.a)
A D-RAM memory is organized as an array of cells. The D-RAM controller can only read or write an entire row in the array at a time. So each operation is decomposed as follows: first a desired row is selected using the higher bits of the memory address, and its contents copied to a row buffer. Then the desired column in the row buffer is selected using the lower bits of the memory address, then accessed according to the operation (either read or write). Then the contents of row buffer is written back to the D-RAM array.
"RAS" is acronym for "Row Address Select", and designates the selection of a row to place in the row buffer. 
"CAS" or "Column Address Select" designates the selection of a column within the row buffer.


3.b) 
The minimum read latency is possible when a read can reuse a row already loaded; it is thus precisely x seconds.

The maximum read latency occurs when the row must be changed, ie. z+x seconds.

The maximum read bandwidth is the maximum number of requests that can be processed by second. [If the D-RAM controller is pipelined, different requests to the same row can be processed in parallel, so ] two subsequent requests only need to wait y seconds. So the maximum read bandwidth is 1/y requests/second. [If the D-RAM controller is *not* pipelined, each new request needs to wait until completion of the previous request *and* for the request-to-request delay. So the maximum read bandwidth becomes 1/max(y, x) requests/second. ]

3.c)
The algorithm reads and writes data linearly so it will never reuse a cache line after it finishes using it. Therefore a direct-mapped cache is sufficient and no replacement policy is necessary. [And its size can be kept very small if there is no intermediate computation, e.g. 2 cache lines.] For the write policy, the goal is to minimize the number of writes from the cache to memory while preserving the average load latency. With a write-through policy the number of stores is not reduced. If we choose a write-back policy and writes can target the same regions as loads, then a load conflict can cause a longer read latency because it must first wait for an eviction. However, we know that the writes target a separate region, so we can use a write-back policy safely to combine writes without impacting the latency of reads. 

3.d)
Memory consistency is a contract between a program and the architecture which establishes the logical relationship between stores and loads. For example with a single processor "a load returns the value sent to memory by the most recent store" determines memory consistency. [If there are more processors, consistency can be "fully sequential", eg "all stores are performed on memory in some global order, then any load returns the value sent to memory by the most recent store in that order". Or consistency can be "partially ordered", for example "the stores from each processor are performed in order, but stores from different processors may be reordered"].

Cache coherency is the hardware protocol by which two or more caches at the same level ensure that they have the same contents. For example MOESI is a coherency protocol.

3.e)
A LRU replacement policy needs to maintain a list or table to remember when a line within a set was last accessed. If there are more than 2 lines per set, this needs a counter of two bits or more per line, and also an adder circuit and a comparator to perform operations on the counters. If there are only 2 lines per set, the cache can use a single 1 bit cell for the entire set to determine which line was last accessed. This is simpler, and also cheaper because there is less hardware logic necessary.

-- 
Raphael 'kena' Poss · r.poss@uva.nl
http://staff.science.uva.nl/~poss/