ECE486/586 Homework No. 1 Due date: 04/16/2015 Problem No. 1 (10 points) Read the following paper posted on the course website: S. Borkar, “Design Challenges of Technology Scaling”, IEEE Micro, pages 23-29, July 1999. Then, answer the following questions: (a) What are the goals of technology scaling? (2 points) (b) How much reduction in gate delay is achieved by scaling CMOS technology to the next technology generation? What is the resultant impact on processor frequency? (4 points) (c) When this paper was published, processor clock frequencies were increasing by a factor of two every technology generation. Why was this rate higher than the frequency improvement provided by technology scaling calculated in part(a)? (4 points) Problem No. 2 (10 points) Read the following paper posted on the course website: T. Agerwala and S. Chatterjee, “Computer Architecture: Challenges and Opportunities for the next decade”, IEEE Micro, pages 58-69, May 2005. Then, answer the following questions: (a) What is meant by CAGR? What are the projections for CAGR in the next few years? (2 points) (b) What is meant by scale-out architectures? Provide two examples of (i) scale-out platforms, (ii) scale-out workloads. (5 points) (c) The paper states that: “Integrating a heterogeneous mixture of simple and complex cores on a chip might provide acceptable performance over a wider variety of workloads”. Under which scenarios would a heterogeneous multi-core achieve higher average performance than a homogeneous mixture of cores? Explain with the help of examples. (3 points) Problem No. 3 (15 points) Consider two processors which use identical designs but operate at different voltage/frequency points: (i) Processor-1 operates at 1V and 3.5 GHz, (ii) Processor-2 operates at 0.75V and 2 GHz). Both the processors are used to run a real-time task in which specific deadlines must be met. Each processor is turned “OFF” after the task has been fully executed. Assume that both the processors are able to meet the timing deadline. Also assume that the processors consume zero power in the “OFF” state: (a) At the time when both the processors are “ON”, which processor consumes less dynamic power? Quantify the relative power savings. (8 points) (b) Which processor is more energy-efficient at executing the task? Is the energy difference between the two processors identical to the difference in dynamic power consumption computed in part (a)? If not, why? (7 points) Problem No. 4 (30 points) Intel’s Diamondville die, implemented in a 32nm process is 3.27mm x 7.94mm. Assume a defect density of 0.028/cm2, a process complexity factor of 12 and a wafer cost of $4,000. Intel desires a 55% (gross) profit margin. Intel’s newer Cedarview die is implemented in a 22nm process and the die area is 20mm2. Assume the newer process has a defect density of 0.036/cm2, a process complexity factor of 14, and a wafer cost of $4,200. Assume 300mm wafers, $1/die for all the packaging and testing costs, 100% wafer yield, and 99% yield at final test. Show all your work. Round die to the nearest integer, yields and profits to the nearest 0.01%, and part prices to three significant digits. (a) How many dies/wafer each can Intel expect for the Diamondville on 32nm and for the Cedarview at 22nm? (8 points) (b) What is the expected die yield (%) for each chip? (6 points) (c) What is the number of good dies/wafer for each? (4 points) (d) At what price must Intel sell the Diamondville parts to achieve their target profit margin? (7 points) (e) If Intel sells Cedarview at the same price, what is their profit margin on Cedarview? (5 points) (f) (Extra Credit Question) A process engineer proposes an optimization to the 22nm process which reduces the defect density to 0.03/mm2 while increasing the process complexity factor to 14.5. However, implementing this optimization will cost 1.5 million dollars in equipment and engineer salaries. Assuming that Intel expects to sell 80 million Cedarview processors after using this process optimization, should the Intel management invest in the proposed optimization? Be quantitative and specific. (5 points) Problem No. 5 (15 points) A newly designed processor “P1” running at 2.5 GHz is being evaluated to run a web server benchmark. The following tables show P1’s CPI for each instruction type and the instruction frequency statistics for the benchmark: Instruction Type Loads/Stores Branches Integer ALU Integer Multiply Clock Cycles 3 4 1 7 Instruction Type Loads/Stores Branches Integer ALU Integer Multiply Frequency 40% 10% 45% 5% (a) To compete with other products available in the market, the processor must be able to provide a throughput (instruction execution rate) of 1,100 instructions per microsecond on the web server benchmark. Will P1 be able to satisfy the desired throughput requirement? (7 points) (b) A microarchitect is proposing a change that will cut down the time taken by “Loads/stores” in P1 from 3 cycles to 2 cycles. Calculate the speedup obtained by this change. (4 points) (c) Using Amdahl’s law, verify the speedup you computed in part (b). (4 points) Problem No. 6 (15 points) An engineer at a major processor company is asked to compare two different designs for the upcoming mobile processor. The first design “C1” is expected to operate at 2.5 GHz, whereas the second design “C2” is expected to operate at 2 GHz. To compare the two designs the engineer uses a benchmark suite comprising of three workloads: web browsing, word processing and email. After simulating the execution of these workloads on the two designs, the engineer obtains the following data about the number of processor cycles taken by each workload on each design: Web browsing Word processing E-mail C1 (Execution time in cycles) 50000 100000 30000 C2 (Execution time in cycles) 40000 20000 30000 Note that the numbers in the above table represent execution time for each processor in terms of “processor cycles” (not seconds). Also note that there is a difference in the operating frequencies of the two designs: (a) Using C1 as the reference design, compute the normalized execution times for each workload on C2. (6 points) (b) Using the geometric mean method of comparing performance, evaluate the speedup of processor C2 over C1. (9 points) Problem No. 7 (15 points) We are considering enhancing a processor by adding vector hardware to it. When a computation is run on the vector hardware, it is 6 times faster than the normal mode of execution. We call the percentage of time that could be spent using vector mode “the percentage of vectorization”. (a) What is the maximum speedup attainable from using vector mode? (3 points) (b) How much performance improvement is achieved if the percentage of vectorization is 50%? (4 points) (c) What percentage of vectorization is needed to achieve a speedup of 4? (5 points) (d) What percentage of the computation run time is spent in vector mode if a speedup of 4 is achieved? (3 points) Problem No. 8 (15 points) A server is built from the following components and subsystems: a multicore CPU motherboard with an MTTF of 10,000 hours, 4 disk drives (each of which has an MTTF of 100,000 hours), a disk controller with an MTTF of 50,000 hours, and a power supply with an MTTF of 20,000 hours. (a) What is the MTTF for the server? (7 points) (b) You need to double the disk storage capacity. You have a choice of purchasing four additional disk drives identical to those you’re already using, or replacing the four you have with disk drives that have twice the capacity. The new disk drives have an MTTF of only 60,000 hours. A disk controller can handle up to four drives. Which choice yields a more reliable system? And by how much as compared to the other choice? (8 points)