From: Thomas Walker Lynch
- Although the Universal Turing Machine conceptually introduced programs stored as data on a tape, Turing could not provide a formal connection to modern architectures, simply because such architectures did not yet exist. Here, by modern, I refer architectures utilizing random-access system memory, dedicated instruction fetch streams with dynamic branching, and discrete processing units. Though Charles Babbage's 1842 Analytical Engine touched on these concepts, they would wait until the 1940s to re-emerge. The practical engineering context of 1936 was limited to calculating machines programmed via patch panels. Hence, for example, there is no explanation in his paper as to why a von Neumann architecture machine (1945) running a program would exhibit the computation theoretic results derived from a computation theory based on the Turing Machine (1936). To establish this missing connection, the volumes of the TTCA, starting in the following chapters, transform the Turing Machine into a modern architecture in a stepwise fashion, while ensuring that at each step the modifications are
@@ -87,10 +87,19 @@ Mathematical logic underpins the computation theory layer. Computation theory speaks of the time and space complexity of algorithms and the existence of solutions to decider problems, which in turn guides the goals of the architecture and organization layers.
++ The Turing Machine is a computation theory object that is suggestive of a simple architecture, and a computer organization. Any student who has had to do homework problems centered on Turing Machines, will have tracked the flow of data through the machine, i.e. worked at the register transfer level. However, a little work is needed to complete the architecture analog. The fundamentals are present, the read/write head, the tape, the procedure for using the tape, but other components are missing. The manipulation of symbols remains ungrounded. The tape is not well defined. The use of emptiness is non-architectural like. The tape transport is not articulated, though it is implied. The read buffer that holds a value read, so the programmed controller can do a write without clobbering the read data needed for the next transition is not identified as a component. As we proceed, we will likely discover other missing components. +
+
An
+ It is not a requirement of a computer architecture that it can be realized. The Turing Machine architecture developed in a later section serves as an example. Instead, an architecture can serve other purposes, in this case as a stepping stone to another architecture that could be realized. +
+ +
- The ENIAC, sometimes credited as the first electronic computer, is an extreme example of using an older implementation in a new process. Rings of ten tubes were wires in circular shift registers so as to act like gears. Binary switch implementations were found in relay computers, and then in the Anastov-Berry machine and ESAC. -
- The common understanding of the word 'architecture' is that of a what Hamacher and Zaky calls an organization. For example, even the most experienced of architects will say things like a microprocessor has a "superscalar architecture", though whether a processor is a scalar, superscalar, or VLIW machine is clearly a question of computer organization.
+ The common understanding of the word 'architecture' is what Hamacher and Zaky call an
@@ -130,155 +136,128 @@
- This cascades down the stack, as organization instructions implementation, etc. For example, if the architecture has an instruction that names one of N registers as an operand, then the organization has a register file that data flows to and from, and busses to carry that data, the design will specify a register file and layout the busses, and the manufacturing people will build them. + This cascades down the stack, as organization instructs implementation, etc. For example, if the architecture has an instruction that names one of N registers as an operand, then the organization has a register file that data flows to and from, and busses to carry that data, the design will specify a register file and layout the busses, and the manufacturing people will build them.
-- The Turing Machine is an abstraction, as are architectures, organizations, and implementations. Only a computer realization is concrete, but even then we can make observations that are analogous to properties of an abstraction. + The Turing Machine is an abstraction, as are architectures, organizations, and implementations. Only a computer realization is concrete, but even then we can make observations that are analogous to properties of an abstraction. Hence, we can use the language of mathematics to talk about machines at all of the levels.
- When transforming machine
- Suppose that machine abstraction
- We can then assign a property to
- We begin with a set of questions that are legal to ask under computation theory
- Here
- Then we pose these questions, and receive the answers.
+ Now suppose that a machine
- Here
- If any question in
- If and only if
- Now let us generalize. First, let us define the set of all input domains that are legal for machine
- Then let us define the set of all questions of decidability, along with the two questions of time and space complexity, that can legally be asked under
- We must establish the theoretical baseline by posing these generalized questions to our initial machine:
+ Suppose we still have the given machines, and their corresponding inputs, that were used when determining transform
- If any question in
- If and only if
- - - -
- The very title of this book has built in the same pitfall, as mostly what we discuss is the flow of data between named components, i.e. register transfer level description, which is organization. Computer architecture is often expressed by describing a
- A key question then, do different computer organizations have consequential effect. By
- A slowing clock could happen due to some peculiar extenuating circumstances, perhaps because a battery is going dead, but is not a normal situation, and there is a reason for that. If a computer works with a clock period of p0, then any longer clock period would be unnecessarily wasting time. -
+
- Nor is it practical to go the other direction, and make an exponential time program run in linear time by having an every faster clock. The simple reason being that the designers will have already set the clock to the fastest that it can safely go.
+ then
- Adding two Arabic representation numbers of ever lengthening operands is asymptotically a linear time problem. For numbers of fixed length, where that fixed length is not very long, a lookup table can be used. As such each addition would require the same amount of time, but this is merely making the worse case time for bit widths up to the lookup table with the time for all addition. It hides the trend curve, but does not replace it. As th - - In theory any program could be run in constant time by using a lookup table. - - What about going the other direction? If we have an ever faster - - - To answer this question we would have to be specific about the organization, but on the face of it the answer is "no". Take a superscalar architecture. Suppose that it were possible to run two instructions on every clocks cycle, that is merely a linear term improvement in the step count for completing the program. It doesn't change the asymptotic 'complexity' of programs. A linear time program is still a linear time program, etc. + If, and only if, it is the case that:
-- The Turing Machine is a computation theory object that is suggestive of a simple architecture, and a computer organization. Any student who has had to do homework problems centered on Turing Machines, will have tracked the flow of data through the machine, i.e. worked at the register transfer level. This makes the Turing Machine a computer organization. All descriptions of Turing Machines - - - however, a little work is needed to complete architecture analog. The fundamentals are present, the read/write head, the tape, the procedure for using the tape, but other components are missing. The manipulation of symbols remains ungrounded. The tape is not well defined. The use of emptiness is non-architecture like. The tape transport is not articulated, though it is implied. The read buffer that holds a value read, so the programmed controller can do write without clobbering the read data needed for the next transition is not identified as a component. As we proceed, we will likely discover other missing components. -
+
- It is not a requirement of a computer architecture that it can be realized. The Turing Machine architecture developed in a later section serves as an example. Instead, an architecture can serve other purposes, in this case as a stepping stone to another architecture that could be realized.
+ then we can say without qualification that
- We can now further articulate the goal of these volumes: to first capture the Turing Machine as an architecture, and then to transform that architecture step by step, without making any changes of consequence relative to computation theoretic results, into a modern computer architecture that could be implemented. A secondary goal is to learn something along the way. -
+The original Turing Machine had an infinite tape. In contrast the TTCA machine has a surprising property: for computational problems all of its components remain finite. This follows from the fact that during computation a machine makes a finite number of steps, so the tape can only be expanded to be a finite size.
++ Let us take the example of adding two Arabic representation numbers. Logically this is considered to be a logarithmic time problem. We break the operands into fixed length pieces, and adding them in pairs results in a carry per block. By recursively pairing the blocks and applying the carries, we generate wider carries. Thus we can show that in terms of the logic gates that must be traversed, the sum is a log time operation. +
+ ++ Physics comes to a different conclusion. In the worst case, a carry into the least significant bit can affect the sum bit some physical distance away. As the operands get longer, this distance grows in proportion. So given the propagation of information at a fixed speed, the bounding evaluation time against growing operand width is linear time. Even if it is log time in gate count, at some point the interconnect delay will dominate. +
+ ++ The logical analysis of the adder given above allowed for unbounded resources, because as the adder operand increases in size, the number of block adders increases without bounds. In any realization there will be a limit on the number of blocks that can be added in parallel. These groups are then processed one by one, and the carry is propagated between them. Consequently, as the operands grow in length without bounds, the adder evaluation time becomes proportional to the number of groups processed. Processing groups in series is a linear time algorithm. +
+ ++ It is notable that the time-multiplexed use of computer resources produces the same linear time result as the physics of information propagation analysis for the adder. +
+ ++ A Turing Machine program faces a situation analogous to physical constraints. Given the operands are found on the tape, and the carry-in can affect the msb of the sum, the head will have to move ever more cells rightward to convey that lsb information up to the msb. Based solely on the propagation time of that information, addition is found to be a linear time algorithm. This propagation remains computation class limiting even if the Turing Machine is given an unbounded number of independent heads. +
+ ++ There appears to be alignment among physical limitations, resource limited computing, and steps spent by Turing Machines while they carry information across a linear tape. This alignment indicates that a reasonable realization will be computation theoretic inconsequential. +
+ +
+ At this point we have arrived at questions of the
+ From Babbage's Analytical Engine of 1842 up to the transition to mechanical relays and vacuum tubes in the 1940s, calculating machines were implemented with gears. The basic principle is apparent to anyone who has seen a mechanical odometer. Consider adding numbers for example: given two odometers, step one back at the same time as stepping a second one forward; when the first one reaches zero, the second will hold the sum. This process can be optimized, but the general idea remains the same. For such machines, a step is a rotation of the main shaft. +
+ +
+ The Harvard Mark I machine had a main axle speed that maxed out at 3000 RPMs, say 2700 RPMs to keep our math simple. Then this is 2700 steps per minute. The ENIAC was a similar implementation, but one that called out the use of circular shift registers of vacuum tubes instead of mechanical gears. Because there were 10 tubes in a ring register, it took 10 clock ticks to complete one 'rotation'. The clock rate maxed out at 450 kilohertz. That would be one rotation every
+ Yet, the same program when run on the Mark I took the same number of steps as on the ENIAC. But more importantly, a linear time algorithm on the Mark I was still a linear time algorithm on the ENIAC, etc. Thus, these implementation differences were computation theoretic inconsequential. +
+ ++ It feels unsatisfactory to leave out the tremendous difference in clock rates. So let us address this feeling by naming an ENIAC main shaft 'rotation' as a standard 'step'. If we do this, then a Mark I shaft rotation would be 1000 ENIAC steps. Yet, this would merely affect the linear constant in the step count formulas. The same programs can be run, with the same inputs, and asymptotic behavior is the same for both machines, because computation classes do not include the constants on the step count equation. Constant time remains constant; linear, polynomial, and exponential time classes are the same as before. Programs that decide questions would get the same answers when they completed. +
+ ++ Because we made an ENIAC shift register turn completion a 'standard step', we have a relative measure, so there is something we can do to create a computation theoretic consequential difference. Suppose we have two ENIAC machines, and we send one speeding away from Earth at an exponentially increasing rate, i.e. increasing red shift, and we observe it from Earth. We will observe that the clock on the traveling ENIAC is growing ever slower, and that a linear time program running on it will be observed to have exponential time behavior. Unfortunately, relativity does not smile upon us, as the people on the spaceship would not see the inverse, a speeding computation on Earth, but rather they would also observe a slowing one. +
+ ++ So then, instead we send a spaceship towards Earth, with increasing blue shift, and we would observe that spaceship's ENIAC getting faster and faster. This is still not a computation theoretic speedup, because it is not asymptotic. In finite time, said spaceship would run into Earth, or pass it by and then be red shifting. +
+ ++ A designer could purposely slow the clock on a second ENIAC so as to emulate red shift. For this to be more than mere theater there would have to be physical reason to run a slower clock than necessary, for example perhaps for conserving an ever dwindling battery. But slowing computation down, or even stopping it, is typically not useful. However, going the other direction, an ever faster clock does not work, as there is a finite maximum physical clock speed. +
+ ++ We get an increasing blue shift situation with Moore's law. If every generation transistors become exponentially smaller, and thus faster, and we consider step times in years, hopping from new realization to new realization, then indeed linear time algorithms on a single realization would be log time algorithms on the generational computer. But chances are this is not an asymptotic, i.e. limit to infinity, phenomenon either. +
+ ++ Superscalar and VLIW computers execute multiple instructions in parallel. Real data dependencies put limitations on how many instructions are available to be executed in parallel, but even discounting this, if a program were executed N instructions at a time, its time to execute would divide by N. This merely affects the linear component of the equation mapping step count to input length, and thus does not change the computation class. Superscalar and VLIW architectures do not affect decisions; indeed they are transparent to programs, so decider problem results cannot change. Hence these techniques are not computation theoretic consequential. +
+ ++ In general, by definition, organizations do not change a program's view of the machine, as that is part of the architecture. So organizations will not affect decider results. Also, the memory operations will be the same, as that is viewable state, so space complexity does not change unless time complexity changes. +
+ ++ The realization sets fixed resources, so any attempt at parallelization will be bounded, as in the superscalar and VLIW discussion above. Thus at best it can divide the execution time by N. +
+ ++ Some organizations can arrange computation in a manner that the base clock can run faster than for other organizations. However, clocks run at a fixed maximum speed. On modern systems they can slow down to reduce heat dissipation or battery consumption, but that does not make programs faster. So if one organization has a faster clock than another, the ratio is merely a linear term contributor. Apart from stopping, there is nothing a clock can do to participate in the decision making of the program. +
+ ++ Caching of values sent to the system memory again does not participate in the decision making of a program. We are at best looking at improvements in the linear term. +
+ ++ Branch prediction saves the time required to do a full fetch, but fundamentally it does not change the data flow graph of the program. The same decisions are made. +
+ +
+ Suppose that an organization keeps the operands for a function in a content-addressable memory. When the operands are recognized, it then immediately returns the looked-up value. This approach, called
+ Common decisions made at the architecture level are those for supporting RISC or CISC, the bit layout and handling of operands, the size of the internal register file, how DMA is to be handled, whether to use memory-mapped I/O or have explicit instructions for it, how interrupts are to be implemented and the number of entries in the interrupt table, what special registers are present and what features are available through them, how virtual memory and its user and process IDs are to be implemented with the possible use of a translation lookaside buffer, what onboard execution units will have direct instructions, the built-in data types, questions of unaligned accesses, bus standards to be supported, if sleep modes are to be present, how the machine will get booted, the security rings that will be supported, details of the hardware virtualization layer, special support for the OS, how the system stack will be handled, potential partitioning of address space, support for large buffers, and memory sharing features: none of these are computation theoretic consequential. +
+ ++ As architecture enters the gray area with organization, cache architecture, bus layouts, bus buffers, direct inclusion of write buffers, perhaps a stack cache, prefetch buffers and split-transaction buses: none of these are computation theoretic consequential either. +
+ ++ Said features certainly affect performance, but none participate in the decisions the program makes, change the number of execution steps by more than a linear ratio, or alter the memory complexity of the program. +
+ + + +