ࡱ> pW qbjbj ^]*** 4 $DDDDP$TD/28(<<<O S!T!,y,{,{,{,@,P .P[/$022/!K O !!/O&<<8O&O&O&!D8<<y,@!y,O&8O&*V+y,< ,Pkv<{DD$8m,  FILLIN \* MERGEFORMAT Asymmetric Real Time Scheduling on a Multimedia Processor  FILLIN \* MERGEFORMAT Alessandro Forin  FILLIN \* MERGEFORMAT Andrew Raffman, Jerry Van Aken  FILLIN \* MERGEFORMAT February 1998 Technical Report  FILLIN \* MERGEFORMAT MSR-TR-98-09 Microsoft Research Microsoft Corporation One Microsoft Way Redmond, WA 98052 Asymmetric Real Time Scheduling on a Multimedia Processor Alessandro Forin, Andrew Raffman, Jerry Van Aken Microsoft Corporation Abstract This paper describes the hardware and system software support for multi-tasking a dual-processor used in a multimedia system. One of the two processors is a conventional RISC processor, the other is a vector processor of new design. There is no commonality in the instruction sets of the two processors. We schedule the former preemptively and the latter cooperatively via check points, resulting in an asymmetric scheduling algorithm. Media processing requires predictable scheduling behaviors. Programmers can inform our scheduler of the Real Time requirements of their computations using time-constraints. Scheduling decisions are very fine-grained, and place strong efficiency demands on the system implementation. Introduction We have designed and implemented the system support for a new type of processor  REF SamsungHotChips \* MERGEFORMAT [Le96] , designed for a high performance multimedia architecture, code-named Talisman [To96]. This heterogeneous dual-processor (MSP) contains a general purpose ARM RISC processor, a general purpose Vector Processor (VCP), plus caches and some peripherals all in one chip. The VCP has a very large amount of user-mode state, about 8 KB. Using preemptive scheduling for the VCP would be very inefficient, because of the expensive memory traffic for state save and restore. The ARM processor has a much more limited state size, and is therefore scheduled preemptively. We have devised a simple way in which an application can cooperate with the system scheduler in saving and restoring the VCP state. The programming model is not affected, it provides the customary multiple processes with multiple threads each. The compiler or the programmer simply inserts a special conditional instruction in the VCP code flow, at points where the active register set is minimal. Normally this instruction does nothing. When the scheduler actually needs to switch the VCP state this instruction becomes a function call to a state save-and-restore routine. In addition to this asymmetric way of dispatching threads on the two processors, the system scheduler provides extremely fine-grained Real Time scheduling support via time constraints [Jo96]. The Talisman [To96] project at Microsoft has designed a new hardware architecture, which achieves state-of-the-art 3D graphics/audio performance and quality at consumer price points. The goal is to make 3D graphics and audio a commodity item, accessible on every personal computer and not just on high-end workstations and/or supercomputers. The Talisman reference hardware implementation is the first implementation of the architecture, a single expansion board that plugs into the PCI bus. This single board replaces the functionality that is typically provided by a Windows accelerator board, a 3D accelerator board, an MPEG playback board, a video-conferencing board, a sound board, and a modem. Our OS kernel is resident on the board, and dynamically loads/unloads application modules and device drivers from the host (which runs either the Windows NT or Windows 95 operating systems). There are two main reasons why our work on Talisman and the MSP processor is of general interest. Consumer demand for rich media support is focusing software developers on a new set of algorithms and data types, especially in the video and audio domains. Movie playback, 3D graphic intense games and applications, sound spatialization, video conferencing are just a few instances in which data compression and decompression and signal processing techniques have become mainstream computing. Many CPU manufacturers are optimizing their processors for the new algorithms and data types [In96]. The MSP design accommodates the new needs by associating a special (vector) coprocessor to an existing general-purpose architecture. This approach minimizes the impact on existing system and application software. Vector coprocessors have large register files, but media processing demands limited latencies. The problems we faced on the MSP are therefore likely to surface again [Pa97]. Embedded systems often employ specialized digital signal processors (DSP), but their programmers long for a more general-purpose instruction set. Recently, hardware manufacturers have started to address this dilemma by packaging a general-purpose CPU core together with the DSP components. Our heterogeneous processor case fits in this new category, and indeed the MSP is expected to be widely used in embedded applications and consumer devices. We demonstrate in this paper a general and economic model for structuring the system support for this new generation of more flexible DSPs. MSP Overview Samsungs Media Signal Processor  REF SamsungHotChips \* MERGEFORMAT [Le96]  is the combination of an established and general purpose RISC CPU core with a new processing architecture based on a SIMD computing model. Additionally, specialized instructions and computing elements enhance the execution speed of algorithms such as MPEG video decode, 3D graphic pipeline, and audio signal transformations. Also on-chip are instruction and data caches, and I/O elements commonly found on microcontrollers, such as audio CODECs, timers and interrupt controller. The RISC core is the ARM-7 from ARM Ltd. It has been used successfully for building personal computers at Acorn, PDAs at Apple and DEC, and in the embedded market for printers and other controllers. This processor is initially expected to be clocked at 40 MHz. Two of the most interesting features of the ARM architecture are the extremely small die size, and the limited code size (even when compared to CISC processors) resulting from the instruction set encoding. From the system point of view, there are a couple of interesting aspects to the ARM. It does not architecturally define support for virtual memory, although at least one MMU has been implemented externally to the CPU core. The ARM supports a number of modes beside the typical user and supervisor modes. The ARM is otherwise similar to many other RISC processors; it has 16 general-purpose registers in each mode, and the standard load-store memory interface. Sometimes we had to work around errors in the architecture, for instance when implementing our synchronization primitives. But our algorithms and our findings are not tailored to the ARM architecture; any other RISC or CISC core could replace the ARM in what we will describe in the rest of the paper. We will refer to the SIMD processor circuitry as the Vector CoProcessor (VCP), because that is the way we model it in software. In reality, this is a full general purpose processor in its own right, with independent instruction fetch/decode units, both scalar and vector instructions, load and store instructions, and control transfer instructions. It delegates to the main ARM processor only interrupt and exception handling duties. Another aspect in which the VCP is somewhat enslaved to the ARM is that some of the VCP registers are accessible to the ARM but not vice-versa. Architecturally, it would be easy to define the few additions needed to make the VCP into a truly standalone, general-purpose processor. The VCP has one bank of 32 scalar registers, which act just like any RISC set of general-purpose registers. It has two banks of 32 vector registers. Each vector register is seen, depending on the instruction operating on it, as a set of N-bits numbers. More specifically, a register could be seen as 32 9-bit integers, 32 8-bit integers, 16 16-bit integers, 8 32-bit integers, and 8 32-bit IEEE 754 single precision format floating-point numbers. There are a few additional control registers that control the behavior of the processor, and the privileged state. This processor is initially expected to be clocked at 80 MHz. The MSP chip contains two other components intended for use by the VCP. One is a BitStream Processor, designed to efficiently perform certain portions of the MPEG decoding that operate on bitfields, in parallel with the other two processors. The state of this processor is 256 bytes. The other component is the ScratchPad RAM, which is just a portion of the cache that acts as 4 KB of memory. Stores to this memory address range are never written back to memory. The ScratchPad is intended as a high-speed memory buffer, to complement the register files. The VCP is clearly a complex engine, but from the operating system point of view the most important aspect of it is the size of the state that would have to be stored and loaded from memory during a full context switch. The grand total is 8600 x 2 = 17,200 bytes. Our techniques should be useful on any other coprocessor of similar complexity. The VCP Model The VCP is faster at signal processing than the ARM, and the overall performance of the system is very dependent upon efficient utilization and scheduling of the VCP resource. We made the following requirements and assumptions while developing a scheduling model for the VCP: The VCP scheduling mechanism must allow developers to implement tightly coupled threads between the ARM processor and the VCP. Some programmers want to use both processors together as a single unit to get the most performance out of the system. This cannot be done by treating the ARM processor and the VCP as totally asynchronous processors. The mechanism must also allow independent computations on the two processors. Some programmers will simply start the VCP on a long computation and block the ARM processor waiting for termination. In these instances we want the ARM processor to be available for other threads. Most VCP algorithms will be somewhat batch oriented, or will otherwise have locations in their code paths at which they maintain relatively little state. Therefore, software developers can design their code to give up the processor at well defined places, and even (in some cases) restart a set of operations that was aborted. VCP algorithms will be single-threaded. The only synchronization is with the ARM counterpart, where all the explicit multi-threading takes place. Consequently, VCP code has no access to synchronization primitives such as condition variables and mutexes, or to time-constraints. There is no separate VCP scheduling code. Indeed, there is little or no kernel code at all compiled for the VCP. The VCP is a resource that must be shared among the threads that execute on the ARM processor. The method used to facilitate this sharing is to virtualize the VCP: each thread that uses the VCP can assume that it has exclusive use of the VCP. Although each thread assumes that it has exclusive use of the VCP, the program that runs on the VCP is, in fact, treated by the operating system as a task that can be context-switched in and out of the VCP. Associated with each ARM thread that uses the VCP, therefore, is a single VCP task that is managed behind the scenes by the kernel. As long as a thread accesses the VCP through the supported primitives and the VCP application programmer follows a few simple rules, the possibility of interactions with other threads that are also using the VCP can be ignored. The first and most basic rule is that an ARM thread must bracket access to the VCP hardware between calls to the VCPTake() and VCPRelease() functions. This indicates that the programmers accesses to the VCP are intentional; any VCP access outside a block that begins with VCPTake() and ends with VCPRelease() is considered an error and generates an exception. The second rule is that tasks running on the VCP must be ready to voluntarily release the processor, saving the necessary state in memory and reloading it when restarted. This is done synchronously; the code stream is interspersed with special instructions to mark check points. At these points the amount of state to be saved is much less than the worst case, typically just a handful of scalar registers or even null. The compiler (or the programmer) generates the VCP function to save and restore state, or uses a pessimistic default function. The last rule is that check points must occur frequently enough. The worst-case latency with which a thread can regain ownership of the VCP is the system-wide longest code sequence between any two check points. Presently, programmers might have to help the compiler to make sure this rule is obeyed. Preemptive Scheduling of the Main Processor By design, the MSP is an unbalanced multiprocessor. One of the two processors is meant to execute the more system-intensive tasks, while the other is meant for the compute-intensive ones. This division of labor is not absolute. There can be codes for which the instruction set of the intended processor is sub-optimal and the programmer certainly has the freedom to allocate a task where it best performs. Since interrupts are all routed to the main ARM processor, that is the one that runs the kernel, the device drivers, and the DMA engines code. In addition, all the system code for inter-thread synchronization and data dispatching is run on the main processor, to take advantage of the tighter integration possible with the scheduler. Threads running on the main processor are preemptively context-switched to support a multitasking environment. When a thread is switched out of the main processor, the kernel saves all the registers and other internal state of that thread into an area of memory. This state is later restored when the thread is switched back into the main processor. In this way, each thread has full access to all of the registers of the machine and does not need to worry about sharing those registers with other threads. Preemptive context switching minimizes the latency with which we can respond to asynchronous events. Saving and restoring the whole register set (16 words) is feasible on the ARM because it represent only a small percentage of the total time required to schedule a thread. A thread can be preempted either because of a timer interrupt, or because of a device interrupt. When the scheduler selects a new thread to run it also decides how long the thread should be allowed to use the CPU. This time interval is then programmed into the system timer before dispatching the thread. When the timer expires the scheduler is invoked and a new thread is selected for execution. Note that we reprogram the interval timer on each context switch instead of using a periodic interrupt. Using a periodic interrupt would generate a large number of useless interrupts. To provide fine-grain scheduling control it would have to run at a very high frequency, causing excessive overhead. The precision and granularity of the hardware timer directly affects the scheduler's precision and accuracy. An interrupt from a device other than the timer does not necessarily cause any scheduling activity. In an Interrupt Service Routine (ISR) the device drivers writer is entitled to use one (and only one) system primitive: signaling of a condition variable. Should this primitive be invoked, upon return from the ISR the kernel code might find it necessary to invoke the scheduler. This in turn might preempt the current thread in favor of another one that has now become more urgent. Ideally, the ISR will just do the minimal amount of work to dismiss the cause of the interrupt, and wakeup some thread to do the actual work of handling the interrupt. For instance, a very simple ISR approach is to disable further interrupts from that particular device, by clearing its interrupt enable bit in some control register. It will be the thread waiting on the device's condition variable that will do all the work that would normally be done in the ISR of a non-Real Time OS kernel. It is up to this thread to then re-enable the device's interrupts. The scheduler is admittedly vulnerable to the user-written ISRs. The scheduler does not account for any time spent in an ISR, for efficiency reasons. We also decided not to employ any interrupt nesting or prioritizing scheme. Doing so would only provide the illusion of low latency for one device (the highest priority one) at the cost of a more costly dispatching of interrupts in the normal, no collision case. It is never the case that the performance of the system depends exclusively on one devices ISR. Therefore our ISRs simply run with interrupts disabled. Unfortunately, high-efficiency dispatching of interrupts also means that scheduling is only accurate to the extent that ISR invocations occupy a negligible portion of the processor's time. Should a programmer install an expensive ISR the whole system will visibly suffer. We did not install any of the many possible safeguards against this danger. A thread can also be rescheduled non-preemptively upon invocation of one of the various synchronization primitives. A typical case is a thread that signals a condition waited upon by a thread with a more stringent time constraint. This happens, for instance, in a small data-dispatching package we wrote to support media streaming. A thread that is starving for data, and approaching its deadline, will most likely wakeup and preempt the thread that just provided the data and signaled the non-empty condition. Cooperative Scheduling of the Vector CoProcessor Virtualizing the VCP in the same way that the main processor is virtualized would require preemptively context-switching the entire state of the VCP when switching from one VCP thread to another. The system scheduler lacks any information as to which VCP resources are actually in use, and therefore must make pessimistic assumptions. The amount of state information associated with the VCP is quite large, and the overhead of preemptive context switching would severely degrade the operation of the system. We decided to adopt a hybrid approach: when the ARM-based kernel needs to context switch the VCP, it initiates a context-switch request to the VCP. This request will be honored when the VCP thread reaches a designated point in its execution, one at which its active context is much smaller than the worst case. In other words, we adopt the software convention that VCP context switches, as well as synchronization between a VCP task and its associated main processor thread, can occur only at these designated check points within a VCP program. We have identified three types of check points that the model must support: sync points, clean points, and exit points. The VCP hardware includes special instructions to assist in implementing these check points. The following sections document the implementation of these primitives using VCP instructions. Clean Points Clean points are check points at which only a small amount of VCP state information must be saved and restored in the event of a VCP context switch. The scheduler requests a context switch when some other VCP task needs to preempt the currently executing VCP task. To avoid the cost of a full VCP context switch, however, the scheduler does not forcefully initiate a context switch. Instead, it allows the VCP task to continue to execute until it reaches a clean point. At a clean point, it is understood that the VCP state to be saved is much smaller than the worst-case 17,200 bytes, and possibly null. The responsibility for identifying clean points within a particular VCP task belongs to the application programmer, not the OS. The OS has no knowledge of which registers may be in use by the application at the time the OS requests a context switch. The application programmer must also ensure that clean points occur frequently enough within his VCP program, so that a VCP task is never forced to wait "too long" for a context switch. The meaning of the words "too long" in this context is system-dependent and varies with the nature of the applications that must share the VCP. In the Talisman architecture, the rule-of-thumb is that the time between successive clean points in a VCP program should be no longer than 100 microseconds. This figure is somewhat arbitrary. The nature of the clean-point method for performing context switches, however, does require the establishment of a maximum clean-point latency to which all VCP programs must conform. A clean point in a VCP program is implemented by the Vector Coprocessor Context Switch (VCCS) instruction: VCCS