Improved Multimedia Server I/O Subsystems

hadj batatia

In a disk-network scenario where expensive data transfers are the norm, such as in multimedia streaming applications, for example, a fast-path I/O architecture is generally considered to be “good practice.” Here, I/O performance can be improved through minimizing the number of in-memory data movements and context switches. In this paper, we report the results of the design and implementation of a high-performance streaming server using cheap hardware units assembled directly on a test card (i.e., NS card). The hardware part of our architecture is open to further reuse, extension, and integration with other applications even in the case of inexpensive and/or faster hardware. From the viewpoint of software-aided I/O, we offer Stream Disk Array (SDA) for scatter/gather-style block I/O, EXT3NS multimedia file system for large-scale file I/O, and interoperable streaming server for stream I/O.

in this paper we develop a Vhdlbased scalable architecture of a multimedia on demand server model. Te three server subsystems are:the control subsystem, the communication subsystem and the storage subsystem are modeled. The communication subsystem is designed as an interconnection network with meshed-pipe structure and delivers packets to their destination through the sub-optimal shortest path. A model for the video traffic generated at the server side for modeling the data sent to the users is also presented We use simulation experiments to validate our modeling approach and compare the results with those obtained using OPNET simulation tool. The comparison results show that our approach is accurate compared to the simulation results. Our modeling approach also shows that using VHDL (Very high speed integrated circuits Hardware Description Language) in modeling digital systems is powerful and accurate.

This paper analyzes the particular requirements that multimedia communication imposes on the network adapter and the I/O subsystem of a workstation. We show the drawbacks of current (parallel) communication subsystems and develop new architectural concepts applicable to multimedia communication subsystems. The key ideas are the separation of isochronous and asynchronous traffic, parallelism between sender and receiver, and the execution of per-byte operations in a hardware pipeline. We adapt these concepts to the design of a Gb/s adapter and to the design of a light-weight, slower speed adapter and present some scenarios for their application.

Improved Multimedia Server I/O Subsystems zyx Michael Weeks, Hadj Batatia, Reza Sotudeh Computer Architecture Research Unit University of Teesside Middlesbrough England, UK. {michael.weeks, h.batatia, r.sotudeh}@tees.ac.uk Telephone: 44+ (1642) 342494 Fax: 44+ (1 642) 34240 1 Abstract zyxwvu zyxwv zyxwvutsr zyxwvut zyxwvut The main function of a continuous media server is to concurrently stream data from storage to multiple clients over a network. The resulting streams will congest the host CPU bus, reducing access to the system S main memory, which degrades CPUperformance. The purpose of this paper is to investigate ways of improving I/O subsystems of continuous media sewers. Several improved I/O subsystem architectures are presented and their performances evaluated. The proposed architectures use an existing device, namely the Intel i960RPCC processor: The objective of using an I/O processor is to move the stream and its control from the host processor and the main memory. The ultimate aim is to identijj the requirements for an integrated I/O subsystem for a high performance scalable media-on-demand server server. In this case, the design of the client application requires greater complexity than the design of the server. Server push is the traditional choice for continuous media server designs as it is more suited to the provision of, and interaction with, concurrent streams. To initiate a media stream the client transmits a request to the server, whereupon the server delivers, and manages, the selected data stream to the client. These types of server require a more complex design, as it must store the state of each media stream. b) Transfer rate The transfer rate should be sufficiently high to support multiple simultaneous clients. The transfer rate is dependent not just upon hardware, but also upon the operating system and the application software. c) 1. Introduction Continuous media, such as audio and video, have different characteristics compared to conventional data. They are typically data intensive, even when compressed, and time dependent. These characteristics place a number of requirements on the servers’ [l & 21, the communication, and even on the end-system configuration. At the server level, these isochronous media impose many constraints mainly on the architecture, the storage [3], and the operating system. They require real-time handling especially at the I/O subsystem. The design of a media-on-demand (audio or video on demand) server must take into account a number of issues. The following sections describe some of the design factors. a) QoS The server should provide streams to the client with a guaranteed Quality of Service (QoS), by implementing disk scheduling, and admission control algorithms. Real-time disk scheduling routines ensure continuity of the media stream by determining the most efficient method for retrieving rounds of data from the hard disk. This is more easily implemented with ‘Constant Bit Rate’ (CBR) coded streams rather than with ‘Variable Bit Rate’ (VBR) coded streams. Similarly, read only files enhance disk-scheduling performance due to contiguous data placement on disk. Admission control algorithms guarantee end-toend performance by preventing stream overload. Admission control does not only guarantee QoS, but other features may also be necessary. For example, the media contents of a video-on-demand server are a marketable commodity, therefore security measures will be necessary to validate the user before access permission is granted, and accounting services will be required to charge the users. Server access style Two access technologies influence multimedia server design, client ’pull’and server ’push’[4]. Client pull technology is similar to that used for file servers for handling text and other aperiodic data types, whereby the client explicitly requests data from the d) Scalability A scalable architecture allows an increase in the zyxwvut 5 14 1089-6503/98$10.00 0 1998 IEEE number of client streams for a proportional increase in the cost. buffering scheme in primary memory, to the network device. This duplication of traffic on the CPU local bus is greater than twice the actual data being transferred. This creates a bottleneck when accessing primary memory, which degrades CPU performance. Consequently, this bottleneck makes the scalability of the single CPU design very poor. Using semi-autonomous 110 devices, CPU stream control can be reduced substantially. Instead of the CPU supervising the transfer of every item of data, it simply initiates the I/O device to transfer a block of data. On completion, the I/O device informs the CPU by the use of interrupts. For one stream these interrupts can amount to hundreds per second, each requiring a CPU response that switches context, and executes an interrupt routine. With lo2 to lo3 media streams, this can amount to a sizeable proportion of the CPU’s processing time. e) Interactivity With multiple media files on a server, the user must be able to browse or search the server content, before making a selwtion. In addition to file management, the server should store metadata that characterises each file’s content. To provide tnie ‘media-on-demand’ features, the user must be able: to interact with the data stream to perform features such as PLAY, STOP, FAST FORWARD, IIEW, PAUSE, etc. zyxwvutsr zyxwv zyxwvuts The purpose of this paper is to investigate the I/O subsystems of continuous media servers. Improved I/O subsystem architectures based on current technologies are suggested. The ultimate aim is to identify the requirements for ;in integrated I/O subsystem for a high performance scalable media-on-demand server. 2. Investigation 3. I/O subsystem improvement To maximise CPU utilisation, the stream and its control must be migrated from the processor. To achieve this, we looked at utilising current technology, in particular, Intel’s i960RPB intelligent I/O processor, modelled in several variations. The next section contains an overview of the i960RP8 device. The server design under initial investigation is the traditional single CPU system that utilises ‘push’ technology (Figure I). To simplify matters, we consider only non-editable CBR coded media streams. The system described utilises a single PCI bus for its I/O devices, which for this case study are SCSI for storage, and ATM for the network interface. The software drivers for the I/O devices utilise a double buffering scheme in the system’s main memory, which enable the smooth transfer of media data from the SCSI adaptor to the network interface card. I 3.1 i960RP8 The i960RP8 is a high performance embedded processor that has been designed for use as an intelligent I/O processor [ 5 ] . The device’s main features are (Figure 2): 1 i960JFa core processor; 1 PCI to PCI Bridge unit; 1 Primary and secondary PCI Address Translation Units (ATU); 1 Messaging Unit (MU); 1 Primary and secondary PCI DMA units; 1 Memory Controller; 1 Bus arbitration units. I Memory v PCI Bridge Secondary PCI zyxwvutsrqp Figure 1 : Architecture target for improvement Although client stream interaction will occur, it will be random and infrequent, and can therefore be considered negligible from the viewpoint of I/O subsystem design. The most frequent state for a stream will be in playback mode, whereby multimedia data is streamed to the client without any user interaction. During playback, the majority of the CPU’s local bus traffic will be due to media data streaming from the storage device, via the dual zyxwvutsrq Figure 2: Simplified block diagram of the i960RP@ The core processor is a 32-bit superscalar RISC design 515 that operates at 33Mhz, and utilises interleaved 32-bit memory via the 80960 local bus. This bus is a 32-bit wide local bus with multiplexed address and data lines. The i960RP@connects to a host processor via its primary PCI bus and appears as a multi-function PCI device. The ATU's are the interfaces between the PCI buses and the 80960 local bus. The i960RPQ contains two ATU's, one for the primary and the other for the secondary PCI buses. The ATU's can burst transfer up to 2kB, and allow inbound and outbound address translations. They can handle multiple inbound transactions by simultaneously processing PCI read and write transactions. Address translation is achieved using an address windowing scheme that determines which addresses to claim and translate. The PCI-to-PCI bridge operates as an address filter between the two PCI buses, in addition to extending the number of loads that a PCI bus may have. The bridge is programmed with a range of addresses that determine the secondary address space. All PCI read transactions traversing a PCI-to-PCI bridge are processed as delayed transactions. depending upon the contents of the associated memory mapped register. The interrupt lines always travel upstream. The ATM and SCSI devices use the interrupt lines to signal to their drivers that they have finished their current task. Therefore a software driver must be executing on a processor upstream of the corresponding I/O device, in order to avoid complicated and timeconsuming interrupt routing schemes. Providing maximum system performance requires rapid interrupt processing, which restricts possible designs to the following. zyxwvutsrqpo 3.2 Single U 0 processor Removing the bottleneck caused by media streaming to main memory would enhance the performance of the system under investigation. A first step in the design improvement consists of migrating the dual buffering scheme from the main memory to the i960RP8 local memory, thereby increasing the host CPU's processing efficiency. In such a case, the i9600 core processor is idling. However, host CPU efficiency can be further improved by migrating stream control, and in particular, 110 device interrupt processing to the i960RP8. To investigate this further, the PCI bus and i960RP8 PCI-to-PCI bridge handling of interrupts are first presented. The vehicles for interrupt passing between the system devices are the PCI buses INTx# lines. With the i960RP8 connected to the primary PCI bus, the PCI bus interrupt lines are as shown in Figure 3. The ATM and SCSI devices are on the secondary PCI bus. The device drivers for both PCI devices reside on the i960RP8; b) The ATM device is installed on the primary PCI bus, with its driver on the host processor. The SCSI adaptor is installed on the secondary PCI bus with its driver on the i960RP8; c) The SCSI device is installed on the primary PCI bus, with its driver on the host processor. The ATM adaptor is installed on the secondary PCI bus with its driver on the i960RPB: d) Any of the above device interconnections, but with the device drivers staying on the host processor, and the i960RP8 acting as a PCI Memory controller1PClPCI bridge. a) z The first connection scheme can be ignored as we wish to balance the sub-streams between the two PCI buses. Similarly, the objective is to remove the I / 0 drivers from the host processor, so the fourth scheme is not relevant. The second and third designs are compromises, therefore, the second design has been evaluated (Figure 4). zyxwvuts zyxwvutsrq FGLf zyxwv I ' zy PCI Bus --- ----I R'DG .....,........................ 4.. SCSI sub-strea4 Host-PCI I Figure 4: Proposed I 1 0 subsystem using a single i960RP8 pcI Bus Figure 3: PCI interrupt lines Performance figures were calculated based on data streaming over the system buses. These figures incorporate the effects of interrupt latencies but do not include the effects of the operating system on The PCI-to-PCI bridge may individually route the secondary PCI bus interrupt lines onto the primary bus interrupt lines, or to the i960RPQ core processor 516 zyxwvutsrqp zyxwvutsrqponm zyxwvutsrqpon zyxwvutsrq performance. We have assumed that all PCI transfers will not be broken in1 o multiple transactions. This assumption will not hold for high streaming scenarios, as the ATU queues are of insufficient size for media transfer, especially the SCSI traffic. Table 1 shows these performance figures. It illustrates the scale of traffic over the system buses and their percentage utilisation. memory, whereas the PCI buses are under-utilised. This is due to the two 32-bit, 33MHz PCI buses trying to access a single 32-bit, 33MHz local bus. This clearly shows that the proposed architecture would not be a large improvement over the target system. This analysis does not contain any inter-processor communication, which would be necessary for communication between the operating system and the SCSI driver. This additional overhead would further reduce the performance of the proposed architecture. The scalability of this architecture can be achieved by introducing a PCI-to-PCI bridge to isolate the I/O subsystem. This obviously incurs an added cost, but allows multiple i960RP8 devices to be attached to the system for added stream capability. Table 1 : Bus Utilisation Streams Bus(' 0.05 3.3 Dual I/O processor 2.3 2.8 50 60 15 90 The single i960RP8 device used in the previous design, removed the stream from the main memory, but created another bottleneck at its own memory. The i960RP8 could not run both I/O software drivers due to the interrupt problem stated earlier, therefore the ATM driver was operated from the host. One solution to this 1/0driver problem could be to utilise two I/O processors, one for each of the I/O devices. The drivers could reside in their respective i96ORP@'s, whilst the memory space of one I/O processor could contain the dual buffering scheme. Whilst this would remove the drivers and their interrupts from the host CPU, their would still be the i960RP8 memory bottleneck. An improvement to this design would be an alternating dual buffering scheme, whereby the buffers would be equally split between the memory spaces of the two I/O processors. Operation for a single stream would be as shown in figure 6 and Figure 7. 3.5 4.1 The chart in Figure 5 shows the comparative performances of the system using a single I/O processor with the original target architecture, based on worst-case bus traffic. From this comparison it must be noted that the streaming affects the software running on the original target architecture, whereas the streaming on the i960RP8 only aflects the SCSI software drivers. zyxwv zyxwvutsrqponmlkjihgfedcbaZYXWVUT zyxwvutsrqponml E - DProcessor ~ ~ I+Single ........................... 120 1~ 100 .............. 1 Processor ........... mas 80 . . . .. ................. ........._...., 1...., ,_o .- 60 ' 40 20 zyxwvutsrq No of n 0 20 40 60 80 Figure 5 : Performances with a single I/O processor With the proposed design (using a single I/O processor), the host CPU has recovered its access to main memory, previously lost to the stream, enabling it to manage more strcams. However, each stream on the PCI bus has become less efficient, due to the i960RP8 ATU delayed PCI transactions. figure 6: Proposed I/O subsystem with two i960RP8 A new bottlen'xk has appeared on the i960WC3 local bus. For ninety streams, the 80960 local bus would be over-loaded when streaming data to and from i96ORPB 517 zyxwvu zyxwvuts +Dual 100 i960RP +Single Procewr zyxwvutsrqponmlkjih 80 .I 60 s* =5 LI1 J 40 20 No of 0 0 40 60 80 Figure 8: Comparative performances Figure 7; Altemate substreams Table 1 shows the bus utilisation in cycles per second per stream, for all of the components in the dual i960RP8 I/O sub-system. Table 2: Bus utilisation for a single stream (cyclesiseclstream) zyxwvu It can be clearly seen that this design has reduced maximum bus utilisation by 33%, but at the expense of increased complexity, and cost. Again scalability will only be achieved at the cost of an additional PCI-to-PCI bridge. 4. Conclusion This paper has focused on the design of an I/O subsystem for a continuous media server. Several improved architectures have been proposed and their performances evaluated. All the proposed architectures were designed using an existing device, namely the Intel i960RP8 processor. The utilisation of the single i960RP8 I/O processor solved the main memory bottleneck problem, but created a new bottleneck in i960RP8 memory. This has highlighted the requirement for a streaming memory bandwidth twice that of the PCI bus. The twin i960RP8 proposed U0 subsystem utilising an alternating dual buffer arrangement, removed this bottleneck but at the expense of scalability, complexity, and cost. This investigation clearly shows the need for an integrated U0 processor, optimised for continuous media. Such a processor would incorporate the following characteristics. zyxw zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQP to local altemate Mean streaming I 20 232699 I 203954 I 186535 I 187350 I 153498 I I With multiple concurrent streams, the mean value will be the important figure, and as can be seen from Table 2 the sub-streams have been more closely balanced around the system buses. The alternating buffer scheme has removed the I/O processor memory bottleneck, with the most activity being on the secondary PCI bus to which the SCSI adaptor is attached. Plotting this data onto a graph (Figure 8) illustrates the comparative performance between the dual i960RP8 design and the initial target architecture. a) b) c) d) e) Two separate subordinate PCI buses for the 110 devices to isolate the sub-streams; Memory bandwidth twice that of a single PCI bus; Larger PCI-memory buffer queues, optimised for the transmission of media data; Low interrupt latency to reduce the time taken to process streams; High scalability so that multiple devices can be attached to the primary PCI bus to increase the number of streams. Figure 9 shows the system's architecture using such a hypothetical I/O processor. On-going research are investigating the feasibility and characteristics of this architecture. 5 18 zyxwvutsrqponml zyxwvutsrqpo Media 110 Processor 1 1 Processor ATM zyxwvutsrq zyxwvutsrqpo Network Figure 9 - Scalatde Server Architecture utilising Media I/O processors 5. References [ I ] Gemmell, D. J., Vin, H. M., Kandlur, D. D., Venkat Rangan, P., arid Rowe, L. (1995). Multimedia Storage Servers: A Tutorial and Survey. IEEE Computer, 28 (5), pp. 40-49. [2] Shenoy, P., Gcyal, P., and Vin, H. M. (1995). Issues in Multimedia Server Design. ACM Computing Surveys, 27 (4), pp. 636-63'). [3] Lougher, P., & Shepherd, D.(1993). The design of a Storage Server for Con inuous Media. The Computer Journal, 36 ( I ) , pp. 32-42. [4] Rao, S., Vin, H . M., and Tarafdar, A. (1996). Comparative Evaluation of Server-push and Client-pull Architectures for Multimedia Sei,vers. In Proceedings of 6th Networks and Operating Systems Support for Digital Video and Audio, April 1096. zyxwvutsr [5] Gillespie, B. ( 1996). PCI Intelligent IiO Design for High Performance Scrvers. Intel@ white paper. 519

RELATED PAPERS

RELATED TOPICS

Log In

Improved Multimedia Server I/O Subsystems

Improved Multimedia Server I/O Subsystems

Related Papers

RELATED PAPERS

RELATED TOPICS