Fast-path I/O architecture for high performance streaming server

Song-woo Sok

In a disk-network scenario where expensive data transfers are the norm, such as in multimedia streaming applications, for example, a fast-path I/O architecture is generally considered to be “good practice.” Here, I/O performance can be improved through minimizing the number of in-memory data movements and context switches. In this paper, we report the results of the design and implementation of a high-performance streaming server using cheap hardware units assembled directly on a test card (i.e., NS card). The hardware part of our architecture is open to further reuse, extension, and integration with other applications even in the case of inexpensive and/or faster hardware. From the viewpoint of software-aided I/O, we offer Stream Disk Array (SDA) for scatter/gather-style block I/O, EXT3NS multimedia file system for large-scale file I/O, and interoperable streaming server for stream I/O.

J Supercomput (2009) 50: 99–120 DOI 10.1007/s11227-008-0254-5 Fast-path I/O architecture for high performance streaming server Yong-Ju Lee · Yoo-Hyun Park · Song-Woo Sok · Hag-Young Kim · Cheol-Hoon Lee Published online: 25 November 2008 © Springer Science+Business Media, LLC 2008 Abstract In a disk-network scenario where expensive data transfers are the norm, such as in multimedia streaming applications, for example, a fast-path I/O architecture is generally considered to be “good practice.” Here, I/O performance can be improved through minimizing the number of in-memory data movements and context switches. In this paper, we report the results of the design and implementation of a high-performance streaming server using cheap hardware units assembled directly on a test card (i.e., NS card). The hardware part of our architecture is open to further reuse, extension, and integration with other applications even in the case of inexpensive and/or faster hardware. From the viewpoint of software-aided I/O, we offer Stream Disk Array (SDA) for scatter/gather-style block I/O, EXT3NS multimedia file system for large-scale file I/O, and interoperable streaming server for stream I/O. Keywords Fast-path I/O · Zero-copy · EXT3NS file system · Network Storage card (NS card) Y.-J. Lee · Y.-H. Park · S.-W. Sok · H.-Y. Kim Dept. of Internet Platform Research, SW and Contents Laboratory, Electronics and Telecommunications Research Institute, Daejeon, Korea Y.-J. Lee e-mail: yongju@etri.re.kr Y.-H. Park e-mail: bakyh@etri.re.kr S.-W. Sok e-mail: swsok@etri.re.kr H.-Y. Kim e-mail: h0kim@etri.re.kr C.-H. Lee () Dept. of Computer Engineering, Chungnam National University, Daejeon 305-764, Korea e-mail: clee@cnu.ac.kr 100 Y.J. Lee et al. 1 Introduction Rapid advances in computing and the Internet have spawned many new services and applications. Among them, applications such as the World Wide Web (WWW) (over a communication network) have achieved great success and have transformed many facets of the community worldwide. Delivery of audio and video, however, poses far greater challenges than general data applications. Moreover, unlike web pages, multimedia contents often require significant increases in both storage space and bandwidth requirements. Coupled with the demand for serving thousands or even ten of thousands of concurrent users, the challenge of designing scalable and cost-effective multimedia systems has been an area of intense research in the past decade. In general, data delivery systems require data copy operations and several system calls. Data copy operations consume CPU cycles, memory, bus, and interface resources. There are several system calls that results in many switches between user and kernel spaces. In order to eliminate redundant data transfers and to improve the scalability of multimedia streaming servers, our fast-path I/O solution required studies of the following: • Network Storage card (NS card): It consists of a disk, network, and memory units integrated into a server via a PCI interface. Three units on one PCI board are peripheral devices offering functionality without intervention. Our current NS card prototype is integrated into inexpensive units (i.e., a SCSI controller for disk interface, dual gigabit Ethernet controllers for network interface, and 512 MB PCI memory). • Zero-copy transmission and pipelined I/O: When a user requests a block, the NS card device driver splits it block into “n” small partitions. This is similar to S/W striping in RAID, especially when a device driver reads data from multiple disks; differences in disk performance characteristics necessarily introduces a waiting time. In the worst case, the read speed depends on the slowest disk. However, the NS card device driver does not wait for the entire I/O to complete. As soon as a small partition disk I/O operation is complete, the next partition on the disk is read. As this is a pipelined I/O operation, all disks operate in nonblocking I/O mode. After reading all of the blocks from a user request, zero-copy transmission is achieved without any intervention. • Dedicated multimedia file system: A low-level device driver for the NS card performs block I/O requests by issuing a read or write command on a block special file. Since there are no meaningful semantics, such as file I/O for manipulating multimedia files, we have developed a scalable multimedia file system, so-called “EXT3NS”, that is designed to handle streaming workload on an NS card and extended it to the EXT3 file system. It provides the standard API’s to read and write files in the storage unit on the NS card. It supports legacy I/O and fast-path I/O operations. • Deployment of fast-path I/O for streaming application: The EXT3NS multimedia file system can offer a large block size defined by the disk striping driver of the NS card and can efficiently store a large number of multimedia files. Our mediastreaming engine deploys this fast-path I/O interface. In the case of fast-path I/O, Fast-path I/O architecture for high performance streaming server 101 a streaming engine sets the PMEM pointer in the buffer structure to read a multimedia file. This structure is sent to the NS card disk driver, and the data is written into the PMEM memory area. Furthermore, the zero-copy transmission layer also sets the PMEM pointer to the socket buffer structure for sending the file. For legacy I/O, the streaming engine reads the multimedia file from main memory. After that, general data transfers are performed. Our fast-path I/O implementation of an in-kernel data path with a streaming application that serves multiple clients is a specialized system for both improving I/O performance and dealing with High Definition (HD) streaming with ease. The rest of this paper is structured as follows. In Sect. 2, we perform a literature review of copy-free data paths. In Sect. 3, we discuss our fast-path I/O architecture and implementation in detail. Then we demonstrate the efficacy of our system through extensive experimentation. In Sects. 4 and 5, we highlight various deployments of our high-performance streaming server. Finally, in Sect. 6, we conclude. 2 Related work Research into the improvement of system performance for data delivery applications has received much attention. A prototype of IO-Lite avoids redundant data copying and multiple buffering and allows for performance optimization across subsystems [1]. It uses a single copy of data that can be shared among interprocess communications (IPC), file system and user buffers, and SCSI/IDE subsystems. To take advantage of IO-Lite, applications can use an extended I/O application programming interface (API), e.g., IOL_read() and IOL_write(), that is based on buffer aggregates. Sharing buffers in IO-Lite introduces several problems, such as concurrent writing, physical consistency, and access control. It was only implemented in the FreeBSD operating system and some derivative works. The Massively-parallel And Real-time Storage (MARS) project proposed and implemented a new kernel buffer management system called Multimedia M-buf (MMBUF) which shortens the data path from a storage device to the network interface [2]. It also has a new API consisting of stream_open, stream_read, and stream_send system calls. However, the MMBUF mechanism is less flexible as it is statically allocated by buffers. In the Intermediate Storage Node Concept (INSTANCE) project [3, 4], a new architecture for Media-on-Demand storage nodes was proposed. It maximizes the number of concurrent clients a single node can support via a zero-copy-one-copy memory, Network Level Framing (NLF), and integrated error management. It inherits from previous MMBUF mechanisms and makes the following modifications to further increase performance: allocation and deallocation of MMBUFs to reduce the time used to handle buffers; a network sends routine to allow for UDP/IP processing—the native MMBUF mechanism used ATM. An outstanding feature in INSTANCE is the NLF, which reduces the resource requirements per stream, and combined with in-kernel data path. It then reduces kernel time and increases the total number of concurrent 102 Y.J. Lee et al. streams by a factor of two. However, the NLF mechanism has not covered the whole communication system, and the integration of NLF and on-board processing. LyraNET is a zero-copy TCP/IP protocol stack and is embedded in the TCP/IP target component. It is derived from the Linux TCP/IP codes and remodeled as a reusable software component independent from operating systems and hardware [5]. It provides a good reference for embedding the Linux TCP/IP stack into a target system that requires network connectivity and improvements in transmission efficiency by zero-copy implementation. Other data copy avoidance architectures attempt to minimize data transfers and copy operations, such as integrated layer processing [6, 7] and direct I/O, which in some form is available in several commodity OSs today, for example, Solaris [8, 9], and attempt to eliminate all copy operations in user/kernel spaces. Another recent approach to reduce data copy operations is to use specialized hardware in the network adapters. Ethernet Message Passing (EMP) is an OS-bypass messaging layer for gigabit Ethernet on Alteon NICs where the entire protocol processing is done at the NIC [10]. It is a NIC-level implementation of a zero-copy message-passing layer and exhibits good performance. In the past, fast-path I/O solutions have been studied extensively with the goal of minimizing data copy operations and eliminating CPU burden without dedicated hardware or limited scopes (e.g., NIC, Kernel). There are also required dedicated APIs that use their own mechanisms (and not covered combinations of techniques, such as, disk-to-block I/O, block I/O-to-file I/O, file I/O-to-streaming) that results in performance improvements. In terms of streaming I/O performance, streaming I/O throughput is not close to the ideal network bandwidth. 3 Fast-path I/O architecture 3.1 Motivation Rapid improvements in hardware technology, such as 10 GbE and InfiniBand, can provide temporary or permanent solutions to deal with growing network traffic. Cheap memory and new multi-core CPUs also lead to significant reductions in computation time. However, improvements in media streaming systems seem to have eluded these hardware improvements. In order to support the efficient streaming of media, a good solution requires flexible choices that allow interoperability between disk I/O and network I/O operations. Hardware-aided I/O methods (e.g., 10 GbE, Myrinet, and InfiniBand) provide high-speed transmission rates. However, these methods do not address disk-data reading issues but data transmission issues in networks. Because VoD servers cannot serve data at the maximum transmission rate from dedicated disks, block of a given media file may be replicated and stored on many VoD servers on networks. This effectively increases the maximum transmission rate to the end-user. Storage area networks (SAN) and iSCSI are good choices to improve disk I/O throughput. Combining the benefits of a high-speed network and storage solutions makes competitive providers an ideal platform for hosting VoD servers. Unfortunately, these combined Fast-path I/O architecture for high performance streaming server 103 solutions are not cost effective. Note that it is certainly beneficial to improve overall performance of disk I/O and network I/O requirements without the need for additional hardware. Software-aided I/O methods (e.g., sendfile() mechanism—the so-called zero-copy functionality under Linux) eliminate some of the copying between the kernel and user buffers. They are amongst the cheapest ways to improve streaming throughput, as they require no additional hardware. Since many operating systems implement the sendfile() system call and a zero-copy TCP/IP protocol stack, which enable a process to transfer data in the file system cache to the network interface directly. Although sendfile() (and other interfaces) operates successfully under a particular load, the implementation of zero copy under Linux is far from finished and is likely to change in the near future. More functionality (e.g., vectored transfers, abundant TCP options) should be added. Most of all, one rather unpleasant limitation in the current hardware and softwareaided I/O methods is the inability to handle more in-depth functionalities (i.e., arbitrary interleaving blocks from files and user-space buffers to be sent as one or more packets). For example, multimedia systems support seamless transfers (e.g., appending RTP headers) in networks as well as protection from unauthorized access (e.g., malfunctioning stream headers). Another example is the “trick play” known as transport controls: instant replay, rewind, etc. Therefore, this trade-off leads to a clue on how to integrate hardware and software-aided architecture. In this paper, we have developed fast-path I/O architecture on a sample network storage card that is physically integrated with three individual controllers. Even when a server has three individual cards (dual NIC, SCSI controller, and PCI memory card), our solution has no limitations with respect to fast-path I/O. Furthermore, it is open to further reuse, extension, and integration with other applications even in the case of inexpensive and/or faster hardware. In terms of software-aided I/O, we have implemented the Stream Disk Array (SDA) for scatter/gather-style block I/O, EXT3NS file system for large-scale file I/O, and interoperable streaming server for stream I/O. 3.2 NS card Figure 1 illustrates a sample hardware unit for testing and evaluating. It consists of three peripheral devices (i.e., dual GbE, dual SCSI controllers, and a PCI memory). In terms of software device drivers, it has a Stream Disk Array (SDA), PCI Memory (PMEM), and TOE drivers. Streaming data copied to the PMEM is directly transmitted to the network without additional memory copy operations. The SDA is a special purpose disk array optimized for large sequential disk access via pipelined I/O. Peripheral memory is equipped with DRAM memory modules and is dedicated to temporarily buffer the disk-to-network data path without user context switching. Lastly, TOE provides protocol-based offload network interfaces without modification in the light of an application developer. The existing TCP/IP protocol stack cannot transmit video data from PMEM to the network interface directly since it assumes that the payload data are in main memory and not in PMEM. Consequently, the protocol stacks are modified in order to access data in PMEM and to transmit streaming data from PMEM directly to the network. Existing TCP/IP stacks still working on main memory. 104 Y.J. Lee et al. Fig. 1 Network Storage card (NS card) Fig. 2 Comparison of traditional I/O and zero-copy I/O 3.3 Zero-copy transmission and pipelined I/O Figure 2 compares traditional and zero-copy I/O. In traditional I/O, data is first transferred from the disk to main memory. It is then managed by many subsystems within the OS that are designed with different objectives in mind, run in their own domain (either in user or kernel space) and, therefore, manage their buffers differently. Due to different buffer representations and protection mechanisms, data is usually copied from domain to domain (e.g., from file system to application, from application to communication system), thus allowing different subsystems the ability to manipulate the data in question. Finally, the data is transferred to network interface. In addition to all these data transfers, the data is loaded into the cache and CPU registers where the data is manipulated for checksum. In our zero-copy I/O architecture, data in a disk system are copied to the memory, are duplicated many times for protocol packet handling, and are sent to the communication network. Clearly, sending the data directly to the network without duplication will improve overall system performance. The size of streaming data in a disk ranges from several hundred megabytes to gigabytes in continuous form. Note the inefficient handling of streaming data with current file systems that normally handle 512 bytes to several kilobytes. Fast-path I/O architecture for high performance streaming server 105 Fig. 3 Split blocks and pipelined I/O Stream Disk Array (SDA) is a new disk array control mechanism that provides at least 1 Gbps bandwidth for the worst block distribution. The stream disk array driver consists of a stream disk array interface, a request queue, a pipelined I/O manager, and a block splitter. The block splitter is interfaced with the Linux block I/O driver which provides the function ′ generic_make_request()′ that makes raw block I/O access possible. It divides a logical block into n-small partitioned blocks which are allocated to each disk. This new allocation method splits blocks into all of the disks consisting of the disk array while RAID interleaves blocks into disks. The logical block is divided by the number of disks consisting of disk arrays. The first split block is located at the first disk. The N ’th split block is located at the N ’th disk. Using block splitter, pipelined I/O transmits read requests for the first and second blocks. If the N ’th read operation is completed and the (N + 2)’th read request exists, the module transmit the (N + 2)’th read request. Thus, the operation of the pipelined I/O manager is achieved in parallel. Figure 3 shows an example of split blocks and pipelined I/O. The PCI memory (PMEM) driver is a software implementation for PMEM. With the help of an NS character device driver, the PMEM driver provides user processes with direct access through PCI bus. The initialization part of the PMEM driver includes detecting, enabling, and setting of the configuration registers of all of the PMEM units in the local server. Like memory management of the general operating system, the PMEM diver provides the management of PMEMs for those used by user processes via a zero-copy mechanism. The management unit of PMEMs is varied as 128 k, 256 k, 512 k, 1 M, 2 M for coping with different transferred multimedia data types. Table 1 shows the pseudocode of udp_sendmsg() function in kernel space. The udp_sendmsg() function gathers data from several transmit buffers before being transmitted. The msghdr structure in udp_sendmsg()’s parameter specifies the buffer parameter for the sendmsg I/O function. The structure allows for specification of an array of scatter/gather buffers. If a iov_base pointer indicates a PMEM address, the pmem_include value is set to “1.” This value specifies that 106 Y.J. Lee et al. Table 1 Pseudocode of udp_sendmsg() int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t len) { if( 0 <= (pmem_id=(*pmem_check)((*__ns_virt_to_phys)(iov->iov_base)))) pmem_include=1; if( pmem_include ) err = ip_append_pmem(sk, ip_ generic_getfrag, msg->msg_iov, ulen, sizeof(struct udphdr), &ipc, rt, corkreq ? msg->msg_flags|MSG_MORE: msg->msg_flags, user_payload_len); else err = ip_append_data(sk, ip_ generic_getfrag, msg->msg_iov, ulen, sizeof(struct udphdr), &ipc, rt, corkreq ? msg->msg_flags|MSG_ MORE: msg->msg_flags); } the ip_append_pmem() function append a PMEM buffer for transmitting. Otherwise, the general ip_append_data() is called. Like the udp_sendmsg() function, the tcp_sendmsg() function within the TCP module detects a PMEM address for zerocopy transmission, depending on the iov_base pointer. 3.4 EXT3NS multimedia file system EXT3NS is a file system built on top of the NS card. It enables applications to access fast-path I/O of the NS card via standard read and write system call interfaces. It provides a legacy VFS layer interface for existing file systems and related applications. The file system can easily accommodate a large block size as defined in the SDA driver of NS card. It has a disk layout that allows for efficient storage of a large number of multimedia files. EXT3NS supports both page cache I/O and SDA fast-path I/O. That this can be possible is due to the fact that the SDA device supports both a fast-path I/O interface and a standard block device driver. Figure 4 illustrates the location of EXT3NS file system in the Linux kernel and shows how it interacts with the user/kernel space [11]. The leftmost vertical flow in that figure indicates the conventional streaming operation in a general-purpose Linux file system. The middle flow and rightmost flow depict general streaming operations and fast-path operations in our file system, respectively. When the application reads data from disk via the read system call, EXT3NS determines, using the argument value of user buffer address, whether it is a read operation to the PMEM area of NS card or to the system main memory. If it is to the PMEM area, EXT3NS performs the fast-path I/O, for example, ext3ns_sda_file_read(), by using the NS device driver. If it is to main memory, EXT3NS performs the legacy page cache I/O. In the case of legacy page cache I/O, EXT3NS just reads from disk to main memory by calling a Fast-path I/O architecture for high performance streaming server 107 Fig. 4 Comparison of general file I/O and fast-path file I/O Table 2 Pseudocode of ext3ns_file_read() ssize_t ext3ns_file_read(struct file *filp, char *buf, size_t count, loff_t *ppos) { pmem_addr_t pmem_paddr = (*__ns_virt_ to_phys)((const void*)buf ); if( pmem_paddr && (*pmem_ check)((unsigned)pmem_paddr) >=0) ext3ns_sda_file_read(filp,buf,count,ppos); else generic_file_read(filp,buf,count,ppos); } generic read function (e.g., generic_file_read() in case of mounting ext3 file system). Table 2 shows the pseudocode of ext3ns_file_read(). The key features of the EXT3NS file system are for the most part, inherited from EXT3. These include the mkfs utility and logical block addressing, for example. The most notable features follow. (1) The disk-block size of EXT3NS is much larger than that of EXT3 (e.g., its size is higher than 128 K bytes and lower than 2 M bytes). Thus, the number of block groups in EXT3NS is much smaller than that in EXT3. (2) The block allocation method is optimized for large multimedia files and sequential access. (3) Zero-copy data movement via EXT3NS file system is achieved by the character device of NS Card. Hence, EXT3NS uses two types of SDA device I/O: legacy buffered I/O via block device interface and fast-path I/O via character device interface. Figure 5a shows the layout of an EXT3NS partition and its block group. All the block groups in the file system have the same size and are stored sequentially. Each block in a block group contains one of the following pieces of information: a copy 108 Y.J. Lee et al. Fig. 5 Disk Layout and data structures used to address the file’s data blocks in EXT3NS of the group of block group descriptors, a data block bitmap, a group of inodes, an inode bitmap, and data belonging to some file (e.g., a data block). Group descriptors are replicated. Inode and data block bitmaps are always one block. So, one inode bitmap and one data block bitmap is located in a block group. The data block bitmap indicates the data blocks in use. Data blocks are unformatted portions of file data or pointers to other data blocks. A block group, including its data structures, is the same as that of EXT3. However, one bit of a data block bitmap indicates that an SDA block (e.g., 1 MB data block) is not a 4 KB block. Therefore, in each block group, there can be at most 8 ∗ b blocks, where b is the block size in bytes. Thus, the total number of block groups is roughly s/(8 ∗ b), where s is the partition size in blocks. In the EXT3NS file system, the larger block size has the smaller number of block groups. The EXT3NS file system, like EXT3, has a block-addressing array in the disk inode structure, as shown in Fig. 5b. It has 15 components of 4 different types: • Direct addressing contains 12 components. The logical block number inside a firstorder array ranges from 0 to 11. It stores a file ranging from 0 to 12b, where b is the file system’s block size. For instance, let us assume that a block size is 4 KB in EXT3, 1 MB in EXT3NS. Their maximum file size would be 48 K and 12 M bytes, respectively. • Indirect addressing corresponds to the file block numbers ranging from 12 to b/4 + 11. In the example above, the maximum file size containing a second-order array is 48K + 4M and 12M + 256G bytes, for EXT3 and EXT3NS, respectively. • Double indirect addressing corresponds to the file block numbers ranging from b/4 + 12 to (b/4)2 + (b/4) + 11. Their maximum file size containing a third-order array is 48K + 4M + 4K ∗ 220 and 12M + 256G + 1M ∗ 236 bytes, for EXT3 and EXT3NS respectively. • Triple indirect addressing corresponds to the file block numbers ranging from (b/4)2 + (b/4) + 12 to (b/4)3 + (b/4)2 + (b/4) + 11. Their maximum file size is 48K + 4M + 4K ∗ 220 + 4K ∗ 230 and 12M + 256G + 1M ∗ 236 + 1M ∗ 254 , for EXT3 and EXT3NS, respectively. Fast-path I/O architecture for high performance streaming server 109 3.5 Fast-path streaming operation in multimedia application Figure 6 shows a streaming engine class diagram which consists of many subclasses for manipulating streaming requests. Each filled box represents a representative class name including a member class in a lined box. First, the MainProcess class acts on controlling the RTSP protocol [12, 13], communicating with NSProcess, monitoring system health, and manipulating clients’ connections. Second, the NSProcess class consists of IPCSubController, Configurator, and SessionManager. The IPCSubController communicates with MainProcess from reserved area. Third, the StreamingService class loaded by the SessionManager dynamically contains a variety of MPEG-1/2/4 services. MemAllocator class has two subclasses (e.g., MainMemAllocator, PMEMAllocator). DataReader class performs basic file manipulation methods such as open(), close(), seek() and read(). It has two kinds of inherited classes (e.g., STANDARDDataReader, EXT3NSDataReader). Algorithm 1 illustrates a legacy streaming operation in application space. Some user memory, for example, main_mem_buf , is allocated and then a media file is read via a read() system call. If the data is ready such as mayIgo(), general streamer sends data to a user at given bitrate. Contrary to legacy streaming operation, the fast-path streaming operation uses a PMEM unit suitable for allocating and reading a large block. It also uses the sendmsg() function to consolidate a header with media data in the PMEM. The “/dev/ns0” in the third line of Algorithm 2 is a special NS device name which consists of “n” small partitions similar to S/W striping in a RAID system. The fourth line’s pmem_alloc(pd) gets one PMEM block from a PMEM memory of a specific NS card. The twentieth line’s pmem_fin(pd) function releases allocated PMEM block. Fig. 6 Streaming engine class diagram 110 Y.J. Lee et al. Algorithm 1 Legacy streaming operation in multimedia application 1: procedure VOID P SEUDO _G ENERAL _S TREAMER () 2: char main_mem_buf[PAYLOAD_SZ+HEADER_SZ]; 3: while !eof do 4: read_sz = read(file_fd,main_mem_buf+HEADER_SZ,PAYLOAD_SZ); 5: while mayIgo()! = READ_SZ_SENT do 6: memcpy(main_mem_buf,header_data,HEADER_SZ); 7: send(sock_fd,main_mem_buf,SENT_SZ,flags); ⊲ general send() function 8: end while ⊲ do until mayIgo() 9: end while ⊲ do until eof 10: end procedure Algorithm 2 Fast-path streaming operation in multimedia application 1: procedure VOID P SEUDO _FAST-PATH _S TREAMER () 2: char main_mem[HEADER_SZ]; 3: int pd = pmem_init("/dev/ns0"); ⊲ first NSCard 4: char *pmem_buf = pmem_alloc(pd); ⊲ pmem allocates 5: while !eof do 6: read_sz = read(data_file_fd,pmem_buf,SDA_SIZE); 7: while mayIgo()! = READ_SZ_SENT do 8: memcpy(main_mem,header_data,HEADER_SZ); 9: struct iovec datavec[2]; 10: datavec[0].iov_base = main_mem; ⊲ a pointer of main memory 11: datavec[0].iov_len = HEADER_SZ; 12: datavec[1].iov_base = pmem_buf; ⊲ a pointer of PMEM 13: datavec[1].iov_len = PAYLOAD_SZ; 14: struct msghdr msg.msg_iov = (struct iovec*)&datavec; 15: sendmsg(sock_fd,msg,SENT_SZ,flags); ⊲ sendmsg() function 16: pmem_buf += PAYLOAD_SZ; 17: end while ⊲ do until mayIgo() 18: end while ⊲ do until eof 19: pmem_free(pmem_buf); ⊲ pmem deallocates. 20: pmem_fin(pd); ⊲ release a pmem block. 21: end procedure 4 Performance evaluation Figure 7 shows a detailed view of out testbed system [14]. The testbed consists of a shooter Linux machine, a GbE switch, and a high-performance streaming server with two NS cards. An NS card is connected between a disk array for serving large volumes and a gigabit interface for controlling and transmitting streaming data in the form of zero-copy via the help of a PCI Memory [15, 16]. To measure streaming performance, the shooter Linux machine generates virtual client requests via RTSP messages using an in-house shooter utility, so-called Fast-path I/O architecture for high performance streaming server 111 Fig. 7 Testbed environment “PseudoPlayer.” This is shown in Fig. 7a, which is implemented to deal with virtual user connections. Using input parameters, PseudoPlayer is able to execute a various protocols including UDP, TCP, and RTP. Through PseudoPlayer’s request (shown in Figs. 7b, 7c, and 7d), the high-performance streaming server can receive virtual RTSP messages and interacts with individual clients. It selects an NS card (e.g., selection is also another part of research as contents distribution framework, we can fix a simple round-robin approach for testing) and transfers data to the switch. As shown in Figs. 7e′ and 7e′′ , if the switch receives a packet with a destination MAC address that does not exist in the bridge table, the switch sends that packet over all its interfaces that belong to the same virtual LAN(VLAN) assigned to the interface where the packet came in from. To prevent this flooding, the switch’s VLAN table must map the VLAN number of the port on which the packet arrived to a list of ports that the packet is spread over [17]. In Fig. 7e′′ , the dummy port of the switch, namely a black hole is physically not connected. For using MAC learning of the switch, we set up an ARP table in a Linux system to add an address mapping entry for flowing the selected port. Contrary to the black hole port, Fig. 7e′ indicates that the video stream analyzer is connected to the switch. It then receives stream data to handle in a proper fashion. Figure 8 shows read throughput and CPU load of the NS card. The size of test file was 10 Gbytes, and the PMEM block size was 512 Kbytes. EXT3 + raid0 refers to 4 disk RAID 0 array and mounts an EXT3 file system with standard kernel. EXT3NS + pure (also, EXT3NS + raid0) refers to the same number of disk and mounts EXT3NS file system with fast-path enabled kernel. EXT3NS + SDA uses Stream Disk Array (SDA) mechanism to handle large data chunk at once. The variation thread number affects read performance in accordance with the type of file system. EXT3NS + SDA shows tremendous performance gains compared with EXT3 + raid0. EXT3NS + pure also has good performance gains through the zerocopy mechanism. It also uses large chunk of data (such as 512 Kbytes). Note, however, that the SDA feature (e.g., pipelined I/O) is not included. Figure 8b shows lower cpu-utilization of EXT3NS + SDA compared with that of EXT3 + raid0. 112 Y.J. Lee et al. Fig. 8 Read throughput/ CPU load of NS card EXT3 + raid0 consumes about 12% CPU usage up to about ∼64 threads, and shows a severe increase in CPU usage from about 64–256 threads. EXT3NS + SDA has no influence on CPU time regardless of the number of threads. In Fig. 9a, we measure the maximum number of streams on four different types. The “Ideal NIC Limitation” is computed by division of the NS card’s NIC capacity by a bitrate (e.g., 200 streams are equal to the division of 2 GbE NICs by 10 Mbps at 1 NS card). Its parameters consist of 200 video objects, a 10 Mbps bitrate, and 1/2/3 various NS cards. Streaming requests are partitioned amongst a node’s mount points (e.g., “/ns0” has 50 video objects) according to a Zipf distribution and arrive at the node according to a fixed rate Poisson process [18, 19]. Figures 9b, c, and d show CPU idle/nonidle utilization under two NS cards. We measure streaming throughput on three different types of operating kernels: (i) standard Linux kernel without NS cards (i.e., a general system), (ii) Linux 2.4 kernel with the zero-copy patch and NS cards, and (iii) Linux 2.6 kernel with the zerocopy patch and NS cards. The horizontal axis denotes the number of NS cards (up to 3 cards), and the vertical axis denotes the maximum number of jitter-less streams. The Linux 2.6 kernel equipped with NS cards gives better performance than both the standard Linux kernel and the Linux 2.4 kernel. The latter’s CPU idle pattern leads to more diverse and unstable values than the 2.6 kernel’s. Notably, its value is about 5 times lower than that of the 2.6 kernel for the case of 370 streams as shown in Fig. 9b. Figure 9d also demonstrates an example of non-interference even if a node equips up to three NS cards within the limits of the maximum number PCI slots. Figures 9e and 9f depict, respectively, the number of streams and CPU idle variation curve for 700 seamless streams over a 2-week period. Fast-path I/O architecture for high performance streaming server 113 Fig. 9 Number of streams/CPU usage patterns Figure 10 shows the number of zero-loss streams for various content types. “Standard 2/4 NIC” means the standard Linux 2.4 kernel with legacy NICs, and “1/2 NS cards” means the Linux 2.4 kernel with the zero-copy patch and NS cards. This test report is offered by Telecommunications Technology Association (TTA), an organization for national telecom standardization and certificated test [20]. 5 Deployment An overall picture of a multimedia distribution network including high performance streaming server is shown in Fig. 11. 114 Y.J. Lee et al. Fig. 10 MPEG-2 4/10/20M,H.264 600K’s stream capacity between standard and 1/2 NSCard-enabled system in Linux 2.4 kernel (source: TTA test reports) Fig. 11 Multimedia distribution network For global deployment, global contents network services are required in more than one geographical location or data center. One of the goals of the Global Distribution Server (GDS) is to enable the geographic distribution of the content distribution service. In addition to content deployment, the Digital Item Server (DIS) provides session mobility management. The location of the edge of the Global Contents Network (GCN) or the front-end of several enterprise sites is a gateway to employ real-time streaming or prefetching for seamless delivery. In order to satisfy these re- Fast-path I/O architecture for high performance streaming server 115 quirements, the Local Distribution Server (LDS) communicates with the Global Distribution Server (GDS) for content distribution, and the High Performance Streaming Server (HPSS) is responsible for context-sensitive multimedia streaming under reliable network bandwidth. Each enterprise site is accessible by a public or personalized device via a well-formed network infrastructure. The primary role of the Global/Local Distribution server is to deploy the right content in the right place at the right time in order to satisfy client’s requests. Additionally, when requested content is not locally available in the Local Distribution Server (LDS), it should be obtained from the Global Distribution Server (GDS) via an efficient distribution policy. Even though the server provides a large storage capacity, all of the multimedia contents to be serviced cannot be stored at the server. In addition, if services are processed globally, global network infrastructure traffic increases. This increased traffic can render the service unreliable or unavailable. To resolve this problem, a service provider generally adopts a Content Delivery Network (CDN) technique that locates many local servers or cache servers at various specific regional areas. In a CDN infrastructure, multimedia streaming services are provided to end-users through a nearest local server. The GDS plays the role as content-provider, and we introduce the Files Server Node (FSN) node concept, as part of the GDS. It is important to store the right content in the right place at the right time in order to satisfy client’s requests. In addition, when requested content is not locally available, it should be obtained from the FSN node as quickly and reliably as possible. In order to support these requirements, the server should provide an efficient content distribution mechanism. The content distribution software provides the following features for versatile transfer, distribution methods, and content usage monitoring. • Content Transfer: The content distribution unit transfers content from the FSN to a HPSS. The content transfer job can be executed promptly (On-demand Transfer) or in a delayed manner (Scheduled Transfer). • Preloading and Prefix Caching: The content distribution unit provides a mechanism to preload contents from the FSN prior to service requests by using either ondemand or scheduled transfer. For better performance, storage of as much content as possible is desired. • Dynamic Loading: When content is requested to be serviced, and if the content is not available completely, the content distribution unit transfers the content dynamically. • Content Purge: When a content transfer job is activated, and if the storage has no space to store the content to be transferred, the transfer job fails. To prevent this situation, the content distribution unit prepares needed storage space in advance by purging some existing contents. • Content Storage Management: For determining the necessity of purge operation, the content distribution unit should know the status of storage. • Content Usage Monitoring: Selection of appropriate contents to cache at the server can affect the cache utilization and thus service performance. The unit should provide information about usage of contents. Our fast-path I/O solution provides three main products: IPTV, Internet VOD, and Cable TV (CATV). At first, the IPTV solution includes the full range of components 116 Y.J. Lee et al. Fig. 12 IPTV system diagram Fig. 13 IPTV system diagram for transmitting TV or video via IP networks as shown in Fig. 12. These include: Digital Signal Processor (DSP) encoders, Conditional Access System (CAS), billing, Subscriber Management System (SMS), and other built-in subsystems, namely StreamXpert (SX) series. The content stream system (SX LS) has an NS card to facilitate high-performance streaming and communicates with a VOD client module in the subscriber’s STB. Content distribution and management system (SX GS) controls adaptive distribution to alleviate network burden. Figure 13 shows a screen shot of an Electronic Program Guide (EPG) used for a first generation IPTV system. Fast-path I/O architecture for high performance streaming server 117 Fig. 14 Internet VOD system diagram Fig. 15 Cable TV (CATV) diagram As shown in Fig. 14, the Internet VOD system, similar to the IPTV system, has major components for interacting with web users. Various client devices, such as PC, PDA/DMB, and Mobile PC, are compatible. For a Cable TV (CATV) service, many kinds of equipment are needed as shown in Fig. 15. An Edge Quadrature Amplitude Modulation (QAM) device is used to receive the signal from Gigabit Ethernet; it gives QAM RF output to the coaxial network. In the subscriber’s field, the cable set-top box with smart card and Open Cable Application Platform (OCAP) will enable cable customers to receive a wide variety 118 Y.J. Lee et al. of services and applications, such as electronic program guides, pay per view, video on demand, interactive sports, and game shows. Like the above three deployments, our system provides an integrated, flexible architecture with coexisting IPTV, Internet VOD and Cable TV services and allows the enhancement of existing services in terms of streaming performance, suitable distribution. 6 Conclusion In this paper, we have presented a fast-path I/O architecture, which is deployed inkernel zero-copy for eliminating data movements, a EXT3NS file system for largescale file I/O, and a high-performance streaming server for Korean-style Internet server specialized on HD-level services. Performance evaluation indicates improvements in streaming throughput without CPU burden. Application-level throughput average is 1.8 Gbps per NS card. Ideally, this meets dual gigabit bandwidth and guarantees high performance without degradation in disk-network scenarios. For emerging applications like wireless streaming, IPTV, or even custom internal streaming deployments, our fast-path solution successfully delivers high-performance streaming media. References 1. Pai VS, Druschel P, Zwaenepoel W (2000) I/O-Lite: a unified I/O buffering and caching system. ACM Trans Comput Syst (TOCS) 18(1):37–66 2. Buddhikot MM, Chen XJ, Wu D et al (1998) Enhancements to 4.4 BSD UNIX for efficient networked multimedia in project MARS. In: Proc the IEEE int conf multimedia computing and systems. IEEE Computer Society, Washington DC, USA, 1998, pp 326–337 3. Plagemann T, Goebel V, Halvorsen P, Anshus O (2000) Operating system support for multimedia systems. Comput Commun J 23(3):267–289 4. Halvorsen P, Plagemann T, Goebel V (2003) Improving the I/O performance of intermediate multimedia storage nodes. Multimedia Syst J 9:1 5. Chiang M, Li Y (2006) LyraNET: a zero-copy TCP/IP protocol stack for embedded systems. J RealTime Syst 34(1) 6. Abbott MB, Peterson LL (1993) Increasing network throughput by integrating protocol layers. IEEE/ACM Trans Netw 1(5):600–610 7. Clark DD, Tennenhouse DL (1990) Architectural considerations for a new generation of protocols. In: Proc of ACM SIGCOMM, 1990, pp 200–208 8. Jerry Chu HK (1996) Zero-copy TCP in Solaris. In: Proc of the USENIX annual technical conference, 1996, pp 253–264 9. An efficient zero-copy I/O framework for Unix. http://research.sun.com/techrep/1995/smli_ tr-95-39.pdf 10. Shivam P, Wyckoff P, Panda D (2001) EMP: zero-copy OS-bypass NIC-driven gigabit Ethernet message passing. In: Proc of supercomputing conference, 2001, Denver 11. Ahn B-S, Sohn S-H, Kim C-Y, Cha G-I, Baek Y-C, Jung S-I, Kim M-J (2004) Implementation and evaluation of EXT3NS multimedia file system. In: Proc ACM international conference on multimedia, New York, NY, USA, Jun, 2004 12. Schulzrinne H, Lanphier R, Rao A (1998) Real time streaming protocol (RTSP) RFC 2326, IETF 13. Schulzrinne H, Casner S, Frederick R, Jacobson V (1996) RTP: a transport protocol for real-time applications RFC 1889, IETF Fast-path I/O architecture for high performance streaming server 119 14. Lee Y-J, Min O-G, Kim H-Y (2005) Performance evaluation technique of the RTSP based streaming server. In: ACIS intl conf on computer and information science, vol 4, no 1, Jul 2005 15. Lee Y-J, Min O-G, Mun S-J, Kim H-Y (2004) Enabling high performance media streaming server on network storage card. In: Proc. IASTED IMSA, USA, Aug 2004 16. Min O-G, Kim H-Y, Kwon T-G (2004) A mechanism for improving streaming throughput on the NGIS system. In: Proc IASTED IMSA, USA, Aug 2004 17. L2 switching basics—MAC learning. http://www.ciscopress.com/articles/article.asp?p=101367&rl=1 18. Zipf’ Law. http://en.wikipedia.org/wiki/Zipf’s-law 19. Poisson Distribution. http://mathworld.wolfram.com/PoissonDistribution.html 20. TTA. http://www.tta.or.kr/English/new/main/index.htm Yong-Ju Lee received the B.S. and M.S. degrees in computer engineering from Chonbuk National University, Korea, in 1999 and 2001, respectively. He joined ETRI (Electronics and Telecommunications Research Institute), Daejeon, Korea, in 2001. Since 2007, he has been working toward the Ph.D. degree in computer engineering at ChungNam National University, Korea. During the years from 2002 to 2006, he was involved in the development of the Next Generation Internet Server. His research interests include high-speed network architecture, multimedia streaming, distributed file systems, and so on. Yoo-Hyun Park received the B.S. and M.S. degrees in computer science from Pusan National University, Korea, 1996 and 1998 and Ph.D. degree in computer science from Pusan National University, Korea, in 2008. He worked at KIDA (Korea Institute for Defense Analyses) in Seoul, Korea and he joined ETRI (Electronics and Telecommunications Research Institute), Daejeon, Korea in 2001. His research interests include transcoding proxy, multimedia content delivery, network-storage system, database and so on. Song-Woo Sok received the B.S. and M.S. degrees in electronics engineering from Kyungpook National University, Korea, 1999 and 2001. In 2001, he joined the ETRI. His research interests include networkstorage system, multimedia file system and so on. 120 Y.J. Lee et al. Hag-Young Kim received the B.S. and M.S. degrees in electronics engineering from Kyungpook National University, Korea, in 1983, 1985, and Ph.D degree in computer engineering from Chungnam National University, Korea in 2003. He joined ETRI in 1988, and he has currently served as the Project Leader of the Media Streaming Research Team of ETRI. His current interests include computer architecture, high-speed network architecture, multimedia, middleware, and digital cable broadcast system. Cheol-Hoon Lee received the B.S. degree in electronics engineering from Seoul National University, Seoul, Korea in 1983, and the M.S. and Ph.D. degree in electrical engineering from Korea Advanced Institute of Science and Technology, Seoul, Korea in 1988 and 1992, respectively. From 1983 to 1994, he worked at Samsung Electronics Company in Seoul, Korea as a researcher. From 1994 to 1995, he was with the University of Michigan, Ann Arbor, as a research scientist at the RealTime Computing Laboratory. Since 1995, he has been a professor in the Department of Computer Engineering, Chungnam National University, Taejeon, Korea. His research interests include parallel processing, operating system, real-time and fault-tolerant computing, and microprocessor design.

RELATED PAPERS

RELATED TOPICS

Log In

Fast-path I/O architecture for high performance streaming server

Fast-path I/O architecture for high performance streaming server

Related Papers

RELATED PAPERS

RELATED TOPICS