J Supercomput (2009) 50: 99–120
DOI 10.1007/s11227-008-0254-5
Fast-path I/O architecture for high performance
streaming server
Yong-Ju Lee · Yoo-Hyun Park · Song-Woo Sok ·
Hag-Young Kim · Cheol-Hoon Lee
Published online: 25 November 2008
© Springer Science+Business Media, LLC 2008
Abstract In a disk-network scenario where expensive data transfers are the norm,
such as in multimedia streaming applications, for example, a fast-path I/O architecture is generally considered to be “good practice.” Here, I/O performance can be
improved through minimizing the number of in-memory data movements and context switches. In this paper, we report the results of the design and implementation
of a high-performance streaming server using cheap hardware units assembled directly on a test card (i.e., NS card). The hardware part of our architecture is open to
further reuse, extension, and integration with other applications even in the case of
inexpensive and/or faster hardware. From the viewpoint of software-aided I/O, we
offer Stream Disk Array (SDA) for scatter/gather-style block I/O, EXT3NS multimedia file system for large-scale file I/O, and interoperable streaming server for stream
I/O.
Keywords Fast-path I/O · Zero-copy · EXT3NS file system · Network Storage card
(NS card)
Y.-J. Lee · Y.-H. Park · S.-W. Sok · H.-Y. Kim
Dept. of Internet Platform Research, SW and Contents Laboratory, Electronics
and Telecommunications Research Institute, Daejeon, Korea
Y.-J. Lee
e-mail: yongju@etri.re.kr
Y.-H. Park
e-mail: bakyh@etri.re.kr
S.-W. Sok
e-mail: swsok@etri.re.kr
H.-Y. Kim
e-mail: h0kim@etri.re.kr
C.-H. Lee ()
Dept. of Computer Engineering, Chungnam National University, Daejeon 305-764, Korea
e-mail: clee@cnu.ac.kr
100
Y.J. Lee et al.
1 Introduction
Rapid advances in computing and the Internet have spawned many new services and
applications. Among them, applications such as the World Wide Web (WWW) (over
a communication network) have achieved great success and have transformed many
facets of the community worldwide. Delivery of audio and video, however, poses far
greater challenges than general data applications. Moreover, unlike web pages, multimedia contents often require significant increases in both storage space and bandwidth requirements. Coupled with the demand for serving thousands or even ten of
thousands of concurrent users, the challenge of designing scalable and cost-effective
multimedia systems has been an area of intense research in the past decade. In general, data delivery systems require data copy operations and several system calls.
Data copy operations consume CPU cycles, memory, bus, and interface resources.
There are several system calls that results in many switches between user and kernel
spaces. In order to eliminate redundant data transfers and to improve the scalability
of multimedia streaming servers, our fast-path I/O solution required studies of the
following:
• Network Storage card (NS card): It consists of a disk, network, and memory units
integrated into a server via a PCI interface. Three units on one PCI board are peripheral devices offering functionality without intervention. Our current NS card
prototype is integrated into inexpensive units (i.e., a SCSI controller for disk interface, dual gigabit Ethernet controllers for network interface, and 512 MB PCI
memory).
• Zero-copy transmission and pipelined I/O: When a user requests a block, the NS
card device driver splits it block into “n” small partitions. This is similar to S/W
striping in RAID, especially when a device driver reads data from multiple disks;
differences in disk performance characteristics necessarily introduces a waiting
time. In the worst case, the read speed depends on the slowest disk. However, the
NS card device driver does not wait for the entire I/O to complete. As soon as
a small partition disk I/O operation is complete, the next partition on the disk is
read. As this is a pipelined I/O operation, all disks operate in nonblocking I/O
mode. After reading all of the blocks from a user request, zero-copy transmission
is achieved without any intervention.
• Dedicated multimedia file system: A low-level device driver for the NS card performs block I/O requests by issuing a read or write command on a block special
file. Since there are no meaningful semantics, such as file I/O for manipulating
multimedia files, we have developed a scalable multimedia file system, so-called
“EXT3NS”, that is designed to handle streaming workload on an NS card and extended it to the EXT3 file system. It provides the standard API’s to read and write
files in the storage unit on the NS card. It supports legacy I/O and fast-path I/O
operations.
• Deployment of fast-path I/O for streaming application: The EXT3NS multimedia
file system can offer a large block size defined by the disk striping driver of the
NS card and can efficiently store a large number of multimedia files. Our mediastreaming engine deploys this fast-path I/O interface. In the case of fast-path I/O,
Fast-path I/O architecture for high performance streaming server
101
a streaming engine sets the PMEM pointer in the buffer structure to read a multimedia file. This structure is sent to the NS card disk driver, and the data is written
into the PMEM memory area. Furthermore, the zero-copy transmission layer also
sets the PMEM pointer to the socket buffer structure for sending the file. For legacy
I/O, the streaming engine reads the multimedia file from main memory. After that,
general data transfers are performed.
Our fast-path I/O implementation of an in-kernel data path with a streaming application that serves multiple clients is a specialized system for both improving I/O
performance and dealing with High Definition (HD) streaming with ease.
The rest of this paper is structured as follows. In Sect. 2, we perform a literature
review of copy-free data paths. In Sect. 3, we discuss our fast-path I/O architecture
and implementation in detail. Then we demonstrate the efficacy of our system through
extensive experimentation. In Sects. 4 and 5, we highlight various deployments of our
high-performance streaming server. Finally, in Sect. 6, we conclude.
2 Related work
Research into the improvement of system performance for data delivery applications
has received much attention.
A prototype of IO-Lite avoids redundant data copying and multiple buffering and
allows for performance optimization across subsystems [1]. It uses a single copy of
data that can be shared among interprocess communications (IPC), file system and
user buffers, and SCSI/IDE subsystems. To take advantage of IO-Lite, applications
can use an extended I/O application programming interface (API), e.g., IOL_read()
and IOL_write(), that is based on buffer aggregates. Sharing buffers in IO-Lite introduces several problems, such as concurrent writing, physical consistency, and access
control. It was only implemented in the FreeBSD operating system and some derivative works.
The Massively-parallel And Real-time Storage (MARS) project proposed and implemented a new kernel buffer management system called Multimedia M-buf (MMBUF) which shortens the data path from a storage device to the network interface [2].
It also has a new API consisting of stream_open, stream_read, and stream_send system calls. However, the MMBUF mechanism is less flexible as it is statically allocated
by buffers.
In the Intermediate Storage Node Concept (INSTANCE) project [3, 4], a new architecture for Media-on-Demand storage nodes was proposed. It maximizes the number of concurrent clients a single node can support via a zero-copy-one-copy memory,
Network Level Framing (NLF), and integrated error management. It inherits from
previous MMBUF mechanisms and makes the following modifications to further increase performance: allocation and deallocation of MMBUFs to reduce the time used
to handle buffers; a network sends routine to allow for UDP/IP processing—the native
MMBUF mechanism used ATM. An outstanding feature in INSTANCE is the NLF,
which reduces the resource requirements per stream, and combined with in-kernel
data path. It then reduces kernel time and increases the total number of concurrent
102
Y.J. Lee et al.
streams by a factor of two. However, the NLF mechanism has not covered the whole
communication system, and the integration of NLF and on-board processing.
LyraNET is a zero-copy TCP/IP protocol stack and is embedded in the TCP/IP
target component. It is derived from the Linux TCP/IP codes and remodeled as a
reusable software component independent from operating systems and hardware [5].
It provides a good reference for embedding the Linux TCP/IP stack into a target system that requires network connectivity and improvements in transmission efficiency
by zero-copy implementation.
Other data copy avoidance architectures attempt to minimize data transfers and
copy operations, such as integrated layer processing [6, 7] and direct I/O, which in
some form is available in several commodity OSs today, for example, Solaris [8, 9],
and attempt to eliminate all copy operations in user/kernel spaces. Another recent approach to reduce data copy operations is to use specialized hardware in the network
adapters. Ethernet Message Passing (EMP) is an OS-bypass messaging layer for gigabit Ethernet on Alteon NICs where the entire protocol processing is done at the
NIC [10]. It is a NIC-level implementation of a zero-copy message-passing layer and
exhibits good performance.
In the past, fast-path I/O solutions have been studied extensively with the goal
of minimizing data copy operations and eliminating CPU burden without dedicated
hardware or limited scopes (e.g., NIC, Kernel). There are also required dedicated
APIs that use their own mechanisms (and not covered combinations of techniques,
such as, disk-to-block I/O, block I/O-to-file I/O, file I/O-to-streaming) that results in
performance improvements. In terms of streaming I/O performance, streaming I/O
throughput is not close to the ideal network bandwidth.
3 Fast-path I/O architecture
3.1 Motivation
Rapid improvements in hardware technology, such as 10 GbE and InfiniBand, can
provide temporary or permanent solutions to deal with growing network traffic.
Cheap memory and new multi-core CPUs also lead to significant reductions in computation time. However, improvements in media streaming systems seem to have
eluded these hardware improvements. In order to support the efficient streaming of
media, a good solution requires flexible choices that allow interoperability between
disk I/O and network I/O operations.
Hardware-aided I/O methods (e.g., 10 GbE, Myrinet, and InfiniBand) provide
high-speed transmission rates. However, these methods do not address disk-data reading issues but data transmission issues in networks. Because VoD servers cannot
serve data at the maximum transmission rate from dedicated disks, block of a given
media file may be replicated and stored on many VoD servers on networks. This
effectively increases the maximum transmission rate to the end-user. Storage area
networks (SAN) and iSCSI are good choices to improve disk I/O throughput. Combining the benefits of a high-speed network and storage solutions makes competitive
providers an ideal platform for hosting VoD servers. Unfortunately, these combined
Fast-path I/O architecture for high performance streaming server
103
solutions are not cost effective. Note that it is certainly beneficial to improve overall
performance of disk I/O and network I/O requirements without the need for additional
hardware.
Software-aided I/O methods (e.g., sendfile() mechanism—the so-called zero-copy
functionality under Linux) eliminate some of the copying between the kernel and
user buffers. They are amongst the cheapest ways to improve streaming throughput,
as they require no additional hardware. Since many operating systems implement
the sendfile() system call and a zero-copy TCP/IP protocol stack, which enable a
process to transfer data in the file system cache to the network interface directly.
Although sendfile() (and other interfaces) operates successfully under a particular
load, the implementation of zero copy under Linux is far from finished and is likely
to change in the near future. More functionality (e.g., vectored transfers, abundant
TCP options) should be added.
Most of all, one rather unpleasant limitation in the current hardware and softwareaided I/O methods is the inability to handle more in-depth functionalities (i.e., arbitrary interleaving blocks from files and user-space buffers to be sent as one or more
packets). For example, multimedia systems support seamless transfers (e.g., appending RTP headers) in networks as well as protection from unauthorized access (e.g.,
malfunctioning stream headers). Another example is the “trick play” known as transport controls: instant replay, rewind, etc. Therefore, this trade-off leads to a clue on
how to integrate hardware and software-aided architecture.
In this paper, we have developed fast-path I/O architecture on a sample network
storage card that is physically integrated with three individual controllers. Even when
a server has three individual cards (dual NIC, SCSI controller, and PCI memory card),
our solution has no limitations with respect to fast-path I/O. Furthermore, it is open
to further reuse, extension, and integration with other applications even in the case
of inexpensive and/or faster hardware. In terms of software-aided I/O, we have implemented the Stream Disk Array (SDA) for scatter/gather-style block I/O, EXT3NS
file system for large-scale file I/O, and interoperable streaming server for stream I/O.
3.2 NS card
Figure 1 illustrates a sample hardware unit for testing and evaluating. It consists of
three peripheral devices (i.e., dual GbE, dual SCSI controllers, and a PCI memory).
In terms of software device drivers, it has a Stream Disk Array (SDA), PCI Memory (PMEM), and TOE drivers. Streaming data copied to the PMEM is directly transmitted to the network without additional memory copy operations. The SDA is a
special purpose disk array optimized for large sequential disk access via pipelined
I/O. Peripheral memory is equipped with DRAM memory modules and is dedicated
to temporarily buffer the disk-to-network data path without user context switching.
Lastly, TOE provides protocol-based offload network interfaces without modification
in the light of an application developer. The existing TCP/IP protocol stack cannot
transmit video data from PMEM to the network interface directly since it assumes
that the payload data are in main memory and not in PMEM. Consequently, the protocol stacks are modified in order to access data in PMEM and to transmit streaming
data from PMEM directly to the network. Existing TCP/IP stacks still working on
main memory.
104
Y.J. Lee et al.
Fig. 1 Network Storage card (NS card)
Fig. 2 Comparison of traditional I/O and zero-copy I/O
3.3 Zero-copy transmission and pipelined I/O
Figure 2 compares traditional and zero-copy I/O.
In traditional I/O, data is first transferred from the disk to main memory. It is then
managed by many subsystems within the OS that are designed with different objectives in mind, run in their own domain (either in user or kernel space) and, therefore, manage their buffers differently. Due to different buffer representations and
protection mechanisms, data is usually copied from domain to domain (e.g., from
file system to application, from application to communication system), thus allowing
different subsystems the ability to manipulate the data in question. Finally, the data
is transferred to network interface. In addition to all these data transfers, the data is
loaded into the cache and CPU registers where the data is manipulated for checksum.
In our zero-copy I/O architecture, data in a disk system are copied to the memory, are
duplicated many times for protocol packet handling, and are sent to the communication network. Clearly, sending the data directly to the network without duplication
will improve overall system performance. The size of streaming data in a disk ranges
from several hundred megabytes to gigabytes in continuous form. Note the inefficient
handling of streaming data with current file systems that normally handle 512 bytes
to several kilobytes.
Fast-path I/O architecture for high performance streaming server
105
Fig. 3 Split blocks and pipelined I/O
Stream Disk Array (SDA) is a new disk array control mechanism that provides at
least 1 Gbps bandwidth for the worst block distribution. The stream disk array driver
consists of a stream disk array interface, a request queue, a pipelined I/O manager,
and a block splitter.
The block splitter is interfaced with the Linux block I/O driver which provides
the function ′ generic_make_request()′ that makes raw block I/O access possible. It
divides a logical block into n-small partitioned blocks which are allocated to each
disk. This new allocation method splits blocks into all of the disks consisting of the
disk array while RAID interleaves blocks into disks. The logical block is divided
by the number of disks consisting of disk arrays. The first split block is located at
the first disk. The N ’th split block is located at the N ’th disk. Using block splitter,
pipelined I/O transmits read requests for the first and second blocks. If the N ’th read
operation is completed and the (N + 2)’th read request exists, the module transmit the
(N + 2)’th read request. Thus, the operation of the pipelined I/O manager is achieved
in parallel. Figure 3 shows an example of split blocks and pipelined I/O.
The PCI memory (PMEM) driver is a software implementation for PMEM. With
the help of an NS character device driver, the PMEM driver provides user processes
with direct access through PCI bus. The initialization part of the PMEM driver includes detecting, enabling, and setting of the configuration registers of all of the
PMEM units in the local server. Like memory management of the general operating system, the PMEM diver provides the management of PMEMs for those used by
user processes via a zero-copy mechanism. The management unit of PMEMs is varied as 128 k, 256 k, 512 k, 1 M, 2 M for coping with different transferred multimedia
data types.
Table 1 shows the pseudocode of udp_sendmsg() function in kernel space.
The udp_sendmsg() function gathers data from several transmit buffers before being transmitted. The msghdr structure in udp_sendmsg()’s parameter specifies the
buffer parameter for the sendmsg I/O function. The structure allows for specification of an array of scatter/gather buffers. If a iov_base pointer indicates a
PMEM address, the pmem_include value is set to “1.” This value specifies that
106
Y.J. Lee et al.
Table 1 Pseudocode of udp_sendmsg()
int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t len)
{
if( 0 <= (pmem_id=(*pmem_check)((*__ns_virt_to_phys)(iov->iov_base))))
pmem_include=1;
if( pmem_include )
err = ip_append_pmem(sk, ip_ generic_getfrag, msg->msg_iov, ulen,
sizeof(struct udphdr), &ipc, rt,
corkreq ? msg->msg_flags|MSG_MORE: msg->msg_flags,
user_payload_len);
else
err = ip_append_data(sk, ip_ generic_getfrag, msg->msg_iov, ulen,
sizeof(struct udphdr), &ipc, rt,
corkreq ? msg->msg_flags|MSG_ MORE: msg->msg_flags);
}
the ip_append_pmem() function append a PMEM buffer for transmitting. Otherwise, the general ip_append_data() is called. Like the udp_sendmsg() function, the
tcp_sendmsg() function within the TCP module detects a PMEM address for zerocopy transmission, depending on the iov_base pointer.
3.4 EXT3NS multimedia file system
EXT3NS is a file system built on top of the NS card. It enables applications to access
fast-path I/O of the NS card via standard read and write system call interfaces. It provides a legacy VFS layer interface for existing file systems and related applications.
The file system can easily accommodate a large block size as defined in the SDA
driver of NS card. It has a disk layout that allows for efficient storage of a large number of multimedia files. EXT3NS supports both page cache I/O and SDA fast-path
I/O. That this can be possible is due to the fact that the SDA device supports both a
fast-path I/O interface and a standard block device driver.
Figure 4 illustrates the location of EXT3NS file system in the Linux kernel and
shows how it interacts with the user/kernel space [11]. The leftmost vertical flow in
that figure indicates the conventional streaming operation in a general-purpose Linux
file system. The middle flow and rightmost flow depict general streaming operations
and fast-path operations in our file system, respectively.
When the application reads data from disk via the read system call, EXT3NS determines, using the argument value of user buffer address, whether it is a read operation
to the PMEM area of NS card or to the system main memory. If it is to the PMEM
area, EXT3NS performs the fast-path I/O, for example, ext3ns_sda_file_read(), by
using the NS device driver.
If it is to main memory, EXT3NS performs the legacy page cache I/O. In the case
of legacy page cache I/O, EXT3NS just reads from disk to main memory by calling a
Fast-path I/O architecture for high performance streaming server
107
Fig. 4 Comparison of general file I/O and fast-path file I/O
Table 2 Pseudocode of ext3ns_file_read()
ssize_t ext3ns_file_read(struct file *filp, char *buf, size_t count, loff_t *ppos)
{
pmem_addr_t pmem_paddr = (*__ns_virt_ to_phys)((const void*)buf );
if( pmem_paddr && (*pmem_ check)((unsigned)pmem_paddr) >=0)
ext3ns_sda_file_read(filp,buf,count,ppos);
else
generic_file_read(filp,buf,count,ppos);
}
generic read function (e.g., generic_file_read() in case of mounting ext3 file system).
Table 2 shows the pseudocode of ext3ns_file_read().
The key features of the EXT3NS file system are for the most part, inherited from
EXT3. These include the mkfs utility and logical block addressing, for example. The
most notable features follow. (1) The disk-block size of EXT3NS is much larger than
that of EXT3 (e.g., its size is higher than 128 K bytes and lower than 2 M bytes).
Thus, the number of block groups in EXT3NS is much smaller than that in EXT3.
(2) The block allocation method is optimized for large multimedia files and sequential
access. (3) Zero-copy data movement via EXT3NS file system is achieved by the
character device of NS Card. Hence, EXT3NS uses two types of SDA device I/O:
legacy buffered I/O via block device interface and fast-path I/O via character device
interface.
Figure 5a shows the layout of an EXT3NS partition and its block group. All the
block groups in the file system have the same size and are stored sequentially. Each
block in a block group contains one of the following pieces of information: a copy
108
Y.J. Lee et al.
Fig. 5 Disk Layout and data structures used to address the file’s data blocks in EXT3NS
of the group of block group descriptors, a data block bitmap, a group of inodes, an
inode bitmap, and data belonging to some file (e.g., a data block). Group descriptors
are replicated. Inode and data block bitmaps are always one block. So, one inode
bitmap and one data block bitmap is located in a block group. The data block bitmap
indicates the data blocks in use. Data blocks are unformatted portions of file data or
pointers to other data blocks. A block group, including its data structures, is the same
as that of EXT3. However, one bit of a data block bitmap indicates that an SDA block
(e.g., 1 MB data block) is not a 4 KB block. Therefore, in each block group, there
can be at most 8 ∗ b blocks, where b is the block size in bytes. Thus, the total number
of block groups is roughly s/(8 ∗ b), where s is the partition size in blocks. In the
EXT3NS file system, the larger block size has the smaller number of block groups.
The EXT3NS file system, like EXT3, has a block-addressing array in the disk
inode structure, as shown in Fig. 5b. It has 15 components of 4 different types:
• Direct addressing contains 12 components. The logical block number inside a firstorder array ranges from 0 to 11. It stores a file ranging from 0 to 12b, where b is
the file system’s block size. For instance, let us assume that a block size is 4 KB
in EXT3, 1 MB in EXT3NS. Their maximum file size would be 48 K and 12 M
bytes, respectively.
• Indirect addressing corresponds to the file block numbers ranging from 12 to
b/4 + 11. In the example above, the maximum file size containing a second-order
array is 48K + 4M and 12M + 256G bytes, for EXT3 and EXT3NS, respectively.
• Double indirect addressing corresponds to the file block numbers ranging from
b/4 + 12 to (b/4)2 + (b/4) + 11. Their maximum file size containing a third-order
array is 48K + 4M + 4K ∗ 220 and 12M + 256G + 1M ∗ 236 bytes, for EXT3 and
EXT3NS respectively.
• Triple indirect addressing corresponds to the file block numbers ranging from
(b/4)2 + (b/4) + 12 to (b/4)3 + (b/4)2 + (b/4) + 11. Their maximum file size
is 48K + 4M + 4K ∗ 220 + 4K ∗ 230 and 12M + 256G + 1M ∗ 236 + 1M ∗ 254 ,
for EXT3 and EXT3NS, respectively.
Fast-path I/O architecture for high performance streaming server
109
3.5 Fast-path streaming operation in multimedia application
Figure 6 shows a streaming engine class diagram which consists of many subclasses
for manipulating streaming requests.
Each filled box represents a representative class name including a member class
in a lined box. First, the MainProcess class acts on controlling the RTSP protocol [12, 13], communicating with NSProcess, monitoring system health, and manipulating clients’ connections. Second, the NSProcess class consists of IPCSubController, Configurator, and SessionManager. The IPCSubController communicates
with MainProcess from reserved area. Third, the StreamingService class loaded
by the SessionManager dynamically contains a variety of MPEG-1/2/4 services.
MemAllocator class has two subclasses (e.g., MainMemAllocator, PMEMAllocator).
DataReader class performs basic file manipulation methods such as open(), close(),
seek() and read(). It has two kinds of inherited classes (e.g., STANDARDDataReader,
EXT3NSDataReader). Algorithm 1 illustrates a legacy streaming operation in application space. Some user memory, for example, main_mem_buf , is allocated and then
a media file is read via a read() system call. If the data is ready such as mayIgo(),
general streamer sends data to a user at given bitrate.
Contrary to legacy streaming operation, the fast-path streaming operation uses a
PMEM unit suitable for allocating and reading a large block.
It also uses the sendmsg() function to consolidate a header with media data in the
PMEM. The “/dev/ns0” in the third line of Algorithm 2 is a special NS device name
which consists of “n” small partitions similar to S/W striping in a RAID system.
The fourth line’s pmem_alloc(pd) gets one PMEM block from a PMEM memory
of a specific NS card. The twentieth line’s pmem_fin(pd) function releases allocated
PMEM block.
Fig. 6 Streaming engine class diagram
110
Y.J. Lee et al.
Algorithm 1 Legacy streaming operation in multimedia application
1: procedure VOID P SEUDO _G ENERAL _S TREAMER ()
2:
char main_mem_buf[PAYLOAD_SZ+HEADER_SZ];
3:
while !eof do
4:
read_sz = read(file_fd,main_mem_buf+HEADER_SZ,PAYLOAD_SZ);
5:
while mayIgo()! = READ_SZ_SENT do
6:
memcpy(main_mem_buf,header_data,HEADER_SZ);
7:
send(sock_fd,main_mem_buf,SENT_SZ,flags);
⊲ general send()
function
8:
end while
⊲ do until mayIgo()
9:
end while
⊲ do until eof
10: end procedure
Algorithm 2 Fast-path streaming operation in multimedia application
1: procedure VOID P SEUDO _FAST-PATH _S TREAMER ()
2:
char main_mem[HEADER_SZ];
3:
int pd = pmem_init("/dev/ns0");
⊲ first NSCard
4:
char *pmem_buf = pmem_alloc(pd);
⊲ pmem allocates
5:
while !eof do
6:
read_sz = read(data_file_fd,pmem_buf,SDA_SIZE);
7:
while mayIgo()! = READ_SZ_SENT do
8:
memcpy(main_mem,header_data,HEADER_SZ);
9:
struct iovec datavec[2];
10:
datavec[0].iov_base = main_mem;
⊲ a pointer of main memory
11:
datavec[0].iov_len = HEADER_SZ;
12:
datavec[1].iov_base = pmem_buf;
⊲ a pointer of PMEM
13:
datavec[1].iov_len = PAYLOAD_SZ;
14:
struct msghdr msg.msg_iov = (struct iovec*)&datavec;
15:
sendmsg(sock_fd,msg,SENT_SZ,flags);
⊲ sendmsg() function
16:
pmem_buf += PAYLOAD_SZ;
17:
end while
⊲ do until mayIgo()
18:
end while
⊲ do until eof
19:
pmem_free(pmem_buf);
⊲ pmem deallocates.
20:
pmem_fin(pd);
⊲ release a pmem block.
21: end procedure
4 Performance evaluation
Figure 7 shows a detailed view of out testbed system [14]. The testbed consists of
a shooter Linux machine, a GbE switch, and a high-performance streaming server
with two NS cards. An NS card is connected between a disk array for serving large
volumes and a gigabit interface for controlling and transmitting streaming data in the
form of zero-copy via the help of a PCI Memory [15, 16].
To measure streaming performance, the shooter Linux machine generates virtual client requests via RTSP messages using an in-house shooter utility, so-called
Fast-path I/O architecture for high performance streaming server
111
Fig. 7 Testbed environment
“PseudoPlayer.” This is shown in Fig. 7a, which is implemented to deal with virtual
user connections. Using input parameters, PseudoPlayer is able to execute a various
protocols including UDP, TCP, and RTP. Through PseudoPlayer’s request (shown in
Figs. 7b, 7c, and 7d), the high-performance streaming server can receive virtual RTSP
messages and interacts with individual clients. It selects an NS card (e.g., selection
is also another part of research as contents distribution framework, we can fix a simple round-robin approach for testing) and transfers data to the switch. As shown in
Figs. 7e′ and 7e′′ , if the switch receives a packet with a destination MAC address
that does not exist in the bridge table, the switch sends that packet over all its interfaces that belong to the same virtual LAN(VLAN) assigned to the interface where the
packet came in from. To prevent this flooding, the switch’s VLAN table must map
the VLAN number of the port on which the packet arrived to a list of ports that the
packet is spread over [17]. In Fig. 7e′′ , the dummy port of the switch, namely a black
hole is physically not connected. For using MAC learning of the switch, we set up an
ARP table in a Linux system to add an address mapping entry for flowing the selected
port. Contrary to the black hole port, Fig. 7e′ indicates that the video stream analyzer
is connected to the switch. It then receives stream data to handle in a proper fashion.
Figure 8 shows read throughput and CPU load of the NS card. The size of
test file was 10 Gbytes, and the PMEM block size was 512 Kbytes. EXT3 + raid0
refers to 4 disk RAID 0 array and mounts an EXT3 file system with standard kernel. EXT3NS + pure (also, EXT3NS + raid0) refers to the same number of disk
and mounts EXT3NS file system with fast-path enabled kernel. EXT3NS + SDA
uses Stream Disk Array (SDA) mechanism to handle large data chunk at once. The
variation thread number affects read performance in accordance with the type of
file system. EXT3NS + SDA shows tremendous performance gains compared with
EXT3 + raid0. EXT3NS + pure also has good performance gains through the zerocopy mechanism. It also uses large chunk of data (such as 512 Kbytes). Note, however, that the SDA feature (e.g., pipelined I/O) is not included. Figure 8b shows
lower cpu-utilization of EXT3NS + SDA compared with that of EXT3 + raid0.
112
Y.J. Lee et al.
Fig. 8 Read throughput/ CPU
load of NS card
EXT3 + raid0 consumes about 12% CPU usage up to about ∼64 threads, and shows
a severe increase in CPU usage from about 64–256 threads. EXT3NS + SDA has no
influence on CPU time regardless of the number of threads.
In Fig. 9a, we measure the maximum number of streams on four different types.
The “Ideal NIC Limitation” is computed by division of the NS card’s NIC capacity
by a bitrate (e.g., 200 streams are equal to the division of 2 GbE NICs by 10 Mbps at
1 NS card). Its parameters consist of 200 video objects, a 10 Mbps bitrate, and 1/2/3
various NS cards. Streaming requests are partitioned amongst a node’s mount points
(e.g., “/ns0” has 50 video objects) according to a Zipf distribution and arrive at the
node according to a fixed rate Poisson process [18, 19]. Figures 9b, c, and d show
CPU idle/nonidle utilization under two NS cards.
We measure streaming throughput on three different types of operating kernels:
(i) standard Linux kernel without NS cards (i.e., a general system), (ii) Linux 2.4
kernel with the zero-copy patch and NS cards, and (iii) Linux 2.6 kernel with the zerocopy patch and NS cards. The horizontal axis denotes the number of NS cards (up
to 3 cards), and the vertical axis denotes the maximum number of jitter-less streams.
The Linux 2.6 kernel equipped with NS cards gives better performance than both the
standard Linux kernel and the Linux 2.4 kernel. The latter’s CPU idle pattern leads to
more diverse and unstable values than the 2.6 kernel’s. Notably, its value is about 5
times lower than that of the 2.6 kernel for the case of 370 streams as shown in Fig. 9b.
Figure 9d also demonstrates an example of non-interference even if a node equips up
to three NS cards within the limits of the maximum number PCI slots. Figures 9e
and 9f depict, respectively, the number of streams and CPU idle variation curve for
700 seamless streams over a 2-week period.
Fast-path I/O architecture for high performance streaming server
113
Fig. 9 Number of streams/CPU usage patterns
Figure 10 shows the number of zero-loss streams for various content types. “Standard 2/4 NIC” means the standard Linux 2.4 kernel with legacy NICs, and “1/2 NS
cards” means the Linux 2.4 kernel with the zero-copy patch and NS cards. This test
report is offered by Telecommunications Technology Association (TTA), an organization for national telecom standardization and certificated test [20].
5 Deployment
An overall picture of a multimedia distribution network including high performance
streaming server is shown in Fig. 11.
114
Y.J. Lee et al.
Fig. 10 MPEG-2 4/10/20M,H.264 600K’s stream capacity between standard and 1/2 NSCard-enabled
system in Linux 2.4 kernel (source: TTA test reports)
Fig. 11 Multimedia distribution network
For global deployment, global contents network services are required in more than
one geographical location or data center. One of the goals of the Global Distribution Server (GDS) is to enable the geographic distribution of the content distribution service. In addition to content deployment, the Digital Item Server (DIS) provides session mobility management. The location of the edge of the Global Contents
Network (GCN) or the front-end of several enterprise sites is a gateway to employ
real-time streaming or prefetching for seamless delivery. In order to satisfy these re-
Fast-path I/O architecture for high performance streaming server
115
quirements, the Local Distribution Server (LDS) communicates with the Global Distribution Server (GDS) for content distribution, and the High Performance Streaming Server (HPSS) is responsible for context-sensitive multimedia streaming under
reliable network bandwidth. Each enterprise site is accessible by a public or personalized device via a well-formed network infrastructure. The primary role of the
Global/Local Distribution server is to deploy the right content in the right place at the
right time in order to satisfy client’s requests. Additionally, when requested content
is not locally available in the Local Distribution Server (LDS), it should be obtained
from the Global Distribution Server (GDS) via an efficient distribution policy. Even
though the server provides a large storage capacity, all of the multimedia contents
to be serviced cannot be stored at the server. In addition, if services are processed
globally, global network infrastructure traffic increases. This increased traffic can render the service unreliable or unavailable. To resolve this problem, a service provider
generally adopts a Content Delivery Network (CDN) technique that locates many
local servers or cache servers at various specific regional areas. In a CDN infrastructure, multimedia streaming services are provided to end-users through a nearest local
server. The GDS plays the role as content-provider, and we introduce the Files Server
Node (FSN) node concept, as part of the GDS. It is important to store the right content in the right place at the right time in order to satisfy client’s requests. In addition,
when requested content is not locally available, it should be obtained from the FSN
node as quickly and reliably as possible. In order to support these requirements, the
server should provide an efficient content distribution mechanism. The content distribution software provides the following features for versatile transfer, distribution
methods, and content usage monitoring.
• Content Transfer: The content distribution unit transfers content from the FSN to
a HPSS. The content transfer job can be executed promptly (On-demand Transfer)
or in a delayed manner (Scheduled Transfer).
• Preloading and Prefix Caching: The content distribution unit provides a mechanism
to preload contents from the FSN prior to service requests by using either ondemand or scheduled transfer. For better performance, storage of as much content
as possible is desired.
• Dynamic Loading: When content is requested to be serviced, and if the content is
not available completely, the content distribution unit transfers the content dynamically.
• Content Purge: When a content transfer job is activated, and if the storage has no
space to store the content to be transferred, the transfer job fails. To prevent this
situation, the content distribution unit prepares needed storage space in advance by
purging some existing contents.
• Content Storage Management: For determining the necessity of purge operation,
the content distribution unit should know the status of storage.
• Content Usage Monitoring: Selection of appropriate contents to cache at the server
can affect the cache utilization and thus service performance. The unit should provide information about usage of contents.
Our fast-path I/O solution provides three main products: IPTV, Internet VOD, and
Cable TV (CATV). At first, the IPTV solution includes the full range of components
116
Y.J. Lee et al.
Fig. 12 IPTV system diagram
Fig. 13 IPTV system diagram
for transmitting TV or video via IP networks as shown in Fig. 12. These include: Digital Signal Processor (DSP) encoders, Conditional Access System (CAS), billing, Subscriber Management System (SMS), and other built-in subsystems, namely StreamXpert (SX) series. The content stream system (SX LS) has an NS card to facilitate
high-performance streaming and communicates with a VOD client module in the
subscriber’s STB. Content distribution and management system (SX GS) controls
adaptive distribution to alleviate network burden.
Figure 13 shows a screen shot of an Electronic Program Guide (EPG) used for a
first generation IPTV system.
Fast-path I/O architecture for high performance streaming server
117
Fig. 14 Internet VOD system diagram
Fig. 15 Cable TV (CATV) diagram
As shown in Fig. 14, the Internet VOD system, similar to the IPTV system, has
major components for interacting with web users. Various client devices, such as PC,
PDA/DMB, and Mobile PC, are compatible.
For a Cable TV (CATV) service, many kinds of equipment are needed as shown in
Fig. 15. An Edge Quadrature Amplitude Modulation (QAM) device is used to receive
the signal from Gigabit Ethernet; it gives QAM RF output to the coaxial network.
In the subscriber’s field, the cable set-top box with smart card and Open Cable
Application Platform (OCAP) will enable cable customers to receive a wide variety
118
Y.J. Lee et al.
of services and applications, such as electronic program guides, pay per view, video
on demand, interactive sports, and game shows.
Like the above three deployments, our system provides an integrated, flexible architecture with coexisting IPTV, Internet VOD and Cable TV services and allows the
enhancement of existing services in terms of streaming performance, suitable distribution.
6 Conclusion
In this paper, we have presented a fast-path I/O architecture, which is deployed inkernel zero-copy for eliminating data movements, a EXT3NS file system for largescale file I/O, and a high-performance streaming server for Korean-style Internet
server specialized on HD-level services. Performance evaluation indicates improvements in streaming throughput without CPU burden. Application-level throughput
average is 1.8 Gbps per NS card. Ideally, this meets dual gigabit bandwidth and guarantees high performance without degradation in disk-network scenarios. For emerging applications like wireless streaming, IPTV, or even custom internal streaming deployments, our fast-path solution successfully delivers high-performance streaming
media.
References
1. Pai VS, Druschel P, Zwaenepoel W (2000) I/O-Lite: a unified I/O buffering and caching system. ACM
Trans Comput Syst (TOCS) 18(1):37–66
2. Buddhikot MM, Chen XJ, Wu D et al (1998) Enhancements to 4.4 BSD UNIX for efficient networked
multimedia in project MARS. In: Proc the IEEE int conf multimedia computing and systems. IEEE
Computer Society, Washington DC, USA, 1998, pp 326–337
3. Plagemann T, Goebel V, Halvorsen P, Anshus O (2000) Operating system support for multimedia
systems. Comput Commun J 23(3):267–289
4. Halvorsen P, Plagemann T, Goebel V (2003) Improving the I/O performance of intermediate multimedia storage nodes. Multimedia Syst J 9:1
5. Chiang M, Li Y (2006) LyraNET: a zero-copy TCP/IP protocol stack for embedded systems. J RealTime Syst 34(1)
6. Abbott MB, Peterson LL (1993) Increasing network throughput by integrating protocol layers.
IEEE/ACM Trans Netw 1(5):600–610
7. Clark DD, Tennenhouse DL (1990) Architectural considerations for a new generation of protocols.
In: Proc of ACM SIGCOMM, 1990, pp 200–208
8. Jerry Chu HK (1996) Zero-copy TCP in Solaris. In: Proc of the USENIX annual technical conference,
1996, pp 253–264
9. An efficient zero-copy I/O framework for Unix. http://research.sun.com/techrep/1995/smli_
tr-95-39.pdf
10. Shivam P, Wyckoff P, Panda D (2001) EMP: zero-copy OS-bypass NIC-driven gigabit Ethernet message passing. In: Proc of supercomputing conference, 2001, Denver
11. Ahn B-S, Sohn S-H, Kim C-Y, Cha G-I, Baek Y-C, Jung S-I, Kim M-J (2004) Implementation and
evaluation of EXT3NS multimedia file system. In: Proc ACM international conference on multimedia,
New York, NY, USA, Jun, 2004
12. Schulzrinne H, Lanphier R, Rao A (1998) Real time streaming protocol (RTSP) RFC 2326, IETF
13. Schulzrinne H, Casner S, Frederick R, Jacobson V (1996) RTP: a transport protocol for real-time
applications RFC 1889, IETF
Fast-path I/O architecture for high performance streaming server
119
14. Lee Y-J, Min O-G, Kim H-Y (2005) Performance evaluation technique of the RTSP based streaming
server. In: ACIS intl conf on computer and information science, vol 4, no 1, Jul 2005
15. Lee Y-J, Min O-G, Mun S-J, Kim H-Y (2004) Enabling high performance media streaming server on
network storage card. In: Proc. IASTED IMSA, USA, Aug 2004
16. Min O-G, Kim H-Y, Kwon T-G (2004) A mechanism for improving streaming throughput on the
NGIS system. In: Proc IASTED IMSA, USA, Aug 2004
17. L2 switching basics—MAC learning. http://www.ciscopress.com/articles/article.asp?p=101367&rl=1
18. Zipf’ Law. http://en.wikipedia.org/wiki/Zipf’s-law
19. Poisson Distribution. http://mathworld.wolfram.com/PoissonDistribution.html
20. TTA. http://www.tta.or.kr/English/new/main/index.htm
Yong-Ju Lee received the B.S. and M.S. degrees in computer engineering from Chonbuk National University, Korea, in 1999 and 2001,
respectively. He joined ETRI (Electronics and Telecommunications Research Institute), Daejeon, Korea, in 2001. Since 2007, he has been
working toward the Ph.D. degree in computer engineering at ChungNam National University, Korea. During the years from 2002 to 2006,
he was involved in the development of the Next Generation Internet
Server. His research interests include high-speed network architecture,
multimedia streaming, distributed file systems, and so on.
Yoo-Hyun Park received the B.S. and M.S. degrees in computer science from Pusan National University, Korea, 1996 and 1998 and Ph.D.
degree in computer science from Pusan National University, Korea, in
2008. He worked at KIDA (Korea Institute for Defense Analyses) in
Seoul, Korea and he joined ETRI (Electronics and Telecommunications
Research Institute), Daejeon, Korea in 2001. His research interests include transcoding proxy, multimedia content delivery, network-storage
system, database and so on.
Song-Woo Sok received the B.S. and M.S. degrees in electronics engineering from Kyungpook National University, Korea, 1999 and 2001.
In 2001, he joined the ETRI. His research interests include networkstorage system, multimedia file system and so on.
120
Y.J. Lee et al.
Hag-Young Kim received the B.S. and M.S. degrees in electronics engineering from Kyungpook National University, Korea, in 1983, 1985,
and Ph.D degree in computer engineering from Chungnam National
University, Korea in 2003. He joined ETRI in 1988, and he has currently
served as the Project Leader of the Media Streaming Research Team of
ETRI. His current interests include computer architecture, high-speed
network architecture, multimedia, middleware, and digital cable broadcast system.
Cheol-Hoon Lee received the B.S. degree in electronics engineering
from Seoul National University, Seoul, Korea in 1983, and the M.S.
and Ph.D. degree in electrical engineering from Korea Advanced Institute of Science and Technology, Seoul, Korea in 1988 and 1992, respectively. From 1983 to 1994, he worked at Samsung Electronics Company
in Seoul, Korea as a researcher. From 1994 to 1995, he was with the
University of Michigan, Ann Arbor, as a research scientist at the RealTime Computing Laboratory. Since 1995, he has been a professor in
the Department of Computer Engineering, Chungnam National University, Taejeon, Korea. His research interests include parallel processing,
operating system, real-time and fault-tolerant computing, and microprocessor design.