Analysis of Disk and Network IO Working Methods



It is necessary to briefly talk about the data transfer mode between slow I/O devices and memory.

  • PIO
    Let’s take disk as an example. Long ago, the data transfer between disk and memory was controlled by CPU. That is to say, if we read disk files into memory, the data would be stored and forwarded by CPU. This method is called PIO. Obviously, this method is very unreasonable. It takes a lot of CPU time to read files, causing the system to almost stop responding when accessing files.

  • DMA
    Later, DMA (Direct Memory Access) replaced PIO, which can directly exchange data between disk and memory without CPU. In DMA mode, CPU only needs to issue instructions to DMA controller to handle data transfer. DMA controller transfers data through system bus and notifies CPU after transfer is completed, thus greatly reducing CPU occupancy rate and greatly saving system resources. However, the difference between its transfer speed and PIO is not very obvious, because it mainly depends on the speed of slow equipment.

To be sure, we have rarely seen PIO-mode computers now.

Standard file access method


Specific steps:

When the application program calls the read interface, the operating system checks whether there is any required data in the cache of the kernel. If it has been cached, it will directly return from the cache. If not, it will read from the disk and then cache in the cache of the operating system.

When an application program calls the write interface, it copies the data from the user address space to the cache in the kernel address space. At this time, the write operation has been completed for the user program. As for when to write to the disk, it is up to the operating system to decide, unless it is shown that the sync synchronization command is called.

Memory mapping (Reduce the copy operation of data between user space and kernel space, and is suitable for mass data transmission.)

Linux kernel provides a special way to access disk files. It can associate a certain address space in memory with the disk files we want to specify, thus converting our access to this memory into the access to disk files. This technology is called Memory Mapping.

The operating system associates a certain area of memory with a file on disk, and when a piece of data in memory is to be accessed, it is converted to a piece of data of the accessed file.The purpose of this method is also to reduce the data copy operation from kernel space cache to user space cache, because the data of the two spaces are shared.

Memory mapping refers to one-to-one correspondence between the location of a file on the hard disk and an area of the same size in the process logical address space. When a piece of data in memory is to be accessed, it is converted into a piece of data to access the file.The purpose of this method is also to reduce the copy operation of data between user space and kernel space. When a large amount of data needs to be t ransferred, using memory mapping to access files will achieve better efficiency.

When using memory-mapped files to process files stored on disk, I/O operations will no longer need to be performed on the files, which means there will be no need to apply for and allocate cache for the files when processing the files. All file cache operations will be directly managed by the system. As the steps of loading file data into memory, writing back data from memory to file, releasing memory blocks, etc. are cancelled, it makesMemory-mapped files can play an important role in processing files with large amounts of data..


Access steps


In most cases, using memory mapping can improve the performance of disk I/O. It does not need to use system calls such as read () or write () to access files, but uses mmap () system calls to establish the association between memory and disk files, and then accesses files as freely as accessing memory.
There are two types of memory mapping, shared and private. The former can synchronize any write operation to memory to disk files, and all processes mapping the same file share any process modification to mapped memory. The files mapped by the latter can only be read-only, so it is not possible to synchronize writes to memory to files, and multiple processes do not share modifications. Obviously, the efficiency of shared memory mapping is low, because if a file is mapped by many processes, each modification synchronization will cost a certain amount of overhead.

Direct I/O (Bypass the kernel buffer and manage the I/O buffer yourself)

In Linux 2.6, there is no essential difference between memory mapping and direct access to files, because data has to go through two replications from process user state memory space to disk, i.e. between disk and kernel buffer and between kernel buffer and user state memory space.
The purpose of introducing the kernel buffer is to improve the access performance of the disk file, because when the process needs to read the disk file, if the file content is already in the kernel buffer, then there is no need to access the disk again; However, when a process needs to write data to a file, it actually only writes to the kernel buffer and tells the process that the writing has been successful, and the actual writing to disk is delayed by a certain policy.

However, for some more complex applications, such as database servers, in order to fully improve performance,I hope to bypass the kernel buffer and implement and manage the I/O buffer in the user state space by myself, including cache mechanism and write delay mechanism, to support the unique query mechanism.For example, the database can improve the hit rate of query cache according to a more reasonable strategy. On the other hand, bypassing the kernel buffer can also reduce the overhead of system memory because the kernel buffer itself is using system memory.

Applications directly access disk data without going through the operating system kernel data buffer.The purpose of this is to reduce one copy of data from the kernel buffer to the user program cache.. This way is usually in a database management system where cache management of data is implemented by an application program.
The disadvantage of direct I/O is that if the accessed data is not in the application cache, the data will be loaded directly from the disk each time, which is very slow. Generally, direct I/O combined with asynchronous I/O will achieve better performance.


Access steps


Linux provides support for this requirement, i.e. add the parameter option O_DIRECT in the open () system call, and the file opened with it can be used.Bypass direct access to the kernel buffer, thus effectively avoiding unnecessary time overhead of CPU and memory.

Incidentally, an option similar to O_DIRECT is O_SYNC, which is only valid for writing data. It writes the data written into the kernel buffer to the disk immediately, minimizing the loss of data in the event of a machine failure, but it still passes through the kernel buffer.

Sendfile/ zero copy (Network I/O,kafka uses this feature)

Common network transmission steps are as follows:

1) the operating system copies the data from the disk to the page cache of the operating system kernel
2) the application copies the data from the kernel cache to the application cache
3) the application writes the data back into the Socket cache of the kernel
4) the operating system copies the data from the Socket cache area to the network card cache, and then sends it out through the network


1. when calling the read system call, copy the data to kernel mode through DMA(Direct Memory Access)
2. Then the CPU controls to copy the kernel mode data to the buffer in the user mode.
3. After the read call is completed, the write call first copies the data in the buffer in user mode to the socket buffer in kernel mode
4. Finally, the data in socket buffer in kernel mode is transferred to network card equipment by DMA copy.

From the above process, it can be seen that the data went from kernel mode to user mode in vain, wasting two copies, both of which are CPU copy, that is, occupying CPU resources.



Sendfile transfer requires only one system call. When sendfile is called:
1. First read the data from disk to kernel buffer by DMA copy
2, and then through the CPU copy data from kernel buffer copy to sokcet buffer
3. finally, data in socket buffer is copied to network card buffer by DMA copy and sent.
Sendfile has one mode switch and one CPU copy less than read/write. However, from the above process, it can also be found that it is unnecessary to copy the data from the kernel buffer to the socket buffer.

To this end, the Linux2.4 kernel has improved sendfile, as shown in the following figure
The improved treatment process is as follows:
1. DMA copy copies disk data into kernel buffer
2. Add the position and offset of the data currently to be sent in the kernel buffer to the socket buffer
3. DMA gather copy directly copies the data in the kernel buffer to the network card according to the position and offset in the socket buffer.
After the above process, the data is transferred from the disk after only 2 copies. (In fact, this Zero copy is for the kernel, and the data is Zero-Copy in kernel mode).
At present, many high-performance http server have introduced sendfile mechanism, such as nginx, lighttpd, etc.

FileChannel.transferTo(Zero copy in Java)

The filechannel.transferTo (longposition, longcount, writeablebytechanneltarget) method in Java NIO transfers the data in the current channel to the target channel target. in linux systems that support Zero-Copy, the implementation of transferto () depends on sendfile () call.


Compared with zero copy mode, traditional mode:


The entire data path involves 4 data replications and 2 system calls. if sendfile is used, multiple data replications can be avoided. the operating system canCopy data directly from kernel page cache to NIC cacheThis can greatly speed up the whole process.

Most of the time, we are requesting static files, such as pictures and style sheets, from the Web server. according to the previous introduction, we know that in the process of processing these requests, the data of the disk files must first pass through the kernel buffer and then reach the user’s memory space. because it is static data that does not need any processing, they are then sent to the kernel buffer corresponding to the network card and then sent to the network card for transmission.

The data went out of the kernel, went around once, and returned to the kernel without any change. It seemed like a waste of time. In the kernel of Linux 2.4, a kernel-level Web server program called khttpd is tentatively introduced, which only handles requests for static files. The purpose of introducing it is that the kernel wants the processing of requests to be completed in the kernel as far as possible, thus reducing the overhead of switching between kernel states and copying data in user states.

At the same time, Linux provides this mechanism to developers through system calls, which are sendfile () system calls. It can directly transfer a specific part of the disk file to the socket descriptor representing the client, speeding up the request of static files and reducing the overhead of CPU and memory.

Sendfile is not supported in OpenBSD and NetBSD. Through strace’s tracking, we can see that Apache uses mmap () system call to realize memory mapping when processing 151 bytes of small files, butWhen Apache processes large files, memory mapping will result in large memory overhead, not worth the cost.So Apache uses sendfile64 () to transfer files. sendfile64 () is an extended implementation of sendfile (), which is available in Linux versions after 2.4.

This does not mean that sendfile can play a significant role in any scenario.Sendfile plays a less important role in requesting smaller static files., through stress testing, we simulated 100 concurrent users requesting 151 bytes of static files, and the throughput rate of using sendfile is almost the same, which can be seen.When processing small file requests, the proportion of time taken by the link sending data in the whole process is much smaller than that of large file requests, so the optimization effect for this part is naturally not very obvious..