The IO Subsystem

{My notes on Linux IO and DMA – all basic stuff}

An application cannot write directly to the devices, or to the network. It needs to channel the requests via the kernel. Similarly when an application wants to read some data, it performs a read system call. The kernel would load the data into the input buffer, and then someone somehow signals the application that the data is ready to be consumed. While an application, or an application server can be written without understanding what goes under the hood, it is often important to understand the IO subsystem to be able to tune high performing servers. The notes on this page try to concisely capture the key concepts in IO subsystem of an operating system, and prepare a stage for understanding the blocking IO, non-blocking IO, and epoll edge triggered channel selection.

Some Basic Basics

  • An application cannot read directly from the devices, or the network. It needs to request via the kernel.
  • The memory is divided into two regions – the user space (a.k.a. high memory), and the kernel space (a.k.a. low memory). The application runs in the user space and does not have access to all parts of memory. The access restriction ensures that applications cannot do bad things like halting the CPU, and updating kernel’s data. The kernel code executes in the kernel space, and has access to all parts of memory including the user space.
  • IO refers to the data read and data written from one of the devices, or the network. For any IO, the application needs to request the kernel via system calls. The part of the kernel that handles IO related system calls are called the drivers (which are specialized interrupt routine handlers).
  • IO transfers are of two types– block oriented (for devices like the disk) and stream oriented (like the network I/O).

Direct Memory Access

It doesn’t really require the CPU to do the IO, or else the benefits of multithreading and CPU optimization would be lost. Direct Memory Access module is a specialized hardware that would perform the I/O and let the CPU know when the data is read/written. The DMA module controls the exchange of data between main memory and devices. Many drivers including the disk drive, network, and the graphics card leverage the DMA channel for efficient data transfers. Its also used for data transfers between cores of a multi-core processor.

  • The DMA is a specialized processor in its own right, with an address register and a logical unit.
  • The drivers can allocate DMA buffer, when a DMA hardware is available.
  • The CPU is interrupted only after the entire block of data has been transferred.

The IO Playground

There are four key players involved in any network IO operation:

  • The application: The application initiates the IO, usually in a thread. We shall use the terms application, process, or a thread synonymously to represent the code that initiates IO.
  • The network driver: The network driver provides native read and write calls for receiving and sending the data over the network. It also performs other functions such as opening the socket, accepting a connection, select, poll, etc. The network driver is usually a part of kernel. Most parts of this discussion are not restricted to a network driver, and are applicable to other device drivers as well.
  • IO Buffers: IO buffers are the data structures used by the network driver. In particular, when an application issues read call on the driver, it tries to read the data from the input buffer, and when it issues a write call, it tries to write the the data to the output buffer. One of the specialized modules (like the DMA module) can take care of filling the buffer when a client is writing to the socket, or flushing the buffer when the application has written to the socket. Thus an application need not worry about reading or writing on a byte-by-byte basis. The device drivers can implement a single, double, or a circular buffer (aka DMA ring buffer) for optimizing this process.
    cat /proc/sys/net/ipv4/tcp_rmem to find the size of receive buffer on Linux.
    cat /proc/sys/net/ipv4/tcp_wmem to find the size of write buffer on Linux.
  • Application Buffers: Optionally, the application may as well require an app side buffer to manage the data that it intends to read/write to the network. To appreciate the need of an application buffer, consider the below scenarios:
  • Consider a scenario where the application is reading humongous data from a network client. The app must thus read a chunk of this data in its own data structure and process it before attempting to read another chunk; or else it will soon be out of memory.
    while( haveMoreToRead() && !eof() ) {
    appBuffer = read(); //read data coming in, into the app buffer
  • Consider another scenario where the application is writing crazy amount of data to a network client. The app must prepare a small chunk of data to be written in its own data structure, write it to the network, and then prepare the next chunk; or else it will exhaust the memory.while( haveMoreToWrite() ) {
    generateChunkOfDataToWrite(appBuffer); //generate piecemeal data for writing out

Scatter and Gather

In order to further reduce the number of system calls, an application can pass a list of buffers to the driver, in a single read or write call. The driver can then fill or drain multiple buffers in sequence. Scattering refers to reading the IO buffer into multiple app buffers. Gathering refers to writing data from multiple app buffers into the IO buffer through a single system call.

TBD: Image

Virtual Memory

Virtual memory is used for below reasons:

  • Provides an addressable space larger than the physical memory. This is done using Memory Paging.
  • Allows for overlapping the user space and the kernel space in physical memory. This is a very neat technique, because this can eliminate the copies between the kernel and the user space (iff they share the same page alignment). Even though the drivers copy the data in kernel space, the same data can be reflected in the user space that is mapped to the same physical address range. This is cool, because this allows us to have Memory Mapped File IO.

TBD: Image

Memory Paging

Memory Paging is required to support an addressable space that is larger than the physical memory. It works using the below process, which is transparent to the application referencing the memory location:

  • A Paging Area on the disk acts as the entire Virtual Memory (addressable space). The physical memory acts as a cache for this paging area.
  • A Memory Page is a continuous range of memory locations.
  • Memory Management Unit (MMU) hardware sits between the CPU and the Main Memory. It contains mappings of Virtual Pages -> Physical Pages, for all data in the physical memory.
  • When CPU references a memory location, the MMU finds which page the location resides in.
  • If there is no mapping between the virtual page and the physical page, the MMU raises a Page Fault.
  • This results in a page-in read system call to the disk driver, which then fetches the page from paging area into the physical memory.
  • This in turn might require a page-out system call to make room for the new page. Page-out results in writing a page back to the paging area on disk (if the page is dirty).
  • The MMU is updated with the new mapping.
  • This entire process is called Demand Paging.
  • A condition, when paging demand becomes so high that nothing else can be done, is called thrashing.

File IO

File IO uses a technique similar to demand paging to load parts of file into physical memory. Instead of mapping between Virtual Pages -> Physical Pages, there is a mapping maintained between Disk File Pages -> Physical Memory Pages. Page faults are generated if the requested file page is not in physical memory. Memory paging and File IO on paginated operating systems are two sides of the same coin.

Memory Mapped Files

Memory Mapped Files is another concept that is linked to the Virtual Memory, and is managed by the MMU. In conventional File IO, the data is transferred between the kernel space and the user space. Memory mapped IO is a special type of File IO that allows applications to take full advantage of page oriented IO, and completely avoid copy of data between IO and application buffers. It uses the file system, to establish a virtual memory mapping from the user space to the file page. Advantages of memory mapped IO:

  • The application views the file as memory data. It does not need to issue read/write calls.
  • As the application references this memory, page faults are generated that would load the file pages in memory. The dirty pages in memory are paged out and flushed to disk.
  • Copying between IO buffer and app buffer is completely eliminated. Thus, very large files can be memory mapped without requiring consuming memory for two different buffers.
  • Most operating systems are efficient when handling page aligned buffers, that are multiples of native page size.

File Locking

File locks allows an app to prevent another app from accessing a file region. Some key points:

  • Its not required to lock the entire file. Applications can choose to lock a file region.
  • File locks are of two types – shared and exclusive. A shared lock is acquired by an app that tries to read a file region. An exclusive lock is required by an app that tries to write to a file region. An exclusive lock on a file region can only be granted when there are no shared locks taken by any app on that region.
  • File locks can be advisory or mandatory. Advisory locks are hinted to the application, and expect the application to take call based on the lock. Mandatory locks are enforced by the kernel, and prevent apps until the lock is available. OS documentation must be checked to find if the OS supports advisory, mandatory, or both kinds of locks.
  • File locks are arbitrated at a process level, and not at the thread level. Thus, if one thread acquires an exclusive lock, another thread in the same process will also be granted an exclusive lock. However, another process will be denied, until the first process releases the lock.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>