相較於linear I/O,分散/聚集 I/O有幾項優點:
(1)較自然的撰碼模式 (2)效率 (3)效能 (4)不可分割
Linux核心內部所有I/O均採用向量的方式
Linux核心內部所有I/O均採用向量的方式
readv, writev, preadv, pwritev - read or write data into multiple buffers
ssize_t readv(int fd, const struct iovec *iov, int iovcnt);
ssize_t writev(int fd, const struct iovec *iov, int iovcnt);
每個iovec結構可用來描述一個獨立的緩衝區,稱為區段segmen(向量vector)
struct iovec {
void *iov_base; /* Starting address */
size_t iov_len; /* Number of bytes to transfer */
};
iovcnt必須≥ 0以及≤ IOV_MAX(1024),
如果iovcnt個iov_len值的總和大於SSIZE_MAX,資料將不會被傳送
grep SSIZE_MAX -r /usr/include -n
SSIZE_MAX = 9223372036854775807
ssize_t writev(int fd, const struct iovec *iov, int iovcnt);
每個iovec結構可用來描述一個獨立的緩衝區,稱為區段segmen(向量vector)
struct iovec {
void *iov_base; /* Starting address */
size_t iov_len; /* Number of bytes to transfer */
};
iovcnt必須≥ 0以及≤ IOV_MAX(1024),
如果iovcnt個iov_len值的總和大於SSIZE_MAX,資料將不會被傳送
grep SSIZE_MAX -r /usr/include -n
SSIZE_MAX = 9223372036854775807
2. event poll (Epoll)
poll()與select()的每次調用,需提供一份所要檢視之檔案描述器的完整清單。然後核心必須處理清單中每個檔案描述器。當這份清單變大時-它可能包含成百上千的檔案描述氣-每次調用所要處理的清單會變成擴充性的瓶頸。
epoll避開這個問題的方法,就是讓「事件檢視器的登記」與「實際的事件檢視工作」脫勾。
(1) epoll_create1系統呼叫用於初始設定epoll的作業環境
epoll_create1 - open an epoll file descriptor
int epoll_create1(int flags);
傳回一個與實例相對應的檔案描述器,與真實的檔案沒有關係;它只是一個可供隨後呼叫使用epoll措施的操作代號,flag目前只有一個有效旗標: EPOLL_CLOEXEC。它可用於啟用close-on-exec行為。
(2) epoll_ctl系統呼叫用於把所要檢視的檔案描述器加入作業環境,或是從作業環境中移除檔案描述器
epoll_ctl - control interface for an epoll descriptor
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
typedef union epoll_data {
void *ptr;
int fd;
uint32_t u32;
uint64_t u64;
} epoll_data_t;
struct epoll_event {
uint32_t events; /* Epoll events */
epoll_data_t data; /* User data variable */
};
參數op用於指定如何操作與fd相對應的檔案
EPOLL_CTL_ADD: 把特定檔案(fd)上的一個檢視器加入epoll實例(epfd)
EPOLL_CTL_DEL: 從epoll實例(epfd)移除特定檔案(fd)上的一個事件檢視器
epoll ET/LT
event參數用於進一步描述操作的行為
EPOLLIN: 檔案可供讀取而且不會遭到阻擋
EPOLLET: 為檔案的檢視器啟用邊緣觸發行為(預設是準位觸發)
poll() / select(): Level-triggered
epoll(): Level-triggered (default) and Edge-triggered
邊緣觸發通常需要運用非阻擋I/O並且仔細檢查EAGAIN
(3) epoll_wait系統呼叫則實際進行事件的等待。
epoll避開這個問題的方法,就是讓「事件檢視器的登記」與「實際的事件檢視工作」脫勾。
(1) epoll_create1系統呼叫用於初始設定epoll的作業環境
epoll_create1 - open an epoll file descriptor
int epoll_create1(int flags);
傳回一個與實例相對應的檔案描述器,與真實的檔案沒有關係;它只是一個可供隨後呼叫使用epoll措施的操作代號,flag目前只有一個有效旗標: EPOLL_CLOEXEC。它可用於啟用close-on-exec行為。
(2) epoll_ctl系統呼叫用於把所要檢視的檔案描述器加入作業環境,或是從作業環境中移除檔案描述器
epoll_ctl - control interface for an epoll descriptor
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
typedef union epoll_data {
void *ptr;
int fd;
uint32_t u32;
uint64_t u64;
} epoll_data_t;
struct epoll_event {
uint32_t events; /* Epoll events */
epoll_data_t data; /* User data variable */
};
參數op用於指定如何操作與fd相對應的檔案
EPOLL_CTL_ADD: 把特定檔案(fd)上的一個檢視器加入epoll實例(epfd)
EPOLL_CTL_DEL: 從epoll實例(epfd)移除特定檔案(fd)上的一個事件檢視器
epoll ET/LT
event參數用於進一步描述操作的行為
EPOLLIN: 檔案可供讀取而且不會遭到阻擋
EPOLLET: 為檔案的檢視器啟用邊緣觸發行為(預設是準位觸發)
poll() / select(): Level-triggered
epoll(): Level-triggered (default) and Edge-triggered
邊緣觸發通常需要運用非阻擋I/O並且仔細檢查EAGAIN
(3) epoll_wait系統呼叫則實際進行事件的等待。
epoll_wait, epoll_pwait - wait for an I/O event on an epoll file descriptor
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
3. memory-mapped I/O (mmap)
mmap, munmap - map or unmap files or devices into memory
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
prot:
PROT_READ Pages may be read.
PROT_WRITE Pages may be written.
PROT_EXEC Pages may be executed.
flags:
MAP_SHARED
Share this mapping. Updates to the mapping are visible to other processes that map this file, and are carried through to the underlying file. The file may not actually be updated until msync(2) or munmap() is called.
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
調整一個映射的大小
mremap - remap a virtual memory address
void *mremap(void *old_address, size_t old_size,
size_t new_size, int flags, ... /* void *new_address */);
mremap() uses the Linux page table scheme. mremap() changes the mapping between virtual addresses and memory pages. This can be used to implement a very efficient realloc(3).
變更一個映射的保護旗標
mprotect - set protection on a region of memory
int mprotect(void *addr, size_t len, int prot);
prot:
PROT_NONE The memory cannot be accessed at all.
PROT_READ The memory can be read.
PROT_WRITE The memory can be modified.
PROT_EXEC The memory can be executed.
檔案與映射的同步
msync - synchronize a file with a memory map
int msync(void *addr, size_t length, int flags);
flag:
MS_SYNC
MS_ASYNC
MS_INVALIDATE
對映射的用法提供建議
madvise - give advice about use of memory
int madvise(void *addr, size_t length, int advice);
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
prot:
PROT_READ Pages may be read.
PROT_WRITE Pages may be written.
PROT_EXEC Pages may be executed.
flags:
MAP_SHARED
Share this mapping. Updates to the mapping are visible to other processes that map this file, and are carried through to the underlying file. The file may not actually be updated until msync(2) or munmap() is called.
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
調整一個映射的大小
mremap - remap a virtual memory address
void *mremap(void *old_address, size_t old_size,
size_t new_size, int flags, ... /* void *new_address */);
mremap() uses the Linux page table scheme. mremap() changes the mapping between virtual addresses and memory pages. This can be used to implement a very efficient realloc(3).
變更一個映射的保護旗標
mprotect - set protection on a region of memory
int mprotect(void *addr, size_t len, int prot);
prot:
PROT_NONE The memory cannot be accessed at all.
PROT_READ The memory can be read.
PROT_WRITE The memory can be modified.
PROT_EXEC The memory can be executed.
檔案與映射的同步
msync - synchronize a file with a memory map
int msync(void *addr, size_t length, int flags);
flag:
MS_SYNC
MS_ASYNC
MS_INVALIDATE
對映射的用法提供建議
madvise - give advice about use of memory
int madvise(void *addr, size_t length, int advice);
4. file advice
5. asynchronous I/O
kqueue
kqueue
沒有留言:
張貼留言