星期日, 8月 25, 2019

運用 memory mapped file 增進 I/O 效能之觀察

曾因工作需要撰寫程式處理資料大小達 gigabyte 以上的檔案,起初僅以傳統的 file stream 方式讀取檔案,但速度甚差以致影響後續的系統整體效能,當時改採以 memory mapped file (mmap) 方式讀檔後效能便獲得顯著的提昇。
mmap 是檔案資料被載入屬於 kernel space 的 cache buffer 後將其地址映射至 user space 的虛擬地址,不需將其複製至 user space 應用程式便可讀寫,一般認為可增進 I/O 的效能。

執行指令 cat /proc/<pid>/maps 可觀察某程序的 mmap 狀態,如下例第 3 行顯示檔案 test2.csv 內容被映射至該程序之虛擬地址 43376000-76e17000。
00010000-00011000 r-xp 00000000 b3:07 393579     /home/pi/Downloads/mytest
00020000-00021000 rw-p 00000000 b3:07 393579     /home/pi/Downloads/mytest
43376000-76e17000 r--p 00000000 b3:07 393560     /home/pi/Downloads/test2.csv
76e17000-76f42000 r-xp 00000000 b3:07 402094     /lib/arm-linux-gnueabihf/libc-2.19.so
76f42000-76f52000 ---p 0012b000 b3:07 402094     /lib/arm-linux-gnueabihf/libc-2.19.so
76f52000-76f54000 r--p 0012b000 b3:07 402094     /lib/arm-linux-gnueabihf/libc-2.19.so
76f54000-76f55000 rw-p 0012d000 b3:07 402094     /lib/arm-linux-gnueabihf/libc-2.19.so
76f55000-76f58000 rw-p 00000000 00:00 0
76f6c000-76f71000 r-xp 00000000 b3:07 790330     /usr/lib/arm-linux-gnueabihf/libarmmem.so
76f71000-76f80000 ---p 00005000 b3:07 790330     /usr/lib/arm-linux-gnueabihf/libarmmem.so
76f80000-76f81000 rw-p 00004000 b3:07 790330     /usr/lib/arm-linux-gnueabihf/libarmmem.so
76f81000-76fa1000 r-xp 00000000 b3:07 400332     /lib/arm-linux-gnueabihf/ld-2.19.so
76fab000-76fb0000 rw-p 00000000 00:00 0
76fb0000-76fb1000 r--p 0001f000 b3:07 400332     /lib/arm-linux-gnueabihf/ld-2.19.so
76fb1000-76fb2000 rw-p 00020000 b3:07 400332     /lib/arm-linux-gnueabihf/ld-2.19.so
7eb79000-7eb9a000 rwxp 00000000 00:00 0          [stack]
7ed8e000-7ed8f000 r-xp 00000000 00:00 0          [sigpage]
7ed8f000-7ed90000 r--p 00000000 00:00 0          [vvar]
7ed90000-7ed91000 r-xp 00000000 00:00 0          [vdso]
ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors]

但系統效能受諸多因素相互作用影響,mmap 有時不必然能有幫助,實應整體考量。
在此藉一 Java 程式簡易測試 mmap 讀取大檔案,並觀察比較在不同的系統條件下執行之結果。

// reading with memory mapped file...
  private static int mapped(){
   int c= 0;
   log("mapped file processing...");
     RandomAccessFile f = new RandomAccessFile(filename,"r");
     long pos = 0, len = f.length();
     FileChannel fc = f.getChannel();
     while (len > 0){
       MappedByteBuffer buff = fc.map(MapMode.READ_ONLY, pos,
         (len > Integer.MAX_VALUE)? Integer.MAX_VALUE:len);
       log("%d, %d, %d",buff.remaining(),buff.position(),buff.limit());
       while (buff.hasRemaining()){
         byte b = buff.get();
         if (b == '\n') c++;
       len -= buff.limit();
       pos += buff.limit();
   } catch(IOException e){
   return c;
// reading with file stream...
  private static int normal(){
   int c = 0;
   log("standard io processing...");
   FileReader fr;
   try {
    fr = new FileReader(filename);
    BufferedReader br = new BufferedReader(fr);
    String r = br.readLine();
    while (r != null){
     r = br.readLine();
   } catch (FileNotFoundException e) {
   } catch (IOException e) {
   return c;

測試環境-A: Raspberry 3B, Raspbian, Oracle JDK, 1GB ram, 32GB MicroSD
測試環境-B: Rock64, Armbian, OpenJDK, 4GB ram, 64GB EMMC
測試環境-C: Rock64, Armbian, OpenJDK, 4GB ram, 32GB microSD

T1 :測試檔案大小 866782000 bytes, file stream 讀取
T2 :測試檔案大小 1733564000 bytes, file stream 讀取
T3 :測試檔案大小 2783650000 bytes, file stream 讀取
T1':測試檔案大小 866782000 bytes, memory mapped file 讀取
T2':測試檔案大小 1733564000 bytes, memory mapped file 讀取
T3':測試檔案大小 2783650000 bytes, memory mapped file 讀取

執行測試程式時以 vmstat 收集數據:
vmstat -t -n 1|awk -Winteractive 'BEGIN {OFS=","} NR==2 {print $18,$3,$4,$5,$6,$7,$8,$9,$10,$13,$14,$16} NR>2 {print $19,$3,$4,$5,$6,$7,$8,$9,$10,$13,$14,$16}

若 vmstat 不支援 -t 選項時,可以下列指令取得時間欄:
vmstat -n 1 | awk -Winteractive 'BEGIN {OFS=","} NR==2 {print "time",$3,$4,$5,$6,$7,$8,$9,$10,$13,$14,$16} NR > 2 {cmd="date +%H:%M:%S"; cmd | getline t;print t,$3,$4,$5,$6,$7,$8,$9,$10,$13,$14,$16;close(cmd)}

         T1        T2         T3         T1'       T2'        T3'
filesize 866782000 1733564000 2783650000 866782000 1733564000 2783650000
A(RPi)   --------- ---------- ---------- --------- ---------- ----------
 sec     62        123         NA        62        122        NA
 avg_bi  13714.90  13882.83    NA        13769.10  13959.90   NA
 avg_us  20.21     20.32       NA        23.84     24.14      NA
 avg_sy  2.47      2.41        NA        1.68      1.56       NA
 avg_wa  3.19      3.65        NA        0.45      0.62       NA
B(Rock64)--------- ---------- ---------- --------- ---------- ----------
 sec     13        24         54         9         17         27
 avg_bi  60465.14  67717.44   51339.10   94052     99588.10   97086
 avg_us  24.79     25.24      25.83      20.22     19.12      20.82
 avg_sy  7.07      7.48       5.83       6.44      6.59       6.43
 avg_wa  0.29      0.16       0.11       5.56      5.35       4.32
C(Rock64)--------- ---------- ---------- --------- ---------- ----------
 sec     39        76         123        39        77         123
 avg_bi  21566.97  22056.26   22081.50   21829.54  21882.49   22096.29
 avg_us  7.13      6.63       10.68      3.46      3.34       3.91
 avg_sy  1.85      1.95       1.90       1.05      1.00       1.03
 avg_wa  17.31     17.71      12.97      20.97     20.77      20.31
         T1'-T1     T2'-T2     T3'-T3
A(RPi)   ---------  ---------  ---------
 sec     0          -1(0.8%)   NA
 us      +3.65      +3.82      NA
 sy      -0.79      -0.85      NA
 wa      -2.74      -3.03      NA
B(Rock64)---------  ---------  ---------
 sec     -4(30.7%)  -7(29.1%)  -27(50%)
 us      -4.57      -6.12      -5.01
 sy      -0.63      -0.89      +0.6
 wa      +5.27      +5.19      +4.21
C(Rock64)---------  ---------  ---------
 sec     0          +1(-1.3%)   0
 us      -3.67      -3.29       -6.77
 sy      -0.85      -0.95       -0.87
 wa      +3.66      +3.06       +7.34

觀察測試數據發現,在[測試環境-B]的測試結果以 memory mapped file 讀取有預期的整體效能提升, block in 平均值增加,CPU 使用率分佈有些變化。
而在[測試環境-A]或[測試環境-C]看來整體效能和 block in 平均值幾乎沒有變化,儘管 CPU 使用率分佈也有改變。
以同為 Rock64 的[測試環境-B]與[測試環境-C]來比較,差異在儲存體:emmc vs. microSD。對此二者各執行指令 sysbench --test=fileio --file-test-mode=seqrd run 觀察得檔案讀取性能的極限分別為 105.06Mb/sec, 21.799Mb/sec。
----- fileio benchmark on system C: Rock64 with microSD -----
Operations performed:  131072 Read, 0 Write, 0 Other = 131072 Total
Read 2Gb  Written 0b  Total transferred 2Gb  (21.799Mb/sec)
 1395.16 Requests/sec executed

Test execution summary:
    total time:                          93.9476s
    total number of events:              131072
    total time taken by event execution: 93.8404
    per-request statistics:
         min:                                  0.01ms
         avg:                                  0.72ms
         max:                                 10.36ms
         approx.  95 percentile:               5.64ms

Threads fairness:
    events (avg/stddev):           131072.0000/0.00
    execution time (avg/stddev):   93.8404/0.00

----- fileio benchmark on system B: Rock64 with emmc -----
Operations performed:  131072 Read, 0 Write, 0 Other = 131072 Total
Read 2Gb  Written 0b  Total transferred 2Gb  (105.06Mb/sec)
 6723.53 Requests/sec executed

Test execution summary:
    total time:                          19.4945s
    total number of events:              131072
    total time taken by event execution: 19.3889
    per-request statistics:
         min:                                  0.01ms
         avg:                                  0.15ms
         max:                                 13.04ms
         approx.  95 percentile:               1.08ms

Threads fairness:
    events (avg/stddev):           131072.0000/0.00
    execution time (avg/stddev):   19.3889/0.00

[測試環境-B]在測試集(T1,T2,T3)以 file stream 讀檔時,block in 未達系統讀檔性能極限, 測試集(T1',T2',T3')採 memory mapped file 讀檔便有提升效能的空間。
[測試環境-C]在測試集(T1,T2,T3)以 file stream 讀檔時,block in 已近系統讀檔性能極限,memory mapped file 讀檔已無提升效能空間, 此時恐怕除了提升儲存裝置硬體性能別無他法。

