Data Travel Guide

Just for fun

View on GitHub

Accelerating NVMe with dm-pcache: Achieving 4µs Latency with DAX Persistent Cache

This post shows how to use dm-pcache — a Device-Mapper based persistent caching layer — to accelerate an NVMe block device using a DAX (devdax) memory device.

With dm-pcache:

This guide includes:


1. Prepare devdax and NVMe

Create a devdax device:

ndctl create-namespace -f -e namespace0.0 --mode=devdax

Example output:

{
  "dev":"namespace0.0",
  "mode":"devdax",
  "map":"dev",
  "size":"31.50 GiB (33.82 GB)",
  "uuid":"6cbbe77d-aff8-422b-8251-510676edb293",
  "daxregion":{
    "id":0,
    "size":"31.50 GiB (33.82 GB)",
    "align":2097152,
    "devices":[
      {
        "chardev":"dax0.0",
        "size":"31.50 GiB (33.82 GB)",
        "target_node":0,
        "align":2097152,
        "mode":"devdax"
      }
    ]
  },
  "align":2097152
}

Check NVMe size:

blockdev --getsz /dev/nvme0n1p7
67108864

2. Baseline NVMe Performance (fio)

fio writes destroy data. Skip if your NVMe holds important data.

4K random write, 1 job, iodepth=1

4K random write, 8 jobs, iodepth=1

Full logs are included in Appendix NVMe Logs.


3. Create a dm-pcache Device

dmsetup create pcache_nvme0n1p7 \
  --table '0 67108864 pcache /dev/dax0.0 /dev/nvme0n1p7 2 data_crc false'

Arguments:

Argument Meaning
67108864 result of blockdev --getsz /dev/nvme0n1p7
/dev/dax0.0 Cache device
/dev/nvme0n1p7 Backing NVMe device
2 optional parameter number
data_crc false Disable data CRC

Resulting device:

/dev/mapper/pcache_nvme0n1p7

4. dm-pcache Performance

4K random write, 1 job

4K random write, 8 jobs

Comparison Table

Test NVMe dm-pcache Improvement
1 job 59K IOPS 224K IOPS 3.8×
8 job 446K IOPS 1.38M IOPS 3.1×
latency 15.6µs 4.0µs 3.9× lower

Full logs included below.


5. Using XFS on dm-pcache

mkfs.xfs /dev/mapper/pcache_nvme0n1p7
mount /dev/mapper/pcache_nvme0n1p7 /media/
echo "pcache testing" > /media/test
umount /media

6. Remove pcache Safely (Writeback Flush)

Check flush status:

dmsetup status pcache_nvme0n1p7

Wait until dirty counters match:

root@sr650:/workspace/linux_compile# dmsetup status
pcache_nvme0n1p7: 0 67108864 pcache 0 2015 2015 602 70 1182 523:11010360 523:11009528 0:0
root@sr650:/workspace/linux_compile# dmsetup status
pcache_nvme0n1p7: 0 67108864 pcache 0 2015 2015 602 70 1182 523:11010360 523:11010360 0:0    <<---- make sure the 523:11010360 523:11010360 are equal, that means all dirty data in cache dev is already writeback to backing_dev.

Remove the cache:

dmsetup remove pcache_nvme0n1p7

Verify backend data:

mount /dev/nvme0n1p7 /media/
cat /media/test

Output:

pcache testing

✔ Data safe ✔ Writeback succeeded ✔ Backend unchanged


Appendix: FULL fio Logs (NVMe + pcache)

Everything below is exactly your raw fio output, unmodified.


Appendix [1] — NVMe Performance Logs

## NVMe — 4K randwrite, 1 job

fio --name=test --iodepth=1 --numjobs=1 --rw=randwrite --bs=4K --filename=/dev/nvme0n1p7 --ioengine=libaio --direct=1 --eta-newline=1 --group_reporting --size=1G --runtime 10

test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.36
Starting 1 process
Jobs: 1 (f=1): [w(1)][75.0%][w=238MiB/s][w=60.9k IOPS][eta 00m:01s]
Jobs: 1 (f=1): [w(1)][100.0%][w=228MiB/s][w=58.3k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=31059: Tue Dec  2 14:02:25 2025
  write: IOPS=59.8k, BW=234MiB/s (245MB/s)(1024MiB/4381msec); 0 zone resets
    slat (nsec): min=1552, max=335779, avg=2551.75, stdev=1593.71
    clat (nsec): min=660, max=232329, avg=13063.51, stdev=3259.11
     lat (usec): min=10, max=391, avg=15.62, stdev= 4.62
    clat percentiles (nsec):
     |  1.00th=[11072],  5.00th=[11200], 10.00th=[11200], 20.00th=[11328],
     | 30.00th=[11328], 40.00th=[11456], 50.00th=[11584], 60.00th=[11968],
     | 70.00th=[12864], 80.00th=[14016], 90.00th=[16320], 95.00th=[21888],
     | 99.00th=[22912], 99.50th=[24192], 99.90th=[32640], 99.95th=[35072],
     | 99.99th=[48384]
   bw (  KiB/s): min=208416, max=260944, per=98.84%, avg=236579.00, stdev=18904.51, samples=8
   iops        : min=52104, max=65236, avg=59144.75, stdev=4726.13, samples=8
  lat (nsec)   : 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.02%, 20=91.86%, 50=8.10%
  lat (usec)   : 100=0.01%, 250=0.01%
  cpu          : usr=17.85%, sys=31.60%, ctx=262045, majf=0, minf=46
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,262144,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=234MiB/s (245MB/s), 234MiB/s-234MiB/s (245MB/s-245MB/s), io=1024MiB (1074MB), run=4381-4381msec

Disk stats (read/write):
  nvme0n1: ios=56/245301, sectors=4160/1962408, merge=0/0, ticks=12/2296, in_queue=2309, util=50.55%


## NVMe — 4K randwrite, 8 jobs

fio --name=test --iodepth=1 --numjobs=8 --rw=randwrite --bs=4K --filename=/dev/nvme0n1p7 --ioengine=libaio --direct=1 --eta-newline=1 --group_reporting --size=1G --runtime 10

test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.36
Starting 8 processes
Jobs: 8 (f=8): [w(8)][60.0%][w=1746MiB/s][w=447k IOPS][eta 00m:02s]
Jobs: 1 (f=1): [_(4),w(1),_(3)][100.0%][w=1299MiB/s][w=333k IOPS][eta 00m:00s]
test: (groupid=0, jobs=8): err= 0: pid=31228: Tue Dec  2 14:02:46 2025
  write: IOPS=390k, BW=1524MiB/s (1598MB/s)(8192MiB/5377msec); 0 zone resets
    slat (nsec): min=1752, max=256135, avg=2425.68, stdev=937.20
    clat (nsec): min=672, max=5666.7k, avg=14292.19, stdev=19111.77
     lat (usec): min=12, max=5668, avg=16.72, stdev=19.21
    clat percentiles (nsec):
     |  1.00th=[11840],  5.00th=[12224], 10.00th=[12480], 20.00th=[13248],
     | 30.00th=[13504], 40.00th=[13632], 50.00th=[13760], 60.00th=[13888],
     | 70.00th=[14016], 80.00th=[14528], 90.00th=[15680], 95.00th=[17792],
     | 99.00th=[23680], 99.50th=[24960], 99.90th=[30592], 99.95th=[34560],
     | 99.99th=[55552]
   bw (  MiB/s): min= 1654, max= 1806, per=100.00%, avg=1745.36, stdev= 7.23, samples=73
   iops        : min=423536, max=462538, avg=446812.94, stdev=1851.77, samples=73
  lat (nsec)   : 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=96.62%, 50=3.35%
  lat (usec)   : 100=0.01%, 250=0.01%
  lat (msec)   : 4=0.01%, 10=0.01%
  cpu          : usr=17.81%, sys=26.83%, ctx=2096931, majf=0, minf=163
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,2097152,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1524MiB/s (1598MB/s), 1524MiB/s-1524MiB/s (1598MB/s-1598MB/s), io=8192MiB (8590MB), run=5377-5377msec

Disk stats (read/write):
  nvme0n1: ios=166/2083890, sectors=10720/16671120, merge=0/0, ticks=19/22509, in_queue=22528, util=78.89%

Appendix [2] — dm-pcache Performance Logs

## pcache — 4K randwrite, 1 job

fio --name=test --iodepth=1 --numjobs=1 --rw=randwrite --bs=4K --filename=/dev/mapper/pcache_nvme0n1p7 --ioengine=libaio --direct=1 --size=1G --runtime=10

test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.36
Starting 1 process
Jobs: 1 (f=1)
test: (groupid=0, jobs=1): err= 0: pid=36200: Tue Dec  2 14:22:35 2025
  write: IOPS=224k, BW=876MiB/s (919MB/s)(1024MiB/1169msec); 0 zone resets
    slat (usec): min=2, max=588, avg= 3.58, stdev= 1.50
    clat (nsec): min=446, max=80264, avg=507.05, stdev=232.83
     lat (usec): min=3, max=597, avg= 4.09, stdev= 1.58
    clat percentiles (nsec):
     |  1.00th=[  458],  5.00th=[  462], 10.00th=[  466], 20.00th=[  470],
     | 30.00th=[  470], 40.00th=[  474], 50.00th=[  478], 60.00th=[  478],
     | 70.00th=[  482], 80.00th=[  494], 90.00th=[  548], 95.00th=[  564],
     | 99.00th=[ 1208], 99.50th=[ 1640], 99.90th=[ 1912], 99.95th=[ 2576],
     | 99.99th=[ 5280]
   bw (  KiB/s): min=890256, max=902456, per=99.93%, avg=896356.00, stdev=8626.70, samples=2
   iops        : min=222564, max=225614, avg=224089.00, stdev=2156.68, samples=2
  lat (nsec)   : 500=82.05%, 750=15.39%, 1000=1.00%
  lat (usec)   : 2=1.50%, 4=0.04%, 10=0.02%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.01%
  cpu          : usr=16.78%, sys=82.96%, ctx=13, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,262144,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=876MiB/s (919MB/s), 876MiB/s-876MiB/s (919MB/s-919MB/s), io=1024MiB (1074MB), run=1169-1169msec


## pcache — 4K randwrite, 8 jobs

fio --name=test --iodepth=1 --numjobs=8 --rw=randwrite --bs=4K --filename=/dev/mapper/pcache_nvme0n1p7 --ioengine=libaio --direct=1 --size=1G --runtime=10

test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.36
Starting 8 processes
Jobs: 1 (f=0)
test: (groupid=0, jobs=8): err= 0: pid=36297: Tue Dec  2 14:22:41 2025
  write: IOPS=1146k, BW=4477MiB/s (4694MB/s)(8192MiB/1830msec); 0 zone resets
    slat (usec): min=3, max=826, avg= 4.96, stdev= 1.26
    clat (nsec): min=443, max=176653, avg=501.82, stdev=211.06
     lat (usec): min=3, max=827, avg= 5.46, stdev= 1.33
    clat percentiles (nsec):
     |  1.00th=[  458],  5.00th=[  466], 10.00th=[  466], 20.00th=[  474],
     | 30.00th=[  478], 40.00th=[  478], 50.00th=[  482], 60.00th=[  482],
     | 70.00th=[  486], 80.00th=[  490], 90.00th=[  556], 95.00th=[  564],
     | 99.00th=[  812], 99.50th=[ 1208], 99.90th=[ 1912], 99.95th=[ 2960],
     | 99.99th=[ 5472]
   bw (  MiB/s): min= 5305, max= 5529, per=100.00%, avg=5402.12, stdev=28.04, samples=18
   iops        : min=1358322, max=1415642, avg=1382944.00, stdev=7178.11, samples=18
  lat (nsec)   : 500=82.78%, 750=16.03%, 1000=0.31%
  lat (usec)   : 2=0.82%, 4=0.03%, 10=0.03%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.01%
  cpu          : usr=13.23%, sys=86.74%, ctx=37, majf=0, minf=128
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,2097152,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=4477MiB/s (4694MB/s), 4477MiB/s-4477MiB/s (4694MB/s-4694MB/s), io=8192MiB (8590MB), run=1830-1830msec