I recently got involved in an interesting performance problem in which the customer was experiencing differences in copying a 500mb file to 2 different VxFS filesystems on the same box by up to a factor of 4 in some cases ie "timex cp <500mb file> /vxfs_filesystem". The only thing which had changed (I’m actually smiling at the word *only*) between 2 dates were:
* Various OS Patches
* Symantec Veritas Volume Manager Patch
* Upgraded LPFC Driver Servers
* Upgraded firmware
* Upgraded Power Path Servers
* Upgraded SYMCLI
which doesn’t help immediately in thinking about probable or possible causes due to far too many things changing! So how does one actually track down what’s happening in the lifecycle of the "cp" on a Solaris 9 box…… TNF to the rescue again. Using the TNF wrapper script from one of my previous post, we could enable and disable TNF whilst the cp was happening. So once we’d got that data from the customer and extracted / processed the TNF buffers I first looked at the TNF truss outputs:
[andharr@exdev:~/tnf/tnf/notworking]$ more tnf.truss
looking for pid 12045
Now(ms) duration(ms) pid thread call return errno
48335.67488000 15.83081400 12045 0x30101ab09a0 mmap64 0x7fffffff 0
48351.53089800 254.21403800 12045 0x30101ab09a0 write 0x800000 0
48605.75773800 16.39489300 12045 0x30101ab09a0 mmap64 0x7fffffff 0
48622.17487400 244.79986500 12045 0x30101ab09a0 write 0x800000 0
48866.98866100 16.51242700 12045 0x30101ab09a0 mmap64 0x7fffffff 0
48883.52309100 263.08493800 12045 0x30101ab09a0 write 0x800000 0
49146.62003100 16.96265000 12045 0x30101ab09a0 mmap64 0x7fffffff 0
49163.60188400 262.34315500 12045 0x30101ab09a0 write 0x800000 0
49425.95560000 13.35143400 12045 0x30101ab09a0 mmap64 0x7fffffff 0
49439.32559600 256.47619000 12045 0x30101ab09a0 write 0x800000 0
49695.81298800 13.16556800 12045 0x30101ab09a0 mmap64 0x7fffffff 0
49708.99975900 246.71828900 12045 0x30101ab09a0 write 0x800000 0
49955.73060900 15.93346900 12045 0x30101ab09a0 mmap64 0x7fffffff 0
49971.68440100 254.45911200 12045 0x30101ab09a0 write 0x800000 0
50226.15543500 16.82535100 12045 0x30101ab09a0 mmap64 0x7fffffff 0
50242.98686700 262.94091700 12045 0x30101ab09a0 write 0x800000 0
50505.93946600 15.95571100 12045 0x30101ab09a0 mmap64 0x7fffffff 0
50521.91406000 254.57880900 12045 0x30101ab09a0 write 0x800000 0
50776.50887100 16.54139300 12045 0x30101ab09a0 mmap64 0x7fffffff 0
50793.06922600 263.14118400 12045 0x30101ab09a0 write 0x800000 0
51056.22369200 13.62531200 12045 0x30101ab09a0 mmap64 0x7fffffff 0
51069.86692700 265.74658200 12045 0x30101ab09a0 write 0x800000 0
51335.62423100 14.17114700 12045 0x30101ab09a0 mmap64 0x7fffffff 0
51349.81458000 266.32114200 12045 0x30101ab09a0 write 0x800000 0
51616.14980400 14.28020100 12045 0x30101ab09a0 mmap64 0x7fffffff 0
<snip>
[andharr@exdev:~/tnf/tnf/working]$ more tnf.truss
looking for pid 16545
<snip>
Now(ms) duration(ms) pid thread call return errno
74.48367000 0.09929400 16545 0x30129365ce0 mmap64 0x7fffffff 0
74.60224600 145.95268900 16545 0x30129365ce0 write 0x800000 0
220.56909700 15.92858800 16545 0x30129365ce0 mmap64 0x7fffffff 0
236.51624700 145.57631700 16545 0x30129365ce0 write
&nbs
p; 0x800000 0
382.11048600 17.13331300 16545 0x30129365ce0 mmap64 0x7fffffff 0
399.25084000 148.40430500 16545 0x30129365ce0 write 0x800000 0
547.67306700 15.89682400 16545 0x30129365ce0 mmap64 0x7fffffff 0
563.59125400 143.89640600 16545 0x30129365ce0 write 0x800000 0
707.50126200 16.40793300 16545 0x30129365ce0 mmap64 0x7fffffff 0
723.91583600 144.97527400 16545 0x30129365ce0 write 0x800000 0
868.90455200 15.84185600 16545 0x30129365ce0 mmap64 0x7fffffff 0
884.76377000 145.10353100 16545 0x30129365ce0 write 0x800000 0
1029.88114300 15.60142300 16545 0x30129365ce0 mmap64 0x7fffffff 0
1045.50632900 145.67129000 16545 0x30129365ce0 write 0x800000 0
1191.19362100 16.41825600 16545 0x30129365ce0 mmap64 0x7fffffff 0
1207.63660000 144.74260200 16545 0x30129365ce0 write 0x800000 0
1352.39696400 17.08674700 16545 0x30129365ce0 mmap64 0x7fffffff 0
1369.49027200 143.29896300 16545 0x30129365ce0 write 0x800000 0
1512.80107700 14.10729700 16545 0x30129365ce0 mmap64 0x7fffffff 0
1526.91309500 144.05898900 16545 0x30129365ce0 write 0x800000 0
1670.98672600 16.63788500 16545 0x30129365ce0 mmap64 0x7fffffff 0
1687.63237200 141.60897100 16545 0x30129365ce0 write 0x800000 0
1829.25798500 15.96867300 16545 0x30129365ce0 mmap64 0x7fffffff 0
1845.23249900 144.46080400 16545 0x30129365ce0 write 0x800000 0
1989.70738500 15.94683000 16545 0x30129365ce0 mmap64 0x7fffffff 0
<snip>
Now, that’s interesting, in that we see a difference in the time taken for write(), so what’s happening in the write() lifecycle?
<not working>
48545.230384 0.006560 12045 1 0x30101ab09a0 224 strategy device: 0x11100000faf block: 39690624 size: 0x10000 buf: 0x3004185e058 flags: 558353 0,0
48545.277271 0.000080 12045 1 0x30101ab09a0 224 strategy device: 0x1240000007a block: 39692928 size: 0x10000 buf: 0x312eec85688 flags: 1289 0,0
48545.416170 0.000800 12045 1 0x30101ab09a0 224 strategy device: 0x11100000faf block: 39690752 size: 0x10000 buf: 0x327286721b0 flags: 558353 0,0
48545.454655 0.002640 12045 1 0x30101ab09a0 224 strategy device: 0x1240000007a block: 39693056 size: 0x10000 buf: 0x506b36947c8 flags: 1289 0,0
48545.544348 0.008562 12045 1 0x30101ab09a0 224 strategy device: 0x11100000faf block: 39690880 size: 0x10000 buf: 0x3004185f458 flags: 558353 0,0
48545.580753 0.012962 12045 1 0x30101ab09a0 224 strategy device: 0x1240000007a block: 39693184 size: 0x10000 buf: 0x312ead7be08 flags: 1289 0,0
48545.642601 0.004721 12045 1 0x30101ab09a0 224 strategy device: 0x11100000faf block: 39691008 size: 0x10000 buf: 0x30070b4af30 flags: 558353 0,0
48545.676366 0.003361 12045 1 0x30101ab09a0 224 strategy device: 0x1240000007a block: 39693312 size: 0x10000 buf: 0x3006ca9bb20 flags: 1289 0,0
48545.735814 0.003681 12045 1 0x30101ab09a0 224 strategy device: 0x11100000faf block: 39691136 size: 0x10000 buf: 0x327cd4835c8 flags: 558353 0,0
<snip>
So our I/O appears to be in 65k chunks but how long did 1 I/O actually take?
<
br />48545.230384 – 12045 1 0x30101ab09a0 224 strategy device: 273,4015 block: 39690624 size: 0x10000 buf: 0x3004185e058 flags: 0x88511
48549.243496 – 0 0 0x2a100477d20 64 biodone device: 273,4015 block: 39690624 buf: 0x3004185e058
4.013112ms
<working>
236.516247 0.000720 16545 1 0x30129365ce0 65 syscall_start sysnum: 4 write
236.972870 0.004481 16545 1 0x30129365ce0 65 address_fault address: 0xfe800000 fault_type: 2 access: 1
238.215601 0.068810 16545 1 0x30129365ce0 65 strategy device: 0x11100000fa9 block: 24838144 size: 0x100000 buf: 0x300359fa3a8 flags: 289 0,0
238.247765 0.002240 16545 1 0x30129365ce0 65 strategy device: 0x12400000022 block: 24840448 size: 0x100000 buf: 0x30038cd24a0 flags: 297 0,0
238.461314 0.002240 16545 1 0x30129365ce0 65 thread_block reason: 0x300359fa468 stack: 4 0x114190801048bb0 0x10f28ec01080b30 0x78a4d834789aa770 0x78
a7dd6078a798b0
<snip>
238.215601 – 16545 1 0x30129365ce0 65 strategy device: 273,4009 block: 24838144 size: 0x100000 buf: 0x300359fa3a8 flags: 0x121
254.402304 – 0 0 0x2a100471d20 64 biodone device: 273,4009 block: 24838144 buf: 0x300359fa3a8
16.186703ms
So time for 8m in 1m chunks:
16.18 * 8 = 129.493624
and 8m in 65k chunks:
4.013112 * 128 = 513.678336 (it’s going to be less anyway due to fact that we can queue iops and be completed in parallel)
So it would seem that the difference in time can be explained by the way that the I/O is broken up on each filesystem. Perhaps the way in which the filesystem has been configured can explain the difference? Unforuntately, I didn’t see differences between the filesystems themselves but some of the configuration details did match some of our physical write sizes:
Filesystem i/o parameters for /filesystem
read_pref_io = 65536
read_nstream = 1
read_unit_io = 65536
write_pref_io = 65536
write_nstream = 1
write_unit_io = 65536
pref_strength = 10
buf_breakup_size = 1048576
discovered_direct_iosz = 262144
max_direct_iosz = 1048576
default_indir_size = 8192
qio_cache_enable = 0
write_throttle = 0
max_diskq = 1048576
initial_extent_size = 1
max_seqio_extent_size = 2048
max_buf_data_size = 8192
hsm_write_prealloc = 0
read_ahead = 1
inode_aging_size = 0
inode_aging_count = 0
fcl_maxalloc = 8130396160
fcl_keeptime = 0
fcl_winterval = 3600
oltp_load = 0
Now our write pattern ties in for our non-working cp in that our prefered write size is 65536 (as above) but for the working one we’re using 1048576 which could be due to discovered direct I/O. Now write() is trying to complete a size of 8388608 (8mb) which we can see from the truss:
29986: 0.0146 mmap64(0xFE800000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0x07000000) = 0xFE800000
29986: 0.2394 write(4, "