Being an engineer it’s always a good feeling getting to the bottom of a problem and none so as this one. Take a T5140, create a 3 way RAID 0 LUN using the internal disks and stick Oracle on top so you can do something useful with it for your application………..and what do you get……… a problem. I suspect after that opening some of you are thinking "where’s he going with this?" …. the answer nowhere, I’m not picking holes in either the T5140 or Oracle…..good, I’m glad we got that one clear! 🙂
Anyhow, so a customer came to us complaining that their application wasn’t running as expected on this platform and really wanted to know if there was a hardware fault / bug with platform or operating system running on it. From the tests that the customer had been doing themselves they believed that the bottleneck was the underlying I/O subsystem and in this case the LSI H/W RAID. Essentially, the customer had configured a 3 disk RAID 0 stripe using the default 64k stripe width like thus:
bash-3.00# raidctl -l c1t1d0
Volume Size Stripe Status Cache RAID
Sub Size Level
Disk
—————————————————————-
c1t1d0 409.9G 64K OPTIMAL OFF RAID0
0.1.0 136.6G GOOD
0.2.0 136.6G GOOD
0.3.0 136.6G GOOD
They had then created a single slice for which Oracle was installed and configured for Direct I/O (which is a good thing anyway if you’ve a UFS filesystem) so we were avoiding the filesystem buffer cache and double buffering. So for a 64k stripe per disk and three disks gives us a total stripe width of 192k. The throughput performance of each of this disks is between 50-60mb per second which means we have a theoretical throughput on all stripes of 150->180mb per second for reads. We can forget writes as Oracle is pwrite()’ing in 8k synchronous chunks to a non-write enabled volume and only hits one disk (because 8K is less than the 64k stripe size) and hence why we saw a 1Gb tablespace creation take 18 seconds and an average through put of 56mb per second which is what we would have expected for a single disk.
SQL> set timing on
SQL> create tablespace andy3
2 datafile ‘/u01/oracle/oradata/SUN/andy03.dbf’
3 size 1g;
Tablespace created.
Elapsed: 00:00:18.12
and iostat -xnz 1 shows us
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
13.0 128.0 104.0 56369.0 0.0 1.6 0.0 11.2 1 93 c1t1d0
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
14.0 78.0 112.0 56281.3 0.0 1.2 0.1 13.5 0 93 c1t1d0
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
13.0 93.0 112.0 53734.0 0.0 1.4 0.0 13.4 1 93 c1t1d0
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
13.0 95.0 104.0 58397.6 0.0 1.4 0.0 12.7 1 92 c1t1d0
This result was the same as the customers but things then get interesting when we start looking at full table scan parallel queries. The customer ended up with these results:
Parallelism | Time | Throughput (Mb per second) | db_file_multiblock_read_count |
0 | 719 | 13.8 | 16 (16 x 8k = 128k) |
2 | 195 | 50.9 | 16 |
4 | 208 | 47.7 | 16 |
0 | 420 | 23.6 | 32 (32 x 8k = 256k) |
2 | 163 | 60.9 | 32 |
4 | 208 | 53.9 | 32 |
Now, they look bad especially if you think that theoretically we should be able to achieve 150mb ->180mb based on a three disk stripe (3 x 60mb).
Using the same parallel test plan as the customer:
oracle@v4v-t5140a-gmp03~$more para.sql
set timing on;
select /*+ FULL(t) */ count(*) from contact_methods t;
select /*+ FULL(t) PARALLEL(t,2) */ count(*) from contact_methods t;
select /*+ FULL(t) PARALLEL(t,4) */ count(*) from contact_methods t;
select /*+ FULL(t) PARALLEL(t,64) */ count(*) from contact_methods t;
oracle@v4v-t5140a-gmp03~/oradata/SUN$ls -alh test01.dbf
-rw-r—– 1 oracle dba 9.7G Oct 24 08:25 test01.dbf
I got these:
SQL> @para
COUNT(*)
———-
15700000
Elapsed: 00:00:47.85
COUNT(*)
———-
15700000
Elapsed: 00:00:32.53
COUNT(*)
———-
15700000
Elapsed: 00:00:34.68
COUNT(*)
———-
15700000
Elapsed: 00:00:42.17
whilst the first full table scan is running I see the following in iostat -xnz 1:
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
108.0 0.0 93122.4 0.0 0.0 0.4 0.1 4.0 1 35 c1t1d0
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
151.0 3.0 95796.9 48.0 0.0 0.5 0.1 3.0 1 34 c1t1d0
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
115.0 0.0 103367.6 0.0 0.0 0.3 0.1 2.6 1 28 c1t1d0
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
116.0 0.0 102232.7 0.0 0.0 0.3 0.1 3.0 1 29 c1t1d0
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
122.0 3.0 105326.4 48.0 0.0 0.3 0.1 2.5 1 29 c1t1d0
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
116.0 0.0 96467.2 0.0 0.0 0.5 0.1 4.1 1 34 c1t1d0
and then
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
193.0 0.0 159383.3 0.0 0.0 8.0 0.1 41.4 1 100 c1t1d0
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
195.5 3.0 163681.0 48.1 0.0 8.1 0.1 40.8 1 100 c1t1d0
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
220.1 0.0 188770.3 0.0 0.0 7.7 0.1 34.8 3 100 c1t1d0
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
192.0 0.0 168156.9 0.0 0.0 7.2 0.1 37.8 1 100 c1t1d0
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
191.0 3.0 162361.2 48.0 0.0 7.4 0.1 38.1 1 100 c1t1d0
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
190.0 0.0 162776.0 0.0 0.0 7.3 0.1 38.7 1 100 c1t1d0
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
192.0 0.0 162737.6 0.0 0.0 6.9 0.1 35.9 1 100 c1t1d0
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
186.0 3.0 153754.2 48.0 0.0 8.4 0.1 44.4 1 100 c1t1d0
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
191.0 1.0 160412.4 8.0 0.0 7.7 0.1 40.1 1 100 c1t1d0
when the parallel jobs are running.
This is because I changed the db_file_multiblock_read_count to 128 (128 * 8k = 1M) and in fact I saw improvements using 192k (64k*3) to match the s
tripe width. I also went with some recommendations from here which also helped along with running the latest T5140 firmware and latest KJP 137111-08 to avoid some known performance issues.
It’s amazing that tuning one little option can have such a dramatic effect on the results and also shows that just because you don’t get the results you’re expecting not to assume that there is a fault with the hardware or software. For me personally, it’s always good to understand the results and how you got there, although as in all performance related issues you can sometimes get sidetracked from what’s gone on before and end going down the wrong path. To avoid that, make sure you chat with you’re colleagues when you feel like you’re not going anywhere as a fresh set of eyes can bring you back on the path and closer to resolution.