Bloom Filter For Non-Parallel query on Non-Partition Table

I notice an enhancement in 12.1.0.2.0 beta, bloom filter is enabled for non-parallel query on non-partition table. The enhancement may help improve the hash join performance, this page show a little demo. The testing is on Exadata X4-2 instance.

> SELECT * FROM PRODUCT_COMPONENT_VERSION;

PRODUCT          VERSION        STATUS
---------------------------------------- -------------------- --------------------
NLSRTL           12.1.0.2.0       Beta
Oracle Database 12c Enterprise Edition   12.1.0.2.0       64bit Beta
PL/SQL           12.1.0.2.0       Beta
TNS for Linux:         12.1.0.2.0       Beta

> select platform_id, platform_name from v$database;

PLATFORM_ID PLATFORM_NAME
----------- ----------------
   13 Linux x86 64-bit

Prepare two table for hash join, 10k rows in t1 and 10m rows in t2. 100 records for each value on the column t1.n1 and t2.n1.

drop table t1;
drop table t2;

-- t1 with 10,000 rows
create table t1
as
select
    rownum                      id,
    lpad(rownum, 10, '0')       v1,
    trunc((rownum - 1)/100)     n1,
    rpad(rownum, 100)           padding
from
    dual
connect by level <= 10000
;

-- t2 with 10,000,000 rows
create table t2
as
select
    rownum                      id,
    lpad(rownum, 10, '0')       v1,
    trunc((rownum - 1)/100)     n1,
    rpad(rownum, 100)           padding
from
    t1, t1
where
    rownum <= 1e7;

begin
    dbms_stats.gather_table_stats(
        ownname      => user,
        tabname      =>'T1',
        method_opt   => 'for all columns size 1'
    );

    dbms_stats.gather_table_stats(
        ownname      => user,
        tabname      =>'T2',
        method_opt   => 'for all columns size 1'
    );
end;
/

The hash join between t1 and t2 return only 100 rows, if the bloom filter is created after t1 is scanned, and applied on t2 when t2 is scanned, most of the rows of t2 is filtered at the first place. The sql monitor report show that how bloom filter takes effect. With bloom filter enabled, the rows returned from smart scan on t2 is 15k, cell offload efficiency is 100%(only 2MB bytes returned from Exadata). Without bloom filter, the rows returned from smart scan on t2 is 10M, cell offload efficiency is 84%(230MB bytes returned from Exadata).

alter system flush buffer_cache;
select
    /*+ monitor */
    count(t2.v1)
from
    t1, t2
where
    t1.id = t2.id
and t1.n1 = 0;

alter system flush buffer_cache;
select
    /*+ monitor no_px_join_filter(t2) */
    count(t2.v1)
from
    t1, t2
where
    t1.id = t2.id
and t1.n1 = 0;

Update on Mar 25th

Serial generation for bloom filters on Exadata is enabled in 12.1.0.2, and backported to 11.2.0.4 with bug 16710321. Hidden parameter _bloom_serial_filter control the function.

> @pd _bloom_serial_filter

NAME                  VALUE     DESCRIPTION
--------------------- --------- -------------------------------------------------------
_bloom_serial_filter  ON        enable serial bloom filter on exadata

Avoid Excess Data Volume in Execution Path

For complex correlated filter, combined with function, the optimizer may not be able to translate the correlated filter to join, and the filter can only applied at the last step, which may lead to performance issue. Here is an example.

Prepare 3 tables with 10k rows, db version is 11.2.0.3.0

drop table t1;
drop table t2;
drop table t3;

create table t1
as
select
    rownum                      id,
    lpad(rownum,10,'0')         v1,
    mod(rownum, 100)            n1,
    rpad(rownum,100)            padding
from dual
connect by level <= 1e4
;

create table t2 as select * from t1;
create table t3 as select * from t1;
 
begin
    dbms_stats.gather_table_stats(
        ownname      => user,
        tabname      =>'T1',
        method_opt   => 'for all columns size 1'
    );

    dbms_stats.gather_table_stats(
        ownname      => user,
        tabname      =>'T2',
        method_opt   => 'for all columns size 1'
    );

    dbms_stats.gather_table_stats(
        ownname      => user,
        tabname      =>'T3',
        method_opt   => 'for all columns size 1'
    );
end;
/

create index t1_pk on t1(id) nologging;
create index t2_pk on t2(id) nologging;
create index t3_pk on t3(id) nologging;

The example sql is as below, the sql take 21s. The join between t1 and t2 is a correlated filter, decode function combined with subquery, filter(T2.N1=DECODE(T1.V1,’0000000001′, (SELECT T.N1 FROM T1 T WHERE T.ID=:B1),T1.N1)). Optimizer can only apply the filter at the last step of execution plan, which cause CARTESIAN join between t1 and t3, as a result, 100m rows is produced. But there is only 9900 rows can survive the filter at line 2, which means under such plan, oracle need to process (100M – 9900) unnecessary rows. These massive unnecessary data flow through execution path, which is why the 21.9s cpu times is wasted. From the sql monitor report, we know that the complex filter is most effective on filtering the data, and need to applied at first to minimize the data volume. Since Optimizer is unable do understand logic of the decode function, we need help the optimizer by rewriting the sql. This is an example that sql tuning often require sql code rewrite.

select /*+ monitor original*/
    sum(t1.n1)
from
    t1, t2, t3
where
    decode(t1.v1, '0000000001', (select t.n1 from t1 t where t.id = t1.id), t1.n1) = t2.n1 
and t2.id = t3.id;

SUM(T1.N1)
----------
    495000

Elapsed: 00:00:21.90

Plan hash value: 3560224838

-------------------------------------------------------------------------------------------
| Id  | Operation                         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |       |     1 |    29 |    12M  (1)| 00:03:58 |
|   1 |  SORT AGGREGATE                   |       |     1 |    29 |            |          |
|*  2 |   FILTER                          |       |       |       |            |          |
|*  3 |    HASH JOIN                      |       |   100M|  2765M| 57733   (2)| 00:00:02 |
|   4 |     TABLE ACCESS STORAGE FULL     | T2    | 10000 | 70000 |    52   (0)| 00:00:01 |
|   5 |     MERGE JOIN CARTESIAN          |       |   100M|  2098M| 57360   (1)| 00:00:02 |
|   6 |      TABLE ACCESS STORAGE FULL    | T1    | 10000 |   175K|    52   (0)| 00:00:01 |
|   7 |      BUFFER SORT                  |       | 10000 | 40000 | 57308   (1)| 00:00:02 |
|   8 |       INDEX STORAGE FAST FULL SCAN| T3_PK | 10000 | 40000 |     6   (0)| 00:00:01 |
|   9 |    TABLE ACCESS BY INDEX ROWID    | T1    |     1 |     7 |     2   (0)| 00:00:01 |
|* 10 |     INDEX RANGE SCAN              | T1_PK |     1 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("T2"."N1"=DECODE("T1"."V1",'0000000001', (SELECT "T"."N1" FROM "T1"
              "T" WHERE "T"."ID"=:B1),"T1"."N1"))
   3 - access("T2"."ID"="T3"."ID")
  10 - access("T"."ID"=:B1)

Replace decode functon with union all clause, and the sql complete in 0.012s, although the buffer gets increase from 381 to 759, most excess data is eliminated. At most there is only 10k rows exists in the exeuction path.

select sum(n1)
from
(
  select /*+ monitor re-write*/
	   sum(t1.n1) n1
  from
	   t1, t2, t3, t1 t
  where
	   t1.v1 = '0000000001'
  and t1.id = t.id
  and t.n1  = t2.n1
  and t2.id = t3.id
  union all
  select
	   sum(t1.n1) n1
  from
	   t1, t2, t3
  where
	   t1.v1 <> '0000000001'
  and t1.id = t2.n1
  and t2.id = t3.id
);

   SUM(N1)
----------
    495000

Elapsed: 00:00:00.02

Plan hash value: 840496

---------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |     1 |    13 |   226   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE                     |       |     1 |    13 |            |          |
|   2 |   VIEW                              |       |     2 |    26 |   226   (1)| 00:00:01 |
|   3 |    UNION-ALL                        |       |       |       |            |          |
|   4 |     SORT AGGREGATE                  |       |     1 |    36 |            |          |
|*  5 |      HASH JOIN                      |       |   100 |  3600 |   114   (1)| 00:00:01 |
|   6 |       TABLE ACCESS STORAGE FULL     | T2    | 10000 | 70000 |    52   (0)| 00:00:01 |
|   7 |       MERGE JOIN CARTESIAN          |       | 10000 |   283K|    61   (0)| 00:00:01 |
|   8 |        NESTED LOOPS                 |       |       |       |            |          |
|   9 |         NESTED LOOPS                |       |     1 |    25 |    54   (0)| 00:00:01 |
|* 10 |          TABLE ACCESS STORAGE FULL  | T1    |     1 |    18 |    52   (0)| 00:00:01 |
|* 11 |          INDEX RANGE SCAN           | T1_PK |     1 |       |     1   (0)| 00:00:01 |
|  12 |         TABLE ACCESS BY INDEX ROWID | T1    |     1 |     7 |     2   (0)| 00:00:01 |
|  13 |        BUFFER SORT                  |       | 10000 | 40000 |    59   (0)| 00:00:01 |
|  14 |         INDEX STORAGE FAST FULL SCAN| T3_PK | 10000 | 40000 |     7   (0)| 00:00:01 |
|  15 |     SORT AGGREGATE                  |       |     1 |    29 |            |          |
|* 16 |      HASH JOIN                      |       | 10000 |   283K|   112   (1)| 00:00:01 |
|  17 |       INDEX STORAGE FAST FULL SCAN  | T3_PK | 10000 | 40000 |     7   (0)| 00:00:01 |
|* 18 |       HASH JOIN                     |       | 10000 |   244K|   105   (1)| 00:00:01 |
|  19 |        TABLE ACCESS STORAGE FULL    | T2    | 10000 | 70000 |    52   (0)| 00:00:01 |
|* 20 |        TABLE ACCESS STORAGE FULL    | T1    |  9999 |   175K|    52   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access("T"."N1"="T2"."N1" AND "T2"."ID"="T3"."ID")
  10 - storage("T1"."V1"='0000000001')
       filter("T1"."V1"='0000000001')
  11 - access("T1"."ID"="T"."ID")
  16 - access("T2"."ID"="T3"."ID")
  18 - access("T1"."ID"="T2"."N1")
  20 - storage("T1"."V1"<>'0000000001')
       filter("T1"."V1"<>'0000000001')

Prevent Filter Pushdown

Normally, we want to apply the filters to get most rows filter at the first place, reduce the data volume flowing through the execution path. There are also cases when we want to postpone some filters to the last steps of execution plan. Here is an example on 11.2.0.3.0.

drop table t1;
create table t1
as
with generator as (
    select  --+ materialize
        rownum id
    from dual
    connect by
        level <= 1e3
)
select
    rownum                      id,
    lpad(rownum,10,'0')         v1,
    rownum                      n1,
    rpad(rownum,100)            padding
from
    generator   v1,
    generator   v2
where
    rownum <= 1e5
;

begin
    dbms_stats.gather_table_stats(
        ownname      => user,
        tabname      =>'T1',
        method_opt   => 'for all columns size 1'
    );
end;
/

create index t_pk on t1(id) nologging;

create or replace function f(p_id number) return number
as
l_n1 number;
begin
  select n1 into l_n1 from t1 where id = trunc(dbms_random.value(0, p_id))+1;
  return l_n1;
end;
/

I just create a table t1 with 100k rows, gather statistics and create an index on t1.id, then create a function to be called from the sql. The following query take 17s and 300k buffer gets, filter(“F”(“V1″.”ID”)>=0) is applied for every rows in , we can confirm from the 100079 recursive calls. Actually the number of result set is 999, means there is 99001(100000-999) unnecessary evaluation on F. We can pull out the filter f(v1.id) to the last step of execution plan, to avoid the reduncant function evaluation.

select
    count(v1.n1)
from
    t1 v1, t1 v2
where
    v1.n1 = v2.n1
and f(v1.id) > 0
and v2.id < 1e3;

COUNT(V1.N1)
------------
         999

Elapsed: 00:00:16.74

Execution Plan
----------------------------------------------------------
Plan hash value: 3220252481

--------------------------------------------------------------------------------------
| Id  | Operation                     | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |      |     1 |    20 |   526   (2)| 00:00:07 |
|   1 |  SORT AGGREGATE               |      |     1 |    20 |            |          |
|*  2 |   HASH JOIN                   |      |   999 | 19980 |   526   (2)| 00:00:07 |
|   3 |    TABLE ACCESS BY INDEX ROWID| T1   |   999 |  9990 |    22   (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN          | T_PK |   999 |       |     4   (0)| 00:00:01 |
|*  5 |    TABLE ACCESS FULL          | T1   |  5000 | 50000 |   504   (2)| 00:00:07 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("V1"."N1"="V2"."N1")
   4 - access("V2"."ID"<1e3)
   5 - filter("F"("V1"."ID")>=0)


Statistics
----------------------------------------------------------
     100079  recursive calls
     302050  consistent gets

Rewrite the sql to have an inline view, pull out the filter on the function f(id)>0 to the last step of execution path, provide a lot improvement. The sql time is reduced to 0.18s, buffer gets is 4806. There are two tricks to prevent the filter is pushed into the inline view: 1) insert a rownum field on the inline view; 2) set the “_optimizer_filter_pushdown” to false. The optimizer parameter _optimizer_filter_pushdown is introduced in 11.2.0.3.0.

-- inline view with rownum column
select count(n1)
from
    (
    select
        v1.id, v1.n1, rownum rn
    from
        t1 v1, t1 v2
    where
        v1.n1 = v2.n1
    and v2.id < 1e3
    )
where
    f(id) > 0;

 COUNT(N1)
----------
       999

Elapsed: 00:00:00.18

Execution Plan
----------------------------------------------------------
Plan hash value: 1568965735

----------------------------------------------------------------------------------------
| Id  | Operation                       | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |      |     1 |    26 |   518   (1)| 00:00:07 |
|   1 |  SORT AGGREGATE                 |      |     1 |    26 |            |          |
|*  2 |   VIEW                          |      |   999 | 25974 |   518   (1)| 00:00:07 |
|   3 |    COUNT                        |      |       |       |            |          |
|*  4 |     HASH JOIN                   |      |   999 | 19980 |   518   (1)| 00:00:07 |
|   5 |      TABLE ACCESS BY INDEX ROWID| T1   |   999 |  9990 |    22   (0)| 00:00:01 |
|*  6 |       INDEX RANGE SCAN          | T_PK |   999 |       |     4   (0)| 00:00:01 |
|   7 |      TABLE ACCESS FULL          | T1   |   100K|   976K|   496   (1)| 00:00:06 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("F"("ID")>0)
   4 - access("V1"."N1"="V2"."N1")
   6 - access("V2"."ID"<1e3)


Statistics
----------------------------------------------------------
       1000  recursive calls
       4806  consistent gets


-- inline view with optimizer parameter _optimizer_filter_pushdown set to false
select /*+ NO_MERGE(v) OPT_PARAM('_OPTIMIZER_FILTER_PUSHDOWN' 'FALSE') */count(n1)
from
    (
    select
        v1.id, v1.n1, rownum rn
    from
        t1 v1, t1 v2
    where
        v1.n1 = v2.n1
    and v2.id < 1e3
    )
where
    f(id) > 0;

 COUNT(N1)
----------
       999

Elapsed: 00:00:00.24

Execution Plan
----------------------------------------------------------
Plan hash value: 1568965735

----------------------------------------------------------------------------------------
| Id  | Operation                       | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |      |     1 |    26 |   518   (1)| 00:00:07 |
|   1 |  SORT AGGREGATE                 |      |     1 |    26 |            |          |
|*  2 |   VIEW                          |      |   999 | 25974 |   518   (1)| 00:00:07 |
|   3 |    COUNT                        |      |       |       |            |          |
|*  4 |     HASH JOIN                   |      |   999 | 19980 |   518   (1)| 00:00:07 |
|   5 |      TABLE ACCESS BY INDEX ROWID| T1   |   999 |  9990 |    22   (0)| 00:00:01 |
|*  6 |       INDEX RANGE SCAN          | T_PK |   999 |       |     4   (0)| 00:00:01 |
|   7 |      TABLE ACCESS FULL          | T1   |   100K|   976K|   496   (1)| 00:00:06 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("F"("ID")>0)
   4 - access("V1"."N1"="V2"."N1")
   6 - access("V2"."ID"<1e3)


Statistics
----------------------------------------------------------
       1000  recursive calls
       4808  consistent gets

Oracle 12c new feature: Temporary undo

This is quick blog to demostrate the tiny new feature in 12C: temporary undo. Temporary undo is introdued in 12 and well documented.

By default, undo records for temporary tables are stored in the undo tablespace and are logged in the redo, which is the same way undo is managed for persistent tables. However, you can use the TEMP_UNDO_ENABLED initialization parameter to separate undo for temporary tables from undo for persistent tables. When this parameter is set to TRUE, the undo for temporary tables is called temporary undo.

Temporary undo is controlled by The Parameter TEMP_UNDO_ENABLED. With TEMP_UNDO_ENABLED enable, significantly less Redo is only generated for the very first DML(insert/update/delete) on the global temporary table, there is no redo for the following DML on the temporary tables within the same transaction totally. This may be a great benifit for the ETL processes, which take temporary table as the staging table, load the huge data into temporary table for heavy process before move the data into the final fact table. With less redo, both the sql performance on global temporary table and system loading may be improved a lot.

Perform the same insert/update/delete on the same global temporary table(see the script below), from below session statistics comparision, we can see: 1) With TEMP_UNDO_ENABLED disabled, the total redo size for the insert/update/delete is 4MB, while enabled, only 280 bytes redo size for the first insert statment, and the following update and delete produce zero redo; 2) With TEMP_UNDO_ENABLED disable, The temporary space requirement is 2MB, while enabled, it’s 7MB.

Here is the script for the demo:

1. Prepare the table.

drop table t1 purge;
create global temporary table t1
(
  id number,
  padding varchar2(100)
)
on commit preserve rows;

2. Run insert/update/delete statement with TEMP_UNDO_ENABLED disabled and enabled.

alter session set temp_undo_enabled = FALSE;
--alter session set temp_undo_enabled = TRUE;
show parameter temp_undo_enabled;

exec snap_stats.start_snap;
insert into t1 select rownum, lpad(rownum, 100, ' ') from dual connect by level exec snap_stats.end_snap;
select count(*) from t1;

exec snap_stats.start_snap;
update t1 set padding = padding;
exec snap_stats.end_snap;

exec snap_stats.start_snap;
delete from t1;
exec snap_stats.end_snap;
commit;

P.S. The TEMP_UNDO_ENABLED can’t be changed dynamically within a session: “When a session uses temporary objects for the first time, the current value of the TEMP_UNDO_ENABLED initialization parameter is set for the rest of the session.”

How to get extreme Exadata IOPS with SLOB

From Exadata X3-2 datasheet, The maximum Flash IOPS is 1.5M, which is pretty impressive.

With SLOB and the help from Kevin Closson, I can get around 1.4M maxinum IOPS on X3-2 full rack. This Blog show how to use Kevin Closson’s SLob to push the IOPS on X3-2 over 1M. To push IOPS as hard as possible, I’ll use all the 8 db nodes on X3-2, to trigger ramdon read concurrently to the 14 cell nodes. The procedure is to setup 256 SLOB schema, keep the SLOB table into smart flash cache and warm up the cache by smart scan. Prepare the SQL*NET connection to the 8 db nodes, make sure the buffer cache is small enough, so the logical read can not hit the buffer cahce, and need to issue physical read request to cells. Here comes the preparation steps and final observations.

1. Slob 2

Get the Slob 2 source 2013.05.05.slob2.tar and patch with SLOB2_patch_2014.01.13.tar

SLOB 2 — A Significant Update. Links Are Here.
http://kevinclosson.wordpress.com/2013/05/02/slob-2-a-significant-update-links-are-here/

Patch README :
https://my.syncplicity.com/share/ien1nsqok0equbk/README

Patch Source:
https://my.syncplicity.com/share/vba5x0tgn05a3bk/SLOB2_patch_2014.01.13.tar

2. Setup 256 SLOB.

You need hack the setup.sh a little bit to setup over 128 schemas.

NOTIFY: Load Parameters (slob.conf):

LOAD_PARALLEL_DEGREE == 32

...

NOTIFY: Seed table loading procedure has exited.
NOTIFY: Setting up user   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32
NOTIFY: Waiting for background processes - Sun Jan 26 04:58:02 PST 2014
NOTIFY: Setting up user   33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64
NOTIFY: Waiting for background processes - Sun Jan 26 05:03:13 PST 2014
NOTIFY: Setting up user   65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96
NOTIFY: Waiting for background processes - Sun Jan 26 05:08:13 PST 2014
NOTIFY: Setting up user   97  98  99  100  101  102  103  104  105  106  107  108  109  110  111  112  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128
NOTIFY: Waiting for background processes - Sun Jan 26 05:13:06 PST 2014
NOTIFY: Setting up user   129  130  131  132  133  134  135  136  137  138  139  140  141  142  143  144  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159  160
NOTIFY: Waiting for background processes - Sun Jan 26 05:17:59 PST 2014
NOTIFY: Setting up user   161  162  163  164  165  166  167  168  169  170  171  172  173  174  175  176  177  178  179  180  181  182  183  184  185  186  187  188  189  190  191  192
NOTIFY: Waiting for background processes - Sun Jan 26 05:23:35 PST 2014
NOTIFY: Setting up user   193  194  195  196  197  198  199  200  201  202  203  204  205  206  207  208  209  210  211  212  213  214  215  216  217  218  219  220  221  222  223  224
NOTIFY: Waiting for background processes - Sun Jan 26 05:29:02 PST 2014
NOTIFY: Setting up user   225  226  227  228  229  230  231  232  233  234  235  236  237  238  239  240  241  242  243  244  245  246  247  248  249  250  251  252  253  254  255  256
NOTIFY: Waiting for background processes - Sun Jan 26 05:34:54 PST 2014

Connect to the SLOB database as user1/user1 via sqlplus and execute @./procedure.

3. Prepare 8 slob service and add the 8 entries to the tnsnames.ora

srvctl add service -d exa -s slob1 -r exa1 -a exa2
srvctl add service -d exa -s slob2 -r exa2 -a exa1
srvctl add service -d exa -s slob3 -r exa3 -a exa1
srvctl add service -d exa -s slob4 -r exa4 -a exa1
srvctl add service -d exa -s slob5 -r exa5 -a exa1
srvctl add service -d exa -s slob6 -r exa6 -a exa1
srvctl add service -d exa -s slob7 -r exa7 -a exa1
srvctl add service -d exa -s slob8 -r exa8 -a exa1

srvctl start service -d exa -s slob1
srvctl start service -d exa -s slob2
srvctl start service -d exa -s slob3
srvctl start service -d exa -s slob4
srvctl start service -d exa -s slob5
srvctl start service -d exa -s slob6
srvctl start service -d exa -s slob7
srvctl start service -d exa -s slob8

slob1 = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = demo10db01)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = slob1)))
slob2 = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = demo10db02)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = slob2)))
slob3 = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = demo10db03)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = slob3)))
slob4 = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = demo10db04)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = slob4)))
slob5 = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = demo10db05)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = slob5)))
slob6 = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = demo10db06)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = slob6)))
slob7 = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = demo10db07)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = slob7)))
slob8 = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = demo10db08)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = slob8)))

4. slob.conf

UPDATE_PCT=0
RUN_TIME=300
WORK_LOOP=0
SCALE=1000000
WORK_UNIT=256
REDO_STRESS=HEAVY
LOAD_PARALLEL_DEGREE=32
SHARED_DATA_MODULUS=0

# Settings for SQL*Net connectivity:
SQLNET_SERVICE_BASE=slob
SQLNET_SERVICE_MAX=8

5. init parameters

Huge page is enable, except one of the eight node dont have enough huge page to hold the whole SGA. 128M buffer cache size seems the smallest size with CPU_COUNT=32.

*.use_large_pages='TRUE'
*.db_cache_size  = 128m

6. Keep the Slob tables into smart flash cache and warm up SLOB tables in smart flash cache by 3 smart scans.

declare
n number;
begin
  for i in 1 .. 256
  loop
    execute immediate 'alter table user' || i || '.cf1 storage (CELL_FLASH_CACHE keep)';
    execute immediate 'select /*+ full(cf1)*/ count(*) from user' || i||'.cf1' into n;
    execute immediate 'select /*+ full(cf1)*/ count(*) from user' || i||'.cf1' into n;
    execute immediate 'select /*+ full(cf1)*/ count(*) from user' || i||'.cf1' into n;
  end loop;
end;
/

7. Obserzation

I’ve tried the number of slob sessions with 96, 128, 160, 192, 224 and 256. With 160 sessions, I can get the maximum IOPS, which is average 1.36M. From the screen capture, the peak IOPS is over 14M, which is pretty close to the 1.5M stated in the X3-2 Datasheet, and each of the 14 cells provide 100K IOPS evenly, which is good. From the awr of 160 session, “cell list of blocks physical read” is 3ms and “cell single block physical read” is 1.25ms. CPU utilization is up to 90% on cells, while it’s 37% on db nodes, seems in such extreme case, the cell processing power is not sufficient and become the first bottleneck to scale.

Top Timed Events                            DB/Inst: EXA/exa1  Snaps: 994-995

                               Wait                                   Event                      Wait Time                    Summary Avg Wait Time (ms)
       ---------------------------------------------------   ----------------------   ------------------------------   ----------------------------------------
  I#   Class      Event                                             Waits %Timeouts        Total(s) Avg(ms) %DB time        Avg      Min      Max  Std Dev  Cnt
----   ---------- ----------------------------------------   ------------ ---------   ------------- ------- --------   -------- -------- -------- -------- ----
   *   User I/O   cell list of blocks physical read            11,416,579       0.0       35,283.85     3.1    43.04       3.09     2.98     3.12     0.05    8
       User I/O   cell single block physical read              25,295,624       0.0       31,685.75     1.3    38.65       1.25     1.25     1.26     0.00    8
                  DB CPU                                              N/A       N/A       24,707.27     N/A    30.14                                          8
       Cluster    gc cr grant 2-way                               758,555       0.0          154.92     0.2     0.19       0.21     0.20     0.26     0.02    8
       System I/O log file parallel write                           9,559       0.0           46.95     4.9     0.06       4.93     4.46     5.22     0.30    8
       User I/O   read by other session                            11,890       0.0           12.14     1.0     0.01       1.02     0.99     1.07     0.03    8
       System I/O control file sequential read                      8,279       0.0            9.62     1.2     0.01       1.16     1.13     1.18     0.02    8
       Cluster    gc buffer busy acquire                            1,638       0.0            8.86     5.4     0.01       6.77     3.35    11.89     3.36    8
       Other      CSS operation: data query                         1,600       0.0            7.74     4.8     0.01       4.84     3.76     6.00     0.81    8
       Concurrenc latch: cache buffers chains                      49,387       0.0            4.55     0.1     0.01       0.09     0.09     0.10     0.00    8

System Statistics - Per Second               DB/Inst: EXA/exa1  Snaps: 994-995

             Logical     Physical     Physical         Redo        Block         User
  I#         Reads/s      Reads/s     Writes/s   Size (k)/s    Changes/s      Calls/s      Execs/s     Parses/s   Logons/s       Txns/s
---- --------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ---------- ------------
   1      170,727.68    170,406.3          2.7      5,017.9         17.2          1.6        664.6          3.8       0.14          0.0
   2      160,987.67    162,898.8          2.5      4,833.9         22.4          1.4        630.4          5.9       0.13          0.1
   3      163,967.64    163,394.0          2.1      4,825.7         15.0          1.4        638.2          3.6       0.14          0.0
   4      174,957.44    174,551.8          3.1      5,130.8         12.7          1.3        679.2          2.1       0.13          0.0
   5      169,568.64    170,028.9          1.8      5,012.8         10.4          1.4        658.4          2.1       0.14          0.0
   6      172,990.65    172,771.3          3.4      5,082.2         13.6          1.4        671.8          2.3       0.14          0.0
   7      172,572.60    172,256.4          1.8      5,064.6         14.0          1.4        671.6          3.5       0.14          0.0
   8      171,493.42    170,899.7          2.6      5,016.6          9.8          1.4        665.8          2.1       0.14          0.0
 ~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~~~
 Sum    1,357,265.74  1,357,207.1         20.0     39,984.6        115.1         11.5      5,280.0         25.3       1.10          0.3
 Avg      169,658.22    169,650.9          2.5      4,998.1         14.4          1.4        660.0          3.2       0.14          0.0
 Std        4,777.34      4,268.8          0.6        111.3          4.0          0.1         17.1          1.3       0.00          0.0

Update on 2014-Feb-08:

With work_unit=256 and only 160 slob sessions, I can push the IOPS over 1.3M. Although the cell is out of CPU(90%), CPU is pretty idle on db node(37%). To seek for 1.5M IOPS, I setup 768 slob schemas, try different small work_unit with more slob sessions. With work_unit=8 or work_unit=16, the db node is running out of cpu much faster than the storage cell. With work_unit=64, the CPU on db nodesand storage cell is relatively balance. The maximum IOPS observed with work_unit=64 is same as 1.4M(still can not get closer to 1.5M). Here are the graphs with work_unit=64, showing how IOPS, db node CPU utilization, Average Active Sessions, IO latency scale with the slob sessions.

--slob.conf
UPDATE_PCT=0
RUN_TIME=300
WORK_LOOP=0
SCALE=500000
WORK_UNIT=64
REDO_STRESS=HEAVY
LOAD_PARALLEL_DEGREE=32
SHARED_DATA_MODULUS=0

SLOB on Exadata X3-2: Random Read IOPS

I try to benchmark the random read IOPS capacity on one of Exadata X3-2 db node with SLOB, Here comes some figure.

1. Parameters

Disable prefecting and avoid multiple block read by adding below 3 init parameters.

_db_block_prefetch_limit      0
_db_block_prefetch_quota      0
_db_file_noncontig_mblock_rea 0

Block size is 8K, db_cache_siz is set to 128M(seems this the lowest db_cache_size I can set 11.2.0.3)

cpu_count                     32
db_block_size                 8192
db_cache_size                 134217728

2. Segment size and sql

prepare 128 users, each have the table CF1 which size is around 8G(with scale=1000000 in slob.conf)

setup.sh IOPS 128

The loading of 8K block random random reads is triggered by below query.

SELECT COUNT(C2) FROM CF1 WHERE CUSTID > ( :B1 - :B2 ) AND (CUSTID < :B1 )

Observations

The Max IOPS is around 140K, when the concurrent process is 96. The avg wait time of ‘db file sequential read’ increase when the concurrent process beyond 64, and CPU utilization drops when the processes number beyond 96, seem the system is overloading and the performance degrades.

Update on 2013-Feb-09:

After below two change, I can gain 250K IOPS with 64 sessions on single db node(almost twice IOPS than before the change), when the cpu is 90%.

Change 1. Comment out the below parameters to enable db block prefect.

--*._db_block_prefetch_limit = 0
--*._db_block_prefetch_quota = 0
--*._db_file_noncontig_mblock_read_count = 0

Change 2. update the WORK_UNIT=16

--slob.conf
UPDATE_PCT=0
RUN_TIME=300
WORK_LOOP=0
SCALE=1000000
WORK_UNIT=16
REDO_STRESS=HEAVY
LOAD_PARALLEL_DEGREE=8
SHARED_DATA_MODULUS=0

The awr number of single block read and physical read, with 32/64/96/128 slob sessions.

[oracle@demo01db01 awr]$ grep "cell single block physical rea" * | sort -t_ -k1 -n
32_awr.txt:cell single block physical rea    9,257,339       3,778      0   39.3 User I/O
64_awr.txt:cell single block physical rea   15,161,265       8,301      1   43.2 User I/O
96_awr.txt:cell single block physical rea   19,802,040      13,751      1   47.6 User I/O
128_awr.txt:cell single block physical rea   17,269,328      15,964      1   41.4 User I/O
[oracle@demo01db01 awr]$ grep "Physical reads:" * | sort -t_ -k1 -n
32_awr.txt:  Physical reads:          197,068.2       59,670,688.0
64_awr.txt:  Physical reads:          249,996.2        3,594,314.9
96_awr.txt:  Physical reads:          253,909.9        3,669,155.5
128_awr.txt: Physical reads:          255,982.8        3,733,850.4

[译]Oracle 11.2 and the direct path read event

来自Frits Hoogland的博客, 原文链接: http://fritshoogland.wordpress.com/2013/01/04/oracle-11-2-and-the-direct-path-read-event/

在我之前的文章中, 我提及了对于全段扫描, Oracle利用操作系统新的实现方式. 全段扫描可以从执行计划中的’TABLE ACCESS FULL’,’FAST FULL INDEX SCAn’和’BITMAP FULL SCAN’确认, 这里的段可以是一个分区, 一个表, 或者一个索引.

你可以从我的另一篇文章中, 了解Oracle是如何选择多块读和直接路径读, 以及两者的区别. 奇怪的是, 对于’direct path reads’, Oracle几乎没有发布什么文档.

这篇文章是研究Oracle 11.2.0.3中’direct path reads’的实现方式, 实验环境是Linux 6.3 X64, kernel 2.6.39-300.17.3.el6uek.x86_64. Database 11.2.0.3 64 bit, no PSU/CPU. 数据库用了两块asm磁盘, clusterware处于单机模式. 虚拟机的用了VMWare Professional 5.0.2, 跑在Macbook OSX 10.8.2, 磁盘用了SSD.

我用gdb对于一个简单的查询进行跟踪 ‘select count(*) from t2′, 表t2没有索引和约束, 它足够大, 数据库引擎会选择’direct path reads’进行全段扫描. Oracle为了提高性能, 引入了多种机制没有文档记录的方式, 尤其是’direct path reads’, 使得前台进程可以一次调用提交多个IO请求, 之后再确认这些请求是否完成, 以提高IO请求的并发数.

1. 清晰明了的观察IO请求

首先, 和前一篇文章一样, 用sqlplus准备一个数据库前台进程, 以root用户挂载这个进程’gdb -p PID’.

现在, 对对IO相关的操作系统调用io_submit()和io_getevents()设置断点, 用”C”命令让进程继续执行:

(gdb) rbreak ^io_.*
Breakpoint 1 at 0xa08c20
 io_prep_pwritev;
Note: breakpoint 1 also set at pc 0xa08c20.
...
Breakpoint 45 at 0x7f2e71b1dc0c
 io_prep_poll;
(gdb) commands
Type commands for breakpoint(s) 1-45, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

在sqlplus中执行一个全段扫描的sql, 观察输出:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()

首先, 为什么有两个io_submit (io_submit@plt() 和io_submit () from /lib64/libaio.so.1)和仅仅一个io_getevents(io_getevents@plt ())呢?

需要对plt解释一下, plt是procedure linkage table的缩写. plt构造了可执行程序所使用的动态链接库提供的函数, 对于Oracle可执行程序, 也是这样. 如果你观察@ptl调用的地址, 你在Oracle可执行文件中也可以同样看到(/proc/maps), ‘io_submit () from /lib64/libaio.so.1’这个函数来自动态链接库libaio.so.1(也可以从/proc/maps中看到).

对于io_getevents, 我们只看到io_getevents@plt, 这可能是可执行文件伪造了系统调用, 或者我们错过了什么. 这需要我们进一步的研究libaio这个库本身的符号链接, 使用’nm -D’命令:

# nm -D /lib64/libaio.so.1
0000000000000000 A LIBAIO_0.1
0000000000000000 A LIBAIO_0.4
0000003f38200710 T io_cancel
0000003f38200670 T io_cancel
0000003f38200690 T io_destroy
0000003f382006a0 T io_getevents
0000003f38200620 T io_getevents
0000003f38200570 T io_queue_init
0000003f38200590 T io_queue_release
0000003f382005b0 T io_queue_run
0000003f382005a0 T io_queue_wait
0000003f382006e0 T io_queue_wait
0000003f38200680 T io_setup
0000003f38200660 T io_submit

啊! 有趣的是, 这里有两个io_getevents!

让我们看看Oracle程序是如何使用io_getevents调用的:

(gdb) del
Delete all breakpoints? (y or n) y
(gdb) rbreak ^io_get.*
Breakpoint 46 at 0x3f382006a0
 io_getevents;
...
Breakpoint 53 at 0xa09030
 io_getevents@plt;
.
And list the breakpoints:

(gdb) info break
Num     Type           Disp Enb Address            What
46      breakpoint     keep y   0x0000003f382006a0 io_getevents
47      breakpoint     keep y   0x0000000000a09030 io_getevents@plt
48      breakpoint     keep y   0x0000003f382006a0 io_getevents
49      breakpoint     keep y   0x0000000000a09030 io_getevents@plt
50      breakpoint     keep y   0x0000003f382006a0 io_getevents
51      breakpoint     keep y   0x0000003f382006a0 io_getevents
52      breakpoint     keep y   0x0000003f382006a0 io_getevents
53      breakpoint     keep y   0x0000000000a09030 io_getevents@plt

所以, io_getevents的断点在地址0x0000003f382006a0! 我们我们观察libaio的内容, 就会发现io_getevents有两个地址0x0000003f382006a0和0x0000003f38200620.

让我们用gdb也在第二个io_getevents的地址设置断点:

(gdb) break *0x0000003f38200620
Breakpoint 54 at 0x3f38200620
(gdb) commands
Type commands for breakpoint(s) 54, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

现在在sqlplus执行全段扫描的sql, 以下是gdb的输出:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1

好的, 现在我们也看到两个io_getevents系统调用了!

在gdb中恰当的设置断点之后, 我们可以观察到操作系统进行异步IO的系统调用, io_submit和io_getevents. Oracle还是用了其他的系统调用, 不过我们在这里只关注io_submit和io_getevents.

2. 加入等待事件, 对直接路径读的IO调用进行测量

现在我加入两个断点, 观察等待事件如何测量IO(kslwtbctx(进入等待)和kslwtectx(结束等待)):

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 55 at 0x8f9a652
 kslwtbctx;
Breakpoint 56 at 0x8fa1334
 kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 55-56, one per line.
End with a line saying just "end".
>silent
>f
>c
>end

如果重新执行全段扫描, 我们可以确定对于这些IO相关的调用(io_submit->io_getevents), 没有等待事件对它们进行测量, 因为我们没有看到kslwtbctx和kslwtectx. 只有IO调用结束后, 我才能看到kslwtbctx和kslwtectx, 这是期望的正常的现象. 因为这个虚拟机的磁盘在SSD上, 而且这些磁盘(也就是VMWare Fusion所在的一个文件)很有可能已经在宿主系统OS X的缓存中了, 所以IO的操作非常快, 以至于Oracle不会对它们进行测量.

就像之前文章介绍的, 最后调用kslwtbctx意味着进入等待事件’SQL*Net message from client’, 这个前台进程空闲中, 等待着用户的输入:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()

现在我们是使用cgroups, 限制IO. 如果我把ASM-disk限制为1 IOPS, 执行全段扫描的SQL, 可以看到如下的gdb输出:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200620 in io_getevents () from /lib64/libaio.so.1
#0  0x0000000008fa1334 in kslwtectx ()

实际上, 以上是IO很慢时, 典型的异步IO模式: io_submit提出IO请求, 紧接着四个io_getevents调用, 如果IO不是非常快的话(还有未完成的IO请求), 前台进程会注册一个等待事件, 调用另一个io_getevents, 结束等待事件(取决于现实世界中系统的IO能力, 我们更可能看到介于以上两种输出的结果).

这个模式有多种形式, 比如, 当扫描开始时, 在异步iO的初始化之后(比如操作系统的IO context和Oracle进程的IO slots), 首先是两个io_submit调用, 先完成至少两个异步IO请求.

io_submit之后的4个io_getevents很明显是非阻塞的. 因为在屏幕上我可以看到这四个调用很快的完成, 直到等待时间的注册, 屏幕的输出在另一个io_getevents调用后停止. 这当然是我的猜想, 我们可以证实这个猜想吗?

3. 更深入的观察debug信息

为了更深入的观察io_getevents调用, 我们需要知道io_getevents函数的参数, 这需要安装带有debug信息的libaio包(可以在http://oss.oracle.com/ol6/debuginfo下载). 安装之后, 为了获得详细的debug信息, 需要重启gdb回话:

这是gdb新的输出:

#0 0x0000000002cfb352 in io_prep_pread ()
#0 0x0000000000a09bb0 in io_submit@plt ()
#0 io_submit (ctx=0x7ff6ceb2c000, nr=1, iocbs=0x7fff2a4e09e0) at io_submit.c:23
23 io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4e9048, timeout=0x7fff2a4ea050) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4ec128, timeout=0x7fff2a4ed130) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4e8e48, timeout=0x7fff2a4e9e50) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=2, nr=128, events=0x7fff2a4ebf28, timeout=0x7fff2a4ecf30) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000008f9a652 in kslwtbctx ()
#0 0x0000000000a09030 in io_getevents@plt ()
#0 io_getevents_0_4 (ctx=0x7ff6ceb2c000, min_nr=1, nr=128, events=0x7fff2a4e8e38, timeout=0x7fff2a4e9e40) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
#0 0x0000000008fa1334 in kslwtectx ()

现在我们可以看到很多有趣的内容!

第一, 所有的IO调用使用同一个’aio_context': ctx=0x7ff6ceb2c000. 这意味着这个sql查询的每个异步IO操作都可见, 而且IO请求被提交顺序和被确认完成的顺序未必相同.

接着是参数min_nr, 非阻塞的io_getevents调用min_nr是2, 阻塞的io_getevents调用min_nr是1.

参数nr是在这个aio_context下, 可以确认的最大IO数量. 据说我所知, 进程slots的数目不能超过32, 那实际上每次确认的请求不能超过32(虽然我们看到nr=128).

io_getevents最后的参数是timeout, 这对于理解IO过程很关键. timeout是指向一个timeout结构的指针. 为了得到timeout值, 我们需要打印出timeout结构的内容.

让我们用gdb, 在断点处打印出timeout结构:

(gdb) del
Delete all breakpoints? (y or n) y
(gdb) break *0x0000003f38200620
Breakpoint 38 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 38, one per line.
End with a line saying just "end".
>print *timeout
>c
>end
(gdb) rbreak ^kslwt[be]ctx
Breakpoint 39 at 0x8f9a652
 kslwtbctx;
Breakpoint 40 at 0x8fa1334
 kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 39-40, one per line.
End with a line saying just "end".
>silent
>f
>end
(gdb) rbreak ^io_.*
Breakpoint 41 at 0x3f38200570: file io_queue_init.c, line 28.
int io_queue_init(int, io_context_t *);
...
Breakpoint 74 at 0x7f549fd44c0c
 io_prep_poll;
(gdb) commands
Type commands for breakpoint(s) 41-74, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

请留意断点38(*0x0000003f38200620, 真正的io_getevents调用)处的”print *timeout”.
“print *timeout”会打印出指针所指的内容, 现在, 我们可以确认我之前的猜想是否正确:

#0  0x0000000002cfb352 in io_prep_pread ()
#0  0x0000000000a09bb0 in io_submit@plt ()
#0  io_submit (ctx=0x7f54a1956000, nr=1, iocbs=0x7fff78f059b0) at io_submit.c:23
23  io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
#0  0x0000000000a09030 in io_getevents@plt ()
 
Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f0dfd8, timeout=0x7fff78f0efe0) at io_getevents.c:46
46      if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$21 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000000a09030 in io_getevents@plt ()
 
Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f110b8, timeout=0x7fff78f120c0) at io_getevents.c:46
46      if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$22 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000000a09030 in io_getevents@plt ()
 
Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f0ddd8, timeout=0x7fff78f0ede0) at io_getevents.c:46
46      if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$23 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000000a09030 in io_getevents@plt ()
 

Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=3, nr=128, events=0x7fff78f10eb8, timeout=0x7fff78f11ec0) at io_getevents.c:46
46      if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$24 = {tv_sec = 0, tv_nsec = 0}
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()
 
Breakpoint 38, io_getevents_0_4 (ctx=0x7f54a1956000, min_nr=1, nr=128, events=0x7fff78f0ddc8, timeout=0x7fff78f0edd0) at io_getevents.c:46
46      if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$25 = {tv_sec = 600, tv_nsec = 0}

是的! gdb输出中timeout结构的内容证实了我的猜想! 没有被测量的4个io_getevents是非阻塞调用(tv_sec=0), 他们只是检查异步IO的完成队列(ctx), 确认是否有3个IO请求已经完成, 然后注册一个等待事件, 调用一个阻塞的io_getevents, 查看同一个完成队列(ctx), 这次只需确认一个IO请求, 这次调用的timeout时间是600秒(tv_sec = 600).

4. 结论

这篇文章研究了’direct path reads’本质是如何工作的. 这个主题显然有更多的内容可以探讨, 尤其是11g中引入了智能的”auto tune”机制(以更好的利用异步IO提高性能)之后.

[译]Oracle 11.2.0.1 and the ‘kfk: async disk IO’ wait event

来自Frits Hoogland的博客, 原文链接: http://fritshoogland.wordpress.com/2012/12/27/oracle-11-2-0-1-and-the-kfk-async-disk-io-wait-event/

最近和朋友讨论一些IO相关的等待. 我关注的等待是’kfk: async disk IO’. 这个等待在11.2.0.1中出现, 但是在11.2.0.2以上的版本中没出现. 这篇文章研究一下问什么.

首先: 这个等待在11.2.0.2以上的版本中没有消失, 可以很容易证明(这是一个11.2.0.3 DB 的输出):

SYS@v11203 AS SYSDBA> select name, parameter1, parameter2, parameter3, wait_class from v$event_name where name like 'kfk: async disk IO';
 
NAME             PARAMETER1 PARAMETER2 PARAMETER3 WAIT_CLASS
-------------------- ---------- ---------- ---------- ----------------------------------------------------------------
kfk: async disk IO   count  intr       timeout    System I/O

有趣的是, 等待级别是’System I/O’. 我不知道等待级别’System I/O’的官方定义, 但是它暗示着后台进程的等待, 而不是我的前台进程. 当然, 我可能是错的…

让我们先看看一个跟踪文件片段, 来自一个使用异步IO的全段扫描, Oracle版本11.2.0.1, 主机是OL6.3 X64的虚拟机.

PARSING IN CURSOR #3 len=23 dep=0 uid=85 oct=3 lid=85 tim=1356620409310181 hv=1020534364 ad='7f1b14a0' sqlid='94dwfa8yd87kw'
select count(*) from t2
END OF STMT
PARSE #3:c=0,e=120,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356620409310180
EXEC #3:c=0,e=63,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356620409310297
WAIT #3: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=1356620409310351
WAIT #3: nam='Disk file operations I/O' ela= 9 FileOperation=2 fileno=5 filetype=2 obj#=-1 tim=1356620409311203
WAIT #3: nam='Disk file operations I/O' ela= 238 FileOperation=2 fileno=0 filetype=15 obj#=73426 tim=1356620409312218
WAIT #3: nam='kfk: async disk IO' ela= 14 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409313703
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409314449
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409315169
WAIT #3: nam='kfk: async disk IO' ela= 5 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409315850
WAIT #3: nam='Disk file operations I/O' ela= 42 FileOperation=2 fileno=0 filetype=15 obj#=73426 tim=1356620409316451
WAIT #3: nam='kfk: async disk IO' ela= 33 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409317903
WAIT #3: nam='kfk: async disk IO' ela= 403 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409320529
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409321950
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409323627
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409325424
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409327121
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409329153
WAIT #3: nam='kfk: async disk IO' ela= 7 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409330861
WAIT #3: nam='kfk: async disk IO' ela= 8 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409332534
WAIT #3: nam='kfk: async disk IO' ela= 6 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356620409334179

非常清晰的是, 在等待事件’kfk: async disk IO’的背后, 数据库肯定在进行着某些工作, 但是’kfk: async disk IO’出现的次数非常多. 所花的时间并不多(大部分 ela < 10 微秒). 让我们更深入一点. 大部分等待事件是对操作系统调用的记录. 我们用Linux的工具'strace', 把等待和系统调用一起列出来. 使用'strace -e write=all -p PID', 以下是输出片段:

io_submit(139892671860736, 1, {{0x7f3b4b5851e0, 0, 0, 0, 11}}) = 1
io_submit(139892671860736, 1, {{0x7f3b4b585420, 0, 0, 0, 12}}) = 1
io_getevents(139892671860736, 2, 128, {{0x7f3b4b5851e0, 0x7f3b4b5851e0, 122880, 0}}, {0, 0}) = 1
write(9, "WAIT #3: nam=’kfk: async disk IO"…, 107) = 107
 | 00000  57 41 49 54 20 23 33 3a  20 6e 61 6d 3d 27 6b 66  WAIT #3:  nam=’kf |
 | 00010  6b 3a 20 61 73 79 6e 63  20 64 69 73 6b 20 49 4f  k: async  disk IO |
 | 00020  27 20 65 6c 61 3d 20 35  33 20 63 6f 75 6e 74 3d  ‘ ela= 5 3 count= |
 | 00030  31 20 69 6e 74 72 3d 30  20 74 69 6d 65 6f 75 74  1 intr=0  timeout |
 | 00040  3d 34 32 39 34 39 36 37  32 39 35 20 6f 62 6a 23  =4294967 295 obj# |
 | 00050  3d 37 33 34 32 36 20 74  69 6d 3d 31 33 35 36 36  =73426 t im=13566 |
 | 00060  32 31 31 31 30 32 35 38  30 31 34                 21110258 014      |
write(9, "\n", 1)                       = 1
 | 00000  0a                                                .                 |
io_getevents(139892671860736, 1, 128, {{0x7f3b4b585420, 0x7f3b4b585420, 1032192, 0}}, {0, 0}) = 1
io_submit(139892671860736, 1, {{0x7f3b4b5851e0, 0, 0, 0, 11}}) = 1
write(9, "WAIT #3: nam=’kfk: async disk IO"…, 106) = 106
 | 00000  57 41 49 54 20 23 33 3a  20 6e 61 6d 3d 27 6b 66  WAIT #3:  nam=’kf |
 | 00010  6b 3a 20 61 73 79 6e 63  20 64 69 73 6b 20 49 4f  k: async  disk IO |
 | 00020  27 20 65 6c 61 3d 20 36  20 63 6f 75 6e 74 3d 31  ‘ ela= 6  count=1 |
 | 00030  20 69 6e 74 72 3d 30 20  74 69 6d 65 6f 75 74 3d   intr=0  timeout= |
 | 00040  34 32 39 34 39 36 37 32  39 35 20 6f 62 6a 23 3d  42949672 95 obj#= |
 | 00050  37 33 34 32 36 20 74 69  6d 3d 31 33 35 36 36 32  73426 ti m=135662 |
 | 00060  31 31 31 30 32 35 39 37  35 39                    11102597 59       |
write(9, "\n", 1)                       = 1
 | 00000  0a                                                .                 |
io_getevents(139892671860736, 1, 128, {{0x7f3b4b5851e0, 0x7f3b4b5851e0, 1032192, 0}}, {0, 0}) = 1
io_submit(139892671860736, 1, {{0x7f3b4b585420, 0, 0, 0, 12}}) = 1
write(9, "WAIT #3: nam=’kfk: async disk IO"…, 106) = 106
 | 00000  57 41 49 54 20 23 33 3a  20 6e 61 6d 3d 27 6b 66  WAIT #3:  nam=’kf |
 | 00010  6b 3a 20 61 73 79 6e 63  20 64 69 73 6b 20 49 4f  k: async  disk IO |
 | 00020  27 20 65 6c 61 3d 20 39  20 63 6f 75 6e 74 3d 31  ‘ ela= 9  count=1 |
 | 00030  20 69 6e 74 72 3d 30 20  74 69 6d 65 6f 75 74 3d   intr=0  timeout= |
 | 00040  34 32 39 34 39 36 37 32  39 35 20 6f 62 6a 23 3d  42949672 95 obj#= |
 | 00050  37 33 34 32 36 20 74 69  6d 3d 31 33 35 36 36 32  73426 ti m=135662 |
 | 00060  31 31 31 30 32 36 32 31  37 38                    11102621 78       |

有趣的是, 在io_getevents()和io_submit()这两个调用之后, ‘kfk: async disk IO’被记录. io_submit调用想操作系统提交一个IO请求, io_getevents调用检查操作系统异步IO(AIO)完成队列里, 确认是否有IO请求已经完成. 所以在我看来, ‘System I/O’是不合适的, 这是我的前台进程正在等待的. 我们也知道这个等待事件和IO相关, 但是它具体意味着什么呢?

为了更进一步, 我们需要知道数据库内部的活动. 最显而易见方法的是阅读源代码, 但这是不可能的, 除非你在Oracle工作或者得到直接的授权. 可以选择其他的途径研究内部原理, 比如我写过的’perf’, 这篇文章中我会使用gdb (gnu debugger).

为了debug一个运行中的Oracle进程, 通过root用户, 用’gdb -p PID’命令挂载进程, 为了和strace输出类似, 使用以下的gdb命令:

(gdb) set pagination off
(gdb) break io_submit
Breakpoint 1 at 0x3f38200660: file io_submit.c, line 23.
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>silent
>frame
>continue
>end
(gdb) break 'io_getevents@plt'
Breakpoint 2 at 0x9d1a58
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>silent
>frame
>continue
>end
(gdb) continue
Continuing.

再运行全段扫描.

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x00000000009d1a58 in io_getevents@plt ()

现在我们需要知道Oracle从如何注册等待事件的, 我们利用之前的sql跟踪文件. 全段扫描的sql之后, 第一个等待事件是’SQL*Net message from client’. Oracle记录这个等待事件的调用是write(). 我们清除当前的断点, 对write()调用设置断点(用CTRL-c停止debugger当前的活动).

(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) break write
Breakpoint 3 at 0x3f38a0e4c0: file ../sysdeps/unix/syscall-template.S, line 82. (3 locations)
(gdb) c
Continuing.
('c' is an abbreviation of 'continue')

重新跑一次全段扫描. 因为这个断点设置, write()被调用时gdb会停止执行.

Breakpoint 3, write () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

接着, 我们打印出到达write()调用之前, 所有递归函数调用, 用’backtrace’命令(‘bt’).

(gdb) bt
#0  write () at ../sysdeps/unix/syscall-template.S:82
#1  0x00000000086014dd in sdbgrfuwf_write_file ()
#2  0x0000000008601426 in sdbgrfwf_write_file ()
#3  0x00000000085caafb in dbgtfdFileWrite ()
#4  0x00000000085ca685 in dbgtfdFileAccessCbk ()
#5  0x00000000085c9d6d in dbgtfPutStr ()
#6  0x0000000001a8d93a in dbktWriteTimestamp ()
#7  0x0000000001a8d6b5 in __PGOSF61_dbktWriteNfy ()
#8  0x00000000085caa06 in dbgtfdFileWrite ()
#9  0x00000000085ca685 in dbgtfdFileAccessCbk ()
#10 0x00000000085c9d6d in dbgtfPutStr ()
#11 0x0000000001a8e291 in dbktPri ()
#12 0x00000000009f143e in ksdwrf ()
#13 0x00000000009f1d8f in ksdwrfn ()
#14 0x0000000005abf0e1 in kxstTraceWait ()
#15 0x000000000821e88d in kslwtectx ()
#16 0x00000000083c2a79 in __PGOSF24_opikndf2 ()
#17 0x000000000143b790 in opitsk ()
#18 0x00000000014406da in opiino ()
#19 0x00000000083c54fd in opiodr ()
#20 0x0000000001437b60 in opidrv ()
#21 0x00000000018aac97 in sou2o ()
#22 0x00000000009d3ef1 in opimai_real ()
#23 0x00000000018aff26 in ssthrdmain ()
#24 0x00000000009d3e5d in main ()

有趣的是, 编号15的函数, kslwtectx(). 我们知道write()记录了’SQL*Net message from client’的结束, kslwtectx应该是’Kernel Service Layer WaiT End ConTeXt’的缩写(个人合理的推断, Tanel Poder帮忙找到这个函数). 记录等待开始的函数为’kslwtbctx’, 记录等待结束的函数为’kslwtectx’. 我们在io_submit, io_getevents, kslwtbctx和kslwtectx设置断点, 然后继续:

(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) break io_submit
Breakpoint 4 at 0x3f38200660
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) break 'io_getevents@plt'
Breakpoint 5 at 0x9d1a58
(gdb) commands
Type commands for breakpoint(s) 5, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) break kslwtbctx
Breakpoint 6 at 0x8217fda
(gdb) commands
Type commands for breakpoint(s) 6, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) break kslwtectx
Breakpoint 7 at 0x821e3d8
(gdb) commands
Type commands for breakpoint(s) 7, one per line.
End with a line saying just "end".
>silent
>f
>c
>end
(gdb) c
Continuing.

现在跑全段扫描, 以下是debugger的输出:

#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()

以上的模式一直重复, 直到IO调用结束, 这个前台进程的结尾和开始一样, 重复等待’SQL*Net message from client'(kslwtbctx->kslwtectx).

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()

好的, 回到’kfk: async disk IO’. 我们看到每个io_getevents()调用都被kslwtbctx()和kslwtectx()这两个调用包围. 请注意这是11.2.0.1的情况. 这意味着’kfk: async disk IO’就是对io_getevents()调用的测量, 或者说io_getevents()是等待事件的一部分.

为了证明以上的猜想, 我们使io_getevents()调用花更长的时间, 确认等待时间是变长了. 可以用debugger的’catch’命令:

(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) catch syscall io_getevents
Catchpoint 8 (syscall 'io_getevents' [208])
(gdb) commands
Type commands for breakpoint(s) 8, one per line.
End with a line saying just "end".
>silent
>shell sleep 0.01
>c
>end
(gdb) c
Continuing.

请注意到这里是io_getevents而不是io_getevents@plt, 因为catch命令会在头文件中寻找这个调用, 而不是在可执行文件的符号表中.

现在重新执行全段扫描, 同时打开10046 level跟踪. 之前我们看到’kfk: async disk IO’大部分等待时间都小于10微秒. 以下是跟踪片段:

PARSING IN CURSOR #3 len=23 dep=0 uid=85 oct=3 lid=85 tim=1356626115353145 hv=1020534364 ad='7f1b14a0' sqlid='94dwfa8yd87kw'
select count(*) from t2
END OF STMT
PARSE #3:c=0,e=203,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626115353116
EXEC #3:c=1000,e=80,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626115355176
WAIT #3: nam='SQL*Net message to client' ela= 30 driver id=1413697536 #bytes=1 p3=0 obj#=73426 tim=1356626115355832
WAIT #3: nam='kfk: async disk IO' ela= 26404 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115384177
WAIT #3: nam='kfk: async disk IO' ela= 25050 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115411677
WAIT #3: nam='kfk: async disk IO' ela= 24885 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115437990
WAIT #3: nam='kfk: async disk IO' ela= 25002 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115464244
 
*** 2012-12-27 17:35:15.492
WAIT #3: nam='kfk: async disk IO' ela= 25192 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115492891
WAIT #3: nam='kfk: async disk IO' ela= 25175 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115521419
WAIT #3: nam='kfk: async disk IO' ela= 24804 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115549124
WAIT #3: nam='kfk: async disk IO' ela= 24980 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115576955
WAIT #3: nam='kfk: async disk IO' ela= 24988 count=1 intr=0 timeout=4294967295 obj#=73426 tim=1356626115604879
1
先解释一下: debugger在进入和离开io_getevents调用时都会触发'catch'命令. 所以我们看到'kfk: async disk IO'等待时间稍稍多余20毫秒, 而不是之前的几微秒, 这很好的暗示了io_getevents和'kfk: async disk IO'是相关的. 我们也可以从debugger中看到:
1
(gdb) info break
Num     Type           Disp Enb Address            What
8       catchpoint     keep y                      syscall "io_getevents" 
    catchpoint already hit 174 times
        silent
        shell sleep 0.01
        c

catchpoint触发了174次, 所以, ‘kfk: async disk IO’是对io_getevents()调用的测量, 或者准确的说, io_getevents()是’kfk: async disk IO’的一部分.

好了, 现在让我们切换到11.2.0.3…
这是同样全段扫描的跟踪文件:

PARSING IN CURSOR #140382932318408 len=23 dep=0 uid=84 oct=3 lid=84 tim=1356626776003874 hv=1020534364 ad='7f491298' sqlid='94dwfa8yd87kw'
select count(*) from t2
END OF STMT
PARSE #140382932318408:c=0,e=41,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626776003873
EXEC #140382932318408:c=0,e=31,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356626776003962
WAIT #140382932318408: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1356626776004009
WAIT #140382932318408: nam='direct path read' ela= 340 file number=5 first dba=28418 block cnt=126 obj#=75579 tim=1356626776045733
FETCH #140382932318408:c=164975,e=188719,p=20941,cr=20944,cu=0,mis=0,r=1,dep=0,og=1,plh=3321871023,tim=1356626776192752
STAT #140382932318408 id=1 cnt=1 pid=0 pos=1 obj=0 op='SORT AGGREGATE (cr=20944 pr=20941 pw=0 time=188708 us)'
STAT #140382932318408 id=2 cnt=1000000 pid=1 pos=1 obj=75579 op='TABLE ACCESS FULL T2 (cr=20944 pr=20941 pw=0 time=68726 us cost=5738 siz=0 card=1000000)'

第一, ‘kfk: async disk IO’事件不见了. 我们看到PARSE, SQL Net的信息和仅仅一个’direct path read’等待, 接着是FETCH. 这和11.2.0.1完全不同.
发生了什么? 我想理解这种现象的唯一方式就是对io_submit, io_getevents, kslwtbctx, kslwtectx设置断点:

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()

这太奇怪了, 在处理IO过程里, kslwtbctx和kslwtectx都没有被调用. 所以, 尽管’kfk: async disk IO’依然存在(11.2.0.3), 但是它不再被用来测量io_getevents(). 如果我们同样使io_getevents()调用慢下来(像之前一样通过cache命令, 每次消耗0.01秒).

PARSE #140382932318408:c=0,e=168,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356627973832834
EXEC #140382932318408:c=0,e=68,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356627973833385
WAIT #140382932318408: nam='SQL*Net message to client' ela= 671 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1356627973835367
WAIT #140382932318408: nam='asynch descriptor resize' ela= 680 outstanding #aio=0 current aio limit=152 new aio limit=182 obj#=75579 tim=1356627973837361
 
*** 2012-12-27 18:06:16.440
FETCH #140382932318408:c=185972,e=2604507,p=20941,cr=20944,cu=0,mis=0,r=1,dep=0,og=1,plh=3321871023,tim=1356627976440376

不, FETCH的时间确实变长了(elapsed; e=2604507; cpu; c=185972; 其中2418535微秒并不是消耗在CPU上), 但是没有等待事件测量消耗的时间. 一个合理的解释是’kfk: async disk IO’事件在这个版本里有所不同(11.2.0.3). 我尝试以同样的方式延长io_submit的时间, 结果是一样的: FETCH的时间变长了, 但是没有等待事件测量变长的等待时间.

如果我把磁盘变慢呢, 用cgroups, 这篇文章里有描述. 把我的ASM磁盘的设为1 IOPS. 这样应该有等待事件出现, 跟踪结果如下:

PARSE #140382932318408:c=0,e=20,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356633964787836
EXEC #140382932318408:c=0,e=64,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1356633964788256
WAIT #140382932318408: nam='SQL*Net message to client' ela= 703 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1356633964790122
WAIT #140382932318408: nam='asynch descriptor resize' ela= 699 outstanding #aio=0 current aio limit=238 new aio limit=264 obj#=75579 tim=1356633964792014
WAIT #140382932318408: nam='direct path read' ela= 497938 file number=5 first dba=23939 block cnt=13 obj#=75579 tim=1356633965295483
 
*** 2012-12-27 19:46:05.795
WAIT #140382932318408: nam='direct path read' ela= 495498 file number=5 first dba=23953 block cnt=15 obj#=75579 tim=1356633965795382
 
*** 2012-12-27 19:46:06.295
WAIT #140382932318408: nam='direct path read' ela= 495890 file number=5 first dba=23969 block cnt=15 obj#=75579 tim=1356633966295208
WAIT #140382932318408: nam='direct path read' ela= 495889 file number=5 first dba=23985 block cnt=15 obj#=75579 tim=1356633966795127

所以, 如果我使IO慢下来, 就可以看到’direct path read’等待事件. 这种情况下用gdb设置断点会怎样呢?

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000000a09030 in io_getevents@plt ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()
#0  0x0000000008fa1334 in kslwtectx ()
#0  0x0000000008f9a652 in kslwtbctx ()

io_getevents()比较有趣, 在kslwtbctx()调用之前, 4个连续的io_getevents()连续出现, 另外io_getevents()被调用了一次或者两次, 在kslwtectx调用之前, 接着等待时间被记录下来. 这时, 没有’kfk: async disk IO’出现.

现在让我们看看限制IO时, 11.2.0.1的表现:

#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000003f38200660 in io_submit () from /lib64/libaio.so.1
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x00000000009d1a58 in io_getevents@plt ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()
#0  0x000000000821e3d8 in kslwtectx ()
#0  0x0000000008217fda in kslwtbctx ()

结果和11.2.0.3是相当类似的!

总结

数据库触发了很多io_getevents()调用, 在io_submit()提交了异步IO请求之后, 这些io_getevents()调用是非阻塞的, 会马上返回. 这些调用检查异步IO的完成队列, 如果有IO在完成队列里, 这个IO就会被处理. 11.2.0.1中, 对非阻塞的io_getevents调用是3次, 11.2.0.3中, 4次(请注意到:数目的不同, 可能是异步IO中slot机制导致的, 这种机制控制了同时能有多少个异步IO请求被提交). 如果这些非阻塞的调用没有返回任何IO请求, 等待事件接口就会被调用(kslwtbctx()), 接着一个阻塞的io_getevents()调用被触发(这个调用需要返回至少一个请求), 这次调用的时间会在’direct path read’的等待时间反应出来.

Oraclee 11.2.0.1中, 第一个非阻塞io_getevents()调用会被事件’kfk: async disk IO’测量, 但是由于未知原因, 这种测量并没有在11.2.0.3中出现.

[译]深入理解log writer IO和等待事件

前言:

这是Frits Hoogland对log writer IO和等待事件的研究, 个人很受益, 因此将这篇文章翻译在此. 关于database writer IO和等待事件的文章, 见这里.

原始链接: http://fritshoogland.wordpress.com/2013/08/30/oracle-io-on-linux-log-writer-io-and-wait-events/

Oracle IO on linux: log writer IO and wait events

这篇文章是关于log writer(lgwr) IO.
我的测试环境是Linux X64 OL6u3, Oracle 11.2.0.3(no BP), Clusterware 11.2.0.3, ASM, 所有的数据文件都在ASM中.

为了观察logwriter的行为, 开启10046 level 8跟踪会给我们一个大概的印象. 可以使用oradebug启用10046, 在生产环境使用这个命令需要非常谨慎, 它可能造成实例崩溃.

以下是oradebug的命令

SYS@v11203 AS SYSDBA> oradebug setospid 2491
Oracle pid: 11, Unix process pid: 2491, image: oracle@ol63-oracle.local (LGWR)
SYS@v11203 AS SYSDBA> oradebug unlimit
Statement processed.
SYS@v11203 AS SYSDBA> oradebug event 10046 trace name context forever, level 8
Statement processed.

2491这个Linux进程显然是log writer, image的内容可以确认.

对于一个空闲的database, logwriter的跟踪如下:

*** 2013-08-29 21:10:45.593
WAIT #0: nam='rdbms ipc message' ela= 3000575 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803445592696

*** 2013-08-29 21:10:48.595
WAIT #0: nam='rdbms ipc message' ela= 3000566 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803448594695

*** 2013-08-29 21:10:51.597
WAIT #0: nam='rdbms ipc message' ela= 3000812 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803451596813

*** 2013-08-29 21:10:54.599
WAIT #0: nam='rdbms ipc message' ela= 3000733 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803454598806

*** 2013-08-29 21:10:57.601
WAIT #0: nam='rdbms ipc message' ela= 3000460 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803457600717

*** 2013-08-29 21:11:00.603
WAIT #0: nam='rdbms ipc message' ela= 3000526 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803460602699

*** 2013-08-29 21:11:03.605
WAIT #0: nam='rdbms ipc message' ela= 3000640 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803463604715

*** 2013-08-29 21:11:04.218
WAIT #0: nam='rdbms ipc message' ela= 612031 timeout=300 p2=0 p3=0 obj#=-1 tim=1377803464218054

logwriter花了3秒(ela=3000812)在事件’rdbms ipc message’上, 因为timeout设置为3s(timeout=300; 单位是分秒, Oracle测量时间的旧方式). 为了从操作系统的角度理解, 我们使用strace观察系统调用:

[root@ol63-oracle ~]# strace -e write=all -e all -p 2491
Process 2491 attached - interrupt to quit
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028646
write(32, "\n*** 2013-08-29 21:35:50.740\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 32 39 20  .*** 201 3-08-29  |
 | 00010  32 31 3a 33 35 3a 35 30  2e 37 34 30 0a           21:35:50 .740.    |
lseek(32, 0, SEEK_CUR)                  = 118213
write(32, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 32 30  38 37 34 35 32 20 74 69   ela= 20 87452 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 30 34 39  35 30 37 34 30 33 37 33  13778049 50740373 |
write(33, "J?Jqz7~UW1\n", 11)           = 11
 | 00000  4a 3f 4a 71 7a 37 7e 55  57 31 0a                 J?Jqz7~U W1.      |
write(32, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028646
semtimedop(327683, {{15, -1, 0}}, 1, {0, 910000000}) = -1 EAGAIN (Resource temporarily unavailable)
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
write(32, "\n*** 2013-08-29 21:35:51.651\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 32 39 20  .*** 201 3-08-29  |
 | 00010  32 31 3a 33 35 3a 35 31  2e 36 35 31 0a           21:35:51 .651.    |
lseek(32, 0, SEEK_CUR)                  = 118339
write(32, "WAIT #0: nam='rdbms ipc message'"..., 94) = 94
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 39 31  30 36 37 37 20 74 69 6d   ela= 91 0677 tim |
 | 00030  65 6f 75 74 3d 39 31 20  70 32 3d 30 20 70 33 3d  eout=91  p2=0 p3= |
 | 00040  30 20 6f 62 6a 23 3d 2d  31 20 74 69 6d 3d 31 33  0 obj#=- 1 tim=13 |
 | 00050  37 37 38 30 34 39 35 31  36 35 31 37 32 34        77804951 651724   |
write(33, "J?HRU3~UU1\n", 11)           = 11
 | 00000  4a 3f 48 52 55 33 7e 55  55 31 0a                 J?HRU3~U U1.      |
write(32, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430028737
semtimedop(327683, {{15, -1, 0}}, 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
write(32, "\n*** 2013-08-29 21:35:54.652\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 32 39 20  .*** 201 3-08-29  |
 | 00010  32 31 3a 33 35 3a 35 34  2e 36 35 32 0a           21:35:54 .652.    |
lseek(32, 0, SEEK_CUR)                  = 118463
write(32, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 33 30  30 30 33 33 32 20 74 69   ela= 30 00332 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 30 34 39  35 34 36 35 32 38 32 36  13778049 54652826 |
write(33, "J?MZSB~UW1\n", 11)           = 11
 | 00000  4a 3f 4d 5a 53 42 7e 55  57 31 0a                 J?MZSB~U W1.      |
lseek(32, 0, SEEK_CUR)                  = 118559
write(32, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
write(33, "3?uG~01\n", 8)               = 8
 | 00000  33 3f 75 47 7e 30 31 0a                           3?uG~01.          |
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 109983}, ru_stime={0, 420936}, ...}) = 0
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037
times({tms_utime=10, tms_stime=42, tms_cutime=0, tms_cstime=0}) = 430029037

我们先关注一点, 虽然未必和logwriter有关: 当一个时间戳出现在Oracle跟踪文件中(比如:2013-08-29 21:38:58.721), Oracle会先运行一个times()系统调用. 如果你想了解这个调用做了什么, 你可以在Linux终端下通过”man 2 times”查看手册. 总之: 这个系统调用取得用户态和内核态的CPU, 包括当前进程和它的子进程.

回到logwriter: 我们看到’rdbms ipc message’等待. 这是对在logwriter在semaphore上实际的休眠时间进行测量, timeout的时间是3秒.

semtimedop(327683, {{15, -1, 0}}, 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable)

同样通过手册可以了解semtimedop调用, 但是最后的结构({3,0})是关于时间的. 第一个数字单位是秒, 第二个是纳秒. 换句话说: 如果logwriter花时间在’rdbms ipc message’事件上, 意味着它处于空闲中, 正等待工作的到来. semaphore结构提供一种进程间的沟通机制, 一个进程可以通知另一进程马上开始某种工作.

阅读这篇博客的读者, 而且已经读到这里, 一定了解logwriter的作用(即把日志缓存的内容写到磁盘上), 如果它发现缓存中有内容需要写, 或者其他的进程通知它开始写重做记录, logwriter就会开始工作. 通知logwriter开始写操作的进程, 在得到logwriter写完成的回应之前, 会在’log file sync’事件上等待.

加上extended SQL trace, 你会看到如下输出:

io_submit(140612782833664, 1, {{0x7fe2f5369450, 0, 1, 0, 257}}) = 1
io_getevents(140612782833664, 1, 128, {{0x7fe2f5369450, 0x7fe2f5369450, 1024, 0}}, {600, 0}) = 1
lseek(26, 0, SEEK_CUR)                  = 5190
write(26, "WAIT #0: nam='log file parallel "..., 104) = 104
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 6c 6f  WAIT #0:  nam='lo |
 | 00010  67 20 66 69 6c 65 20 70  61 72 61 6c 6c 65 6c 20  g file p arallel  |
 | 00020  77 72 69 74 65 27 20 65  6c 61 3d 20 31 30 37 20  write' e la= 107  |
 | 00030  66 69 6c 65 73 3d 31 20  62 6c 6f 63 6b 73 3d 32  files=1  blocks=2 |
 | 00040  20 72 65 71 75 65 73 74  73 3d 31 20 6f 62 6a 23   request s=1 obj# |
 | 00050  3d 2d 31 20 74 69 6d 3d  31 33 37 37 38 30 37 30  =-1 tim= 13778070 |
 | 00060  32 34 39 35 31 33 31 38                           24951318          |

我们看到io_submit()提交一个异步IO请求, 然后io_getevents()调用确认IO请求已经完成, 接着logwriter在跟踪文件写下’log file parallel write’事件, 记录这个IO过程. 等待事件的信息(files=1)显示只写到一个日志文件(两个IO的系统调用都暗示了只有一个IO), IO大小是两个块. 我们知道Linux系统中, 重做日志块的大小是512字节, 也可以从io_getevents中确认是两个块(1024=512*2). IO的时间是107微秒(ela=107).

Linux上, 重做日志块的大小是512字节, 但是在新的Oracle版本中, 对于新的固态硬盘设备, 也可以是4KB(为了充分利用新设备的IO特性).

107微秒的IO时间看起来是合理的, 不过我们呆会再来讨论. 现在, 我们增加一个日志文件到每个日志组, 切换日志, 让新的IO都写到两个日志文件中. 以下logwriter新的输出:

io_submit(140612782833664, 2, {{0x7fe2f5369450, 0, 1, 0, 256}, {0x7fe2f53696a8, 0, 1, 0, 256}}) = 2
io_getevents(140612782833664, 2, 128, {{0x7fe2f5369450, 0x7fe2f5369450, 512, 0}, {0x7fe2f53696a8, 0x7fe2f53696a8, 512, 0}}, {0, 0}) = 2
lseek(26, 0, SEEK_CUR)                  = 101469
write(26, "WAIT #0: nam='log file parallel "..., 104) = 104
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 6c 6f  WAIT #0:  nam='lo |
 | 00010  67 20 66 69 6c 65 20 70  61 72 61 6c 6c 65 6c 20  g file p arallel  |
 | 00020  77 72 69 74 65 27 20 65  6c 61 3d 20 31 31 37 20  write' e la= 117  |
 | 00030  66 69 6c 65 73 3d 32 20  62 6c 6f 63 6b 73 3d 32  files=2  blocks=2 |
 | 00040  20 72 65 71 75 65 73 74  73 3d 32 20 6f 62 6a 23   request s=2 obj# |
 | 00050  3d 2d 31 20 74 69 6d 3d  31 33 37 37 38 30 38 36  =-1 tim= 13778086 |
 | 00060  33 32 30 36 37 31 32 35                           32067125          |

我们看到一个io_submit()调用同时提交了两个IO, 然后通过io_getevents确认这两个IO已经完成, 需要至少确认2个IO都完成之后, io_getevents才能返回(保证重做记录被持久化). 所以’log file parallel write’的写操作确实是并行的. 块的数量有点疑惑, 总共有两个日志块被写入(blocks=2), 每个在线日志文件都被写入两个块, 这意味着是1024字节(2*512), 这里block=2意味着一共向每个日志文件写入两个日志块.

当异步IO没有启用时, Oracle如何写重做日志的呢? 我们把disk_asynch_io设为false, 然后观察logwriter的输出:

pwrite(256, "\1\"\0\0\362\0\0\0\n\1\0\0\20\200\330\227\330\1\0\0\r\0\0\0\275\203O\0\1\0\0\0"..., 512, 1415701504) = 512
pwrite(256, "\1\"\0\0\362\0\0\0\n\1\0\0\20\200\330\227\330\1\0\0\r\0\0\0\275\203O\0\1\0\0\0"..., 512, 6560015360) = 512
lseek(25, 0, SEEK_CUR)                  = 4600
write(25, "WAIT #0: nam='log file parallel "..., 103) = 103
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 6c 6f  WAIT #0:  nam='lo |
 | 00010  67 20 66 69 6c 65 20 70  61 72 61 6c 6c 65 6c 20  g file p arallel  |
 | 00020  77 72 69 74 65 27 20 65  6c 61 3d 20 31 35 20 66  write' e la= 15 f |
 | 00030  69 6c 65 73 3d 32 20 62  6c 6f 63 6b 73 3d 32 20  iles=2 b locks=2  |
 | 00040  72 65 71 75 65 73 74 73  3d 32 20 6f 62 6a 23 3d  requests =2 obj#= |
 | 00050  2d 31 20 74 69 6d 3d 31  33 37 37 38 30 39 33 39  -1 tim=1 37780939 |
 | 00060  37 38 38 31 38 31 39                              7881819           |

这时没有异步IO的调用, Oracle顺序调用了两次pwrite. 没有发生并行写!

现在想想’log file parallel write’在两种IO模式下的表现. 对于异步IO(现在的系统都应该使用这个特性), 是对两个IO的计时, 它意味着两个IO需要完成的时间. 因为两个IO是并行的, 如果你的存储系统有足够的带宽, 这意味着等待时间是最慢完成的IO时间. 对于同步IO, 等待时间是两个IO时间的和. 你需要把等待时间除以IO的数量, 才能得到平均的IO延迟. 注意: 这种情况下IO都是顺序发生的, 没有并发.

这种结论就是对的吗?

我们用gdb更深入的跟踪! 对于同步IO的情况, 用gdb挂载上logwriter(gdb -p PID), 然后使用以下的debugger命令(注: kslwtbctx (Oracle 11.2, Oracle 10.2中这个函数名为kslwtb_tm), 开始测量等待事件; kslwtectx (Oracle 10.2: kslwte_tm) 结束测量等待事件. ‘rbreak’意思是’break by regular expression’, 可以匹配多个函数设置断点):

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break pwrite
Breakpoint 3 at 0x3f38a0ee90: file ../sysdeps/unix/syscall-template.S, line 82. (2 locations)
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) 

现在插入一些数据, 然后提交:

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 3, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 3, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
1
嘿! 你看到以上有趣的输出了没? 开始, logwriter像小孩那样的安睡. 因为kslwtbctx()开始等待事件的计时, semtimedop()调用意味着logwriter正在休眠, timeout之后kslwtectx()结束了这个等待. 但是接着logwriter通过pwrite64()开始写重做记录的时候, kslwtbctx()没有出现, 只有当pwrite64()结束之后才出现(意味着写IO没有被计时).

难道我漏掉什么东西了吗? 还是Oracle超级聪明, 都不用对IO调用计时了? 让我们开始extended sql trace, 然后控制IO. 通过限制logwriter的写IO, 让写IO的延迟大幅提高(通过我的<a href="http://fritshoogland.wordpress.com/2012/12/15/throttling-io-with-linux/" target="_blank">另一篇文章</a>, 可以了解如何控制IO时间; 对于logwriter你需要限制写IOPS, 而不是读IOPS). 完成之后我们看看此时的extended sql trace输出:
1
*** 2013-08-29 23:11:54.509
WAIT #0: nam='log file parallel write' ela= 512 files=2 blocks=4 requests=2 obj#=-1 tim=1377810714509122

*** 2013-08-29 23:11:54.509
Warning: log write elapsed time 2004ms, size 1KB

*** 2013-08-29 23:11:56.509
WAIT #0: nam='log file parallel write' ela= 541 files=2 blocks=2 requests=2 obj#=-1 tim=1377810716509653

*** 2013-08-29 23:11:56.509
Warning: log write elapsed time 2000ms, size 0KB

这就是我现场捉到的证据! ‘log file parallel write’事件显示了极快的IO时间(ela=512微秒), 但是接着我们得到警告: 日志的写非常慢. 对于同步IO, Oracle并没有对logwriter的写IO计时, ‘log file parallel write’的等待时间是错的!

现在看看异步IO的情况(disk_asynch_io=true). 同样的方法, 把gdb挂载在lgwr上, 对异步IO设置断点: 系统调用是io_submit和io_getevents(Linux系统上是io_getevents_0_4).

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_submit
Breakpoint 3 at 0x3f38200660: file io_submit.c, line 23.
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_getevents_0_4
Breakpoint 4 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

一开始你应该可以看到kslwtbctx和kslwtectx, 对semtimedop()调用进行计时. 如果插入一些数据, 可以看到以下的输出:

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 3, io_submit (ctx=0x7f8615d0c000, nr=2, iocbs=0x7fffc4001750) at io_submit.c:23
23	io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, io_getevents_0_4 (ctx=0x7f8615d0c000, min_nr=2, nr=128, events=0x7fffc40078f8, timeout=0x7fffc4008900) at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)

这里有值得注意的地方. 第一行显示一个等待事件的结束(kslwtectx), 接着logwriter提交了一个IO请求(io_submit). 注意到这时没有任何等待事件开始计时, 因为kslwtbctx是在io_submit之后才被调用的. 这意味着’log file parallel write’只记录确认IO的时间, 而不是整个IO延迟时间. 我知道提交IO请求只占总体IO时间的很小部分, 但不管如何等待时间只是确认IO的时间, 而不是完整的IO时间.

现在让我们看看之前strace输出中的io_submit/io_getevents调用:

io_submit(140612782833664, 2, {{0x7fe2f5369450, 0, 1, 0, 256}, {0x7fe2f53696a8, 0, 1, 0, 256}}) = 2
io_getevents(140612782833664, 2, 128, {{0x7fe2f5369450, 0x7fe2f5369450, 512, 0}, {0x7fe2f53696a8, 0x7fe2f53696a8, 512, 0}}, {0, 0}) = 2

io_getevents调用中timespec结构是: {0, 0}. 这是一个非阻塞调用, 在操作系统的完成队列里确认提交的IO是否已完成. 换句话说, 如果提交的IO长时间没有完成, 这个调用也会马上返回, 继续执行之后的代码. 这就像直接路径读(direct path reads)异步IO读的实现方式(这篇文章有关于非阻塞io_getevents调用的阐述).

让我们限制lgwr, 使它的IO慢下来, 在io_getevents的断点处打印timespec结构:

(gdb) break io_getevents_0_4
Breakpoint 4 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>print *timeout
>c
>end

现在重新插入数据, 提交, 使logwriter开始工作, 注意到IO已经被限制慢下来了:

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

Breakpoint 3, io_submit (ctx=0x7f8615d0c000, nr=2, iocbs=0x7fffc4001640) at io_submit.c:23
23 io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)

Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()

Breakpoint 4, io_getevents_0_4 (ctx=0x7f8615d0c000, min_nr=2, nr=128, events=0x7fffc40077e8, timeout=0x7fffc40087f0) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$6 = {tv_sec = 0, tv_nsec = 0}

Breakpoint 4, io_getevents_0_4 (ctx=0x7f8615d0c000, min_nr=2, nr=128, events=0x7fffc40077d8, timeout=0x7fffc40087e0) at io_getevents.c:46
46 if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$7 = {tv_sec = 600, tv_nsec = 0}

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

啊! 我看到了io_submit调用提交了两个IO请求(nr=2), 接着对’log file parallel write’事件开始计时(kslwtbctx), 然后我们看到两个io_getevents调用, 第一个是非阻塞的, 这个调用几乎肯定不会成功, 之后是一个阻塞的io_getevents调用(因为timespec结构中的tv_sec为600)

我还不知道为什么一个阻塞的io_getevents调用之前有一个非阻塞的io_geteventst调用(因为写IO不大可能马上完成, 第一个调用很可能没有用).

总结

以下Oracle 11.2 reference guide中对’log file parallel write’的描述:

‘等待时间: 它完成IO所需的时间. 尽管重做记录可以并行写入, 最后一个IO完成之后等待才会结束.’

这其实是非常好的解释, 对于异步IO来说非常的精确.

对于同步IO, 这却是错的. 同步IO的情况下, Oracle的写IO操作是顺序进行的, 这些操作完全没有被计时, 此时的等待时间完全是错的.

对于异步IO, ‘log file parallel write’的等待时间记录了把重做记录写到磁盘的时间. 从等待事件中, 可以确认最慢IO的时间, 其他更快的写IO都比等待时间快. 这个等待时间是从进程的角度来说的, 并不是对IO延迟时间的计算. 它很好的暗示了最慢的IO时间, 对于评估写IO对于logwriter的影响, 很精确的, 但是要对实际IO延迟时间进行测量的话, 这个等待时间是不足够的. 因为除了确定logwriter在写IO上花了多少时间, 你必须非常清楚它的实现. 例如, 你不能把’log file parallel write’的等待时间除以IO请求的数量; 这样的结果不能反映任何东西(并不是写IO的平均延迟时间).

[译]深入理解database writer和db file parallel write等待事件

前言:

在awr中, 如果对比数据文件读写IO, 会有一个奇怪的现象, 写IO(db file parallel write)比读IO更快, 无论单块读(db file sequential read)或者多块读(db file scatterred read). 对于cache配置为write-through的SAN来说, 写IO应该比读IO更慢才对. 因为读IO可能从操作系统的page cache或者IO控制器的缓存获益, 而写IO只能把数据直接写到磁盘中. 以下awr等待时间的片段中(DB环境: Linux 64bit Oracle 10.2.0.5), 写IO明显比读IO更快, 单块读和多块读分别是4毫秒和8毫秒, 而写IO只有1ms.

                                                                  Avg
                                            %Time  Total Wait    wait     Waits
Event                                 Waits -outs    Time (s)    (ms)      /txn
---------------------------- -------------- ----- ----------- ------- ---------
db file sequential read           4,160,209   N/A      15,787       4       7.6
db file scattered read               40,348   N/A         316       8       0.1
db file parallel write              233,282   N/A         164       1       0.4

如果比较”db file parallel write”的平均等待时间和OS的iostat的结果, 或者IO控制器的IO统计信息, Orcle的写IO的等待时间反而更短. Frits Hoogland用10046/strace/gdb等工具, 研究了Oracle如何测量database writer的写IO时间, 结论是”db file parallel write”的等待时间并没有覆盖整个写IO的整个cycle, 所以看起来比实际IO的时间短, 也就是”db file paralle write”并不能作为判断存储系统写性能的可靠依据. 以下是我对这篇文件的翻译(原始文件需要翻墙):

翻译内容:

Oracle IO on linux: database writer IO and wait events

原始链接: http://fritshoogland.wordpress.com/2013/09/06/oracle-io-on-linux-database-writer-io-and-wait-events/

这边文章是关于database writer(dbwr, 大部分时候是dbw0进程)的IO.

测试的环境是: Linux X64 OL6u3, Oracle 11.2.0.3(on BP), Clusterware 11.2.0.3, ASM, 所有的数据文件都在ASM中. 服务器是VMWARE FUSION的虚拟机, 有两个CPU.

你可能想看看我之前关于log writer IO的文章.

Database writer的数量取决于instance所用的CPU数量, 没有设置DB_WRITER_PROCESSES的时候, 这个值等于CEIL(CPU_COUNT/8). 可能有其他的因素影响database writer的数量, 比如说NUMA. 因为我的VM只有两个CPU, 意味着我只有一个database writer (dbw0).

让我们开始研究. 对database writer开启10046 level 8跟踪. 当Oracle空闲的时候, database writer以等待事件”rdbms ipc message”的形式休眠(每3秒唤醒一次), 这时跟踪文件的内容和log writer类似. 以下的片段显示database writer处于空闲中, 等待工作.

*** 2013-08-30 12:42:00.291
WAIT #0: nam='rdbms ipc message' ela= 3001013 timeout=300 p2=0 p3=0 obj#=-1 tim=1377859320291675
 
*** 2013-08-30 12:42:03.292
WAIT #0: nam='rdbms ipc message' ela= 3000870 timeout=300 p2=0 p3=0 obj#=-1 tim=1377859323292672
 
*** 2013-08-30 12:42:06.293
WAIT #0: nam='rdbms ipc message' ela= 3000806 timeout=300 p2=0 p3=0 obj#=-1 tim=1377859326293581

简而言之, ‘rdbms ipc message’意味着: database writer空闲中, 等待着工作的到来.

如果我们创建了一个小表, 插入一行数据, 提交. 这样buffer cache中就有了脏块. 接着”alter system flush buffer_cache” 会强制Oracle把buffer cache中的脏块写到数据文件中. 这个过程会记录在database writer的跟踪文件中.

WAIT #0: nam='rdbms ipc message' ela= 328100 timeout=300 p2=0 p3=0 obj#=-1 tim=1377868410713860
WAIT #0: nam='db file async I/O submit' ela= 1 requests=14 interrupt=0 timeout=0 obj#=-1 tim=1377868410715372
WAIT #0: nam='db file parallel write' ela= 26 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=1377868410715427

第一行’rdbms ipc message’等待时间0.3281s(ela=328100, ela的单位是微秒)小于3s, 因为我们用”alter system flush buffer_cache”触发了database writer开始工作. database收到通知, 停止等待.

接着我们看到两个等待: ‘db file async I/O submit’和’db file paralle write’. ‘db file parallel write’是我所关注的, Oracle需要写三个脏块, I/O需要时间, 这个过程有一个等待事件是合理的. 我注意到’db file async I/O submit’在一些awr中出现, 但都不是顶级的等待事件, 所以不会在这篇文章中研究.

让我们更深入的跟踪database writer, 结合strace的输出和10046 level 8跟踪文件, 用’strace -e write -e all -p PID’命令.

semtimedop(851971, {{14, -1, 0}}, 1, {3, 0}) = 0
times({tms_utime=34, tms_stime=85, tms_cutime=0, tms_cstime=0}) = 433153247
write(18, "\n*** 2013-08-30 15:32:44.063\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 33 30 20  .*** 201 3-08-30  |
 | 00010  31 35 3a 33 32 3a 34 34  2e 30 36 33 0a           15:32:44 .063.    |
lseek(18, 0, SEEK_CUR)                  = 682279
write(18, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 31 32  38 31 32 31 31 20 74 69   ela= 12 81211 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 36 39 35  36 34 30 36 33 36 30 31  13778695 64063601 |
write(19, "J?2-u4~UW1\n", 11)           = 11
 | 00000  4a 3f 32 2d 75 34 7e 55  57 31 0a                 J?2-u4~U W1.      |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=34, tms_stime=85, tms_cutime=0, tms_cstime=0}) = 433153247
io_submit(139931766530048, 24, {{0x7f44671a7a50, 0, 1, 0, 256}, {0x7f44671a7ca8, 0, 1, 0, 257}, {0x7f44671a7f00, 0, 1, 0, 256}, {0x7f44671a8158, 0, 1, 0, 256}, {0x7f44671a83b0, 0, 1, 0, 256}, {0x7f44671a8608, 0, 1, 0, 257}, {0x7f44671a8860, 0, 1, 0, 257}, {0x7f44671a8ab8, 0, 1, 0, 257}, {0x7f44671a8d10, 0, 1, 0, 257}, {0x7f44671a8f68, 0, 1, 0, 257}, {0x7f44671a91c0, 0, 1, 0, 257}, {0x7f44671a9418, 0, 1, 0, 257}, {0x7f44671a9670, 0, 1, 0, 257}, {0x7f446719e1f8, 0, 1, 0, 256}, {0x7f44671a77f8, 0, 1, 0, 257}, {0x7f44671a75a0, 0, 1, 0, 257}, {0x7f44671a7348, 0, 1, 0, 256}, {0x7f44671a70f0, 0, 1, 0, 256}, {0x7f44671a6e98, 0, 1, 0, 256}, {0x7f44671a6c40, 0, 1, 0, 256}, {0x7f44671a69e8, 0, 1, 0, 256}, {0x7f44671a6790, 0, 1, 0, 256}, {0x7f44671a6538, 0, 1, 0, 256}, {0x7f44671a62e0, 0, 1, 0, 256}}) = 24
lseek(18, 0, SEEK_CUR)                  = 682376
write(18, "WAIT #0: nam='db file async I/O "..., 109) = 109
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 61 73  79 6e 63 20 49 2f 4f 20   file as ync I/O  |
 | 00020  73 75 62 6d 69 74 27 20  65 6c 61 3d 20 31 20 72  submit'  ela= 1 r |
 | 00030  65 71 75 65 73 74 73 3d  32 34 20 69 6e 74 65 72  equests= 24 inter |
 | 00040  72 75 70 74 3d 30 20 74  69 6d 65 6f 75 74 3d 30  rupt=0 t imeout=0 |
 | 00050  20 6f 62 6a 23 3d 2d 31  20 74 69 6d 3d 31 33 37   obj#=-1  tim=137 |
 | 00060  37 38 36 39 35 36 34 30  36 36 31 36 37           78695640 66167    |
write(19, "J?JV~1j1\n", 9)              = 9
 | 00000  4a 3f 4a 56 7e 31 6a 31  0a                       J?JV~1j1 .        |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
io_getevents(139931766530048, 2, 128, {{0x7f44671a7a50, 0x7f44671a7a50, 8192, 0}, {0x7f44671a7ca8, 0x7f44671a7ca8, 8192, 0}, {0x7f44671a7f00, 0x7f44671a7f00, 8192, 0}, {0x7f44671a8158, 0x7f44671a8158, 8192, 0}, {0x7f44671a83b0, 0x7f44671a83b0, 8192, 0}, {0x7f44671a8608, 0x7f44671a8608, 8192, 0}, {0x7f44671a8860, 0x7f44671a8860, 8192, 0}, {0x7f44671a8ab8, 0x7f44671a8ab8, 8192, 0}, {0x7f44671a8d10, 0x7f44671a8d10, 8192, 0}, {0x7f44671a8f68, 0x7f44671a8f68, 8192, 0}, {0x7f44671a91c0, 0x7f44671a91c0, 8192, 0}, {0x7f44671a9418, 0x7f44671a9418, 8192, 0}, {0x7f44671a9670, 0x7f44671a9670, 8192, 0}, {0x7f446719e1f8, 0x7f446719e1f8, 8192, 0}, {0x7f44671a77f8, 0x7f44671a77f8, 8192, 0}, {0x7f44671a75a0, 0x7f44671a75a0, 8192, 0}, {0x7f44671a7348, 0x7f44671a7348, 16384, 0}, {0x7f44671a70f0, 0x7f44671a70f0, 8192, 0}, {0x7f44671a6e98, 0x7f44671a6e98, 8192, 0}, {0x7f44671a6c40, 0x7f44671a6c40, 16384, 0}, {0x7f44671a69e8, 0x7f44671a69e8, 8192, 0}, {0x7f44671a6790, 0x7f44671a6790, 8192, 0}, {0x7f44671a6538, 0x7f44671a6538, 8192, 0}, {0x7f44671a62e0, 0x7f44671a62e0, 8192, 0}}, {600, 0}) = 24
write(18, "WAIT #0: nam='db file parallel w"..., 117) = 117
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 70 61  72 61 6c 6c 65 6c 20 77   file pa rallel w |
 | 00020  72 69 74 65 27 20 65 6c  61 3d 20 31 35 33 20 72  rite' el a= 153 r |
 | 00030  65 71 75 65 73 74 73 3d  32 20 69 6e 74 65 72 72  equests= 2 interr |
 | 00040  75 70 74 3d 30 20 74 69  6d 65 6f 75 74 3d 32 31  upt=0 ti meout=21 |
 | 00050  34 37 34 38 33 36 34 37  20 6f 62 6a 23 3d 2d 31  47483647  obj#=-1 |
 | 00060  20 74 69 6d 3d 31 33 37  37 38 36 39 35 36 34 30   tim=137 78695640 |
 | 00070  36 36 35 35 30                                    66550             |

这段跟踪片段从semtimedop这个系统调用开始. 如果你观察10046 level 8的跟踪输出, database write在信号量(semaphore)上休眠了1.28秒. 然后往.trc和.trm文件中写了一些记录, 接着用一个times()系统调用去拿到进程在用户态和内核态分别使用时间. 之后, 出现了一个重要的系统调用: io_submit(). io_submit()提交了24个IO.

在系统调用io_submit()之后, 这个进程记录一个等待事件’db file async I/O submit’, 等待时间是1微秒(ela=1).

看起来Oracle测量了database writer提交IO所用的时间. 这和logwriter或者前台进程不同, 那些进程不会对系统调用io_submit()所需的时间进行测量.

为了确认之前提交的异步写IO是否完成, 系统调用io_getevents出现了. io_getevents调用需要确认至少两个IO完成了才能结束. 另外, 它能够确认全部24个已经提交的IO(你可以数一下io_getevents参数中iocb的数量, 一共是24). 对于这个确认动作, Oracle的等待时间是’db file parallel write’, 可以从接着的write调用确认. io_getevents的timeout时间是600s. 对于运行在Linux的Oracle异步IO来说, 600s看起来是通用的timeout时间. 最值得关注的是, 这个等待事件没有任何信息关于多少IO请求已经完成; 只要有两个I/O完成, 这个调用就可以结束了.

现在让我们切换到同步IO的情况. 我已经提到过好多次了, 如果你的系统不开启异步IO, 将是极为错误的. Oracle里很多新的功能都是建立在异步IO基础之上的; direct path reads就是一个例子, 因为异步IO, 批量提交IO请求, 不用等待每一个不同IO顺序完成, 可以大大的缩短整体的IO延迟.

将disk_asynch_io设置为false, Oracle仅使用同步IO, 以下是extended sql trace和strace结合的跟踪片段, 当’alter system flush buffer_cache’触发database writer开始工作的时候:

semtimedop(983043, {{14, -1, 0}}, 1, {3, 0}) = 0
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522701
write(18, "\n*** 2013-08-30 16:34:19.162\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  33 2d 30 38 2d 33 30 20  .*** 201 3-08-30  |
 | 00010  31 36 3a 33 34 3a 31 39  2e 31 36 32 0a           16:34:19 .162.    |
lseek(18, 0, SEEK_CUR)                  = 9915
write(18, "WAIT #0: nam='rdbms ipc message'"..., 96) = 96
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 72 64  WAIT #0:  nam='rd |
 | 00010  62 6d 73 20 69 70 63 20  6d 65 73 73 61 67 65 27  bms ipc  message' |
 | 00020  20 65 6c 61 3d 20 31 36  35 34 35 38 36 20 74 69   ela= 16 54586 ti |
 | 00030  6d 65 6f 75 74 3d 33 30  30 20 70 32 3d 30 20 70  meout=30 0 p2=0 p |
 | 00040  33 3d 30 20 6f 62 6a 23  3d 2d 31 20 74 69 6d 3d  3=0 obj# =-1 tim= |
 | 00050  31 33 37 37 38 37 33 32  35 39 31 36 32 37 36 32  13778732 59162762 |
write(19, "J?e7K6~UW1\n", 11)           = 11
 | 00000  4a 3f 65 37 4b 36 7e 55  57 31 0a                 J?e7K6~U W1.      |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522701
pwrite(256, "\6\242\0\0\207 \f\0\214\255O\0\0\0\n\6\236\316\0\0\1\0\0\0%1\1\0X\231O\0"..., 8192, 6221258752) = 8192
pwrite(257, "\2\242\0\0\326\22\300\0\214\255O\0\0\0\t\4\10\207\0\0\1\0\16\0\371\4\0\0005\2\32\32"..., 8192, 3414867968) = 8192
pwrite(256, "\2\242\0\0*\2\300\0\205\255O\0\0\0\1\4\236W\0\0\10\0\33\0=\6\0\0\305\2\v\v"..., 8192, 1359298560) = 8192
pwrite(256, "\2\242\0\0d\1\300\0\207\255O\0\0\0\1\4\321e\0\0\4\0\33\0#\5\0\0\35\3\20\20"..., 8192, 1358725120) = 8192
pwrite(257, "&\242\0\0\360\0\300\0\206\255O\0\0\0\1\4[\30\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 1368260608) = 8192
pwrite(257, "&\242\0\0\260\0\300\0\210\255O\0\0\0\1\4\320\351\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 1367736320) = 8192
pwrite(257, "&\242\0\0\200\0\300\0\214\255O\0\0\0\2\4.)\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 1367343104) = 8192
pwrite(257, "\6\242\0\0\6\t@\0\210\255O\0\0\0\1\6Z\37\0\0\1\0\0\0N&\1\0\21\254O\0"..., 8192, 749780992) = 8192
pwrite(256, "\6\242\0\0\231\2@\0\206\255O\0\0\0\1\6q\207\0\0\1\0\0\0J\0\0\0\253\254O\0"..., 8192, 734208000) = 8192
lseek(18, 0, SEEK_CUR)                  = 10012
write(18, "WAIT #0: nam='db file parallel w"..., 107) = 107
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 70 61  72 61 6c 6c 65 6c 20 77   file pa rallel w |
 | 00020  72 69 74 65 27 20 65 6c  61 3d 20 32 34 20 72 65  rite' el a= 24 re |
 | 00030  71 75 65 73 74 73 3d 39  20 69 6e 74 65 72 72 75  quests=9  interru |
 | 00040  70 74 3d 30 20 74 69 6d  65 6f 75 74 3d 30 20 6f  pt=0 tim eout=0 o |
 | 00050  62 6a 23 3d 2d 31 20 74  69 6d 3d 31 33 37 37 38  bj#=-1 t im=13778 |
 | 00060  37 33 32 35 39 31 38 33  32 32 37                 73259183 227      |
write(19, "J?ny4~1h1\n", 10)            = 10
 | 00000  4a 3f 6e 79 34 7e 31 68  31 0a                    J?ny4~1h 1.       |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
write(18, "WAIT #0: nam='db file parallel w"..., 106) = 106
 | 00000  57 41 49 54 20 23 30 3a  20 6e 61 6d 3d 27 64 62  WAIT #0:  nam='db |
 | 00010  20 66 69 6c 65 20 70 61  72 61 6c 6c 65 6c 20 77   file pa rallel w |
 | 00020  72 69 74 65 27 20 65 6c  61 3d 20 31 20 72 65 71  rite' el a= 1 req |
 | 00030  75 65 73 74 73 3d 39 20  69 6e 74 65 72 72 75 70  uests=9  interrup |
 | 00040  74 3d 30 20 74 69 6d 65  6f 75 74 3d 30 20 6f 62  t=0 time out=0 ob |
 | 00050  6a 23 3d 2d 31 20 74 69  6d 3d 31 33 37 37 38 37  j#=-1 ti m=137787 |
 | 00060  33 32 35 39 31 38 33 34  32 30                    32591834 20       |
write(18, "\n", 1)                      = 1
 | 00000  0a                                                .                 |
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522703
semctl(983043, 16, SETVAL, 0x1)         = 0
times({tms_utime=0, tms_stime=5, tms_cutime=0, tms_cstime=0}) = 433522703
semtimedop(983043, {{14, -1, 0}}, 1, {1, 320000000}) = -1 EAGAIN (Resource temporarily unavailable)

额…我看到很多可疑的东西. 这个跟踪从semaphore上的休眠开始, 然后进程写下一个空闲的等待事件. 接着, 有9个pwrite系统调用, 因为database正在把脏块写回数据文件. 这9个写动作是一个一个顺序完成的. 完成9个写操作之后, Oracle记录了等待事件. 因为不是异步IO, 我们没有看到’db file async I/O submit’, 只有’db file parallel write’. 请求的数量是9, 和之前9个pwrite的数量相同.

让我们停留第一个’db file parallel write’等待事件. 它的等待时间非常可疑…9个写操作难道可以在24微秒中完成吗(ela=24). 等等, 还有更奇怪的东西呢?

我看到另一个’db file parallel write’, 在第一个之后, 而且中间没有任何pwrite()系统调用, 这很奇怪. 这个等待也显示了9个pwrite()请求, 只用了1微秒(ela=1). 两个等待事件连着出现的这种情况, 在每次’alter system flush buffer_cache’. 而且这个1微秒的等待时间非常可疑. (正常的IO不可能这么快)

让我们更深入的观察Oracle如何对等待时间的进行测量. 我使用gdb和以下的命令(注: kslwtbctx (Oracle 11.2, Oracle 10.2中这个函数名为kslwtb_tm), 开始测量等待事件; kslwtectx (Oracle 10.2: kslwte_tm) 结束测量等待事件. ‘rbreak’意思是’break by regular expression’, 可以匹配多个函数设置断点):

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) command
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break semtimedop
Breakpoint 3 at 0x3f386ea7d0: file ../sysdeps/unix/syscall-template.S, line 82.
(gdb) command
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) break pwrite
Breakpoint 4 at 0x3f38a0ee90: file ../sysdeps/unix/syscall-template.S, line 82. (2 locations)
(gdb) command
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

在gdb continues之后, 我提交了’alter system flush buffer_cache’, 以下是gdb的输出:

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 4, pwrite64 () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
 
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
 
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
 
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
 
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
 
Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

输出的头尾包含了两个semaphore的调用. 很容易看到, 对中间的pwrite64()调用并没有使用kslwtbctx/kslwtectx计时. 这就是为什么’db file parallel write’等待时间会这么短…因为没有任何实际IO会被计时. 除非有人可以证明我是错的, 我有理由使用同步IO的情况下, Oracle database writer等待事件的实现完全是错误的.

当然, 就像我所说的: 你真的应该使用异步IO, 而不是同步IO.

让我们切换回到异步IO的情况, 一下是我使用的gdb命令:

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break semtimedop
Breakpoint 3 at 0x3f386ea7d0: file ../sysdeps/unix/syscall-template.S, line 82.
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_submit
Breakpoint 4 at 0x3f38200660: file io_submit.c, line 23.
(gdb) command
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_getevents_0_4
Breakpoint 5 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 5, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

你应该可以看到semtimedop()这个系统调用, 前后有kslwt[be]ctx, 每个三秒在屏幕上出现. 我插入几行数据到测试表, commit, 然后运行’alter system flush buffer_cache’, 一下是gdb的输出:

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
 
Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=3, iocbs=0x7fff872031a0) at io_submit.c:23
23  io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
 
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
 
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
 
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
 
Breakpoint 5, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=1, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46      if (ring==NULL || ring->magic != AIO_RING_MAGIC)
 
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
 
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
 
Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)

这个片段开头和结尾都是semaphore的等待.
接着, io_submit()很显然没有被计时, 因为它前后没有被kslwtbctx()和kslwtectx()这个组合包围. 很遗憾, 等待事件’db file async I/O submit’似乎暗示着提交IO给操作系统的时间, 实际上这个过程并没有被真正计时.
接着, 我们看到kslwtbctx() io_getevents_0_4() kslwtectx(). 这是Oracle对确认之前提交的写IO完成情况的计时.

为了进一步确认io_submit()调用的情况, 我们让这个调用持续更长的时间. 使用以下的gdb命令(Breakpoint 4 是io_submit的断点):

(gdb) commands 4
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>shell sleep 1
>c
>end
(gdb) 

‘shell sleep 1’这个注入意味着: 当进入io_submit()时, debugger会让database writer暂停运行, 它会执行’sleep 1’这个shell命令, 然后继续database writer进程的执行. 这样, io_submit()这个系统调用就会超过1秒.

然而, 我系统上10046/8的跟踪文件却是这样的:

*** 2013-09-06 11:12:16.337
WAIT #0: nam='db file async I/O submit' ela= 470 requests=1 interrupt=0 timeout=0 obj#=-1 tim=1378458736337487
WAIT #0: nam='db file parallel write' ela= 1386 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=1378458736340152
 
*** 2013-09-06 11:12:17.344
WAIT #0: nam='db file async I/O submit' ela= 545 requests=1 interrupt=0 timeout=0 obj#=-1 tim=1378458737344837
WAIT #0: nam='db file parallel write' ela= 1554 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=1378458737346981

使得io_submit()调用超过一秒之后, ‘db file async I/O submit’依然只有大概0.5毫秒. 现在, 可以很确定的说, ‘db file async I/O submit’并没有对提交IO请求到操作系统这个过程计时.

现在, 看看’db file parallel write’这个事件. ‘db file parallel write’事件中的的’request’并不反映已经确认的IO请求数量, 而是io_getevents()调用需要从操作系统的完成队列里, 确认最少已经完成的IO请求的数量. 实际已经完成的IO请求并没有显示出来. 另外一个奇怪的地方是timeout. 2147483647显然是2^31-1, 单位是百分之一秒(依据Oracle reference guide, 版本11.2), 意味着写操作timeout时间是2147483秒, 以为几乎25天之后才timeout. 这时间也太长了.

幸好我们可以确认io_getevents()系统调用的实际timeout时间. 通过strace, 可以看到调用最后的一个参数. 或者可以用debugger打印出timeout这个结构.

(gdb) commands 5
Type commands for breakpoint(s) 5, one per line.
End with a line saying just "end".
>print *timeout
>c
>end
(gdb) c
Continuing.

以下是gdb的输出:

Breakpoint 3, semtimedop () at ../sysdeps/unix/syscall-template.S:82
82  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
 
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
 
Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=3, iocbs=0x7fff872031a0) at io_submit.c:23
23  io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
 
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
 
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
 
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
 
Breakpoint 5, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=1, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46      if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$1 = {tv_sec = 600, tv_nsec = 0}
 
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()

所以, io_getevents()实际的timeout是600秒($1 = {tv_sec = 600, tv_nsec = 0}), 并不是2147483秒, 正如Oracle文档说的一样. 这个timeout时间的结论是关于database writer的, 当然Oracle的代码可能重新调用io_getevents, 但是我表示怀疑.

目前为止, database writer只需要写几个数据块到磁盘上. 因为数量很少, 而且我测试的系统在VM上, 所以IO很可能被缓存了, 意味着IO的延迟非常的低. 那如果我们可以限制database writer IO的数量呢? 另一篇文章介绍了Linux系统上如何使IO慢下来.

我的测试表有几百万条记录(为了产生更多的脏块, 产生数据可以运行sql: ‘insert into t select 1+(level-1) from dual connect by level < 100000;'). 使用cgroups设置IO控制阀, 把database writer放进任务列表进行控制. 然后, 开始删除所有的数据'delete from t'. database writer依然打开10046 level 8跟踪, 我才能在oracle的层面观察database writer做些什么. 另外, 我在gdb中修改io_getevents()的断点设置: 使用'finish'命令, 它会一直等到当前的函数调用结束, 然后打印出返回值. io_getevents()返回了实际已经完成的IO请求数量. 'finish'的负面影响是执行会停止, 必须手动继续执行. 以下是gdb的输出:

Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
 
Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=153, iocbs=0x7fff872031a0) at io_submit.c:23
23  io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
 
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
 
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
 
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
 
Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=38, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46      if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$12 = {tv_sec = 600, tv_nsec = 0}
c
0x0000000002cfa74e in skgfrliopo ()
Value returned is $13 = 38
(gdb) c
Continuing.
 
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
 
Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=115, nr=128, events=0x7fff8720afe8, timeout=0x7fff8720bff0) at io_getevents.c:46
46      if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$14 = {tv_sec = 0, tv_nsec = 0}
0x0000000002cfa74e in skgfrliopo ()
Value returned is $15 = 0
(gdb) c
Continuing.
 
Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=115, nr=128, events=0x7fff8720e0c8, timeout=0x7fff8720f0d0) at io_getevents.c:46
46      if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$16 = {tv_sec = 0, tv_nsec = 0}
0x0000000002cfa74e in skgfrliopo ()
Value returned is $17 = 18
(gdb) 

开始是kslwtectx(), 接着io_submit()提交了153 IO请求. 之后kslwtbctx()-kslwtectx()的计时是为了事件’db file async I/O submit’.
接着我们看到对’db file parallel write’开始计时的kslwtbctx(), 然后是io_getevents(). 这次io_getevents至少需要确认38个已经完成的IO请求, 同时timeout设置为600秒. 因为读和写IO都被控制位1 IOPS, 这次只会确认38个IO请求.

我能够观察到这种情况是因为gdb的’finish’命令. 它返回了调用io_getevents的函数(skgfrliopo()), 这是一个Oracle函数.

继续执行这个进程, 我们看到有些意外的现象: 等待终止了(kslwtectx()), 接着Oracle用两个非阻塞的io_getevents()调用, 尝试确认剩下的115个IO请求. 一旦这两个调用结束, 或者新的IO请求会被提交(重复我们刚刚描述的过程), 或者oracle注册另一个’db file parallel write’事件, 运行一个阻塞的io_getevents()系统调用, 这次确认的IO请求数量会小于实际还没完成的IO请求数量. 如果未完成的IO请求数量持续增长, ‘db file parallel write’等待事件中可能包含两个io_getevents()调用.

以下是未完成的IO请求持续增长时gdb的输出:

Breakpoint 4, io_submit (ctx=0x7f9d8dbd1000, nr=685, iocbs=0x7fff872031a0) at io_submit.c:23
23  io_syscall3(int, io_submit, io_submit, io_context_t, ctx, long, nr, struct iocb **, iocbs)
 
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
 
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
 
Breakpoint 1, 0x0000000008f9a652 in kslwtbctx ()
 
Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=128, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46      if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$22 = {tv_sec = 600, tv_nsec = 0}
 
Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=114, nr=128, events=0x7fff8720b088, timeout=0x7fff8720c090) at io_getevents.c:46
46      if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$23 = {tv_sec = 600, tv_nsec = 0}
 
Breakpoint 2, 0x0000000008fa1334 in kslwtectx ()
 
Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=128, nr=128, events=0x7fff8720afe8, timeout=0x7fff8720bff0) at io_getevents.c:46
46      if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$24 = {tv_sec = 0, tv_nsec = 0}
 
Breakpoint 6, io_getevents_0_4 (ctx=0x7f9d8dbd1000, min_nr=128, nr=128, events=0x7fff8720e0c8, timeout=0x7fff8720f0d0) at io_getevents.c:46
46      if (ring==NULL || ring->magic != AIO_RING_MAGIC)
$25 = {tv_sec = 0, tv_nsec = 0}

总结

Database writer相关的等待事件是奇怪的, 并没有揭示真正工作情况.

‘db file parallel write’对于异步IO才有并发意义, 对于同步IO, 写操作是顺序进行的.

Database writer独有的事件’db file async I/O submit’并没有记录向操作系统提交异步IO的时间, 虽然名称很像. 现在对于同步IO, 这个事件也会出现.

等待事件’db file parallel write’对同步IO情况下的写IO没有计时, 对于同一批顺序写操作, 还打印了两个等待事件. 我认为这个实现是完全错误的, 这种情况下’db file parallel write’的等待时间没有任何参考价值.

等待事件’db file async I/O submit’显示了io_submit()提交到操作系统的IO请求数量.

等待事件’db file parallel write’显示了io_getevents()调用需要至少确认多少个IO请求数量, 这个数量可能比io_submit()实际提交的IO请求少. 等待事件中的timeout时间是错的, 实际的timeout是600秒, 而不是25天.

等待事件’db file parallel write’等待的IO请求数量比io_submit提交的少, database writer会尝试确认剩下的IO请求, 使用两次非阻塞的io_getevents()调用, 之后, 新的IO请求可以被提交.

最后, 我的结论是’db file parallel write’仅仅覆盖database writer的一部分工作, 完全不能用来衡量Oracle实际写IO的延迟.