ANSI JOIN对Global Hint的限制

最近一个case, 使用ANSI JOIN时, 使用cardinality(view_name.table_name 1000000000) 这个hint不起作用.

这个限制在官方文档中有提及:

The view_name.table_name notation for global hints does not work for queries using ANSI join syntax because the optimizer generates additional views during parsing. When using ANSI join syntax, specify the query block name in the hint instead of theview_name.table_name notation.

重现一下, 准备两个表t1, t2

create table t1 as
with v1 as (
select /*+ materialize */
rownum id from dual connect by level <= 100
)
select
    rownum                      id,
    rpad(rownum, 10, '0')       v1,
    trunc((rownum - 1)/100)     n1,
    rpad(rownum, 10)           padding
from v1, v1;

create table t2 as select * from t1;

准备视图v

create or replace view v
as select * from t2;

Cardinality(v.t2 1000000000)这个hint对oracle join语法生效

explain plan for
select /*+ cardinality(v.t2 1000000000)*/
    sum(t1.n1)
from
    t1, v
where
    t1.id = v.id;

------------------------------------------------------------------------------------
| Id  | Operation                   | Name | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |      |     1 |    11 |  9703 (100)| 00:00:01 |
|   1 |  SORT AGGREGATE             |      |     1 |    11 |            |          |
|*  2 |   HASH JOIN                 |      |  1000M|    10G|  9703 (100)| 00:00:01 |
|   3 |    TABLE ACCESS STORAGE FULL| T1   | 10000 | 70000 |    16   (0)| 00:00:01 |
|   4 |    TABLE ACCESS STORAGE FULL| T2   |  1000M|  3814M|  1628 (100)| 00:00:01 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"."ID"="ID")

Cardinality(v.t2 1000000000)这个hint对ANSI join语法不生效

explain plan for
select /*+ cardinality(v.t2 1000000000)*/
    sum(t1.n1)
from
    t1 inner join v
on
    t1.id = v.id;

------------------------------------------------------------------------------------
| Id  | Operation                   | Name | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |      |     1 |    11 |    33   (4)| 00:00:01 |
|   1 |  SORT AGGREGATE             |      |     1 |    11 |            |          |
|*  2 |   HASH JOIN                 |      | 10000 |   107K|    33   (4)| 00:00:01 |
|   3 |    TABLE ACCESS STORAGE FULL| T2   | 10000 | 40000 |    16   (0)| 00:00:01 |
|   4 |    TABLE ACCESS STORAGE FULL| T1   | 10000 | 70000 |    16   (0)| 00:00:01 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"."ID"="ID")

解决方法: 在view中使用qb_name指定query block名称, 比如inner_view, 在sql的全局hint中使用t2@inner_view引用视图中的表t2, 可以看到此时cardinality hint生效了.

create or replace view test_hw
as select /*+ qb_name(inner_view) */* from t2;

explain plan for
select /*+ cardinality(t2@inner_view 1000000000)*/
    sum(t1.n1)
from
    t1 inner join test_hw
on
    t1.id = test_hw.id;

------------------------------------------------------------------------------------
| Id  | Operation                   | Name | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |      |     1 |    11 |  9703 (100)| 00:00:01 |
|   1 |  SORT AGGREGATE             |      |     1 |    11 |            |          |
|*  2 |   HASH JOIN                 |      |  1000M|    10G|  9703 (100)| 00:00:01 |
|   3 |    TABLE ACCESS STORAGE FULL| T1   | 10000 | 70000 |    16   (0)| 00:00:01 |
|   4 |    TABLE ACCESS STORAGE FULL| T2   |  1000M|  3814M|  1628 (100)| 00:00:01 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"."ID"="ID")

深入理解Oracle的并行执行(上)

这是被ACOUG和OTN收录的关于Oracle并行执行的论文, 主要涉及了以下的主题:

  • Oracle 并行执行为什么使用生产者-消费者模型.
  • 如何阅读并行执行计划.
  • 不同的数据分发方式分别适合什么样的场景.
  • 使用 partition wise join 和并行执行的组合提高性能.
  • 数据倾斜会对不同的分发方式带来什么影响.
  • 由于生产者-消费者模型的限制, 执行计划中可能出现阻塞点.
  • 布隆过滤是如何提高并行执行性能的.
  • 现实世界中, 使用并行执行时最常见的问题.
  • 文章链接: OTN, ACOUG

    MOATS RAC Dashboard, 轻量级RAC监控工具

    1. 介绍

    MOATS RAC 是一个轻量级的RAC监控工具, 类似top命令的效果. MOATS RAC源于对MOATS做了扩展, MOATS(Mother Of All Tuning Scripts)是Tanel Poder和Adrian Billington在2011年公开的一个工具, 用于监控单个实例的活动统计信息, 包括Top Event和Top SQL. MOATS RAC对MOATS做了几点增强:

    • 监控RAC所有实例, 在最多8个节点的RAC测试过.
    • 增加Active Session Graph, 展示整个RAC过去一段时间的活动信息.
    • 增加Top Event和Top SQL的AAS(Average Active Sessions)信息, 对于每个Top SQL, 计算排名第一和第二的等待事件和所占的百分比.
    • 可以根据实际的终端大小, 调整监控界面的高度, 比如整个监控窗口的高度, Top Event/Active Session Graph的显式高度, 最多显示多少条Top SQL等.
    • 支持11g R1以上的版本, 增加Exadata相关的统计信息, 如offload的比例, Storage Index消除的物理IO和Smart Flash Cache的命中次数.

    2. 监控界面

    3. 安装

    MOATS RAC本质是一个package和几个types:
    1. 如果使用于非SYS安装, 需要使用sys登录, 调用moats_privs_grant.sql对目标用户进行必要的授权.
    2. 以目标用户登录, 调用moats_install.sql进行安装.

    4. MOATS TOP 使用示例

    使用默认设置, 只需调用q.sql, 默认的显示窗口大小为40 * 175.

    SQL> !cat q.sql
    set arrays 80 lines 2000 trims on head off tab off pages 0
    SELECT /*+ no_monitor */* FROM TABLE(moats.top(5));
    
    SQL> @q 
    

    你可以调整整个窗口的高度, 以及Active Session Graph和top sql的高度. sqlplus 的arraysize必须等于整个窗口高度的两倍.

    SQL> set arraysize 120                             
    SQL> SELECT *                                      
         FROM   TABLE(moats.top( p_screen_size->60,    
                                 p_ash_height=>20,     
                                 p_sql_height=>15));   
    

    详细的README和代码请参见我的github.

    Range分区多个分区键时maxvalue的使用

    假设range分区使用多个分区键, 比如这个示例中的(month_id, status_id). 如果常用查询条件为month_id=201407, 我们会期望分区裁剪之后, 只有201407开头分区会被扫描. 实际情况如何, 我们做个简单的实验:

    创建两个结构相同的表t1和t2, 只是每个月最后的一个分区第二个分区键的high value不同, t1为’V40′, t2为maxvalue.

    sid@ORA12C> create table t1
      2  (
      3      id             number,
      4      month_id       number,
      5      status_id      varchar2(5)
      6  )
      7  partition by range(month_id, status_id)
      8  (
      9      partition p_201407_V10 values less than (201407,'V10'),
     10      partition p_201407_V20 values less than (201407,'V20'),
     11      partition p_201407_V30 values less than (201407,'V30'),
     12      partition p_201407_V40 values less than (201407,'V40'),
     13      partition p_201408_V10 values less than (201408,'V10'),
     14      partition p_201408_V20 values less than (201408,'V20'),
     15      partition p_201408_V30 values less than (201408,'V30'),
     16      partition p_201408_V40 values less than (201408,'V40')
     17  );
    
    Table created.
    
    sid@ORA12C> create table t2
      2  (
      3      id             number,
      4      month_id       number,
      5      status_id      varchar2(5)
      6  )
      7  partition by range(month_id, status_id)
      8  (
      9      partition p_201407_V10 values less than (201407,'V10'),
     10      partition p_201407_V20 values less than (201407,'V20'),
     11      partition p_201407_V30 values less than (201407,'V30'),
     12      partition p_201407_V40 values less than (201407,maxvalue),
     13      partition p_201408_V10 values less than (201408,'V10'),
     14      partition p_201408_V20 values less than (201408,'V20'),
     15      partition p_201408_V30 values less than (201408,'V30'),
     16      partition p_201408_V40 values less than (201408,maxvalue)
     17  );
    
    Table created.
    

    使用条件month_id=201407扫描数据, 观察执行计划中的pstart, pstop.
    对于表t1, 因为201407的最后一个分区p_201407_V40的high value为(201407,’V40′), 那可能出现(201407,’ZZZ’)的数据大于(201407,’V40′), 需要放到201408的第一个分区p_201408_V10. 所以使用条件month_id=201407时, 必须扫描p_201408_V10这个分区, 此时执行计划中pstat=1, pstop=5.
    对于表t2, 因为201407的最后一个分区p_201407_V40的high value为(201407,maxvalue), 201407所有大于等于(201407,’V30′)的数据, 都会放在7月份的最后一个分区. 使用条件month_id时, 分区裁剪之后, 只需要扫描201407开头的分区. 此时执行计划中pstart=1, pstop=4. 避免对201408第一个分区不必要的扫描.

    sid@ORA12C> explain plan for
      2  select * from t1 where month_id = 201407;
    
    Explained.
    
    sid@ORA12C> select * from table(dbms_xplan.display('plan_table',null,'typical'));
    
    PLAN_TABLE_OUTPUT
    --------------------------------------------------------------------------------------------------
    Plan hash value: 277861402
    
    ---------------------------------------------------------------------------------------------------
    | Id  | Operation                  | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
    ---------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT           |      |     1 |    30 |     2   (0)| 00:00:01 |       |       |
    |   1 |  PARTITION RANGE ITERATOR  |      |     1 |    30 |     2   (0)| 00:00:01 |     1 |     5 |
    |*  2 |   TABLE ACCESS STORAGE FULL| T1   |     1 |    30 |     2   (0)| 00:00:01 |     1 |     5 |
    ---------------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       2 - storage("month_id"=201407)
           filter("month_id"=201407)
    
    Note
    -----
       - dynamic statistics used: dynamic sampling (level=2)
    
    19 rows selected.
    
    sid@ORA12C> explain plan for
      2  select * from t2 where month_id = 201407;
    
    Explained.
    
    sid@ORA12C> select * from table(dbms_xplan.display('plan_table',null,'typical'));
    
    PLAN_TABLE_OUTPUT
    --------------------------------------------------------------------------------------------------
    Plan hash value: 462758725
    
    ---------------------------------------------------------------------------------------------------
    | Id  | Operation                  | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
    ---------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT           |      |     1 |    30 |     2   (0)| 00:00:01 |       |       |
    |   1 |  PARTITION RANGE ITERATOR  |      |     1 |    30 |     2   (0)| 00:00:01 |     1 |     4 |
    |*  2 |   TABLE ACCESS STORAGE FULL| T2   |     1 |    30 |     2   (0)| 00:00:01 |     1 |     4 |
    ---------------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       2 - storage("month_id"=201407)
           filter("month_id"=201407)
    
    Note
    -----
       - dynamic statistics used: dynamic sampling (level=2)
    
    19 rows selected.
    

    Real World Performance Online Learning Video

    我们组(Real World Performance Group)老板Andrew Holdsworth录制了一系列关于Oracle Performance的视频, 强烈推荐给对Oracle性能感兴趣的朋友, 每个视频10分钟左右, 这20个视频是RWP为期三天半培训的精华浓缩, 每个视频涉及的主题都有真实demo演示, 我相信每个人都可以从RWP的经验获益.

    http://www.oracle.com/goto/oll/rwp

    Exadata Write-back Smart Flash Cache Random Write Performance

    最近在一台X4-2半配上用SLOB测试了Write-back Smart Flash Cache对于随机写的性能, 记录一下, 测试的时候只有6个cell节点可用, 而不是7个.

    Oracle后台进程Database write(dbwr)负责把buffer cache中的脏块写回到数据文件中, 这些写操作几乎都是随机写的. 为了让dbwr尽可能进行繁忙的随机写, 可以把buffer cache设置的足够小, 这个测试中设为192M, 表CF1每个数据块只包含一条记录, 思路是每次update 64条记录, 大约产生64个脏块. 把redo logfile size设置足够大, 通过set C2=‘A’尽可能少得产生redo, 避免lgwr成为瓶颈.

    update和select的语句如下, 比例接近50%-50%.

            Elapsed                  Elapsed Time
            Time (s)    Executions  per Exec (s)  %Total   %CPU    %IO    SQL Id
    ---------------- -------------- ------------- ------ ------ ------ -------------
             1,996.3         88,329          0.02   83.1   32.0   77.8 f335amzu9kgy8
    Module: SQL*Plus
    UPDATE CF1 SET C2='A' WHERE ( CUSTID > ( :B1 - :B2 ) ) AND (CUSTID < :B1 )
    
               382.6         95,873          0.00   15.9   56.0   60.1 bhdvtsvjhgvrh
    Module: SQL*Plus
    SELECT COUNT(C2) FROM CF1 WHERE ( CUSTID > ( :B1 - :B2 ) ) AND (CUSTID < :B1 )
    

    当flash cache为write-back模式, buffer cache中得脏块只需写到存储节点的flash上, db file parallel write平均时间在1ms左右, top 10等待事件, 如下:

    Top 10 Foreground Events by Total Wait Time
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                Tota    Wait   % DB
    Event                                 Waits Time Avg(ms)   time Wait Class
    ------------------------------ ------------ ---- ------- ------ ----------
    cell single block physical rea    5,921,681 1635       0   68.1 User I/O
    DB CPU                                      877.           36.5
    cell list of blocks physical r      190,164 148.       1    6.2 User I/O
    log file switch completion               16   .4      22     .0 Configurat
    reliable message                         11   .2      18     .0 Other
    enq: WF - contention                     20   .1       4     .0 Other
    gc current block busy                    41    0       1     .0 Cluster
    gc current block 3-way                  111    0       0     .0 Cluster
    gc cr multi block request                10    0       3     .0 Cluster
    control file sequential read            136    0       0     .0 System I/O
    
                                                                 Avg
                                            %Time Total Wait    wait    Waits   % bg
    Event                             Waits -outs   Time (s)    (ms)     /txn   time
    -------------------------- ------------ ----- ---------- ------- -------- ------
    db file parallel write          395,538     0        276       1      1.7   19.1
    

    当flash cache为write-through模式, buffer cache中得脏块需要直接写到磁盘上, db file parallel write平均时间不小于21ms, 造成大量的free buffer waits, top 10等待事件, 如下:

    Top 10 Foreground Events by Total Wait Time
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                               Total Wait       Wait   % DB Wait
    Event                                Waits Time (sec)    Avg(ms)   time Class
    ------------------------------ ----------- ---------- ---------- ------ --------
    free buffer waits                1,039,389      12.2K      11.72   63.0 Configur
    cell single block physical rea   9,244,945     4251.9       0.46   22.0 User I/O
    DB CPU                                         2561.1              13.3
    cell list of blocks physical r     358,004      736.8       2.06    3.8 User I/O
    latch: gc element                  144,537      368.1       2.55    1.9 Other
    write complete waits                   571      224.7     393.47    1.2 Configur
    cursor: pin S wait on X              1,793       54.4      30.35     .3 Concurre
    enq: RO - fast object reuse             92       43.5     472.32     .2 Applicat
    latch free                          15,992       18.6       1.16     .1 Other
    latch: gcs resource hash            12,772       17.9       1.40     .1 Other
    
                                                    Total        Avg
                                           %Time     Wait       wait    Waits   % bg
    Event                            Waits -outs Time (s)       (ms)     /txn   time
    -------------------------- ----------- ----- -------- ---------- -------- ------
    db file parallel write           7,019     0      153      21.74      0.0    7.4
    

    在write-back模式下用32/64/128/192/256/320/392并发分别跑五分钟, 以下是各种统计图.

    计算节点的CPU利用率, 因为主要是IO操作, 计算节点的CPU利用率并不高.

    awr报告中的IOPS, 其中写IOPS最高为25万左右.

    awr报告中的MBPS, 其中写MBPS最高为2GB/s.

    存储节点的CPU利用率和MBPS, 其中写MBPS最高为4GB/s.

    存储节点的IOPS, 使用392并发时, flash卡德IOPS为1百万, 如果满配的14个存储节点的话, 会接近datasheet所说的2,660,000的flash IOPS.

    存储节点的infiniband流量, 接近存储节点的MBPS

    存储节点的IOPS/MBPS会比awr报告中的值稍大, 因为对于写操作, 在write-back模式下, 需要写两份到flash卡上(normal redundency), 这也意味这打开write-back的话, 会牺牲一般的flash的容量, 这是设置write-back时需要权衡的.

    write-back的实际作用:

    对于写密集型的应用, 比如繁忙的股票交易系统, 有大量的数据更新, 不用write-back的话, Exadata的磁盘的IOPS很其实容易用满, 例如:
    1. buffer cache中的脏块需要直接写到磁盘上, normal redundency写两份, high redundency写三份;
    2. 因为配置了data guard或者其他原因, 需要频繁的读写控制文件;

    此时设置write-back会极大改善这类应用的IO性能.

    动态采样一例

    对于有复杂的过滤条件的sql, 为了在执行计划中得到正确的cardinality, 统计信息未必有帮助, 包括extended statistics. 比如下面in和like的组合条件, 或者where条件中使用了自定义的函数.

    status in (‘COM’, ‘ERR’)
    and v1 like ‘10%’

    这时候dynamic sampling可能是唯一的选择. 下面是一个例子, 采用level为6的采样之后, cardinality更为接近真实的数据.

    create table t6 as
    with v1 as (
    select /*+ materialize */
    rownum id from dual connect by level <= 1000
    )
    select
        rownum                      id,
        rpad(rownum, 10, '0')       v1,
        trunc((rownum - 1)/100)     n1,
        case
            when mod(rownum,100000) = 7                 then 'ERR'
            when rownum <= 9990000                      then 'COM'
            when mod(rownum,10) =0                      then 'NEW'
            when mod(rownum,10) between 1 and 5         then 'PRP'
            when mod(rownum,10) between 6 and 8         then 'FKC'
            when mod(rownum,10) = 9                     then 'LDD'
        end status,
        rpad(rownum, 100)           padding
    from v1, v1
    where rownum <= 1e6;
    
    begin
        dbms_stats.gather_table_stats(user,'T6');
    end;
    /
    
      1  select
      2      count(*)
      3  from
      4      t6
      5  where
      6      status in ('COM', 'ERR')
      7* and v1 like '10%'
    sid@OFA> /
    
      COUNT(*)
    ----------
         11113
    
    
    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 4096694858
    
    -----------------------------------------------------------------------------------
    | Id  | Operation                  | Name | Rows  | Bytes | Cost (%CPU)| Time     |
    -----------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT           |      |     1 |    15 |  4983   (1)| 00:00:01 |
    |   1 |  SORT AGGREGATE            |      |     1 |    15 |            |          |
    |*  2 |   TABLE ACCESS STORAGE FULL| T6   |   395 |  5925 |  4983   (1)| 00:00:01 |
    -----------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       2 - storage("V1" LIKE '10%' AND ("STATUS"='COM' OR "STATUS"='ERR'))
           filter("V1" LIKE '10%' AND ("STATUS"='COM' OR "STATUS"='ERR'))
    
    
    Statistics
    ----------------------------------------------------------
              1  recursive calls
              0  db block gets
          18188  consistent gets
              0  physical reads
              0  redo size
            544  bytes sent via SQL*Net to client
            547  bytes received via SQL*Net from client
              2  SQL*Net roundtrips to/from client
              0  sorts (memory)
              0  sorts (disk)
              1  rows processed
    
    
    sid@OFA> select /*+ OPT_PARAM('OPTIMIZER_DYNAMIC_SAMPLING', 6) */
      2      count(*)
      3  from
      4      t6
      5  where
      6      status in ('COM', 'ERR')
      7  and v1 like '10%';
    
      COUNT(*)
    ----------
         11113
    
    
    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 4096694858
    
    -----------------------------------------------------------------------------------
    | Id  | Operation                  | Name | Rows  | Bytes | Cost (%CPU)| Time     |
    -----------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT           |      |     1 |    15 |  4983   (1)| 00:00:01 |
    |   1 |  SORT AGGREGATE            |      |     1 |    15 |            |          |
    |*  2 |   TABLE ACCESS STORAGE FULL| T6   | 16595 |   243K|  4983   (1)| 00:00:01 |
    -----------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       2 - storage("V1" LIKE '10%' AND ("STATUS"='COM' OR "STATUS"='ERR'))
           filter("V1" LIKE '10%' AND ("STATUS"='COM' OR "STATUS"='ERR'))
    
    Note
    -----
       - dynamic statistics used: dynamic sampling (level=6)
    

    Improve connect by statment by parallel pipelined function

    在oracle不是原生支持并行处理的地方, 可以利用parallel pipelined function这个特性DIY你想要的并行的效果. 比如通过定制, 把数据并行导出到不同的数据文件, 就快导出速度. 这篇博客介绍对于包含connect by的sql语句的一个优化例子. 在实际的客户系统中, 结合Exadata计算能力和sql并行改造, 获得上百倍或者上千倍的性能提升也是常见的.

    脚本准备, 准备一个100k的表

    drop table t1;
    -- t1 with 100,000 rows
    create table t1
    as
    select
        rownum                      id,
        lpad(rownum, 10, '0')       v1,
        trunc((rownum - 1)/100)     n1,
        rpad(rownum, 100)           padding
    from
        dual
    connect by level <= 100000
    ;
    
    begin
        dbms_stats.gather_table_stats(user,t1);
    end;
    /
    

    以下的查询因为包含connect by, 不能使用并行处理, 花了3m5s.

    select /*+ monitor */
        count(*)
    from
    (
        select
            CONNECT_BY_ROOT ltrim(id) root_id,
            CONNECT_BY_ISLEAF is_leaf,
            level as t1_level,
            a.v1
        from t1 a
        start with a.id <=1000
        connect by NOCYCLE id = prior id + 1000
    );
    
      COUNT(*)
    ----------
         10000
    
    Elapsed: 00:03:05.17
    

    准备parallel pipelined table function, 对于每个root, 输入rowid, 单独进行一次connect by. 以下是parallel pipelined function用到的脚本.

    create or replace package refcur_pkg
    AS
        TYPE R_REC IS RECORD (row_id ROWID);
        TYPE refcur_t IS REF CURSOR RETURN R_REC;
    END;
    /
    
    create or replace package connect_by_parallel 
    as
       /*  Naviagates a shallow hiearchy in parallel, where we do a tree walk for each root */
    
        CURSOR C1 (p_rowid ROWID) IS     -- Cursor done for each subtree. This select is provided by the customer
        select  CONNECT_BY_ROOT ltrim(id) root_id, CONNECT_BY_ISLEAF is_leaf, level as t1_level, a.v1
              from t1 a
              start with rowid = p_rowid
              connect by NOCYCLE id = prior id + 1000; 
    
        TYPE T1_TAB is TABLE OF C1%ROWTYPE;
     
        FUNCTION treeWalk (p_ref refcur_pkg.refcur_t) RETURN T1_TAB
                 PIPELINED
        PARALLEL_ENABLE(PARTITION p_ref BY ANY);
    
    END connect_by_parallel;
    / 
    
    create or replace package body connect_by_parallel 
    as  
    FUNCTION treeWalk (p_ref refcur_pkg.refcur_t) RETURN T1_TAB
              PIPELINED PARALLEL_ENABLE(PARTITION p_ref BY ANY) 
    IS
      in_rec p_ref%ROWTYPE;
    BEGIN
       execute immediate 'alter session set "_old_connect_by_enabled"=true';
       LOOP -- for each root
        FETCH p_ref INTO in_rec; 
        EXIT WHEN p_ref%NOTFOUND;
        FOR c1rec IN c1(in_rec.row_id)  LOOP -- retrieve rows of subtree
            PIPE ROW(c1rec);
        END LOOP;
      END LOOP;
      execute immediate 'alter session set "_old_connect_by_enabled"=false';  
      RETURN;
    END  treeWalk;
    
    END connect_by_parallel;
    /
    

    改写之后, sql花了41s, 相对原来有4倍多得提升.

    SELECT
      /*+ monitor */
      COUNT(*)
    FROM TABLE(connect_by_parallel.treeWalk (CURSOR
      (SELECT /*+ parallel (a 100) */
        rowid FROM t1 a WHERE id <= 100))) b;
    
      COUNT(*)
    ----------
         10000
    
    Elapsed: 00:00:41.21
    

    Bloom Filter For Non-Parallel query on Non-Partition Table

    I notice an enhancement in 12.1.0.2.0 beta, bloom filter is enabled for non-parallel query on non-partition table. The enhancement may help improve the hash join performance, this page show a little demo. The testing is on Exadata X4-2 instance.

    > SELECT * FROM PRODUCT_COMPONENT_VERSION;
    
    PRODUCT          VERSION        STATUS
    ---------------------------------------- -------------------- --------------------
    NLSRTL           12.1.0.2.0       Beta
    Oracle Database 12c Enterprise Edition   12.1.0.2.0       64bit Beta
    PL/SQL           12.1.0.2.0       Beta
    TNS for Linux:         12.1.0.2.0       Beta
    
    > select platform_id, platform_name from v$database;
    
    PLATFORM_ID PLATFORM_NAME
    ----------- ----------------
       13 Linux x86 64-bit
    

    Prepare two table for hash join, 10k rows in t1 and 10m rows in t2. 100 records for each value on the column t1.n1 and t2.n1.

    drop table t1;
    drop table t2;
    
    -- t1 with 10,000 rows
    create table t1
    as
    select
        rownum                      id,
        lpad(rownum, 10, '0')       v1,
        trunc((rownum - 1)/100)     n1,
        rpad(rownum, 100)           padding
    from
        dual
    connect by level <= 10000
    ;
    
    -- t2 with 10,000,000 rows
    create table t2
    as
    select
        rownum                      id,
        lpad(rownum, 10, '0')       v1,
        trunc((rownum - 1)/100)     n1,
        rpad(rownum, 100)           padding
    from
        t1, t1
    where
        rownum <= 1e7;
    
    begin
        dbms_stats.gather_table_stats(
            ownname      => user,
            tabname      =>'T1',
            method_opt   => 'for all columns size 1'
        );
    
        dbms_stats.gather_table_stats(
            ownname      => user,
            tabname      =>'T2',
            method_opt   => 'for all columns size 1'
        );
    end;
    /
    

    The hash join between t1 and t2 return only 100 rows, if the bloom filter is created after t1 is scanned, and applied on t2 when t2 is scanned, most of the rows of t2 is filtered at the first place. The sql monitor report show that how bloom filter takes effect. With bloom filter enabled, the rows returned from smart scan on t2 is 15k, cell offload efficiency is 100%(only 2MB bytes returned from Exadata). Without bloom filter, the rows returned from smart scan on t2 is 10M, cell offload efficiency is 84%(230MB bytes returned from Exadata).

    alter system flush buffer_cache;
    select
        /*+ monitor */
        count(t2.v1)
    from
        t1, t2
    where
        t1.id = t2.id
    and t1.n1 = 0;
    

    alter system flush buffer_cache;
    select
        /*+ monitor no_px_join_filter(t2) */
        count(t2.v1)
    from
        t1, t2
    where
        t1.id = t2.id
    and t1.n1 = 0;
    

    Update on Mar 25th

    Serial generation for bloom filters on Exadata is enabled in 12.1.0.2, and backported to 11.2.0.4 with bug 16710321. Hidden parameter _bloom_serial_filter control the function.

    > @pd _bloom_serial_filter
    
    NAME                  VALUE     DESCRIPTION
    --------------------- --------- -------------------------------------------------------
    _bloom_serial_filter  ON        enable serial bloom filter on exadata