Blog

Strange things with the Oracle optimizer

Share this article

Recently I needed to investigate a performance problem for 2 customers, which turned out to be problematic execution plans. It had the smell of Oracle optimizer bugs all over them.
Although the first one is maybe somewhat self-inflicted injury.
Both environments were using Oracle 12.1.0.2 with the October 2017 patch set.

Case 1: functions on the “left” side

In the first case the client had a large (generated, of course) query, which would run for hours until it eventually failed with an ORA-01555 snapshot too old error.
Investigating the execution plan, I spotted the part where the Oracle optimizer had a completely wrong estimate (expected only 1 row) of the number of returned rows (sql monitor rocks).
This underestimation was further amplified because of the number of nested loops and resulted in a full table scan executed over 2 billion times…

The sql part with the underestimated number of returned rows was something like:

  trunc(snapshotdate) 
    between to_date('01/01/2018','DD/MM/YYYY')  and to_date('31/01/2018', 'DD/MM/YYYY') 

And the Oracle optimizer estimated this would return 1 row. In reality around 144.000 rows would fall between that data range.
Checking the CBO stats, it showed that these were recent and that a frequency histogram was in place for that column (which included the dates used in the predicate).
I then changed the predicate to shift the trunc from the left to the right part (which is best practice anyway).

  snapshotdatum
    between to_date('01/01/2018','DD/MM/YYYY') and to_date('31/01/2018', 'DD/MM/YYYY') + 1 - 1/(24*60*60)

(the date values in the query came from a BO entry screen and could not easily be changed)
This caused the CBO to change the expected rows from 1 to 144.000 and changed the entire execution plan.

I continued with building a test case. Which showed that as soon as the trunc was used on the column, the estimated rows dropped to 1.
Cranking up the number of records in the table however showed that the CBO was not just returning 1. But that it calculated the number of returned rows as “num_rows / 100” for an equality predicate and as “num_rows / 400” for a closed range predicate (between). The information from the histograms would be completely ignored.
This shows why you need to build test cases with a significant number of records. Or use the cbo trace to look at the selectivity.

However, it does not explain why on a table of approximately 15 million rows, the CBO would expect 1 row instead of 37.500 (15.000.000/ 400).
I started playing again with the original query, changing the date ranges, when I suddenly noticed that 2 consecutively explain plans were very different.
One had an estimate of 1 row for the date range and the other had 38.163. This corresponds with the number_of_rows/400.
And this was for the exact same query in the same session with no changed parameters (and optimizer_adaptive_features was set to false).
So, my guess is that there is some bug in the optimizer causing this behavior. In the original case, the table and index were partitioned (not on snapshotdate). Possibly playing a role in this.

Changing the predicate to remove the trunc and instead add a time part to the between clause is a better solution anyway. So we decided to implement this and no longer pursue the cause of this bug.

Case 2: nchar and ranges

The second case involved also an execution plan in which the Oracle optimizer estimated the number of rows as 1 instead of over a 100.000. And again this was amplified because of nested loops. This time however, there was no function on the predicate column. Instead there was a closed range predicate on an nvarchar column (JDE, don’t ask).
After some testing rounds, I was able to trigger this behavior with the following conditions:

  • A range filter (>=, <= ) or “like” operation is done on a nchar data type column.
  • The column values start with a number of spaces.
  • A frequency histogram exists on the involved column.
  • The values in the predicate filter are not end points in the frequency histogram.

In the query below I would expect 9.400 records to be returned.
When the datatype of the predicate column is char, the optimizer uses the frequency histogram and comes up with the correct estimate:

sys@FDHT> explain plan for select count(*) from fdh.histo_test where vchar >= '    16' and vchar <= ' 16ZZZZZ';
sys@FDHT> @xplain
sys@FDHT> set linesize 170
sys@FDHT> set pages 9999
sys@FDHT> 
sys@FDHT> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------
Plan hash value: 505401197

---------------------------------------------------------------------------------
| Id  | Operation          | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |            |     1 |    51 |    86   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |            |     1 |    51 |            |          |
|*  2 |   TABLE ACCESS FULL| HISTO_TEST |  9400 |   468K|    86   (0)| 00:00:01 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("VCHAR">='    16' AND "VCHAR"<='    16ZZZZZ')

But when changing the datatype from char to nchar:

sys@FDHT> explain plan for select count(*) from fdh.histo_test where vnchar >= '    16' and vnchar <= ' 16ZZZZZ';
sys@FDHT> @xplain
sys@FDHT> set linesize 170
sys@FDHT> set pages 9999
sys@FDHT> 
sys@FDHT> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------
Plan hash value: 505401197

---------------------------------------------------------------------------------
| Id  | Operation          | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |            |     1 |    81 |    86   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |            |     1 |    81 |            |          |
|*  2 |   TABLE ACCESS FULL| HISTO_TEST |     1 |    81 |    86   (0)| 00:00:01 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("VNCHAR">=U'    16' AND "VNCHAR"<=U'    16ZZZZZ')

I could reproduce this on 11.2, 12.1 and 18.1. Didn’t test it on 12.2 though, but I expect no difference there.
I have no idea of this type of query is typical for JDE or if it’s because of some customizations. But it is certainly the first time I have seen something like it :-/

The complete test case: histogram_test.

Tags: Blog
Oracle Cloud, going down the rabbit hole – Part 5
Oracle Cloud, going down the rabbit hole – Part 4

You May Also Like