Hashing rows…



By Patch ~ September 15th, 2010. Filed under: Oracle, PL/SQL, SQL.

During my work at the current customer site I was asked to create files with the changes to the data in the database. Normally you would use triggers on the tables to create logs of the records changed. Or use a Flashback Data Archive, which is only available starting Oracle 11G and I am working with Oracle 10G. The trouble in this situation is that we want to see if something changed on the data of a materialized view. This materialized view is fully refreshed overnight.

First idea I had was to check for the RowID. I figured that would change when the row was changed. Problem is that the RowID either always changes, whether or not the data changed, or didn’t change, even when the data changed. This proved not to be a reliable way to check for changes in the data.

Of course I could make a copy of the MV before it gets refreshed, but that would mean duplicating a lot of data. And of course we would have to check every column for changes. This would be an easy task if the tables don’t have too many columns and if they are simple columns (i.e. varchar2s, numbers and dates).

How about making some sort of hash value for the entire row and store that value. This involved a lot of thinking and some investigating. Chances that 2 rows will hash to the same value are really small. According to Stack Overflow:

(there are 2128 possible hashes total so the chance of two random strings producing the same hash is theoretically close to 1 in 1038).

That means hashing is not flawless, but it is good (enough).

I want to make it dynamic, I don’t want to create similar code for every table (or materialized view in this case) I need to check. These can be tens, maybe hundreds or even thousands. How can I make this a generic solution. I can of course put all the fields of a record in one big string and hash that string. Let’s check some documentation. First of all, I need to find all the columns in the table. Luckily Oracle provides us with a data dictionary in which we can found out just about everything we need to know about our objects. Column information is stored in the ALL_TAB_COLUMNS view (ALL for all objects we have access to. We can also use USER_TAB_COLUMNS but that shows us only the columns in table in our own schema. There is also DBA_TAB_COLUMNS, which is pretty much the same as ALL_TAB_COLUMNS but it needs the DBA Role and it shows us every column in every table in every schema).

I need the names and types of columns in a certain table (possibly owned by a different user) so I need to retrieve the columns COLUMN_NAME and DATA_TYPE based on TABLE_NAME and OWNER. The SQL statement I need to execute is something like this:

SELECT atc.column_name

     , atc.data_type

  FROM all_tab_columns atc

WHERE atc.table_name = <table_name>

   and atc.OWNER = <owner>

Since I am using PL/SQL it is easy to return everything I need in one roundtrip to the SQL Engine using Bulk Operations. The means I want to BULK COLLECT whatever is returned. When I retrieve a single value I can use a scalar variable to select into. But now I will retrieve probably more than one row, so I need a collection to fetch into. Actually I need a collection of records so I create a record and a collection in my package:

TYPE column_rt IS RECORD(

  column_name user_tab_columns.column_name%TYPE

, data_type user_tab_columns.data_type%TYPE);

TYPE column_aat IS TABLE OF column_rt INDEX BY BINARY_INTEGER;

Using this collection type and my SQL statement I create a function like this:

FUNCTION get_column_names(table_name_in IN VARCHAR2

,owner_in IN VARCHAR2 DEFAULT NULL) RETURN column_aat IS

  l_returnvalue column_aat;

BEGIN

 
SELECT atc.column_name

      
, atc.data_type BULK COLLECT

   
INTO l_returnvalue

   
FROM all_tab_columns atc

   WHERE atc.table_name = table_name_in

     AND (atc.owner = owner_in OR owner_in IS NULL);

 
RETURN l_returnvalue;

END get_column_names;

Now that I know what columns are in the table I can construct an SQL statement to retrieve all columns concatenated in one column. Something like this:

SELECT <concatenated_columns> <alias>

FROM <user>.<table> <table_alias>

I also need to find a way to link the result back to the originating table. The only way (I can think of) I can do this (without relying on Primary Keys which may or may not be defined) is using the ROWID. So my SQL statement will become something like this:

SELECT <concatenated_columns> <alias>, rowid

FROM <user>.<table> <table_alias>

Using Native Dynamic SQL I can construct the SQL statement at runtime, hence using the data dictionary to define what I really select.

Oracle does a lot of implicit conversion on which I could rely, but I think it is better (and faster) to do explicit conversion. That way I have more control over the actual statement being generated. Also, my statement will fail if there is no implicit conversion available (and it did, since I have a couple of Oracle Spatial columns in my test Materialized Views).

So, to build the select part of the statement which is dynamic I created another function:

FUNCTION create_select(columns_in IN column_aat

  ,table_name_in IN VARCHAR2

 
,owner_in IN VARCHAR2) RETURN VARCHAR2 IS

  l_prefix VARCHAR2(30);

 
l_suffix VARCHAR2(30);

 
l_skip BOOLEAN;

 
l_returnvalue maxvarchar;

BEGIN

 
l_returnvalue := ”;

 
FOR indx IN nvl(columns_in.first, 0) .. nvl(columns_in.last, -1) LOOP

   
l_skip := FALSE;

   
CASE columns_in(indx).data_type

     
WHEN c_varchar THEN

       
BEGIN

         
l_prefix := c_prefix_varchar;

         
l_suffix := c_suffix_varchar;

       
END;

     
WHEN c_number THEN

       
BEGIN

         
l_prefix := c_prefix_number;

         
l_suffix := c_suffix_number;

       
END;

            WHEN c_date THEN

       
BEGIN

         
l_prefix := c_prefix_date;

         
l_suffix := c_suffix_date;

       
END;

     
ELSE

       
BEGIN

         
l_prefix := c_prefix_empty;

         
l_suffix := c_suffix_empty;

         
l_skip := TRUE;

         
logging.add(9, ‘data_type’, columns_in(indx).data_type);

       
END;

   
END CASE;

   
IF NOT (l_skip) THEN

     
l_returnvalue := l_returnvalue || ‘|| ‘ || l_prefix || columns_in(indx).column_name || l_suffix;

   
END IF;

 
END LOOP;

 
RETURN l_returnvalue;

END create_select;

As you can see in the code, I am currently only hashing the scalar fields in the table. This makes it easier for me to test. If there are special columns (like the Spatial datatype) it gets skipped for now, but I do log that this datatype is currently not processed (In the currently working version I do have some code to do hashing on the SDO_GEOMETRY column type). This function can ‘easily’ be expanded to reflect new, different types of columns that I might encounter. I could even create support for User Defined dataTypes that I may store in a table.

I now have a SQL statement that will return two columns. The concatenated values from a row and the rowid. The rowid must be returned as is. This will be the way of connecting the originating table to the value returned by this bit of PL/SQL. The concatenated value of the entire row should be hashed so it returns a value of predefined size (Varchar2(32)).

To hash a value I use the hash function in the dbms_crypto package. Since I don’t want to write the code needed here in my statements (hide the logic behind a layer of code (Steven Feuerstein)) I created a simple function for it:

FUNCTION hashvalue(value_in IN maxvarchar) RETURN VARCHAR2 IS

BEGIN

 
RETURN dbms_crypto.hash(src => utl_i18n.string_to_raw(value_in, ‘AL32UTF8′)

       
                 ,typ => dbms_crypto.hash_md5);

END hashvalue;

(maxvarchar is a subtype defined so I don’t need to remember the maximum size of a varchar in PL/SQL

SUBTYPE maxvarchar IS VARCHAR2(32767);

)

Now I need to create a table function (a function that can be called from SQL using the TABLE() operator) that gets me the hashed values for all the rows.

To be able to access more than one field in the resultset I need this function to return a schema level defined collection type. In this case a Nested Table Type.

CREATE OR REPLACE TYPE hash_rt AS OBJECT

( hrt_rowid VARCHAR2(18)

, hrt_hashvalue VARCHAR2(32)

)

/

create or replace type hash_ntt as table of hash_rt

/

The code needs to know what table it should run against and who the owner of this object is. So the function specification becomes:

FUNCTION hashrows(table_name_in IN VARCHAR2

                
,owner_in IN VARCHAR2 DEFAULT USER) RETURN hash_ntt

By adding the keyword PIPELINED to this specification I can utilize possible parallel execution of any SQL I write against this function.

FUNCTION hashrows( table_name_in IN VARCHAR2

                 , owner_in IN VARCHAR2 DEFAULT USER) RETURN hash_ntt

PIPELINED IS

  TYPE valrow_r IS RECORD(

           vr_value maxvarchar

         , vr_rowid maxvarchar);

  TYPE valrow_aat IS TABLE OF valrow_r INDEX BY BINARY_INTEGER;

  l_valrow valrow_aat;

  l_columns column_aat;

  l_cv cv_type;

  l_sql maxvarchar;

BEGIN

  – get the columns for this table.

  l_columns := get_column_names(upper(table_name_in), upper(owner_in));

  – build the select statement

  l_sql := ‘SELECT ‘ || ltrim(create_select(l_columns, table_name_in, owner_in), ‘|’) ||

           ‘ fullrow, rowid from ‘ || owner_in || ‘.’ || table_name_in || ‘ ‘ || c_table_alias;

  – open the cursor using this select statement

  OPEN l_cv FOR l_sql;

  – bulk collect the results

  FETCH l_cv BULK COLLECT

   INTO l_valrow;

  – check to see if there is anything to process.

  IF l_valrow.count > 0 THEN

    – loop though the result

    FOR indx IN l_valrow.first .. l_valrow.last LOOP

      — pipe the result out of the function as soon as it is known

      PIPE ROW(hash_rt(l_valrow(indx).vr_rowid, hashvalue(l_valrow(indx).vr_value)));

    END LOOP;   
  END IF;

  — close the cursor

  CLOSE l_cv;

  — return control

  RETURN;

EXCEPTION

  WHEN OTHERS THEN

    logging.add(9, ‘Error’, SQLERRM);

    logging.add(9, ‘Error Stack’, dbms_utility.format_error_stack);

    logging.add(9, ‘SQL Statement’, l_sql);

END hashrows;

By choosing the pipelined option I am obliged to PIPE each ROW when it’s done processing out of the function. At the end of the function I have nothing to RETURN but control, so therefore I have a simple return at the end of the function (no value is being returned).

Since my SQL returns multiple rows and I want to BULK COLLECT them I need a collection. In this case (again) a collection of records, since I am fetching more than one column per record. I could have created a package level type, but there is no need for this type outside this function, so I decided to declare it here.

I PIPE out each row which is of the rowtype I defined. It is kind of strange that the function says it will return a collection but actually it is returning single rows. But if you think about it, it does return more than one row, just one at a time.

Since the create_select function returns the column names with two | at the beginning of the string (saves a lot of if-then processing) I need to trim these bars from the beginning in building the statement:

ltrim(create_select(l_columns, table_name_in, owner_in)

Now, lets see what this function returns. Unfortunately this function cannot be called from PL/SQL. Running this testscript:

declare

 
result hash_ntt;

begin

 
– Call the function

 
result := hashing.hashrows(table_name_in => ‘MV_HASH’);

end;

Results in an Oracle error:

ORA-06550: line 19, column 12:

PLS-00653: aggregate/table functions are not allowed in PL/SQL scope

So I need to call it from SQL using a statement like this:

select * from table(hashing.hashrows(’MV_HASH’))

This results in:

HRT_ROWID HRT_HASHVALUE

—————— ——————————–

AAFXh7AAEAAAA9EAAA AC5B43F5442765E5C27C96B069EC9B6A

AAFXh7AAEAAAA9EAAB 3FEF7C64CBEF9C04D6FD767787343E67

AAFXh7AAEAAAA9EAAC 4F22E776622A31C15AF0D3CF0E7DCA50

AAFXh7AAEAAAA9EAAD 49FDD474072F7EB55A2073E3B2DA5475

AAFXh7AAEAAAA9EAAE 8CEA084275D4010464E0965EADD1E910

AAFXh7AAEAAAA9EAAF 234E9FA7FD39430372C60FFEAD77FD6F

………

Now that looks cool. Not something you would want to read though.

Let’s do some testing. First I create two tables. 1 for the ‘real’ data and one to store the hashed values in (along with the rowed).

create table mv_hash (

  id number

, name varchar2(30)

)

/

create table ht_hash (

  ht_rowid varchar2(18)

, ht_hashvalue varchar2(32)

)

/

Then I insert some rows into the first table (which is called mv_ because I will be using this on a Materialized View in the real world)

insert into mv_hash (id, name) (select t.object_id, t.OBJECT_NAME from user_objects t)

/

603 rows inserted

commit

/

Now, insert the hash values of the mv_table into the ht_ (HashTable) table.

insert into ht_hash (ht_rowid, ht_hashvalue) (select hrt_rowid, hrt_hashvalue from table(hashing.hashrows(’MV_HASH’)))

/

603 rows inserted

commit

/

Then it’s time to update a row in the mv_ table

update mv_hash

   set name = ‘Patrick’

where rownum = 1

/

commit

/

1 row updated

Let’s find out which row was updated

select *

  from mv_hash m

 
left outer join ht_hash h on (m.rowid = h.ht_rowid)

where h.ht_rowid is null

/

It seems the rowed hasn’t changed by the update. Let’s see if we can find it using the entire table, the hashtable and the table function too.

select *

  from mv_hash m

  join (table(hashing.hashrows(’MV_HASH’)) hm

        left outer join ht_hash h on (hm.hrt_hashvalue = h.ht_hashvalue)

       ) on (hm.hrt_rowid = m.rowid)

where h.ht_hashvalue is null

/

It is the way to get to the changed row. Actually I am doing an ANTI-JOIN (outer-join with one of the joined columns being null, i.e. not available)

This check works even when the entire table is truncated and filled again since we don’t check the rowid, but the hashvalue for the row. If the hashvalue hasn’t changed, it won’t show up.

Reader's Comments

  1. Theo | September 16th, 2010 at 12:56 pm

    Very cool post! I will keep this one in mind when I stumlbe upon a comparable situation at a customer.

    One thing puzzles me though in your post. At the beginning of the post you state that (I quote):

    “[The] problem is that the RowID either always changes, whether or not the data changed, or didn’t change, even when the data changed. This proved not to be a reliable way to check for changes in the data.”

    A bit further in the post you still decide to store the hashes with the RowIDs. Wouldn’t it be safer to store the hashes with the primary key or unique key of the materialized view? That would mean that your code would be a little less generic, but you could opt for an extra parameter that contains of the PK/UK columns.

  2. Patch | September 16th, 2010 at 3:41 pm

    It was an option to take the PK/UK as an extra column (and it still is), but this can be variable. Number, varchar2 that sort of thing.
    I am not using the RowID as a comparison, but only the hashvalue. The chance of a different row hashing to the same value is small enough (not 0, but nearly). The only thing I use the RowID for is to correlate the hash value back to the originating row.
    But, I am thinking about it a bit more, and I might just add a PK/UK column.

  3. Optimus | September 18th, 2010 at 5:59 pm

    Hi Patch,

    while browsing I stumbled upon your site.
    There is another trick to create a hash table.
    Take a look at the following select statement.
    It also handles the geometry columns. I’m not quite sure about the performance though,
    Excuses about the formatting.

    –dont forget: GRANT EXECUTE ON DBMS_CRYPTO TO SCOTT;

    select empno as key
    , sys.dbms_crypto.hash(to_clob(dbms_xmlgen.getxmltype (’select ‘ || cs.concat_column_string || ‘ from scott.emp where rowid = ”’ || t.rowid || ””)),2) as hashvalue
    from scott.emp t
    , (select rtrim (xmlagg (xmlelement (”column”, column_name || ‘,’)).extract (’//text()’), ‘,’ ) concat_column_string
    from all_tab_columns tc
    where owner = ‘SCOTT’
    and table_name = ‘EMP’
    and column_name not in (’EMPNO’) — exclude the ones you don’t need
    order by column_id ) cs;

    Regards!

  4. Alex Nuijten | September 24th, 2010 at 5:23 am

    Nice post. One comment though: “Problem is that the RowID either always changes” is not correct. ROWID points to the physical location of a record. With a complete refresh, data will be truncated and reinserted based on the SQL statement. The data will be “new”, so ROWIDs are “handed out” when the data is inserted and considered new for the MV. Just because the data in the record is the same doesn’t mean it is stored in the same place. Tiny thing, but worth noting, I think.

  5. Patch | September 24th, 2010 at 5:43 am

    The RowID changes with a full refresh of the MV. The row is logically the same (all the same data) but the RowID is different. It is logically the same row, but physically it’s a different row. The hash value will be the same, so we will not see a change. The RowID is not a good thing to store to check for changes, because that one always changes, like Alex states.

Leave a Comment