Fastest way to compute for hash of a whole table [duplicate]
Asked Answered
E

2

7

We need to be able to compute table hashes for an external environment and compare it to pre-computed hash from an internal environment. The use of this is to ensure that data in the external environment is not tampered by a "rogue" database administrator. Users insist this feature.

Currently, we do this by computing the individual hashes of each column value, perform bit-xor on the column hashes to get the row hash, then perform bit-xor on all the row hashes to come up with the table hash. Pseudo-script below:

cursor hash_cur is
select /*+ PARALLEL(4)*/ dbms_crypto.mac(column1_in_raw_type, HMAC_SH512, string_to_raw('COLUMN1_NAME')) as COLUMN1_NAME
       ...
from TABLE_NAME;

open hash_cur;
fetch hash_cur bulk collect into hashes;
close hash_cur;

for i in 1..hashes.count
loop
  rec := hashes(i);
  record_xor = rec.COLUMN1;
  record_xor = bit_xor(record_xor, rec.COLUMN2);
  ...
  record_xor = bit_xor(record_xor, rec.COLUMNN);

  table_xor = bit_xor(table_xor, record_xor);
end loop;

The pseudo-script above will be run in parallel by using dbms_job.

Problem with this is that we have terabytes of data for certain tables and currently the performance does not meet the performance we want to achieve. Hashing must be done "on-the-fly" as users want to perform hash checking themselves.

  1. Do you guys have a better way to perform whole table hashing, or basically comparing tables from different environments which are connected by a low-latency and relatively low-bandwidth network?

It seems to me that the operation is more CPU-bound than I/O bound. I am thinking of storing the table data in a blob instead, where data is properly arranged by record, then by column. Then perform hash on the output file. This should make the operation completely I/O bound.

  1. What is the fastest way to do this? Anyway to do this within the select clause of a query to remove any overhead PL/SQL-to-SQL engine context switch?
    • I was thinking of modifiying a global blob for this
    • Would also like to remove I/O overhead of bulk collecting the results.

Any suggestions that can lead me to a better performing script would be greatly appreciated. Thanks.

Eppes answered 20/11, 2015 at 10:39 Comment(9)
I am not sure but you can append all column data into single column and make a single hash key then do same with other table and compare these hash keys.Archaimbaud
@vishnusable that assumes that all types data can be converted to varchar and appended. Better to aggregate all data to a binary lob as pointed out in #2.Eppes
To calculate hash of all rows and columns of a table the server has to read all this data. Read the whole table. It is definitely IO bound first. I'd try to find a hashing algorithm that make it possible to update the hash as values in the table change. If you have million rows in a table and one row changes you don't want to read all million rows again to recalculate the hash. I don't know if such algorithm exists, though.Isomerism
@VladimirBaranov the reason why it is necessary to read all records again is to ensure that data has not been tampered by an admin w/ elevated access rights. This is explicitly requested by our users. There's a trigger in the internal database that updates the table hash depending on the editted row, but for the external, need to recompute the whole table hash. If this is not done, it can never be known that data in external has been tampered with.Eppes
Well, then you have to read all external data and it will be slow. BTW, if external admin has full rights (say, he can adjust the triggers that recompute the hash) he can always find a way to mimic the necessary hash (hide his tampering), isn't it?Isomerism
@VladimirBaranov he won't be able to do so since entry point of hash calculation is a web application that also generates the scripts dynamically. No script is stored in the database. I don't see how a db admin can overcome this.Eppes
This might be a duplicate of this question.Nomination
Do you think that your script will detect values in two different rows being swapped?Chaplet
@DavidAldridge initially though you were right, however, the data swapping should only matter if PK values are NOT swapped, though if PK values are not swapped this should result to a different hash.Eppes
C
4

First of all, I think the way to approach "rogue administrators" is with a combination of Oracle's audit trail and Database Vault features.

That said, here's what I might try:

1) Create a custom ODCI aggregate function to compute a hash of multiple rows as an aggregate. 2) Create a VIRTUAL NOT NULL column on the table that was an SHA hash of all the columns in the table -- or all the one's you care about protecting. You'd keep this around all the time -- basically trading away some insert/update/delete performance in exchange to be able to compute hashes more quickly. 3) Create a non-unique index on that virtual column 4) SELECT my_aggregate_hash_function(virtual_hash_column) FROM my_table to get the results.

Here's code:

Create an aggregate function to compute a SHA hash over a bunch of rows

CREATE OR REPLACE TYPE matt_hash_aggregate_impl AS OBJECT
(
  hash_value RAW(32000),
  CONSTRUCTOR FUNCTION matt_hash_aggregate_impl(SELF IN OUT NOCOPY matt_hash_aggregate_impl ) RETURN SELF AS RESULT,  
-- Called to initialize a new aggregation context
-- For analytic functions, the aggregation context of the *previous* window is passed in, so we only need to adjust as needed instead 
-- of creating the new aggregation context from scratch
  STATIC FUNCTION ODCIAggregateInitialize (sctx IN OUT matt_hash_aggregate_impl) RETURN NUMBER,
-- Called when a new data point is added to an aggregation context  
  MEMBER FUNCTION ODCIAggregateIterate (self IN OUT matt_hash_aggregate_impl, value IN raw ) RETURN NUMBER,
-- Called to return the computed aggragate from an aggregation context
  MEMBER FUNCTION ODCIAggregateTerminate (self IN matt_hash_aggregate_impl, returnValue OUT raw, flags IN NUMBER) RETURN NUMBER,
-- Called to merge to two aggregation contexts into one (e.g., merging results of parallel slaves) 
  MEMBER FUNCTION ODCIAggregateMerge (self IN OUT matt_hash_aggregate_impl, ctx2 IN matt_hash_aggregate_impl) RETURN NUMBER,
  -- ODCIAggregateDelete
  MEMBER FUNCTION ODCIAggregateDelete(self IN OUT matt_hash_aggregate_impl, value raw) RETURN NUMBER  
);

/

CREATE OR REPLACE TYPE BODY matt_hash_aggregate_impl IS

CONSTRUCTOR FUNCTION matt_hash_aggregate_impl(SELF IN OUT NOCOPY matt_hash_aggregate_impl ) RETURN SELF AS RESULT IS
BEGIN
  SELF.hash_value := null;
  RETURN;
END;


STATIC FUNCTION ODCIAggregateInitialize (sctx IN OUT matt_hash_aggregate_impl) RETURN NUMBER IS
BEGIN
  sctx := matt_hash_aggregate_impl ();
  RETURN ODCIConst.Success;
END;


MEMBER FUNCTION ODCIAggregateIterate (self IN OUT matt_hash_aggregate_impl, value IN raw ) RETURN NUMBER IS
BEGIN
  IF self.hash_value IS NULL THEN
    self.hash_value := dbms_crypto.hash(value, dbms_crypto.hash_sh1);
  ELSE 
      self.hash_value := dbms_crypto.hash(self.hash_value || value, dbms_crypto.hash_sh1);
  END IF;
  RETURN ODCIConst.Success;
END;

MEMBER FUNCTION ODCIAggregateTerminate (self IN matt_hash_aggregate_impl, returnValue OUT raw, flags IN NUMBER) RETURN NUMBER IS
BEGIN
  returnValue := dbms_crypto.hash(self.hash_value,dbms_crypto.hash_sh1);
  RETURN ODCIConst.Success;
END;

MEMBER FUNCTION ODCIAggregateMerge (self IN OUT matt_hash_aggregate_impl, ctx2 IN matt_hash_aggregate_impl) RETURN NUMBER IS
BEGIN
    self.hash_value := dbms_crypto.hash(self.hash_value || ctx2.hash_value, dbms_crypto.hash_sh1);
  RETURN ODCIConst.Success;
END;

-- ODCIAggregateDelete
MEMBER FUNCTION ODCIAggregateDelete(self IN OUT matt_hash_aggregate_impl, value raw) RETURN NUMBER IS
BEGIN
  raise_application_error(-20001, 'Invalid operation -- hash aggregate function does not support windowing!');
END;  

END;
/

CREATE OR REPLACE FUNCTION matt_hash_aggregate ( input raw) RETURN raw
PARALLEL_ENABLE AGGREGATE USING matt_hash_aggregate_impl;
/

Create a test table to work with (you skip this since you have your real table)

create table mattmsi as select * from mtl_system_items where rownum <= 200000;

Create a virtual column hash of each row's data. Make sure it is NOT NULL

alter table mattmsi add compliance_hash generated always as ( dbms_crypto.hash(to_clob(inventory_item_id || segment1 || last_update_date || created_by || description), 3 /*dbms_crypto.hash_sh1*/) ) VIRTUAL not null ;

Create an index on the virtual column; this way you can compute your hash with an full scan of the narrow index instead of a full scan of the fat table

create index msi_compliance_hash_n1 on mattmsi (compliance_hash);  

Put it all together to compute your hash

SELECT matt_hash_aggregate(compliance_hash) from (select compliance_hash from mattmsi order by compliance_hash);

A few comments:

  1. I think it is important to use a hash to compute the aggregate (rather than merely doing a SUM() over the row-level hashes, because an attacker could forge the correct sum very easily.
  2. I don't think you can (easily?) use parallel query because it is important that the rows be fed to the aggregate function in a consistent order, or else the hash value will change.
Colored answered 20/11, 2015 at 21:40 Comment(7)
You might want to take a hash of the source for matt_hash_aggregate_impl as well then.Chaplet
The OP said in another comment that the suspect DBAs wouldn't have access to the hash-checking algorithm, but maybe it wouldn't hurt. Again, I think Oracle Database Vault and auditing are probably more robust ways to deal with the issue.Colored
Thanks Matthew, never have come across virtual columns before, they seem promising. However, wouldn't it have the same "security issue" like in triggers where a "rogue admin" can just disable a hash-computing-trigger so that changes in data are not reflected in the hash?Eppes
Upon reading more about virtual columns, I don't see an advantage given my case, the virtual column value gets recomputed every time. Besides someone can just drop the virtual column. Better of with creating a physical column and create a trigger for computing the physical column's value.Eppes
I assumed that when you compute the hash, you can verify that the virtual column is in place. A rogue administrator could tamper with it, but not in a way that you could not detect. A trigger is far worse: DBAs could just update the hash directly to whatever they needed it to be to conceal their updates.Colored
@MatthewMcPeak couldn't a rouge administrator just create a copy of the virtual column hash to some location, alter the virtual column computation to refer to the copy of the copied virtual hashes instead?Eppes
You can check for that easily. But that is a problem with this sort of do-it-yourself security -- you need to think of everything yourself. That's part of the reason I encouraged you to look into the Audit Trail and Database Vault products from OracleColored
E
0

you can use ORA_HASH and pass as expression several columns

select sum(ORA_HASH(col1||col2||col3)) as hash from my_table

but here, on AskTom, there's similar discussion about why it's not a good way to do: Creating a unique HASH value for the contents of a table

Exhaustive answered 20/11, 2015 at 14:34 Comment(3)
I don't think sum(hash) is strong enough. A rogue DBA could not easily compute a single row to give a particular hash value, but he could easily insert/delete rows with true hash values to get the sum() back to the correct value.Colored
Can someone explain what sum is doing in this case? I thought sum only supported arithmetic of base10 numbers?Rurik
@Rurik Per ORA_HASH own documentation, it returns a NUMBER (hence per row here), so no problem to add them together. But that also shows that the above will not detect some changes as it sums all values. If one table has 2 rows for which ORA_HASH is -2 and +2 and another table has ORA_HASH values of -1 and +1, then the SELECT will give same result, and you will believe the tables are the same, when they are not (simplistic example of course, but shows the real problem).Panocha

© 2022 - 2024 — McMap. All rights reserved.