Several months ago I was working on a problem with one of my clients. This involved taking a blob of data and performing a transformation on it. (what else would you do with a database eh?) However the catch is that the transform is performed according to user defined rules which are dynamic. The user’s have the ability to add/remove/update these rules.
A rule consisted of several columns of sparse data. The first columns described the data pattern which caused a match and then the remained gave the split of the new data. Basically it took financial data and split it out by low level organisation. So a company can see how each element of cost/income was split over the business.
The rules matched based on a set of hierarchical ordering. The more pattern columns which contained data were matched first and you can determine that less columns matched later. This was controlled by a table which was used to sort the rules. More in detail below.
So how do you go about solving this problem? The first instinct is to simply perform a join across the data table and rules and sorting and then use a massive case to remove the duplicates and select the proper rule to insert into the result table. Fine yes that would work, but would consume vast amounts of temporary space as Oracle built the join and then sorted it and then selected the proper rule and row of data. In most instances several rows of data would be created (if not dozens) for each real data row and at most only one would pass through.
Still it seemed the fastest way to do it. The plan didn’t look too bad and so tested it on some real data. It was slow, very slow. And occasionally it would cause an error. (OMG oracle has bugs!). So I had to find another way to get the job done, and fast.
I came upon an idea where I load the rules all into memory. Basically into a array with the rules sorted in the proper order. I could then take a sample of a a couple of columns of data – which were mandatory – and that would reduce the number of rules to compare down to a handful. I could then trim down to the single rule easily enough and return the rule id for further processing. As the rules were all in memory this should be really, really fast. Now to hook it up!
So I can load up the rules pretty fast using BULK LOAD facility of Oracle PL/SQL and keep them in memory for fast look up. But now how do I hook up that process with the data in the table and have it work as fast as possible. The secret here is two things. Pipelining and Parallel processing.
Pipelining is where a PL/SQL function can return results as each item is calculated and then Oracle can process them as part of a larger PL/SQL statement as it goes. Otherwise the DBMS would have to wait for the entire set of results to be returned before continuing. It sounds like perhaps in this situation this is not needed, but it is.
As the results of each function are independant of any other invocation having the function run in parallel could boost performance significantly. In my experience it does make a difference. You’ll need to get Oracle to use multiple cores or processors obviously for this to work in practise.
So, now I’ve got all the pieces and the PL/SQL looked something like this:
The SQL Statement:
insert into output( ENT_ID, VALUE_, RULE_ID )
select /*+ parallel(data,2) */ aqsr.ent_id, data.value_*aqsr.ent_percent,
from reslt_m data
table(GroupdData(CURSOR(select /*+ parallel(GD, 2) */ * from Groupd_Data GD ))) asm
on data.cost_centre_id=asm.cost_centre_id and
data.expense_type_id=asm.expense_type_id and data.measure=asm.measure_id and
data.year_=asm.year_ and data.data_type_id=asm.data_type_id
inner join rules_ratios aqsr on asm.rule_id=aqsr.UID_
and the pipelined function GroupdData
function GroupdData(p_inpcur in ref_basedatacur_t) return G_GrpdData_tab
PIPELINED PARALLEL_ENABLE (Partition p_inpcur by any)
/* definitions left out */
fetch p_inpcur bulk collect into l_groupify_rec limit 10000;
exit when l_groupify_rec.count = 0;
for l_idx in l_groupify_rec.first..l_groupify_rec.last
/* tedious code to check and find rules left out ..
fill in rule_id to return to query */
l_rec.rule_id := rules(l_idx)(l_rul_idx).UID_;
And that’s all there is to it! Okay so I’ve left out some of the implementation nitty gritty details but you get the idea.
This code would appear to run slowly, but don’t be deceived it is very very fast. And I intend on using similar techniques in the future. That’s only if a single SQL statement is either too slow or problematic.