In my previous post, we introduced some basic features of OML4Py. In this post, I'll illustrate using some in-database machine learning algorithms.
Creating in-database predictive models
In this example, we use the Random Forest algorithm and the optional cost matrix input to predict Species from the IRIS table. First, we get a DataFrame proxy object to the IRIS table, use the overloaded split function to obtain training and test data sets, which are also proxy objects (data remains in the database). We then create and display a demo cost matrix - just for illustration - before building a Random Forest model. Note that the cost matrix is also a named database table, RF_COST.
Read More: 1Z0-062: Oracle Database 12c Administration
IRIS = oml.sync(table = "IRIS") # obtain proxy object
IRIS_TRAIN, IRIS_TEST = IRIS.split() # split data using default ratio
TRAIN_x = IRIS_TRAIN[0].drop('Species')
TRAIN_y = IRIS_TRAIN[0]['Species']
# Create a cost matrix table in the database
cost_matrix = [['setosa', 'setosa', 0],
['setosa', 'virginica', 0.2],
['setosa', 'versicolor', 0.8],
['virginica', 'virginica', 0],
['virginica', 'setosa', 0.5],
['virginica', 'versicolor', 0.5],
['versicolor', 'versicolor', 0],
['versicolor', 'setosa', 0.4],
['versicolor', 'virginica', 0.6]]
COST = oml.create(pd.DataFrame(cost_matrix,
columns = ['ACTUAL_TARGET_VALUE',
'PREDICTED_TARGET_VALUE',
'COST']),
"RF_COST")
COST.pull().pivot(index='ACTUAL_TARGET_VALUE',
columns='PREDICTED_TARGET_VALUE',values='COST')
With the output:
PREDICTED_TARGET_VALUE setosa versicolor virginica
ACTUAL_TARGET_VALUE
setosa 0.0 0.8 0.2
versicolor 0.4 0.0 0.6
virginica 0.5 0.5 0.0
To build our Random Forest model, we create an rf object, specifying a maximum tree depth of 4, and invoke fit on our training data with the cost matrix. To display model details, we print the object. Note that the Random Forest model also reports attribute importance.
from oml import rf
rf_mod = rf(tree_term_max_depth = '4')
rf_mod = rf_mod.fit(TRAIN_x, TRAIN_y, COST)
rf_mod # Show model details
With the output:
Algorithm Name: Random Forest
Mining Function: CLASSIFICATION
Target: Species
Settings:
setting name setting value
0 ALGO_NAME ALGO_RANDOM_FOREST
1 CLAS_COST_TABLE_NAME "PYQUSER"."RF_COST"
2 CLAS_MAX_SUP_BINS 32
3 CLAS_WEIGHTS_BALANCED OFF
4 ODMS_DETAILS ODMS_ENABLE
5 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
6 ODMS_RANDOM_SEED 0
7 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
8 PREP_AUTO ON
9 RFOR_NUM_TREES 20
10 RFOR_SAMPLING_RATIO .5
11 TREE_IMPURITY_METRIC TREE_IMPURITY_GINI
12 TREE_TERM_MAX_DEPTH 4
13 TREE_TERM_MINPCT_NODE .05
14 TREE_TERM_MINPCT_SPLIT .1
15 TREE_TERM_MINREC_NODE 10
16 TREE_TERM_MINREC_SPLIT 20
Computed Settings:
setting name setting value
0 RFOR_MTRY 2
Global Statistics:
AVG_DEPTH AVG_NODECOUNT MAX_DEPTH MAX_NODECOUNT MIN_DEPTH \
0 3.25 5.6 4.0 6.0 4.0
MIN_NODECOUNT NUM_ROWS
0 4.0 102.0
Attributes:
SEPAL_LENGTH
SEPAL_WIDTH
PETAL_LENGTH
PETAL_WIDTH
Partition: NO
Importance:
ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_IMPORTANCE
0 PETAL_LENGTH None 0.584287
1 PETAL_WIDTH None 0.512901
2 SEPAL_LENGTH None 0.136424
3 SEPAL_WIDTH None 0.038918
Use this Random Forest in-database model to make predictions
The rf_mod object is itself a proxy object to its corresponding in-database model. We predict using the IRIS_TEST proxy object and optionally specify supplemental columns from the original data. Supplemental columns are useful to compare actual target values against predicted values, or to include unique keys for subsequent row identification or joining with other tables. You may recall that row order is not implicit on results from relational databases unless a sort (i.e., ORDER BY) is specified by the user. Why? Sorting of often unnecessary and a potentially expensive operation depending on data size. The predicted results in pred are also a DataFrame proxy object, since the result could be large (e.g., millions or billions of rows).
pred = rf_mod.predict(IRIS_TEST.drop('Species'),
supplemental_cols = test_dat[:, ['SEPAL_LENGTH',
'SEPAL_WIDTH',
'PETAL_LENGTH',
'Species']])
print("Shape:",pred.shape)
pred.head(10)
With the output:
Shape: (48, 5)
Out[16]:
SEPAL_LENGTH SEPAL_WIDTH PETAL_LENGTH Species PREDICTION
0 4.9 3.0 1.4 setosa setosa
1 4.9 3.1 1.5 setosa setosa
2 4.8 3.4 1.6 setosa setosa
3 5.8 4.0 1.2 setosa setosa
4 5.1 3.5 1.4 setosa setosa
5 5.0 3.4 1.6 setosa setosa
6 5.2 3.5 1.5 setosa setosa
7 5.4 3.4 1.5 setosa setosa
8 4.9 3.1 1.5 setosa setosa
9 5.0 3.2 1.2 setosa setosa
Next, let's compute a confusion matrix for the model predictions to assess model accuracy. Using the overloaded crosstab function, we compute the confusion matrix between the actual values in Species and the predicted result in the PREDICTION column directly in the database.res_ct = pred.crosstab('Species','PREDICTION',pivot=True)
print("Type:",type(res_ct))
print("Columns:",res_ct.columns)
res_ct.sort_values(by='Species')
With the output:
Type:
Columns: ['Species', 'count_(setosa)', 'count_(versicolor)', 'count_(virginica)']
Species count_(setosa) count_(versicolor) count_(virginica)
0 setosa 16 0 0
1 versicolor 0 13 2
2 virginica 0 0 17
0 comments:
Post a Comment