Monday, March 1, 2021

Using Oracle Machine Learning for Python - building predictive models

Oracle Database Tutorial and Material, Oracle Database Learning, Oracle Database Certification, Oracle Database Guides, Database Career, Database Exam Study

In my previous post, we introduced some basic features of OML4Py. In this post, I'll illustrate using some in-database machine learning algorithms.

Creating in-database predictive models

In this example, we use the Random Forest algorithm and the optional cost matrix input to predict Species from the IRIS table. First, we get a DataFrame proxy object to the IRIS table, use the overloaded split function to obtain training and test data sets, which are also proxy objects (data remains in the database).  We then create and display a demo cost matrix - just for illustration - before building a Random Forest model. Note that the cost matrix is also a named database table, RF_COST.

Read More: 1Z0-062: Oracle Database 12c Administration

IRIS = oml.sync(table = "IRIS")        # obtain proxy object

IRIS_TRAIN, IRIS_TEST = IRIS.split()   # split data using default ratio

TRAIN_x = IRIS_TRAIN[0].drop('Species')

TRAIN_y = IRIS_TRAIN[0]['Species']

# Create a cost matrix table in the database

cost_matrix = [['setosa', 'setosa', 0],

  ['setosa', 'virginica', 0.2],

  ['setosa', 'versicolor', 0.8],

  ['virginica', 'virginica', 0],

  ['virginica', 'setosa', 0.5],

  ['virginica', 'versicolor', 0.5],

  ['versicolor', 'versicolor', 0],

  ['versicolor', 'setosa', 0.4],

  ['versicolor', 'virginica', 0.6]]

COST = oml.create(pd.DataFrame(cost_matrix,

                               columns = ['ACTUAL_TARGET_VALUE', 

                                          'PREDICTED_TARGET_VALUE', 

                                          'COST']),

                  "RF_COST")

COST.pull().pivot(index='ACTUAL_TARGET_VALUE',         

                  columns='PREDICTED_TARGET_VALUE',values='COST')

With the output:

PREDICTED_TARGET_VALUE   setosa   versicolor   virginica

ACTUAL_TARGET_VALUE

setosa                      0.0          0.8         0.2

versicolor                  0.4          0.0         0.6

virginica                   0.5          0.5         0.0

To build our Random Forest model, we create an rf object, specifying a maximum tree depth of 4, and invoke fit on our training data with the cost matrix. To display model details, we print the object. Note that the Random Forest model also reports attribute importance.

from oml import rf

rf_mod = rf(tree_term_max_depth = '4')

rf_mod = rf_mod.fit(TRAIN_x, TRAIN_y, COST)

rf_mod   # Show model details

With the output:

Algorithm Name: Random Forest

Mining Function: CLASSIFICATION

Target: Species

Settings:

                    setting name            setting value

0                      ALGO_NAME       ALGO_RANDOM_FOREST

1           CLAS_COST_TABLE_NAME      "PYQUSER"."RF_COST"

2              CLAS_MAX_SUP_BINS                       32

3          CLAS_WEIGHTS_BALANCED                      OFF

4                   ODMS_DETAILS              ODMS_ENABLE

5   ODMS_MISSING_VALUE_TREATMENT  ODMS_MISSING_VALUE_AUTO

6               ODMS_RANDOM_SEED                        0

7                  ODMS_SAMPLING    ODMS_SAMPLING_DISABLE

8                      PREP_AUTO                       ON

9                 RFOR_NUM_TREES                       20

10           RFOR_SAMPLING_RATIO                       .5

11          TREE_IMPURITY_METRIC       TREE_IMPURITY_GINI

12           TREE_TERM_MAX_DEPTH                        4

13         TREE_TERM_MINPCT_NODE                      .05

14        TREE_TERM_MINPCT_SPLIT                       .1

15         TREE_TERM_MINREC_NODE                       10

16        TREE_TERM_MINREC_SPLIT                       20

Computed Settings:

  setting name setting value

0    RFOR_MTRY             2

Global Statistics:

   AVG_DEPTH  AVG_NODECOUNT  MAX_DEPTH  MAX_NODECOUNT  MIN_DEPTH  \

0       3.25            5.6        4.0            6.0        4.0   

   MIN_NODECOUNT  NUM_ROWS  

0            4.0     102.0  

Attributes:

SEPAL_LENGTH

SEPAL_WIDTH

PETAL_LENGTH

PETAL_WIDTH

Partition: NO

Importance:

  ATTRIBUTE_NAME ATTRIBUTE_SUBNAME  ATTRIBUTE_IMPORTANCE

0   PETAL_LENGTH              None              0.584287

1    PETAL_WIDTH              None              0.512901

2   SEPAL_LENGTH              None              0.136424

3    SEPAL_WIDTH              None              0.038918

Use this Random Forest in-database model to make predictions

The rf_mod object is itself a proxy object to its corresponding in-database model. We predict using the IRIS_TEST proxy object and optionally specify supplemental columns from the original data. Supplemental columns are useful to compare actual target values against predicted values, or to include unique keys for subsequent row identification or joining with other tables. You may recall that row order is not implicit on results from relational databases unless a sort (i.e., ORDER BY) is specified by the user. Why? Sorting of often unnecessary and a potentially expensive operation depending on data size. The predicted results in pred are also a DataFrame proxy object, since the result could be large (e.g., millions or billions of rows).

pred = rf_mod.predict(IRIS_TEST.drop('Species'), 

                      supplemental_cols = test_dat[:, ['SEPAL_LENGTH', 

                                                       'SEPAL_WIDTH',

                                                       'PETAL_LENGTH', 

                                                       'Species']])

print("Shape:",pred.shape)

pred.head(10)

With the output:

Shape: (48, 5)

Out[16]:

   SEPAL_LENGTH  SEPAL_WIDTH  PETAL_LENGTH Species PREDICTION

0           4.9          3.0           1.4  setosa     setosa

1           4.9          3.1           1.5  setosa     setosa

2           4.8          3.4           1.6  setosa     setosa

3           5.8          4.0           1.2  setosa     setosa

4           5.1          3.5           1.4  setosa     setosa

5           5.0          3.4           1.6  setosa     setosa

6           5.2          3.5           1.5  setosa     setosa

7           5.4          3.4           1.5  setosa     setosa

8           4.9          3.1           1.5  setosa     setosa

9           5.0          3.2           1.2  setosa     setosa

Oracle Database Tutorial and Material, Oracle Database Learning, Oracle Database Certification, Oracle Database Guides, Database Career, Database Exam Study
Next, let's compute a confusion matrix for the model predictions to assess model accuracy. Using the overloaded crosstab function, we compute the confusion matrix between the actual values in Species and the predicted result in the PREDICTION column directly in the database.

res_ct = pred.crosstab('Species','PREDICTION',pivot=True)

print("Type:",type(res_ct))

print("Columns:",res_ct.columns)

res_ct.sort_values(by='Species')

With the output:

Type: 

Columns: ['Species', 'count_(setosa)', 'count_(versicolor)', 'count_(virginica)']

     Species  count_(setosa)  count_(versicolor)  count_(virginica)

0      setosa              16                   0                  0

1  versicolor               0                  13                  2

2   virginica               0                   0                 17

Related Posts

0 comments:

Post a Comment