Introduction to FunctionXform in PmmlTransformations

Dmitriy Bolotov

2018-08-13

Introduction

This vignette provides examples of how to use the FunctionXform transformation to create new data features for PMML models.

Given a WrapData object and a transformation expression, FunctionXform calculates data for a new feature and creates a new WrapData object. When PMML is produced with pmml::pmml(), the transformation is inserted into the LocalTransformations node as a DerivedField.

FunctionXform makes it possible to use multiple data fields and functions to produce a new feature.

While FunctionXform is part of the pmmlTransformations package, the code to produce pmml from R is in the pmml package. The following examples assume that both these packages are installed and loaded. The kable function is part of knitr, and is used to make tables more readable.

Single numeric field

Using the iris dataset as an example, let’s construct a new feature by transforming one variable. Load the dataset and show the first few lines:

data(iris)
kable(head(iris,3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa

Create the irisBox object with WrapData:

irisBox <- WrapData(iris)

irisBox contains the data and transform information that will be used to produce PMML later. The original data is in irisBox$data. Any new features created with a transformation are added as columns to this data frame.

kable(head(irisBox$data,3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa

Transform and field information is in irisBox$fieldData. The fieldData data frame contains information on every field in the dataset, as well as every transform used. The functionXform column contains expressions used in the FunctionXform transform.

kable(irisBox$fieldData)
type dataType origFieldName sampleMin sampleMax xformedMin xformedMax centers scales fieldsMap transform default missingValue functionXform
Sepal.Length original numeric NA NA NA NA NA NA NA NA NA NA NA NA
Sepal.Width original numeric NA NA NA NA NA NA NA NA NA NA NA NA
Petal.Length original numeric NA NA NA NA NA NA NA NA NA NA NA NA
Petal.Width original numeric NA NA NA NA NA NA NA NA NA NA NA NA
Species original factor NA NA NA NA NA NA NA NA NA NA NA NA

Now add a new feature, Sepal.Length.Sqrt, using FunctionXform:

irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length",
                         newFieldName="Sepal.Length.Sqrt",
                         formulaText="sqrt(Sepal.Length)")

The new feature is calculated and added as a column to the irisBox$data data frame:

kable(head(irisBox$data,3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length.Sqrt
5.1 3.5 1.4 0.2 setosa 2.258318
4.9 3.0 1.4 0.2 setosa 2.213594
4.7 3.2 1.3 0.2 setosa 2.167948

irisBox$fieldData now contains a new row with the transformation expression:

kable(irisBox$fieldData[6,c(1:3,14)])
type dataType origFieldName functionXform
Sepal.Length.Sqrt derived numeric Sepal.Length sqrt(Sepal.Length)

Construct a linear model for Petal.Width using this new feature, and convert it to PMML:

fit <- lm(Petal.Width ~ Sepal.Length.Sqrt, data=irisBox$data)
fit_pmml <- pmml(fit, transform=irisBox)

Since the model predicts Petal.Width using a variable based on Sepal.Length, the PMML will contain these two fields in the DataDictionary and MiningSchema:

fit_pmml[[2]] #Data Dictionary node
#> <DataDictionary numberOfFields="2">
#>  <DataField name="Petal.Width" optype="continuous" dataType="double"/>
#>  <DataField name="Sepal.Length" optype="continuous" dataType="double"/>
#> </DataDictionary>
fit_pmml[[3]][[1]] #Mining Schema node
#> <MiningSchema>
#>  <MiningField name="Petal.Width" usageType="predicted" invalidValueTreatment="returnInvalid"/>
#>  <MiningField name="Sepal.Length" usageType="active" invalidValueTreatment="returnInvalid"/>
#> </MiningSchema>

The LocalTransformations node contains Sepal.Length.Sqrt as a derived field:

fit_pmml[[3]][[3]]
#> <LocalTransformations>
#>  <DerivedField name="Sepal.Length.Sqrt" dataType="double" optype="continuous">
#> <Apply function="sqrt">
#>   <FieldRef field="Sepal.Length"/>
#> </Apply> 
#>  </DerivedField>
#> </LocalTransformations>

Single categorical field

FunctionXform can also operate on categorical data. In this example, let’s create a boolean feature that equals 1 only when Species is setosa:

irisBox <- WrapData(iris)
irisBox <- FunctionXform(irisBox,origFieldName="Species",
                         newFieldName="Species.Setosa",
                         formulaText="if (Species == 'setosa') {1} else {0}")
kable(head(irisBox$data,3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species.Setosa
5.1 3.5 1.4 0.2 setosa 1
4.9 3.0 1.4 0.2 setosa 1
4.7 3.2 1.3 0.2 setosa 1

Create a linear model and check the LocalTransformations node:

fit <- lm(Petal.Width ~ Species.Setosa, data=irisBox$data)
fit_pmml <- pmml(fit, transform=irisBox)
fit_pmml[[3]][[3]]
#> <LocalTransformations>
#>  <DerivedField name="Species.Setosa" dataType="double" optype="continuous">
#> <Apply function="if">
#>   <Apply function="equal">
#>     <FieldRef field="Species"/>
#>     <Constant dataType="string">setosa</Constant>
#>   </Apply>
#>   <Constant dataType="double">1</Constant>
#>   <Constant dataType="double">0</Constant>
#> </Apply> 
#>  </DerivedField>
#> </LocalTransformations>

Multiple input fields

It is possible to create new features by combining several fields. Let’s create a new field from the ratio of sepal and petal lengths:

irisBox <- WrapData(iris)
irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length,Petal.Length",
                         newFieldName="Length.Ratio",
                         formulaText="Sepal.Length / Petal.Length")

As before, the new field is added as a column to the irisBox$data data frame:

kable(head(irisBox$data,3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Length.Ratio
5.1 3.5 1.4 0.2 setosa 3.642857
4.9 3.0 1.4 0.2 setosa 3.500000
4.7 3.2 1.3 0.2 setosa 3.615385

Fit a linear model using this new feature, and convert it to pmml:

fit <- lm(Petal.Width ~ Length.Ratio, data=irisBox$data)
fit_pmml <- pmml(fit, transform=irisBox)

The pmml will contain Sepal.Length and Petal.Length in the DataDictionary and MiningSchema, since these were used in FormulaXform:

fit_pmml[[2]] #Data Dictionary node
#> <DataDictionary numberOfFields="3">
#>  <DataField name="Petal.Width" optype="continuous" dataType="double"/>
#>  <DataField name="Sepal.Length" optype="continuous" dataType="double"/>
#>  <DataField name="Petal.Length" optype="continuous" dataType="double"/>
#> </DataDictionary>
fit_pmml[[3]][[1]] #Mining Schema node
#> <MiningSchema>
#>  <MiningField name="Petal.Width" usageType="predicted" invalidValueTreatment="returnInvalid"/>
#>  <MiningField name="Sepal.Length" usageType="active" invalidValueTreatment="returnInvalid"/>
#>  <MiningField name="Petal.Length" usageType="active" invalidValueTreatment="returnInvalid"/>
#> </MiningSchema>

The Local.Transformations node contains Length.Ratio as a derived field:

fit_pmml[[3]][[3]]
#> <LocalTransformations>
#>  <DerivedField name="Length.Ratio" dataType="double" optype="continuous">
#> <Apply function="/">
#>   <FieldRef field="Sepal.Length"/>
#>   <FieldRef field="Petal.Length"/>
#> </Apply> 
#>  </DerivedField>
#> </LocalTransformations>

Using a previously derived feature

It is possible to pass a feature derived with FunctionXform to another FunctionXform call. To do this, the second call to FunctionXform must use the original data field names (instead of the derived field) in the origFieldName argument.

irisBox <- WrapData(iris)
irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length,Petal.Length",
                         newFieldName="Length.Ratio",
                         formulaText="Sepal.Length / Petal.Length")

irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length,Petal.Length,Sepal.Width",
                         newFieldName="Length.R.Times.S.Width",
                         formulaText="Length.Ratio * Sepal.Width")
kable(irisBox$fieldData[6:7,c(1:3,14)])
type dataType origFieldName functionXform
Length.Ratio derived numeric Sepal.Length,Petal.Length Sepal.Length / Petal.Length
Length.R.Times.S.Width derived numeric Sepal.Length,Petal.Length,Sepal.Width Length.Ratio * Sepal.Width
fit <- lm(Petal.Width ~ Length.R.Times.S.Width, data=irisBox$data)
fit_pmml <- pmml(fit, transform=irisBox)

The pmml will contain Sepal.Length, Petal.Length, and Sepal.Width in the DataDictionary and MiningSchema, since these were used in FormulaXform:

fit_pmml[[2]] #Data Dictionary node
#> <DataDictionary numberOfFields="4">
#>  <DataField name="Petal.Width" optype="continuous" dataType="double"/>
#>  <DataField name="Sepal.Length" optype="continuous" dataType="double"/>
#>  <DataField name="Petal.Length" optype="continuous" dataType="double"/>
#>  <DataField name="Sepal.Width" optype="continuous" dataType="double"/>
#> </DataDictionary>
fit_pmml[[3]][[1]] #Mining Schema node
#> <MiningSchema>
#>  <MiningField name="Petal.Width" usageType="predicted" invalidValueTreatment="returnInvalid"/>
#>  <MiningField name="Sepal.Length" usageType="active" invalidValueTreatment="returnInvalid"/>
#>  <MiningField name="Petal.Length" usageType="active" invalidValueTreatment="returnInvalid"/>
#>  <MiningField name="Sepal.Width" usageType="active" invalidValueTreatment="returnInvalid"/>
#> </MiningSchema>

The Local.Transformations node contains Length.Ratio and Length.R.Times.S.Width as derived fields:

fit_pmml[[3]][[3]]
#> <LocalTransformations>
#>  <DerivedField name="Length.Ratio" dataType="double" optype="continuous">
#> <Apply function="/">
#>   <FieldRef field="Sepal.Length"/>
#>   <FieldRef field="Petal.Length"/>
#> </Apply> 
#>  </DerivedField>
#>  <DerivedField name="Length.R.Times.S.Width" dataType="double" optype="continuous">
#> <Apply function="*">
#>   <FieldRef field="Length.Ratio"/>
#>   <FieldRef field="Sepal.Width"/>
#> </Apply> 
#>  </DerivedField>
#> </LocalTransformations>

PMML functions supported by FunctionXform

The following R functions and operators are directly supported by FunctionXform. Their PMML equivalents are listed on the second line:

For these functions, no extra code is required for translation.

The R function prod can be used as long as only numeric arguments are specified. That is, prod can take an na.rm argument, but specifying this in FunctionXform directly will not produce PMML equivalent to the R expression.

Similarly, the R function log can be used directly as long as the second argument (the base) is not specified.

PMML functions not supported by FunctionXform

There are built-in functions defined in PMML that cannot be directly translated to PMML using FunctionXform as described above.

In this case, an error will be thrown when R tries to calculate a new feature using the function passed to FunctionXform, but does not see that function in the environment.

It is still possible to make FunctionXform work, but the PMML function must be defined in the R environment first.

Let’s use isIn, a PMML function, as an example. The function returns a boolean indicating whether the first argument is contained in a list of values. Detailed specification for this function is available on this DMG page.

One way to implement this in R is by using %in%, with the list of values being represented by ...:

isIn <- function(x, ...) {
  dots <- c(...)
  if (x %in% dots) {
    return(TRUE)
  } else {
    return(FALSE)
  }
}

isIn(1,2,1,4)
#> [1] TRUE

This function can now be passed to FunctionXform. The following code creates a feature that indicates whether Species is either setosa or versicolor:

irisBox <- WrapData(iris)
irisBox <- FunctionXform(irisBox,origFieldName="Species",
                         newFieldName="Species.Setosa.or.Versicolor",
                         formulaText="isIn(Species,'setosa','versicolor')")

The data data frame now contains the new feature:

kable(head(irisBox$data,3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species.Setosa.or.Versicolor
5.1 3.5 1.4 0.2 setosa TRUE
4.9 3.0 1.4 0.2 setosa TRUE
4.7 3.2 1.3 0.2 setosa TRUE

Create a linear model and view the corresponding PMML for the function:

fit <- lm(Petal.Width ~ Species.Setosa.or.Versicolor, data=irisBox$data)
fit_pmml <- pmml(fit, transform=irisBox)
fit_pmml[[3]][[3]]
#> <LocalTransformations>
#>  <DerivedField name="Species.Setosa.or.Versicolor" dataType="double" optype="continuous">
#> <Apply function="isIn">
#>   <FieldRef field="Species"/>
#>   <Constant dataType="string">setosa</Constant>
#>   <Constant dataType="string">versicolor</Constant>
#> </Apply> 
#>  </DerivedField>
#> </LocalTransformations>

PMML Function not supported by FunctionXform - another example

As another example, let’s use R’s mean function to create a new feature. PMML has a built-in avg, so we will define an R function with this name.

avg <- function(...) {
  dots <- c(...)
  return(mean(dots))
}

Now use this function to take an average of several other features and combine with another field:

irisBox <- WrapData(iris)
irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length,Petal.Length,Sepal.Width",
                         newFieldName="Length.Average.Ratio",
                         formulaText="avg(Sepal.Length,Petal.Length)/Sepal.Width")

The data data frame now contains the new feature:

kable(head(irisBox$data,3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Length.Average.Ratio
5.1 3.5 1.4 0.2 setosa 0.9285714
4.9 3.0 1.4 0.2 setosa 1.0500000
4.7 3.2 1.3 0.2 setosa 0.9375000

Create a simple linear model and view the corresponding PMML for the function:

fit <- lm(Petal.Width ~ Length.Average.Ratio, data=irisBox$data)
fit_pmml <- pmml(fit, transform=irisBox)
fit_pmml[[3]][[3]]
#> <LocalTransformations>
#>  <DerivedField name="Length.Average.Ratio" dataType="double" optype="continuous">
#> <Apply function="/">
#>   <Apply function="avg">
#>     <FieldRef field="Sepal.Length"/>
#>     <FieldRef field="Petal.Length"/>
#>   </Apply>
#>   <FieldRef field="Sepal.Width"/>
#> </Apply> 
#>  </DerivedField>
#> </LocalTransformations>

In the PMML, avg will be recognized as a valid function.

PMML for arbitrary functions

The function functionToPMML (part of the pmml package) makes it possible to convert an R expression into PMML directly, without creating a model or calculating values.

As long as the expression passed to the function is a valid R expression (e.g., no unbalanced parentheses), it can contain arbitrary function names not defined in R. Variables in the expression passed to FunctionXform are always assumed to be field names, and not substituted. That is, even if x has a value in the R environment, the resulting expression will still use x.

functionToPMML("1 + 2")
#> <Apply function="+">
#>   <Constant dataType="double">1</Constant>
#>   <Constant dataType="double">2</Constant>
#> </Apply>

x <- 3
functionToPMML("foo(bar(x * y))")
#> <Apply function="foo">
#>   <Apply function="bar">
#>     <Apply function="*">
#>       <FieldRef field="x"/>
#>       <FieldRef field="y"/>
#>     </Apply>
#>   </Apply>
#> </Apply>

More notes on functions

There are several limitations to parsing expressions in FunctionXform.

Each transformation operates on one data row at a time. For example, it is not possible to compute the mean of an entire feature column in FunctionXform.

An expression such as foo(x) is treated as a function foo with argument x. Consequently, passing in an R vector c(1,2,3) will produce PMML where c is a function and 1,2,3 are the arguments:

functionToPMML("c(1,2,3)")
#> <Apply function="c">
#>   <Constant dataType="double">1</Constant>
#>   <Constant dataType="double">2</Constant>
#>   <Constant dataType="double">3</Constant>
#> </Apply>

We can also see what happens when passing an na.rm argument to prod, as mentioned in an above example:

functionToPMML("prod(1,2,na.rm=FALSE)") #produces incorrect PMML
#> <Apply function="product">
#>   <Constant dataType="double">1</Constant>
#>   <Constant dataType="double">2</Constant>
#>   <FieldRef field="FALSE"/>
#> </Apply>
functionToPMML("prod(1,2)") #produces correct PMML
#> <Apply function="product">
#>   <Constant dataType="double">1</Constant>
#>   <Constant dataType="double">2</Constant>
#> </Apply>

Additionally, passing in a vector to prod produces incorrect PMML:

prod(c(1,2,3))
#> [1] 6
functionToPMML("prod(c(1,2,3))")
#> <Apply function="product">
#>   <Apply function="c">
#>     <Constant dataType="double">1</Constant>
#>     <Constant dataType="double">2</Constant>
#>     <Constant dataType="double">3</Constant>
#>   </Apply>
#> </Apply>

More examples of functions

The following are additional examples of pmml produced from R expressions.

Extra parentheses:

functionToPMML("pmmlT(((1+2))*(x))")
#> <Apply function="pmmlT">
#>   <Apply function="*">
#>     <Apply function="+">
#>       <Constant dataType="double">1</Constant>
#>       <Constant dataType="double">2</Constant>
#>     </Apply>
#>     <FieldRef field="x"/>
#>   </Apply>
#> </Apply>

If-else expressions:

functionToPMML("if(a<2) {x+3} else if (a>4) {4} else {5}")
#> <Apply function="if">
#>   <Apply function="lessThan">
#>     <FieldRef field="a"/>
#>     <Constant dataType="double">2</Constant>
#>   </Apply>
#>   <Apply function="+">
#>     <FieldRef field="x"/>
#>     <Constant dataType="double">3</Constant>
#>   </Apply>
#>   <Apply function="if">
#>     <Apply function="greaterThan">
#>       <FieldRef field="a"/>
#>       <Constant dataType="double">4</Constant>
#>     </Apply>
#>     <Constant dataType="double">4</Constant>
#>     <Constant dataType="double">5</Constant>
#>   </Apply>
#> </Apply>

References