`library(anticlust)`

In this vignette I explore two ways to incorporate categorical variables with anticlustering. The main function of `anticlust`

is `anticlustering()`

, and it has an argument `categories`

. It can be used easily enough: We just pass the numeric variables as first argument (`x`

) and our categorical variable(s) to `categories`

. I will use the penguin data set from the `palmerpenguins`

package to illustrate the usage:

```
library(palmerpenguins)
# First exclude cases with missing values
na.omit(penguins)
df <-head(df)
#> # A tibble: 6 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgersen 39.1 18.7 181 3750
#> 2 Adelie Torgersen 39.5 17.4 186 3800
#> 3 Adelie Torgersen 40.3 18 195 3250
#> 4 Adelie Torgersen 36.7 19.3 193 3450
#> 5 Adelie Torgersen 39.3 20.6 190 3650
#> 6 Adelie Torgersen 38.9 17.8 181 3625
#> # ℹ 2 more variables: sex <fct>, year <int>
nrow(df)
#> [1] 333
```

In the data set, each row represents a penguin, and the data set has four numeric variables (bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) and several categorical variables (species, island, sex) as descriptions of the penguins.

Let’s call `anticlustering()`

to divide the 333 penguins into 3 groups. We use the four the numeric variables as first argument (i.e., the anticlustering objective is computed on the basis of the numeric variables), and the penguins’ sex as categorical variable:

```
df[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")]
numeric_vars <- anticlustering(
groups <-
numeric_vars, K = 3,
categories = df$sex
)
```

Let’s check out how well our categorical variables are balanced:

```
table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
```

A perfect split! Similarly, we could use the species as categorical variable:

```
anticlustering(
groups <-
numeric_vars, K = 3,
categories = df$species
)
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 22 40
#> 2 49 23 39
#> 3 48 23 40
```

As good as it could be! Now, let’s use both categorical variables at the same time:

```
anticlustering(
groups <-
numeric_vars, K = 3,
categories = df[, c("species", "sex")]
)
table(groups, df$sex)
#>
#> groups female male
#> 1 54 57
#> 2 56 55
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 49 22 40
#> 2 49 23 39
#> 3 48 23 40
```

The results for the sex variable are worse than previously when we only considered one variable at a time. This is because when using multiple variables with the `categories`

argument, all columns are “merged” into a single column, and each combination of sex / species is treated as a separate category. Some information on the original variables is lost, and the results may become less optimal—while being still pretty okay here. Alas, using only the `categories`

argument, we cannot improve this balancing even if a better split with regard to both categorical variables would be possible.

K-means anticlustering offers a second possibility to distribute categorical variables evenly between groups. This approach can lead to better results when multiple categorical variables are available, and / or if the group sizes are unequal. To use this approach, we first generate a matrix of the categorical variables in binary representation using the `anticlust`

convenience function `categories_to_binary()`

.^{1} Because k-means anticlustering optimizes similarity with regard to means, k-means anticlustering applied to this binary matrix will even out the proportion of each category in each group (this is because the mean of a binary variable is the proportion of `1`

s in that variable).

```
categories_to_binary(df[, c("species", "sex")])
binary_categories <-# see ?categories_to_binary
head(binary_categories)
#> speciesChinstrap speciesGentoo sexmale
#> 1 0 0 1
#> 2 0 0 0
#> 3 0 0 0
#> 4 0 0 0
#> 5 0 0 1
#> 6 0 0 0
```

```
anticlustering(
groups <-
binary_categories,K = 3,
method = "local-maximum",
objective = "variance",
repetitions = 10,
standardize = TRUE
)table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 48 23 40
#> 2 49 23 39
#> 3 49 22 40
```

The results are quite convincing. In particular, the penguins’ sex is better balanced than previously when we used the argument `categories`

. If we have multiple categorical variables and / or unequal-sized groups, it may be useful to try out the k-means optimization version of including categorical variables, instead of (only) using the `categories`

argument. If we also wish to ensure that the categorical variables *in their combination* are balanced between groups (i.e., the proportion of the penguins’ sex is roughly the same for each species in each group), we could set the optional argument `use_combinations`

of `categories_to_binary()`

to `TRUE`

:

```
categories_to_binary(df[, c("species", "sex")], use_combinations = TRUE)
binary_categories <- anticlustering(
groups <-
binary_categories,K = 3,
method = "local-maximum",
objective = "variance",
repetitions = 10,
standardize = TRUE
)table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 48 23 40
#> 2 49 22 40
#> 3 49 23 39
table(groups, df$sex, df$species)
#> , , = Adelie
#>
#>
#> groups female male
#> 1 23 25
#> 2 25 24
#> 3 25 24
#>
#> , , = Chinstrap
#>
#>
#> groups female male
#> 1 12 11
#> 2 10 12
#> 3 12 11
#>
#> , , = Gentoo
#>
#>
#> groups female male
#> 1 20 20
#> 2 20 20
#> 3 18 21
```

Note that we only evenly distributed the categorical variable between groups and did not consider the numeric variables—which is however why we would usually call `anticlustering()`

. Fortunately, also considering the numeric variables is possible, and can we accomplish that in two different ways:

- we first optimize similarity with regard to the categorical variable(s) via k-means anticlustering, and then insert the resulting group assignment as a “hard constraint” into
`anticlustering()`

- we simultaneous optimize similarity with regard to numeric and categorical variables

We discuss both approaches in the following.

We use the output vector `groups`

of the previous call to `anticlustering()`

—which convincingly balanced our categorical variables—as input to the `K`

argument in an additional call to `anticlustering()`

. The `groups`

vector is used as the initial group assignment before the anticlustering optimization starts. In this group assignment, the categories are already well balanced. We additionally pass the two categorical variables to `categories`

, thus ensuring that the balancing of the categorical variable is never changed throughout the optimization process:^{2}

```
anticlustering(
final_groups <-
numeric_vars,K = groups,
standardize = TRUE,
method = "local-maximum",
categories = df[, c("species", "sex")]
)
table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 48 23 40
#> 2 49 22 40
#> 3 49 23 39
mean_sd_tab(numeric_vars, final_groups)
#> bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> 1 "44.00 (5.50)" "17.16 (1.98)" "201.00 (14.17)" "4208.78 (809.97)"
#> 2 "44.00 (5.50)" "17.17 (2.00)" "200.93 (13.97)" "4206.31 (798.76)"
#> 3 "43.98 (5.45)" "17.16 (1.95)" "200.97 (14.03)" "4206.08 (814.14)"
```

The results are convincing, both with regard to the numeric variables and the categorical variables.

We can simultaneously consider the numeric and categorical variables in the optimization process. Note that this approach only works with the k-means and k-plus objectives, because only k-means adequately deals with the categorical variables (at least when using the approach described here). Using the simultaneous approach, we just pass all variables (representing binary categories and numeric variables) as a single matrix to the first argument of `anticlustering()`

. Do not use the `categories`

argument here!

```
anticlustering(
final_groups <-cbind(numeric_vars, binary_categories),
K = 3,
standardize = TRUE,
method = "local-maximum",
objective = "variance",
repetitions = 10
)
table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 48 23 40
#> 2 49 22 40
#> 3 49 23 39
mean_sd_tab(numeric_vars, final_groups)
#> bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> 1 "43.99 (5.29)" "17.16 (2.16)" "200.97 (13.20)" "4207.21 (760.65)"
#> 2 "43.99 (5.77)" "17.16 (1.95)" "200.96 (15.01)" "4206.98 (887.34)"
#> 3 "43.99 (5.39)" "17.17 (1.80)" "200.96 (13.91)" "4206.98 (768.74)"
```

The following code extends the simultaneous optimization approach towards k-plus anticlustering, which ensures that standard deviations as well as means are similar between groups (and not only the means, which is achieved via standard k-means anticlustering):

```
anticlustering(
final_groups <-cbind(kplus_moment_variables(numeric_vars, T = 2), binary_categories),
K = 3,
method = "local-maximum",
objective = "variance",
repetitions = 10
)
table(groups, df$sex)
#>
#> groups female male
#> 1 55 56
#> 2 55 56
#> 3 55 56
table(groups, df$species)
#>
#> groups Adelie Chinstrap Gentoo
#> 1 48 23 40
#> 2 49 22 40
#> 3 49 23 39
mean_sd_tab(numeric_vars, final_groups)
#> bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> 1 "44.00 (5.49)" "17.16 (1.98)" "200.98 (14.05)" "4205.41 (807.48)"
#> 2 "43.98 (5.48)" "17.16 (1.97)" "200.96 (14.06)" "4207.21 (807.60)"
#> 3 "44.00 (5.48)" "17.17 (1.98)" "200.95 (14.06)" "4208.56 (807.87)"
```

While we use `objective = "variance"`

—indicating that the k-means objective is used—this code actually performs k-plus anticlustering because the first argument takes as input the augmented k-plus variable matrix^{3}. We see that the standard deviations are now also quite evenly matched between groups (which is unlike when using standard k-means anticlustering).

In the end: You should try out the different approaches for dealing with categorical variables and see which one works best for you!

Internally,

`categories_to_binary()`

is just a thin wrapper around the base`R`

function`model.matrix()`

.↩︎Only elements that have the same value in

`categories`

are exchanged between clusters throughout the optimization algorithm, so the initial balancing of the categories is never changed when the algorithm runs.↩︎This is how k-plus anticlustering actually works: It reuses the k-means criterion but uses additional “k-plus” variables as input. More information on the k-plus approach is given in the documentation:

`?kplus_moment_variables`

and`?kplus_anticlustering`

.↩︎