Frequency tables (or frequency distributions) are summaries of the distribution of values in a sample. With the `freq`

function, you can create univariate frequency tables. Multiple variables will be pasted into one variable, so it forces a univariate distribution. We take the `septic_patients`

dataset (included in this AMR package) as example.

To only show and quickly review the content of one variable, you can just select this variable in various ways. Let’s say we want to get the frequencies of the `gender`

variable of the `septic_patients`

dataset:

**Frequency table of gender from a data.frame (2,000 x 49)**

Class: `character`

(`character`

)

Length: 2,000 (of which NA: 0 = 0.00%)

Unique: 2

Shortest: 1

Longest: 1

Item | Count | Percent | Cum. Count | Cum. Percent | |
---|---|---|---|---|---|

1 | M | 1,031 | 51.6% | 1,031 | 51.6% |

2 | F | 969 | 48.5% | 2,000 | 100.0% |

This immediately shows the class of the variable, its length and availability (i.e. the amount of `NA`

), the amount of unique values and (most importantly) that among septic patients men are more prevalent than women.

Multiple variables will be pasted into one variable to review individual cases, keeping a univariate frequency table.

For illustration, we could add some more variables to the `septic_patients`

dataset to learn about bacterial properties:

Now all variables of the `microorganisms`

dataset have been joined to the `septic_patients`

dataset. The `microorganisms`

dataset consists of the following variables:

```
colnames(microorganisms)
# [1] "mo" "col_id" "fullname" "kingdom" "phylum"
# [6] "class" "order" "family" "genus" "species"
# [11] "subspecies" "rank" "ref" "species_id" "source"
# [16] "prevalence"
```

If we compare the dimensions between the old and new dataset, we can see that these 15 variables were added:

So now the `genus`

and `species`

variables are available. A frequency table of these combined variables can be created like this:

**Frequency table of genus and species from a data.frame (2,000 x 64)**

Columns: 2

Length: 2,000 (of which NA: 0 = 0.00%)

Unique: 95

Shortest: 8

Longest: 34

Item | Count | Percent | Cum. Count | Cum. Percent | |
---|---|---|---|---|---|

1 | Escherichia coli | 467 | 23.4% | 467 | 23.4% |

2 | Staphylococcus coagulase-negative | 313 | 15.7% | 780 | 39.0% |

3 | Staphylococcus aureus | 235 | 11.8% | 1,015 | 50.7% |

4 | Staphylococcus epidermidis | 174 | 8.7% | 1,189 | 59.5% |

5 | Streptococcus pneumoniae | 117 | 5.9% | 1,306 | 65.3% |

6 | Staphylococcus hominis | 81 | 4.1% | 1,387 | 69.4% |

7 | Klebsiella pneumoniae | 58 | 2.9% | 1,445 | 72.3% |

8 | Enterococcus faecalis | 39 | 2.0% | 1,484 | 74.2% |

9 | Proteus mirabilis | 36 | 1.8% | 1,520 | 76.0% |

10 | Pseudomonas aeruginosa | 30 | 1.5% | 1,550 | 77.5% |

11 | Serratia marcescens | 25 | 1.3% | 1,575 | 78.8% |

12 | Enterobacter cloacae | 23 | 1.2% | 1,598 | 79.9% |

13 | Enterococcus faecium | 21 | 1.1% | 1,619 | 81.0% |

14 | Staphylococcus capitis | 21 | 1.1% | 1,640 | 82.0% |

15 | Bacteroides fragilis | 20 | 1.0% | 1,660 | 83.0% |

(omitted 80 entries, n = 340 [17.0%])

Frequency tables can be created of any input.

In case of numeric values (like integers, doubles, etc.) additional descriptive statistics will be calculated and shown into the header:

```
# # get age distribution of unique patients
septic_patients %>%
distinct(patient_id, .keep_all = TRUE) %>%
freq(age, nmax = 5, header = TRUE)
```

**Frequency table of age from a data.frame (981 x 49)**

Class: `numeric`

(`numeric`

)

Length: 981 (of which NA: 0 = 0.00%)

Unique: 73

Mean: 71.08

SD: 14.05 (CV: 0.20, MAD: 13.34)

Five-Num: 14 | 63 | 74 | 82 | 97 (IQR: 19, CQV: 0.13)

Outliers: 15 (unique count: 12)

Item | Count | Percent | Cum. Count | Cum. Percent | |
---|---|---|---|---|---|

1 | 83 | 44 | 4.5% | 44 | 4.5% |

2 | 76 | 43 | 4.4% | 87 | 8.9% |

3 | 75 | 37 | 3.8% | 124 | 12.6% |

4 | 82 | 33 | 3.4% | 157 | 16.0% |

5 | 78 | 32 | 3.3% | 189 | 19.3% |

(omitted 68 entries, n = 792 [80.7%])

So the following properties are determined, where `NA`

values are always ignored:

**Mean****Standard deviation****Coefficient of variation**(CV), the standard deviation divided by the mean**Five numbers of Tukey**(min, Q1, median, Q3, max)**Coefficient of quartile variation**(CQV, sometimes called coefficient of dispersion), calculated as (Q3 - Q1) / (Q3 + Q1) using quantile with`type = 6`

as quantile algorithm to comply with SPSS standards**Outliers**(total count and unique count)

So for example, the above frequency table quickly shows the median age of patients being 74.

To sort frequencies of factors on factor level instead of item count, use the `sort.count`

parameter.

`sort.count`

is `TRUE`

by default. Compare this default behaviour…

**Frequency table of hospital_id from a data.frame (2,000 x 49)**

Class: `factor`

(`numeric`

)

Length: 2,000 (of which NA: 0 = 0.00%)

Levels: 4: `A`

, `B`

, `C`

, `D`

Unique: 4

Item | Count | Percent | Cum. Count | Cum. Percent | |
---|---|---|---|---|---|

1 | D | 762 | 38.1% | 762 | 38.1% |

2 | B | 663 | 33.2% | 1,425 | 71.3% |

3 | A | 321 | 16.1% | 1,746 | 87.3% |

4 | C | 254 | 12.7% | 2,000 | 100.0% |

… with this, where items are now sorted on count:

**Frequency table of hospital_id from a data.frame (2,000 x 49)**

Class: `factor`

(`numeric`

)

Length: 2,000 (of which NA: 0 = 0.00%)

Levels: 4: `A`

, `B`

, `C`

, `D`

Unique: 4

Item | Count | Percent | Cum. Count | Cum. Percent | |
---|---|---|---|---|---|

1 | A | 321 | 16.1% | 321 | 16.1% |

2 | B | 663 | 33.2% | 984 | 49.2% |

3 | C | 254 | 12.7% | 1,238 | 61.9% |

4 | D | 762 | 38.1% | 2,000 | 100.0% |

All classes will be printed into the header (default is `FALSE`

when using markdown like this document). Variables with the new `rsi`

class of this AMR package are actually ordered factors and have three classes (look at `Class`

in the header):

**Frequency table of amox from a data.frame (2,000 x 49)**

Class: `factor`

> `ordered`

> `rsi`

(`numeric`

)

Length: 2,000 (of which NA: 771 = 38.55%)

Levels: 3: `S`

< `I`

< `R`

Unique: 3

Drug: Amoxicillin

%IR: 55.82%

Item | Count | Percent | Cum. Count | Cum. Percent | |
---|---|---|---|---|---|

1 | R | 683 | 55.6% | 683 | 55.6% |

2 | S | 543 | 44.2% | 1,226 | 99.8% |

3 | I | 3 | 0.2% | 1,229 | 100.0% |

Frequencies of dates will show the oldest and newest date in the data, and the amount of days between them:

**Frequency table of date from a data.frame (2,000 x 49)**

Class: `Date`

(`numeric`

)

Length: 2,000 (of which NA: 0 = 0.00%)

Unique: 1,140

Oldest: 2 January 2002

Newest: 28 December 2017 (+5,839)

Median: 31 July 2009 (47.39%)

Item | Count | Percent | Cum. Count | Cum. Percent | |
---|---|---|---|---|---|

1 | 2016-05-21 | 10 | 0.5% | 10 | 0.5% |

2 | 2004-11-15 | 8 | 0.4% | 18 | 0.9% |

3 | 2013-07-29 | 8 | 0.4% | 26 | 1.3% |

4 | 2017-06-12 | 8 | 0.4% | 34 | 1.7% |

5 | 2015-11-19 | 7 | 0.4% | 41 | 2.1% |

(omitted 1,135 entries, n = 1,959 [98.0%])

A frequency table is actaually a regular `data.frame`

, with the exception that it contains an additional class.

[1] “frequency_tbl” “data.frame”

Because of this additional class, a frequency table prints like the examples above. But the object itself contains the complete table without a row limitation:

[1] 74 5

`na.rm`

With the `na.rm`

parameter (defaults to `TRUE`

, but they will always be shown into the header), you can include `NA`

values in the frequency table:

**Frequency table of amox from a data.frame (2,000 x 49)**

Class: `factor`

> `ordered`

> `rsi`

(`numeric`

)

Length: 2,000 (of which NA: 771 = 38.55%)

Levels: 3: `S`

< `I`

< `R`

Unique: 4

Drug: Amoxicillin

%IR: 55.82%

Item | Count | Percent | Cum. Count | Cum. Percent | |
---|---|---|---|---|---|

1 | (NA) | 771 | 38.6% | 771 | 38.6% |

2 | R | 683 | 34.2% | 1,454 | 72.7% |

3 | S | 543 | 27.2% | 1,997 | 99.9% |

4 | I | 3 | 0.2% | 2,000 | 100.0% |

`row.names`

The default frequency tables shows row indices. To remove them, use `row.names = FALSE`

:

**Frequency table of hospital_id from a data.frame (2,000 x 49)**

Class: `factor`

(`numeric`

)

Length: 2,000 (of which NA: 0 = 0.00%)

Levels: 4: `A`

, `B`

, `C`

, `D`

Unique: 4

Item | Count | Percent | Cum. Count | Cum. Percent |
---|---|---|---|---|

D | 762 | 38.1% | 762 | 38.1% |

B | 663 | 33.2% | 1,425 | 71.3% |

A | 321 | 16.1% | 1,746 | 87.3% |

C | 254 | 12.7% | 2,000 | 100.0% |

`markdown`

The `markdown`

parameter is `TRUE`

at default in non-interactive sessions, like in reports created with R Markdown. This will always print all rows, unless `nmax`

is set.

**Frequency table of hospital_id from a data.frame (2,000 x 49)**

Class: `factor`

(`numeric`

)

Length: 2,000 (of which NA: 0 = 0.00%)

Levels: 4: `A`

, `B`

, `C`

, `D`

Unique: 4

Item | Count | Percent | Cum. Count | Cum. Percent | |
---|---|---|---|---|---|

1 | D | 762 | 38.1% | 762 | 38.1% |

2 | B | 663 | 33.2% | 1,425 | 71.3% |

3 | A | 321 | 16.1% | 1,746 | 87.3% |

4 | C | 254 | 12.7% | 2,000 | 100.0% |