HTML Tables

Duncan Garmonsway

2019-01-02

This vignette for the unpivotr package demonstrates unpivoting html tables of various kinds.

The HTML files are in the package directory at system.file("extdata", c("rowspan.html", "colspan.html", "nested.html"), package = "unpivotr").

library(dplyr)
library(rvest)
## Loading required package: xml2
library(htmltools)
library(unpivotr)

Rowspan and colspan examples

If a table has cells merged across rows or columns (or both), then as_cells() does not attempt to fill the cell contents across the rows or columns. This is different from other packages, e.g. rvest. However, if merged cells cause a table not to be square, then as_cells() pads the missing cells with blanks.

Rowspan

HTML table with rowspan
Header (1:2, 1) Header (1, 2)
cell (2, 2)

## [[1]]
##   Header (1:2, 1) Header (1, 2)
## 1 Header (1:2, 1)   cell (2, 2)
## [[1]]
## # A tibble: 4 x 4
##     row   col data_type html                                    
##   <int> <int> <chr>     <chr>                                   
## 1     1     1 html      "<th rowspan=\"2\">Header (1:2, 1)</th>"
## 2     2     1 html      <NA>                                    
## 3     1     2 html      <th>Header (1, 2)</th>                  
## 4     2     2 html      <td>cell (2, 2)</td>

Colspan

HTML table with colspan
Header (1, 1:2)
cell (2, 1) cell (2, 2)

## [[1]]
##   Header (1, 1:2) Header (1, 1:2)
## 1     cell (2, 1)     cell (2, 2)
## [[1]]
## # A tibble: 4 x 4
##     row   col data_type html                                    
##   <int> <int> <chr>     <chr>                                   
## 1     1     1 html      "<th colspan=\"2\">Header (1, 1:2)</th>"
## 2     2     1 html      <td>cell (2, 1)</td>                    
## 3     1     2 html      <NA>                                    
## 4     2     2 html      <td>cell (2, 2)</td>

Both rowspan and colspan: non-square

HTML table with colspan
Header (1:2, 1:2) Header (2, 3)
cell (3, 1) cell (3, 2) cell (3, 3)

## [[1]]
##   Header (1:2, 1:2) Header (1:2, 1:2) Header (2, 3)
## 1 Header (1:2, 1:2) Header (1:2, 1:2)   cell (3, 1)
## [[1]]
## # A tibble: 10 x 4
##      row   col data_type html                                              
##    <int> <int> <chr>     <chr>                                             
##  1     1     1 html      "<th colspan=\"2\" rowspan=\"2\">Header (1:2, 1:2…
##  2     2     1 html      <NA>                                              
##  3     1     2 html      <NA>                                              
##  4     2     2 html      <NA>                                              
##  5     1     3 html      <th>Header (2, 3)</th>                            
##  6     2     3 html      <td>cell (3, 1)</td>                              
##  7     1     4 html      <NA>                                              
##  8     2     4 html      <td>cell (3, 2)</td>                              
##  9     1     5 html      <NA>                                              
## 10     2     5 html      <td>cell (3, 3)</td>

Nested example

as_cells() never descends into cells. If there is a table inside a cell, then to parse that table use html_table again on that cell.

Nested HTML table
Header (1, 1) Header (1, 2)
cell (2, 1)
Header (2, 2)(1, 1) Header (2, 2)(1, 2)
cell (2, 2)(2, 1) cell (2, 2)(2, 1)

## [[1]]
##         Header (1, 1)
## 1         cell (2, 1)
## 2 Header (2, 2)(1, 1)
## 3   cell (2, 2)(2, 1)
##                                                                                                            Header (1, 2)
## 1 Header (2, 2)(1, 1)\n              Header (2, 2)(1, 2)\n            cell (2, 2)(2, 1)\n              cell (2, 2)(2, 1)
## 2                                                                                                    Header (2, 2)(1, 2)
## 3                                                                                                      cell (2, 2)(2, 1)
##                    NA                  NA                NA
## 1 Header (2, 2)(1, 1) Header (2, 2)(1, 2) cell (2, 2)(2, 1)
## 2                <NA>                <NA>              <NA>
## 3                <NA>                <NA>              <NA>
##                  NA
## 1 cell (2, 2)(2, 1)
## 2              <NA>
## 3              <NA>
## 
## [[2]]
##   Header (2, 2)(1, 1) Header (2, 2)(1, 2)
## 1   cell (2, 2)(2, 1)   cell (2, 2)(2, 1)
## # A tibble: 4 x 4
##     row   col data_type html                                               
##   <int> <int> <chr>     <chr>                                              
## 1     1     1 html      <th>Header (1, 1)</th>                             
## 2     2     1 html      <td>cell (2, 1)</td>                               
## 3     1     2 html      <th>Header (1, 2)</th>                             
## 4     2     2 html      "<td>\n          <table>\n<tr>\n<th>Header (2, 2)(…
## [1] "<td>\n          <table>\n<tr>\n<th>Header (2, 2)(1, 1)</th>\n              <th>Header (2, 2)(1, 2)</th>\n            </tr>\n<tr>\n<td>cell (2, 2)(2, 1)</td>\n              <td>cell (2, 2)(2, 1)</td>\n            </tr>\n</table>\n</td>"
## [[1]]
## # A tibble: 4 x 4
##     row   col data_type html                        
##   <int> <int> <chr>     <chr>                       
## 1     1     1 html      <th>Header (2, 2)(1, 1)</th>
## 2     2     1 html      <td>cell (2, 2)(2, 1)</td>  
## 3     1     2 html      <th>Header (2, 2)(1, 2)</th>
## 4     2     2 html      <td>cell (2, 2)(2, 1)</td>

URL example

A motivation for using unpivotr::as_cells() is that it extracts more than just text – it can extract whatever part of the HTML you need.

Here, we extract URLs.

HTML table with rowspan
Scraping HTML.
Sweet as? Yeah,
right.

## # A tibble: 8 x 6
##     row   col data_type html                             text   url        
##   <int> <int> <chr>     <chr>                            <chr>  <chr>      
## 1     1     1 html      "<td colspan=\"2\">\n<a href=\"… Scrap… example1.c…
## 2     1     1 html      "<td colspan=\"2\">\n<a href=\"… HTML.  example2.c…
## 3     2     1 html      "<td><a href=\"example3.co.nz\"… Sweet  example3.c…
## 4     1     2 html      <NA>                             <NA>   <NA>       
## 5     2     2 html      "<td><a href=\"example4.co.nz\"… as?    example4.c…
## 6     1     3 html      <NA>                             <NA>   <NA>       
## 7     2     3 html      "<td>\n<a href=\"example5.co.nz… Yeah,  example5.c…
## 8     2     3 html      "<td>\n<a href=\"example5.co.nz… right. http://www…