Converting nested XML to dataframe in R - a tidyverse approach

 798 words  |   4-min read  |    XML, R, tidyverse, data manipulation


2021-03-xml-dataframe-r/xml-to-dataframe.png

TL;DR

tidyr (in tidyverse) provides functions unnest_wider and unnest_longer to transform XML data into dataframe quickly, using the same ideology of pivot_wider and pivot_longer in dplyr.


I rarely deal with XML data. However, sometimes the government discloses open data in XML format only. Currently I need to find data about the restaurant licenses in Hong Kong, and FEHD provides an open dataset. Unluckily, the dataset is available in XML format only.

The XML data file looks like something below:

XML of the restaurant license data

The tree structure of the XML is in the below form:

  • DATA
    • DEPARTMENT
    • GENERATION_DATE
    • LINK
    • TYPE_CODE
    • DIST_CODE
    • INFO_CODE
    • LPS
      • LP
        • TYPE
        • DIST
        • LICNO
        • SS
        • ADR
        • INFO
        • EXPDATE

DEPARTMENT, GENERATION_DATE and LINK are metadata. TYPE_CODE, DIST_CODE and INFO_CODE are data dictionary to store the ID used to encode data. Therefore, only the parts wrapped inside the LPS is the “actual” data I need.

Composition of the XML

Transform xml to dataframe with R

library(xml2)
library(tidyverse)

First thing first, load the packages.

Transform xml to list

xml_address = "http://www.fehd.gov.hk/english/licensing/license/text/LP_Restaurants_EN.XML"

restaurant_license_xml = as_list(read_xml(xml_address))

Here read_xml from xml2 is used to read the XML file. The as_list function then turn it into an equivalent R list.

Expand the data to multiple rows by tags

xml_df = tibble::as_tibble(restaurant_license_xml) %>%
  unnest_longer(DATA)

The XML becomes an extremely long list. As the XML is nested into multiple layers, the thing to do is to unnest the first layer. unnest_longer is a function in tidyr which unnest a list the split the values of the list to multiple rows (thus longer).

DATA in the parameter of unnest_longer refers to the DATA tag on the top layer of the xml. When it is unnested, the structure of the dataframe will return the data inside each tag. In addition, a new column named DATA_id is added to indicate which xml tag the data belongs to. Since we are unnesting the first layer, the DATA_id column includes these 7 values:

  • DEPARTMENT
  • GENERATION_DATE
  • LINK
  • TYPE_CODE
  • DIST_CODE
  • INFO_CODE
  • LP

Unnest the xml list into longer form

The data of each LP tag is in a list form like this:

list(
  TYPE = list("MR"),
  DIST = list("15"),
  LICNO = list("3715038319"),
  SS = list("Jumbo"),
  ADR = list("JUMBO FLOATING RESTAURANT, ABERDEEN TYPHOON SHELTER, SHUM WAN, WONG CHUK HANG, SOUTHERN, HONG KONG"), 
  INFO = list("#F#G#H"),
  EXPDATE = list("2021-06-30")
)

Compare to the XML version of the data:

<LP>
  <TYPE>MR</TYPE>
  <DIST>15</DIST>
  <LICNO>3715038319</LICNO>
  <SS>Jumbo</SS>
  <ADR>JUMBO FLOATING RESTAURANT, ABERDEEN TYPHOON SHELTER, SHUM WAN, WONG CHUK HANG, SOUTHERN, HONG KONG</ADR>
  <INFO>#F#G#H</INFO>
  <EXPDATE>2021-06-30</EXPDATE>
</LP>

The structure of the data is thus the same. Still, it is easier to work with list in R.

Expand the data to multiple columns by tags

lp_wider = xml_df %>%
  dplyr::filter(DATA_id == "LP") %>%
  unnest_wider(DATA) 

First, data not in the LP tags are discarded using dplyr::filter. Then, I need to expand the attributes of each license into each column (i.e. a column for TYPE, another column for DIST, etc.). Unlike the step above, this time unnest_wider is used to “expand” the data in a single column to multiple columns (thus wider).

Unnest the xml list into wider form

Unnest the list in each cell

lp_df = lp_wider %>%
  # 1st time unnest to release the 2-dimension list?
  unnest(cols = names(.)) %>%
  # 2nd time to nest the single list in each cell?
  unnest(cols = names(.)) %>%
  # convert data type
  readr::type_convert() 

The data then has to be unnested two times. I do not quite understand why this is required though. Maybe because the values in the cells a in the form of “list in list”/“nested list”? Following are the results after lp_wider is passed through 1. the first unnest and 2. the second unnest. After the first unnest, each column is still a type of list with length of 1. The data are exposed only after the second unnest.

Tibble after the first unnest function

Tibble after the second unnest function

And finally, the xml is converted to a tidy tabular format for further analysis, and we can use write_csv to export the tibble into csv.

The converted tibble

Other approaches

Manual editing and conversion

The quick (though dumb) way to get the nested part of a XML is - manually delete the “outer” part! In the case of this restaurant license XML, I only need the <LPS> part, and all other nodes (I am not sure about this is the correct terminology for XML) could be deleted. After that, I just need to use an online XML to CSV converter for the data format conversion.

Using XML package

The XML package is an older package to transform XML to dataframes.

https://blog.gtwang.org/r/r-xml-package-parsing-and-generating-xml-tutorial/


References

How to convert a partly nested XML to data frame using xml2
https://community.rstudio.com/t/how-to-convert-a-partly-nested-xml-to-data-frame-using-xml2/36705/2

nested XML to data frame in R
https://stackoverflow.com/questions/47254923/nested-xml-to-data-frame-in-r

Converting xml to tibble in R
https://megapteraphile.wordpress.com/2020/03/29/converting-xml-to-tibble-in-r/

The Latest