How to Import XML into Tableau without a Web Connector. Because who knows how they work?

How to Import XML into Tableau without a Web Connector

Because who knows how they work?

by:  
dale  dale  on 11 Jan '21


For this recent article I wrote for Foodie, I downloaded the 2019 and 2020 restaurant licencing data from data.gov.hk to investigate whether there has been a significant decline in registration over the year. The data comes in XML format, which is easy enough to understand when you look at it but is a bit cumbersome when it comes to importing it into Tableau.


<DATA xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<DEPARTMENT>Food and Environmental Hygiene Department</DEPARTMENT>

<GENERATION_DATE>2019-12-30</GENERATION_DATE>

<LINK>http://www.fehd.gov.hk</LINK>

<TYPE_CODE>

<CODE ID="RL">General Restaurant Licence</CODE>

<CODE ID="RR">Light Refreshment Restaurant Licence</CODE>

<CODE ID="MR">Marine Restaurant Licence</CODE>

</TYPE_CODE>

...

</DATA>

The XML data from fehd.gov.hk


It is apparently possible to use Web Connectors to read XML data into Tableau. I’ve played with Web Connectors before, and the payback for my time was limited, so I was keen to avoid that path. So I turned to Python to parse and extract the data I needed into a CSV file.

Using this page as a guide, I pip-installed the library lxml and iteratively determined how to access each element in the file, which I wrote to a CSV file, one restaurant per line.


#!/usr/bin/python

# -*- coding: utf-8 -*-

from lxml import etree


xml2019 = '2019 restaurant data.xml'

xml2020 = '2020 restaurant data.xml'


myfile = open('restaurants.csv', 'a')

myfile.write("TYPE,DIST,LICNO,NAME,INFO,EXPDATE,YEAR" + "\n")

events = ("start", "end")

tags = ("TYPE", "DIST", "LICNO", "SS", "ADR", "INFO", "EXPDATE")


context = etree.iterparse(xml2019, events=events)

for action, elem in context:

if (elem.tag in tags) and (action in ("start")):

print("%s, %s" % (elem.tag, elem.text))

if elem.tag != 'ADR':

string = str(elem.text)

myfile.write(string.replace(',', '') + ",")

if elem.tag == 'EXPDATE':

myfile.write("2019,\n")


context = etree.iterparse(xml2020, events=events)

for action, elem in context:

if (elem.tag in tags) and (action in ("start")):

print("%s, %s" % (elem.tag, elem.text))

if elem.tag != 'ADR':

string = str(elem.text)

myfile.write(string.replace(',', '') + ",")

if elem.tag == 'EXPDATE':

myfile.write("2020,\n")


myfile.close()


Although not very dry, this code writes all the 2019 and 2020 restaurant registrations to one CSV file with an additional distinguishing field of “year” and avoids the need to blend data sources in Tableau.

In 2019, there were 15,656 registered restaurants, and in 2020, there were 16,071 – an increase of 415 licences. The odd entry with missing district or restaurant type data was omitted.



The code metadata for districts and restaurant types were manually entered into Tableau as a calculated field, as Tableau can get pretty clunky when connecting more than one data set – especially when one of them is as large as this data set.

Tableau-calculated field for district codes


If you’re interested, you can read the results article here.


dale

dale

IT, burger & beer specialist at Foodie.
Loving sichuan like a drug, hit me up with that peppery numbness