Create XML from Excel for CRS reporting

make use of Element tree in lxml and Data frame in Pandas

Financial institutions are obligated under the Inland Revenue Ordinance to furnish return reporting the required information of reportable accounts to the Department in a specified format on an annual basis.

The IT department is responsible for developing a program which converts XML from Excel based on the CRS XML Schema. There are several steps to make ready this report.

1. Preparation

Dive into the XML scheme and original Excel file (contain what columns and the structure of account information)

First of all, we must learn this specific XML scheme and the structure of existing data by using excel as pictures below:

dev-source.png

Open the Developer option in Excel

open_scheme.png

Import scheme and display the hierarchy on the right side

2. Programming language and packages

Python definitely was the first one which flash out in my brain, for Python has three methods to parse XML: SAX, DOM and ElementTree. Because of simple and fast, Element tree has been utilized to create data in XML scheme.

The XML output should look like th below

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<crs:AEOI_Report xmlns:cfc="http://www.ird.gov.hk/AEOI/aeoitypes/v2" xmlns:crs="http://www.ird.gov.hk/AEOI/crs/v2" version="2.0">
  <crs:MessageSpec>
    <crs:AeoiId>AL07730</crs:AeoiId>
    <crs:FIName>Newton Futures Limited</crs:FIName>
    <crs:MessageRefId>2020AL055302021052617421300</crs:MessageRefId>
    <crs:MessageTypeIndic>CRS701</crs:MessageTypeIndic>
    <crs:ReturnYear>2020</crs:ReturnYear>
    <crs:Timestamp>2021-05-26T17:42:13</crs:Timestamp>
  </crs:MessageSpec>
  <crs:CrsBody>
    <crs:ReportingGroup>
      <crs:AccountReport>
        <crs:DocSpec>
          <crs:DocTypeIndic>OECD1</crs:DocTypeIndic>
          <crs:DocRefId>45282e59-0734-4034-b1a1-bd13</crs:DocRefId>
        </crs:DocSpec>
        <crs:AccountNumber>5391537</crs:AccountNumber>
        <crs:AccountHolder>
          <crs:Individual>
            <crs:ResCountryCode>CN</crs:ResCountryCode>
            <crs:TIN>37246695862</crs:TIN>
            <crs:Name>
              <crs:FirstName>YEPING </crs:FirstName>
              <crs:LastName>ZHAO</crs:LastName>
            </crs:Name>
            <crs:Address>
              <cfc:CountryCode>CN</cfc:CountryCode>
              <cfc:AddressFree>
                <cfc:Line>中國上海市賈秀路</cfc:Line>
                <cfc:Line>地球小區研究院</cfc:Line>
                <cfc:Line>郵編: 097750</cfc:Line>
              </cfc:AddressFree>
            </crs:Address>
            <crs:BirthInfo>
              <crs:BirthDate>1991-10-23</crs:BirthDate>
            </crs:BirthInfo>
          </crs:Individual>
        </crs:AccountHolder>
        <crs:AccountBalance currCode="HKD">0.00</crs:AccountBalance>
      </crs:AccountReport>
    </crs:CrsBody>
</crs:AEOI_Report>

3. Walk through the program

First of all, source file was read from Excel.

import pandas as pd
xls = pd.ExcelFile('wxt.xlsx')
df_mess_head  = pd.read_excel(xls, "Message Header")
df_account_holder = pd.read_excel(xls, "Account Holder", skiprows=[0,2])
df_tin = pd.read_excel(xls, "Account Holder Country & TIN", skiprows=[0,2])
df_account_balance = pd.read_excel(xls, "Account Balance", skiprows=[0,2])
df_payment = pd.read_excel(xls, "Payment", skiprows=[0,2])

You can notice the prefix tag like 'crs:' and 'cfc:'. The first challenge was to set specific namespace in line with this scheme. The code below indicate how to put the namespace in right .

from lxml import etree as et
#begin to construct xml file 
MY_NAMESPACES = {'crs': 'http://www.ird.gov.hk/AEOI/crs/v2', 'cfc': 'http://www.ird.gov.hk/AEOI/aeoitypes/v2'}
root_tags = et.Element('{%s}AEOI_Report' % MY_NAMESPACES['crs'], nsmap=MY_NAMESPACES, version='2.0' )

These are the tag names for each row.

head = et.SubElement(root_tags, '{%s}MessageSpec' % MY_NAMESPACES['crs'])
head_column_1 = et.SubElement(head, '{%s}AeoiId' % MY_NAMESPACES['crs'])
head_column_2 = et.SubElement(head, '{%s}FIName' % MY_NAMESPACES['crs'])
head_column_3 = et.SubElement(head, '{%s}MessageRefId' % MY_NAMESPACES['crs'])
head_column_4 = et.SubElement(head, '{%s}MessageTypeIndic' % MY_NAMESPACES['crs'])
head_column_5 = et.SubElement(head, '{%s}ReturnYear' % MY_NAMESPACES['crs'])
head_column_6 = et.SubElement(head, '{%s}Timestamp' % MY_NAMESPACES['crs'])

This section outputs the data to an xml file

head_column_1.text = str(df_mess_head.iat[2,1])
head_column_2.text = str(df_mess_head.iat[2,2])
head_column_3.text = str(df_mess_head.iat[2,5]) + str(df_mess_head.iat[2,1]) + now.strftime('%Y%m%d%H%M%S') +'00'
head_column_4.text = 'CRS701'
head_column_5.text = str(df_mess_head.iat[2,5])
head_column_6.text = now.strftime('%Y-%m-%dT%H:%M:%S')

Lastly, the program is to create XML file and populate it with the data from the source file.

tree = et.ElementTree(root_tags)
#this just formats in a way that the XML is readable
et.indent(tree, space='\t', level=0)
tree.write('ireFrom.xml', encoding = 'utf-8', xml_declaration=True)

Reference

1. https://dataanalyticsireland.ie/2021/10/04/how-to-create-an-xml-file-from-excel-using-python/#comment-4444