We used machine learning to calculate the predicted impact of adding a new tax center at any particular postcode.
The impact was measured in terms of the number of tax returns filed and the total, which hopefully allows the ATO to identify the best locations for new Tax Help Centers which benefit the most people.
It's important to note that the model we developed is extremely flexible, and is not at all limited to predicting the effect of adding new Tax Help Centers. There are some 400+ input parameters that can be adjusted. We could, for example, use it to predict changes in taxation due to demographic shifts with respect to age, sex, occupational status, etc.
o calculate the fraction of tax returns filed, we need to know the total number of people who ought to have filed tax returns in each postcode. The ABS data included in
atoabsgovhack2018.csv is general population information, and not specifically about the working population.
We can get the working population in each postcode from GCCPOA G43. However, the data is also paramterised with respect to age and sex.
Marginalise G43 over age, sex, to get working population in each postcode
Append ATO data in
atoabsgovhack2018.csvwith the results from (1)
As postal area approximates postcode, we could perform a join of many datasets seen in
munging/DataMunge.ipynb trivially. There were only ~400 postcodes where tax help centres were built (or at least labeled). We labeled the remaining postcodes as having no tax centre, this allowed the machine to learn from both positive and negative information. We were left with ~2500 rows and ~450 columns, in the analytics that we have presented as Combined ATO and Census we have given all of this information to the machine to learn from. With the analytics labeled ATO only data there are ~40 columns.
Given the mix of continuous and discrete types in the data, and the relatively small size of the sample set (i.e. number of taxed postcodes), we have used a gradient-boosted regression tree (GBRT) method. We have employed the DART tree booster (described in Rashmi Korlakai Vinayak, Ran Gilad-Bachrach. “DART: Dropouts meet Multiple Additive Regression Trees.” JMLR.) which adapts dropout regularisation from deep learning to boosted trees, ameliorating the tendency of these models to overfit their training data. The model hyperparameters were optimised using a hybrid coarse grid-search and meta GBRT optimiser
We trained models using a subset of the ATO data only, and another model which used the combined ATO/ABS data. We found that the ATO data was overly optimistic about the impact of a new Tax Help Center; whereas the model trained on the ATO/ABS data was somewhat pessimistic. We included interactive graphs of both models for comparison.
Evidence of Work
ABS GeoSpatial API, Postal Area ArcGIS
Description of Use: We integrated this into our mapping interface
Census DataPacks 2016, General Community Profile, Postal Area
Description of Use: The General Community Profile has a lot of data in it, the datasets of interest for us were in G43 and have been saved in `data/census/gcppoa/raw`
GovHack ATO 2018
Description of Use: We were most interested in this dataset for the correlation between postcode and the total number of people who have lodged a tax return.
Region: Australian Capital TerritoryChallenge
The Friendly ATO
Bounty: Tax Help Centers