Many important social indicators are based on sample data. Regional disaggregation is often required in order to adequately monitor progress. However, these possibilities are limited by sampling error. This is particularly true for many indicators of the so-called Sustainable Development Goals.
The innovation project "Machine Learning for Sample Data Geographic information systems" (LEARN4SDGis) was funded by a EUROSTAT grant. Its main objective was: How can important indicators be estimated for small areas? How can valid information be provided, for example, for poverty risk at any regional level - such as grid, census district, municipality, district or NUTS level?
The project started in early 2018 and was completed in mid-2020. The final product is an atlas for five small-scale indicators on education, poverty and health. These indicators are an integral part of the national set of indicators to monitor progress on the Sustainable Development Goals (SDGs). Each indicator represents a population percentage for a specific characteristic (e.g. at-risk-of-poverty rate). Upon completion of the project, a consultation on the methods and results was conducted with the Quality Committee of the Statistics Council, the Advisory Board for Social Statistics and the regional statistical offices. In spring 2021, methods and data bases were updated, based on this.
Sample data were linked with geoinformation and additional administrative data. The integration of already existing data provides valuable information for the required small-area estimations. For example, in the case of poverty indicators, there is already by definition a connection with income sources that are covered in income tax or other register data. Regionally available indicators such as the number of unemployed according to Austrian Public Employment Service (AMS), the frequency of newborns with a low birth weight or the rate of new vehicle registrations, also showed plausible relationships.
Due to the abundance of possible additional information, machine learning algorithms were applied to automatically recognise the correlations in the sample data that are relevant for an improved estimate and to model them according to defined optimisation criteria.
The following machine learning approaches were tested:
In each case, the variable required for the calculation of an indicator (e.g. poverty risk) was modelled. The models were trained on sample data and then applied to population data to estimate the relevant characteristic for every individual that lives in a private household on a specified reference day.
As its main substantial result, the project delivered maps of poverty, health and education at the small-area level. This was achieved by machine learning and integration of different data sources. These data provide valuable initial insights, but the methodological work is still in progress and cannot rely on any European harmonisation yet.
The small-area results are ultimately based on the available administrative data. If the administrative data does not adequately represent the reality of life, intuitively unexpected results are to be expected. This applies in particular to discrepancies between the registration address and the household affiliation, or, if a household relies on income which is not or not entirely recorded in tax or other administrative data. The actual income situation of self-employed persons for instance is only insufficiently reflected in the administrative data. This may explain why results relating to areas with an above-average share of self-employed workers may appear less plausible than in areas with a high share of the employed or pensioners. Regional concentrations of people who live predominantly from capital income, tips or seasonal employment can have a similar effect.
As a general rule, the smaller the area in which the results are presented, the more critically they should be assessed. The results have an estimation error, which stems partly from the modelling and partly from the sample. Particularly unreliable results were therefore suppressed.
The results are accessible in an interactive atlas.
The original project report to Eurostat can be read here.
A detailed methodological description can be found in an article published in September 2020 (available in German only).
A summary of main methodological modifications can be found here (available in German only).