Tuesday 6 March 2018

k-means clustering

I want to play with data. In this case I want to find regimes/clusters in a dataset I saw in book "The Spirit Level" from the Equality Trust. I took the data (year 2010) from GapMinder (Hans Rosling is an idol of mine!).

It's a method in a field of statistical computing (called machine learning, statistical learning, artificial intelligence, data mining, unsupervised learning, etc). The method is called k-means clustering. 'k' simply refers to the number of groups (clusters) in the dataset.

In this case the idea is that once a country reaches a certain level of wealth (about 5000-10000 USD/capita), no longer does life expectancy increase much (or at all) any more with still further increases in wealth.

k=3. When I was at school we were taught mostly about three worlds: under-developed, emerging and developed economies. So the first group includes countries like Mozambique, Malawi, Afghanistan. The second group Russia, India, Chile, China. The final group most western countries (USA, UK, Germany, etc).

k=9. Interestingly with more clusters small features start to emerge, like a group for South Africa and Botswana where increases in wealth have not translated into increases in life expectancy.

With k-means typically equal-size clusters are typical, so small clusters don't emerge, e.g. Bermuda and Luxembourg as the two most-rich per-capita countries.