delectableData: March 2018

Tuesday, 13 March 2018

Decision trees for UK voting

The next data-analysis method I'm playing with is decision-tree regression.

It's a method often said to be included in a field of statistical computing (called machine learning, statistical learning, artificial intelligence, data mining, supervised learning, etc). Decision trees usefully split up datasets into groups, often using YES/NO questions at each split.

I'm using data from Qriously (date 2017-06-07) in the run-up to the UK general election. I'm looking only at England & Wales, and I've only considered 3 regressors: gender (0=F, 1=M), age, income. I've considered a YES/NO voting intention for the 5 biggest political parties.

The trees are below. Here are some key aspects that jump out:

Age seems to be the most important regressor for most parties.
CON seems to get many votes from older voters (except if they're poor).
CON gets few votes from younger votes (especially poorer voters).
CON's best group were older females (not males as one might expect - maybe this is simply a bias of longer life expectancy for females).
LAB/CON results are fairly inverted (as we might expect), i.e. poorer and younger voters favouring LAB.
LAB's best group were young, poor females.
LAB's worst group are the 65+.
LIB seems to do best from low- and middle-income voters, more-so for male voters.
LIB's two worst groups are from (a) elderly richer females, and (b) poorer older voters.
GRN's voters are generally younger (the one exception being wealthier older females) -- young males is one key group.
GRN does badly with (a) older, poorer voters and (b) older, richer males.
UKIP voters are generally poorer. One key group being poorer younger males.

There are so many assumptions and drawbacks in these sorts of analyses - but anyway, interesting all the same.

CON:

LAB:

LIB:

UKIP:

GRN:

Wednesday, 7 March 2018

Logistic regression

The next data-analysis method I'm playing with is logistic regression (LR).

It's a method often said to be included in a field of statistical computing (called machine learning, statistical learning, artificial intelligence, data mining, unsupervised learning, etc). LR is mostly used in cases of yes/no classifications.

I'm trying to use data from my previous blogs where possible. So the dataset is not really super suitable for LR (I don't believe I could really use the model result to predict the probability of future undisturbed nights of sleep) - but I'll use it anyway as a data-exploration tool. The data are taken from camera imagery of our baby/toddler's sleep each night (see earlier blog posts).

It is probably possible to estimate, by eye from the red crosses, that there have been more undisturbed nights of sleep in the later months. But the logistic regression makes it clear that there is a positive relationship - i.e. our baby/toddler is sleeping through more and more nights (apparently sleeping through most nights by the end of the year).

Tuesday, 6 March 2018

k-means clustering

I want to play with data. In this case I want to find regimes/clusters in a dataset I saw in book "The Spirit Level" from the Equality Trust. I took the data (year 2010) from GapMinder (Hans Rosling is an idol of mine!).

It's a method in a field of statistical computing (called machine learning, statistical learning, artificial intelligence, data mining, unsupervised learning, etc). The method is called k-means clustering. 'k' simply refers to the number of groups (clusters) in the dataset.

In this case the idea is that once a country reaches a certain level of wealth (about 5000-10000 USD/capita), no longer does life expectancy increase much (or at all) any more with still further increases in wealth.

k=3. When I was at school we were taught mostly about three worlds: under-developed, emerging and developed economies. So the first group includes countries like Mozambique, Malawi, Afghanistan. The second group Russia, India, Chile, China. The final group most western countries (USA, UK, Germany, etc).

k=9. Interestingly with more clusters small features start to emerge, like a group for South Africa and Botswana where increases in wealth have not translated into increases in life expectancy.

With k-means typically equal-size clusters are typical, so small clusters don't emerge, e.g. Bermuda and Luxembourg as the two most-rich per-capita countries.

Monday, 5 March 2018

k-nearest neighbour

I previously made some observations of lake depth in central Finland and plotted them in octave. I let octave use its own smoothing and interpolation between my datapoints. Now I'm curious to look a little at a simple algorithm of my own.

It's a simple method in a field of statistical computing (called machine learning, statistical learning, artificial intelligence, data mining, supervised learning, etc). The method is called k-nearest neighbours. 'k' simply refers to the number of nearest neighbours one takes into account.