Rubik's Cube World Records Analysis

Posted by Pythux on Wed 27 January 2016

Following on the recent hype around the robot that is able to solve a Rubik's cube in a second (which seems a big improvement over the previous robot that was able to solve it in about 3 seconds), I got interested in human records, and their evolution over the year.

The World Cube Association provides dataset about ranking, records, players in Rubik's Cube competitions. They offer a downloadable dataset for data science purpose. But since it would be too easy to use a TSV ;), I'll show how we can extract data from this website, and use Pandas get some insight.

Players

We'll first try to explore data about results of Worldwide competitions that we can find on this page.

Let's use requests to fetch HTML from the URL. If you go directly on the page, you'll see that you can select the number of results you're interested in, countries, years, etc. All theses parameters can be specified in the URL too. In this post, I'll use all the data (all players, all years, all countries):

In [174]:
import requests
r = requests.get("https://www.worldcubeassociation.org/results/events.php?eventId=333&regionId=&years=&show=All%2BPersons&single=Single")

Beautifulsoup will let you manipulate HTML without dealing too much with low-level details.

In [175]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

You can ask him to find and extract one particular part of the HTML, in our case it would be table-responsive which represents the array of results:

In [176]:
table = soup.find("div", {"class": "table-responsive"})
raw_lines = [line.contents[:5] for line in table.find_all("tr") if len(line) == 6]
raw_lines[0]
Out[176]:
[<td class="r">1</td>,
 <td><a class="p" href="/results/p.php?i=2011ETTE01">Lucas Etter</a></td>,
 <td class="R2">4.90</td>,
 <td>USA</td>,
 <td><a class="c" href="/results/c.php?i=RiverHillFall2015">River Hill Fall 2015</a></td>]

We get a list of HMTL Entities that we will process to extract the data we need. But first, let's automate this extraction process with a function, since every tables on this website respect the same format:

In [333]:
from bs4 import BeautifulSoup
import requests

def get_table_raws(url, cols):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    table = soup.find("div", {"class": "table-responsive"})
    raw_lines = [line.contents for line in table.find_all("tr") if len(line) == cols]
    return [[c.text for c in line] for line in raw_lines]

Since we'll have to deal with dates and timings, we could use this little helpers to do the conversions:

In [263]:
def extract_year_from_competition(competition):
    # Format is [name] [year]
    return int(competition.strip().rsplit(' ', 1)[-1])

def convert_timing_to_seconds(timing):
    try:
        # Try to convert directly in seconds
        return float(timing)
    except ValueError:
        try:
            # Format should be [minutes]:[seconds].[tenth]
            minutes, rest = timing.split(":", 1)
            seconds, tenth = rest.split(".", 1)
            return (float(minutes) * 60 + float(seconds) + float(tenth) / 100)
        except ValueError:
            return None

Let's use our helpers to extract all the data:

In [339]:
url = "https://www.worldcubeassociation.org/results/events.php?eventId=333&regionId=&years=&show=All%2BPersons&single=Single"
clean_table = []
for rank, (_, person, timing, country, competition, _) in enumerate(get_table_raws(url, 6)):
    year = extract_year_from_competition(competition)
    timing = convert_timing_to_seconds(timing)
    clean_table.append((rank + 1, person, timing, country, competition, year))

clean_table[:5]
Out[339]:
[(1, u'Lucas Etter', 4.9, u'USA', u'River Hill Fall 2015', 2015),
 (2, u'Keaton Ellis', 5.09, u'USA', u'River Hill Fall 2015', 2015),
 (3, u'Collin Burns', 5.25, u'USA', u'Doylestown Spring 2015', 2015),
 (4, u'Feliks Zemdegs', 5.39, u'Australia', u'World Championship 2015', 2015),
 (5, u'Mats Valk', 5.55, u'Netherlands', u'Zonhoven Open 2013', 2013)]
In [340]:
len(clean_table)
Out[340]:
47465

Let's see if we can gain some insight from this dataset using Pands:

In [341]:
import pandas as pd
import numpy as np
In [342]:
df = pd.DataFrame(
    data=clean_table, 
    columns=('Rank', 'Person', 'Timing', 'Country', 'Competition', 'Year'))
In [343]:
df.head()
Out[343]:
Rank Person Timing Country Competition Year
0 1 Lucas Etter 4.90 USA River Hill Fall 2015 2015
1 2 Keaton Ellis 5.09 USA River Hill Fall 2015 2015
2 3 Collin Burns 5.25 USA Doylestown Spring 2015 2015
3 4 Feliks Zemdegs 5.39 Australia World Championship 2015 2015
4 5 Mats Valk 5.55 Netherlands Zonhoven Open 2013 2013

We see that the first entry is the World record by Lucas Etter. If you didn't see, watch it now, it's very impressive.

In [211]:
df["Timing"].describe()
Out[211]:
count    47465.000000
mean        35.078421
std         27.492358
min          4.900000
25%         17.350000
50%         26.720000
75%         43.830000
max        648.000000
Name: Timing, dtype: float64

We have 47465 entries. The average time of resolution in these competitions is 35 seconds, which is actually pretty fast. The maximum, of 648 seconds (10 minutes) seems more reasonnable to me...

In [213]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

We can count the number of players for each country:

In [231]:
country_count = df[["Country"]].apply(pd.value_counts)
country_count[:20]
Out[231]:
Country
USA 8918
China 6286
India 4659
Brazil 2039
Poland 1739
Canada 1526
Germany 1092
Indonesia 1090
France 1065
Japan 999
Philippines 993
Spain 977
Mexico 932
Korea 869
Ukraine 848
Taiwan 828
Russia 817
Australia 766
Peru 671
Colombia 627

No surprise, bigger countries are more represented in Rubik's Cube competitions. It would be interesting to compare these numbers with the actual population, to see if Rubik's Cube is more "popular" in some countries.

Let's visualize the countries with more than 200 players using a barplot:

In [234]:
country_count[country_count > 200].dropna().plot(kind="bar")
Out[234]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9b477ebfd0>

We can do the same for players:

In [355]:
person_count = df[["Person"]].apply(pd.value_counts).apply(pd.value_counts)
person_count
Out[355]:
Person
1 46709
2 305
3 26
4 13
5 2
6 1

It appears that the vast majority of player only participated one time. And only about 0.007% participated more than once. Can we conclude that players are more hobbyist than professional? Another information that could be interesting is the age of the participants, since the World Champion is very young, it would be fun to have more insight about the average or mean age of Rubik's Cube champions.

Records

Let's now get to the world records, and try to see how they evolve over the years. The code is pretty similar to the previous section, so we'll just use our generic function for table extraction:

In [336]:
url = "https://www.worldcubeassociation.org/results/regions.php?regionId=&eventId=333&years=&history=History"
clean_table = []
for (_, single, avg, person, country, competition, _) in get_table_raws(url, 7):
    single = convert_timing_to_seconds(single.strip())
    avg = convert_timing_to_seconds(avg.strip())
    year = float(extract_year_from_competition(competition))
    clean_table.append((person, single, avg, country, competition, year))

clean_table[:5]
Out[336]:
[(u'Lucas Etter', 4.9, None, u'USA', u'River Hill Fall 2015', 2015.0),
 (u'Collin Burns', 5.25, None, u'USA', u'Doylestown Spring 2015', 2015.0),
 (u'Mats Valk', 5.55, None, u'Netherlands', u'Zonhoven Open 2013', 2013.0),
 (u'Feliks Zemdegs',
  5.66,
  None,
  u'Australia',
  u'Melbourne Winter Open 2011',
  2011.0),
 (u'Feliks Zemdegs',
  6.18,
  None,
  u'Australia',
  u'Melbourne Winter Open 2011',
  2011.0)]
In [307]:
df = pd.DataFrame(
    data=clean_table, 
    columns=('Person', 'Single', 'Avg', 'Country', 'Competition', 'Year'))
In [308]:
df.head()
Out[308]:
Person Single Avg Country Competition Year
0 Lucas Etter 4.90 NaN USA River Hill Fall 2015 2015
1 Collin Burns 5.25 NaN USA Doylestown Spring 2015 2015
2 Mats Valk 5.55 NaN Netherlands Zonhoven Open 2013 2013
3 Feliks Zemdegs 5.66 NaN Australia Melbourne Winter Open 2011 2011
4 Feliks Zemdegs 6.18 NaN Australia Melbourne Winter Open 2011 2011
In [332]:
df = df.sort(["Year", "Single"], ascending=[True, False])
df[df["Single"].notnull()].plot(x="Year", y="Single", legend="Test")
plt.title("World records of Rubik's Cube")
Out[332]:
<matplotlib.text.Text at 0x7f9b45833e90>

Last words

It would seem reasonable to conclude that we are now on a plateau. One think that could bring even more insight would be to compare this evolution to, say, evolution among sprinters' world records over time, etc. I guess we're hitting the same kind of limitation.

A major difference is that in Rubik's Cube, you're limited both by your fingers' dexterity, and the speed at which your brain can process information, whereas in athletics, it's more about physical performance. I would be curious to see which one is more limiting: brain, or body.

Another open question for me is: how much does the final score depends on the initial configuration of the Rubik's Cube? I know that they are able to look at it before the beginning, so maybe this is negligible, maybe not.


Comments !