../

Rubik's Cube records

└─ 2016-01-27 • Reading time: ~9 minutes

Following on the recent hype around the robot that is able to solve a Rubik’s cube in a second (which seems a big improvement over the previous robot that was able to solve it in about 3 seconds), I got interested in human records, and their evolution over the year.

The World Cube Association provides dataset about ranking, records, players in Rubik’s Cube competitions. They offer a downloadable dataset for data science purpose. But since it would be too easy to use a TSV ;), I’ll show how we can extract data from this website, and use Pandas get some insight.

Players

We’ll first try to explore data about results of Worldwide competitions that we can find on this page.

Let’s use requests to fetch HTML from the URL. If you go directly on the page, you’ll see that you can select the number of results you’re interested in, countries, years, etc. All theses parameters can be specified in the URL too. In this post, I’ll use all the data (all players, all years, all countries):

import requests
r = requests.get("https://www.worldcubeassociation.org/results/events.php?eventId=333&regionId=&years=&show=All%2BPersons&single=Single")

Beautifulsoup will let you manipulate HTML without dealing too much with low-level details.

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

You can ask him to find and extract one particular part of the HTML, in our case it would be table-responsive which represents the array of results:

table = soup.find("div", {"class": "table-responsive"})
raw_lines = [
    line.contents[:5]
    for line in table.find_all("tr")
    if len(line) == 6
]
raw_lines[0]

1 | Lucas Etter | 4.90 | USA | River Hill Fall 2015 — | — | — | — |

We get a list of HMTL Entities that we will process to extract the data we need. But first, let’s automate this extraction process with a function, since every tables on this website respect the same format:

from bs4 import BeautifulSoup
import requests

def get_table_raws(url, cols):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    table = soup.find("div", {
        "class": "table-responsive"
    })

    raw_lines = [
        line.contents
        for line in table.find_all("tr")
        if len(line) == cols
    ]

    return [[c.text for c in line] for line in raw_lines]

Since we’ll have to deal with dates and timings, we could use this little helpers to do the conversions:

def extract_year_from_competition(competition):
    # Format is [name] [year]
    return int(competition.strip().rsplit(' ', 1)[-1])

def convert_timing_to_seconds(timing):
    try:
        # Try to convert directly in seconds
        return float(timing)
    except ValueError:
        try:
            # Format should be [minutes]:[seconds].[tenth]
            minutes, rest = timing.split(":", 1)
            seconds, tenth = rest.split(".", 1)
            return (
                float(minutes) * 60 +
                float(seconds) +
                float(tenth) / 100
            )
        except ValueError:
            return None

Let’s use our helpers to extract all the data:

url = "https://www.worldcubeassociation.org/results/events.php?eventId=333&regionId=&years=&show=All%2BPersons&single=Single"
clean_table = []
for rank, (_, person, timing, country, competition, _) in enumerate(get_table_raws(url, 6)):
    year = extract_year_from_competition(competition)
    timing = convert_timing_to_seconds(timing)
    clean_table.append((rank + 1, person, timing, country, competition, year))

clean_table[:5]
[(1, u'Lucas Etter', 4.9, u'USA', u'River Hill Fall 2015', 2015),
 (2, u'Keaton Ellis', 5.09, u'USA', u'River Hill Fall 2015', 2015),
 (3, u'Collin Burns', 5.25, u'USA', u'Doylestown Spring 2015', 2015),
 (4, u'Feliks Zemdegs', 5.39, u'Australia', u'World Championship 2015', 2015),
 (5, u'Mats Valk', 5.55, u'Netherlands', u'Zonhoven Open 2013', 2013)]
len(clean_table)
47465

Let’s see if we can gain some insight from this dataset using Pandas:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    data=clean_table,
    columns=(
        'Rank',
        'Person',
        'Timing',
        'Country',
        'Competition',
        'Year'
    ))

df.head()
RankPersonTimingCountryCompetitionYear
1Lucas Etter4.9USARiver Hill Fall 20152015
2Keaton Ellis5.09USARiver Hill Fall 20152015
3Collin Burns5.25USADoylestown Spring 20152015
4Feliks Zemdegs5.39AustraliaWorld Championship 20152015
5Mats Valk5.55NetherlandsZonhoven Open 20132013

We see that the first entry is the World record by Lucas Etter. If you didn’t see, watch it now, it’s very impressive.

df["Timing"].describe()

| — | — count | 47465.000000 mean | 35.078421 std | 27.492358 min | 4.900000 25% | 17.350000 50% | 26.720000 75% | 43.830000 max | 648.000000

Name: Timing, dtype: float64

We have 47465 entries. The average time of resolution in these competitions is 35 seconds, which is actually pretty fast. The maximum, of 648 seconds (10 minutes) seems more reasonable to me…

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

We can count the number of players for each country:

country_count = df[["Country"]].apply(pd.value_counts)
country_count[:20]
CountryCount
USA8918
China6286
India4659
Brazil2039
Poland1739
Canada1526
Germany1092
Indonesia1090
France1065
Japan999
Philippines993
Spain977
Mexico932
Korea869
Ukraine848
Taiwan828
Russia817
Australia766
Peru671
Colombia627

No surprise, bigger countries are more represented in Rubik’s Cube competitions. It would be interesting to compare these numbers with the actual population, to see if Rubik’s Cube is more “popular” in some countries.

Let’s visualize the countries with more than 200 players using a barplot:

country_count[country_count > 200].dropna().plot(kind="bar")
Players per country

We can do the same for players:

df[["Person"]]
    .apply(pd.value_counts)
    .apply(pd.value_counts)
Number of participationsCount
146709
2305
326
413
52
61

It appears that the vast majority of player only participated one time. And only about 0.007% participated more than once. Can we conclude that players are more hobbyist than professional? Another information that could be interesting is the age of the participants, since the World Champion is very young, it would be fun to have more insight about the average or mean age of Rubik’s Cube champions.

Records

Let’s now get to the world records, and try to see how they evolve over the years. The code is pretty similar to the previous section, so we’ll just use our generic function for table extraction:

url = "https://www.worldcubeassociation.org/results/regions.php?regionId=&eventId=333&years=&history=History"
clean_table = []
for (_, single, avg, person, country, competition, _) in get_table_raws(url, 7):
    single = convert_timing_to_seconds(single.strip())
    avg = convert_timing_to_seconds(avg.strip())
    year = float(extract_year_from_competition(competition))
    clean_table.append((person, single, avg, country, competition, year))

clean_table[:5]
[
  (u'Lucas Etter', 4.9, None, u'USA', u'River Hill Fall 2015', 2015.0),
  (u'Collin Burns', 5.25, None, u'USA', u'Doylestown Spring 2015', 2015.0),
  (u'Mats Valk', 5.55, None, u'Netherlands', u'Zonhoven Open 2013', 2013.0),
  (u'Feliks Zemdegs', 5.66, None, u'Australia', u'Melbourne Winter Open 2011',
  2011.0),
  (u'Feliks Zemdegs',
   6.18,
   None,
   u'Australia',
   u'Melbourne Winter Open 2011',
   2011.0)
]
df = pd.DataFrame(
    data=clean_table,
    columns=('Person', 'Single', 'Avg', 'Country', 'Competition', 'Year'))

df.head()

Person | Single | Avg | Country | Competition | Year — | — | — | — | — | 0 | Lucas Etter | 4.90 | NaN | USA | River Hill Fall 2015 | 2015 1 | Collin Burns | 5.25 | NaN | USA | Doylestown Spring 2015 | 2015 2 | Mats Valk | 5.55 | NaN | Netherlands | Zonhoven Open 2013 | 2013 3 | Feliks Zemdegs | 5.66 | NaN | Australia | Melbourne Winter Open 2011 | 2011 4 | Feliks Zemdegs | 6.18 | NaN | Australia | Melbourne Winter Open 2011 | 2011

df = df.sort(["Year", "Single"], ascending=[True, False])
df[df["Single"].notnull()].plot(x="Year", y="Single", legend="Test")
plt.title("World records of Rubik's Cube")
World records of Rubik’s Cube

Last words

It would seem reasonable to conclude that we are now on a plateau. One think that could bring even more insight would be to compare this evolution to, say, evolution among sprinters’ world records over time, etc. I guess we’re hitting the same kind of limitation.

A major difference is that in Rubik’s Cube, you’re limited both by your fingers’ dexterity, and the speed at which your brain can process information, whereas in athletics, it’s more about physical performance. I would be curious to see which one is more limiting: brain, or body.

Another open question for me is: how much does the final score depends on the initial configuration of the Rubik’s Cube? I know that they are able to look at it before the beginning, so maybe this is negligible, maybe not.