Following on the recent hype around the robot that is able to solve a Rubik’s cube in a second (which seems a big improvement over the previous robot that was able to solve it in about 3 seconds), I got interested in human records, and their evolution over the year.

The World Cube Association provides dataset about ranking, records, players in Rubik’s Cube competitions. They offer a downloadable dataset for data science purpose. But since it would be too easy to use a TSV ;), I’ll show how we can extract data from this website, and use Pandas get some insight.

Players

We’ll first try to explore data about results of Worldwide competitions that we can find on this page.

Let’s use requests to fetch HTML from the URL. If you go directly on the page, you’ll see that you can select the number of results you’re interested in, countries, years, etc. All theses parameters can be specified in the URL too. In this post, I’ll use all the data (all players, all years, all countries):

import requests
r = requests.get("https://www.worldcubeassociation.org/results/events.php?eventId=333&regionId=&years=&show=All%2BPersons&single=Single")

Beautifulsoup will let you manipulate HTML without dealing too much with low-level details.

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

You can ask him to find and extract one particular part of the HTML, in our case it would be table-responsive which represents the array of results:

table = soup.find("div", {"class": "table-responsive"})
raw_lines = [
    line.contents[:5]
    for line in table.find_all("tr")
    if len(line) == 6
]
raw_lines[0]

1 | Lucas Etter | 4.90 | USA | River Hill Fall 2015 — | — | — | — |

We get a list of HMTL Entities that we will process to extract the data we need. But first, let’s automate this extraction process with a function, since every tables on this website respect the same format:

from bs4 import BeautifulSoup
import requests

def get_table_raws(url, cols):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    table = soup.find("div", {
        "class": "table-responsive"
    })

    raw_lines = [
        line.contents
        for line in table.find_all("tr")
        if len(line) == cols
    ]

    return [[c.text for c in line] for line in raw_lines]

Since we’ll have to deal with dates and timings, we could use this little helpers to do the conversions:

def extract_year_from_competition(competition):
    # Format is [name] [year]
    return int(competition.strip().rsplit(' ', 1)[-1])

def convert_timing_to_seconds(timing):
    try:
        # Try to convert directly in seconds
        return float(timing)
    except ValueError:
        try:
            # Format should be [minutes]:[seconds].[tenth]
            minutes, rest = timing.split(":", 1)
            seconds, tenth = rest.split(".", 1)
            return (
                float(minutes) * 60 +
                float(seconds) +
                float(tenth) / 100
            )
        except ValueError:
            return None

Let’s use our helpers to extract all the data:

url = "https://www.worldcubeassociation.org/results/events.php?eventId=333&regionId=&years=&show=All%2BPersons&single=Single"
clean_table = []
for rank, (_, person, timing, country, competition, _) in enumerate(get_table_raws(url, 6)):
    year = extract_year_from_competition(competition)
    timing = convert_timing_to_seconds(timing)
    clean_table.append((rank + 1, person, timing, country, competition, year))

clean_table[:5]

[(1, u'Lucas Etter', 4.9, u'USA', u'River Hill Fall 2015', 2015),
 (2, u'Keaton Ellis', 5.09, u'USA', u'River Hill Fall 2015', 2015),
 (3, u'Collin Burns', 5.25, u'USA', u'Doylestown Spring 2015', 2015),
 (4, u'Feliks Zemdegs', 5.39, u'Australia', u'World Championship 2015', 2015),
 (5, u'Mats Valk', 5.55, u'Netherlands', u'Zonhoven Open 2013', 2013)]

len(clean_table)

Let’s see if we can gain some insight from this dataset using Pandas:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    data=clean_table,
    columns=(
        'Rank',
        'Person',
        'Timing',
        'Country',
        'Competition',
        'Year'
    ))

df.head()

Rank Person Timing Country Competition Year
1 Lucas Etter 4.9 USA River Hill Fall 2015 2015
2 Keaton Ellis 5.09 USA River Hill Fall 2015 2015
3 Collin Burns 5.25 USA Doylestown Spring 2015 2015
4 Feliks Zemdegs 5.39 Australia World Championship 2015 2015
5 Mats Valk 5.55 Netherlands Zonhoven Open 2013 2013

Rank	Person	Timing	Country	Competition	Year
1	Lucas Etter	4.9	USA	River Hill Fall 2015	2015
2	Keaton Ellis	5.09	USA	River Hill Fall 2015	2015
3	Collin Burns	5.25	USA	Doylestown Spring 2015	2015
4	Feliks Zemdegs	5.39	Australia	World Championship 2015	2015
5	Mats Valk	5.55	Netherlands	Zonhoven Open 2013	2013

We see that the first entry is the World record by Lucas Etter. If you didn’t see, watch it now, it’s very impressive.

df["Timing"].describe()

| — | — count | 47465.000000 mean | 35.078421 std | 27.492358 min | 4.900000 25% | 17.350000 50% | 26.720000 75% | 43.830000 max | 648.000000

Name: Timing, dtype: float64

We have 47465 entries. The average time of resolution in these competitions is 35 seconds, which is actually pretty fast. The maximum, of 648 seconds (10 minutes) seems more reasonable to me…

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

We can count the number of players for each country:

country_count = df[["Country"]].apply(pd.value_counts)
country_count[:20]

Country Count
USA 8918
China 6286
India 4659
Brazil 2039
Poland 1739
Canada 1526
Germany 1092
Indonesia 1090
France 1065
Japan 999
Philippines 993
Spain 977
Mexico 932
Korea 869
Ukraine 848
Taiwan 828
Russia 817
Australia 766
Peru 671
Colombia 627

Country	Count
USA	8918
China	6286
India	4659
Brazil	2039
Poland	1739
Canada	1526
Germany	1092
Indonesia	1090
France	1065
Japan	999
Philippines	993
Spain	977
Mexico	932
Korea	869
Ukraine	848
Taiwan	828
Russia	817
Australia	766
Peru	671
Colombia	627

No surprise, bigger countries are more represented in Rubik’s Cube competitions. It would be interesting to compare these numbers with the actual population, to see if Rubik’s Cube is more “popular” in some countries.

Let’s visualize the countries with more than 200 players using a barplot:

country_count[country_count > 200].dropna().plot(kind="bar")

We can do the same for players:

df[["Person"]]
    .apply(pd.value_counts)
    .apply(pd.value_counts)

Number of participations Count
1 46709
2 305
3 26
4 13
5 2
6 1

Number of participations	Count
1	46709
2	305
3	26
4	13
5	2
6	1

It appears that the vast majority of player only participated one time. And only about 0.007% participated more than once. Can we conclude that players are more hobbyist than professional? Another information that could be interesting is the age of the participants, since the World Champion is very young, it would be fun to have more insight about the average or mean age of Rubik’s Cube champions.

Records

Let’s now get to the world records, and try to see how they evolve over the years. The code is pretty similar to the previous section, so we’ll just use our generic function for table extraction:

url = "https://www.worldcubeassociation.org/results/regions.php?regionId=&eventId=333&years=&history=History"
clean_table = []
for (_, single, avg, person, country, competition, _) in get_table_raws(url, 7):
    single = convert_timing_to_seconds(single.strip())
    avg = convert_timing_to_seconds(avg.strip())
    year = float(extract_year_from_competition(competition))
    clean_table.append((person, single, avg, country, competition, year))

clean_table[:5]

[
  (u'Lucas Etter', 4.9, None, u'USA', u'River Hill Fall 2015', 2015.0),
  (u'Collin Burns', 5.25, None, u'USA', u'Doylestown Spring 2015', 2015.0),
  (u'Mats Valk', 5.55, None, u'Netherlands', u'Zonhoven Open 2013', 2013.0),
  (u'Feliks Zemdegs', 5.66, None, u'Australia', u'Melbourne Winter Open 2011',
  2011.0),
  (u'Feliks Zemdegs',
   6.18,
   None,
   u'Australia',
   u'Melbourne Winter Open 2011',
   2011.0)
]

df = pd.DataFrame(
    data=clean_table,
    columns=('Person', 'Single', 'Avg', 'Country', 'Competition', 'Year'))

df.head()

Person | Single | Avg | Country | Competition | Year — | — | — | — | — | 0 | Lucas Etter | 4.90 | NaN | USA | River Hill Fall 2015 | 2015 1 | Collin Burns | 5.25 | NaN | USA | Doylestown Spring 2015 | 2015 2 | Mats Valk | 5.55 | NaN | Netherlands | Zonhoven Open 2013 | 2013 3 | Feliks Zemdegs | 5.66 | NaN | Australia | Melbourne Winter Open 2011 | 2011 4 | Feliks Zemdegs | 6.18 | NaN | Australia | Melbourne Winter Open 2011 | 2011

df = df.sort(["Year", "Single"], ascending=[True, False])
df[df["Single"].notnull()].plot(x="Year", y="Single", legend="Test")
plt.title("World records of Rubik's Cube")

Last words

It would seem reasonable to conclude that we are now on a plateau. One think that could bring even more insight would be to compare this evolution to, say, evolution among sprinters’ world records over time, etc. I guess we’re hitting the same kind of limitation.

A major difference is that in Rubik’s Cube, you’re limited both by your fingers’ dexterity, and the speed at which your brain can process information, whereas in athletics, it’s more about physical performance. I would be curious to see which one is more limiting: brain, or body.

Another open question for me is: how much does the final score depends on the initial configuration of the Rubik’s Cube? I know that they are able to look at it before the beginning, so maybe this is negligible, maybe not.

../

Rubik's Cube records

Players

Records

Last words