From 1980 to…how clean is this data?
Introduction
The Vault of Classics, my first DataCamp Competition, was based on a Kaggle dataset containing console video game sales from 1980 to 2020. You can find it here. This competition was an invaluable experience. It gave me a deadline and the pace of the competition forced me to become functional in matplotlib and seaborn data visualization techniques in about a month.
My goals were to
- visualize how each genre’s popularity changed over time.
- determine the most powerful publishers and examine how their market share changed over time.
- identify the most beloved games and the most beloved game series.
Data Cleaning and Overview
The video game dataset that DataCamp provided contains 16, 598 records and 11 variables; rank, game title, publisher, platform (console), genre, release year, global sales, European sales, North American sales, Japanese sales, and Other sales. The rank variable was based on the ranking at the dataset’s origin so I dropped it in favor of creating my own ranking based on global sales.
After cleaning up two duplicate entries, I turned to missing data.
Recovering Missing Release Years
Missing release years account for $100 million, or just over 1%, of total sales and the largest single record affected by missing release year data accounts for $5.23 million in global sales.

However, we can see that three of the five top selling games with a null release year are not missing year data at all! The year data is in the title of the game.
In order to determine which release year to assign to each of these games, let’s look at a game with a year in the title and a non-null release year.

FIFA 15 was released in 2014. It is a common practice to release a game the autumn before it’s title is current. I used string methods to extract the year data from game titles when possible and assigned the previous year to the release year field.
After assigning release year based on game titles, 227 missing years remain. Let’s take a look at a few examples of these 227.

Interestingly, Call of Duty: Black Ops and LEGO Batman: The Videogame are each missing year values for certain platforms, but the year is consistently labeled for other platforms. To recapture the missing release year data, I grouped all games with complete data by title, year, and publisher. Then I merged the dataset of games with missing years with the table of games with complete data, joining on the title and publisher fields. This added 55 new rows with complete year data.
Of course, everything can’t be cleaned up, but here’s a summary of my success recapturing missing year data.

The Impact of Missing Years
Because I’m going to be looking at sales by genre and by publisher over time, I want to see how much those remaining missing years affect individual genres and publishers.

Miscellaneous and Adventure style games are the only genres missing more than one percent of their release year information.

Here, many publishers are missing year information. However, these are largely small publishers that will not be considered among the top performers. Only Konami is among the top ten highest grossing publishers and Konami’s missing release year data just crests 1%.
Recovering Missing Publisher Information
Missing publishers account for roughly $56 million or about half a percent of all sales.

Missing publisher data is not as simple as missing year data. Let’s look at a two games with missing publisher information.

Looking at the NASCAR Thunder 2003 example, there is no publisher listed for the year 2003. (It also looks like the year field was mishandled in the original dataset.) NASCAR clearly has only one possible publisher. I joined games with missing publishers to games of the exact same title with complete publisher data. This cleaned up titles with multiple release years, like NASCAR. Then, I merged remaining missing publisher games with complete publisher games of the exact same title and release year.
Teenage Mutant Ninja Turtles is missing publisher information for the Game Boy Advance platform in 2003. It looks like all 2003 editions of this game were published by Konami Digital Entertainment. To recapture these publishers, I assigned Konami to the unknown publisher field for the 2003 GBA edition of the game by merging the missing publisher dataset with the complete dataset, joining on title and year. This method only recaptured Teenage Mutant Ninja Turtles.
I am not sure that these final efforts were worth my labor or computing power. Though the first recapture added nearly sixty new records, the original impact of missing publishers was rather low at about half a percent of gross global sales.

There are still over 200 records without publisher data, some publishers of which are very findable. Let’s take a look at which titles remain with missing publisher data.

Smackdown and Gran Turismo are familiar titles and there are likely to be more instances of these titles. If the missing publisher records comprised a greater percentage of the total global sales, it might be worth the computational expense of a records matching algorithm. However, recovering a few million dollars in missing publisher data will not affect the results of nearly $9 billion in video game sales.
Examining Genre and Publisher Data Integrity
Next, I wanted to be sure that genres were being correctly assigned to games. The following is a list of games assigned to more than one genre.

After some internet research, I found that Culdcept is a strategy game and is similar to a board game; Little Busters! and Steins; Gate: Hiyoku Renri no Darling are visual novels; and Syndicate is strategy with shooting. These titles evade categorization and account for a tiny fraction of total sales. I will ignore their genre labeling inconsistencies.
One hundred and seventy-nine publishers share a game title with another publisher. Because 179 is a manageable number of records, I decided to read through the alphabetized publisher names. Similarities between publishing names convinced me to do some digging. For instance, is Falcom Corporation a typo for Palcom? No, it is not. However, EA Games is an abbreviation of Electronic Arts.
Acquisitions and subsidiaries account for many of the games being released by different publishers. Kadokawa Shoten is a division of Kadokawa Games. Hudson Entertainment is a subsidiary of Hudson Soft. How should these publishers and associated titles be handled? Ultimately, I chose to leave publisher data as is. Were I providing this data analysis to client or superior, I would have a discussion with the end user regarding the purpose of the analysis. A deep understanding of publisher acquisitions and mergers would require more manpower and may not be cost effective for the project at hand.
Quality of the Financial Data
Since the dataset provides sales numbers for Japan, North America, Europe, and other locations and separate global sales totals, I wanted to see if there were any discrepancies between a sum of the regional sales numbers and the reported global sales.

The net difference in global sales and total regional sales is an insignificant portion of the total global sales. In my analysis, except where stated otherwise, I will use the reported global sales.
Data Completeness
As one final data integrity check, I graphed the number of games in the dataset by release year.

The number of games published generally increases each year until 2008 where it stabilizes and then begins to decrease. It is my suspicion that the true number of games published in a year did not decrease, but the data completeness deteriorates. For this reason, I have chosen to ignore the last four years of this dataset.
The Final Dataset
After recapturing missing release year and publisher data where I could, noting publisher, genre, and sales discrepancies, and creating a more complete subset of the dataset, the video games sales data is prepared for analysis and visualization. Most of the data issues are small enough to be of little impact on the truth of the final analyses. The lack of complete data is the largest hurdle to drawing true conclusions about the video game market.
The Python code for this project can be found here.