Pandas Data Structures: DataFrame basics
In the previous post I’ve summed up some of the basics of using pandas Series data structure. Now, let’s take a quick look at DataFrame.
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. It’s sort of like an SQL table, or a dict of Series objects.
This is the object you use a lot when digging through some data with pandas. I’m not going to describe how to create a DataFrame from lists and dicts, because for me it’s more common that I need to dig through some existing data rather than create new data. As it happens with existing data, it comes in all kinds of formats, and for the most typical ones, it’s easy to read it into a DataFrame for further manipulation.
For my examples I’ll be using movie_metadata.csv
that has been scraped from IMDB.com by Chuan Sun (@sundeepblue on Github) .
Read to DataFrame
Quite often your data comes as an excel or .csv file.
To read it into a DataFrame, just use, well, read_excel
or read_csv
.
import pandas as pd
df = pd.read_csv('movie_metadata.csv')
In ideal world (and in this example) that would be it, but most of the time the original excel or csv files are full of mess. Luckily you don’t need to write any code to get only one sheet, or skip rows, or fix encoding - read_excel
and read_csv
have a bunch of options to help you fix things.
There are also standard methods to read from an SQL query or database table, or JSON.
Unfortunately, if you have XML, you’ll need to do some parsing yourself. Perhaps, I’ll write an example of how I did it in some other post. But here, we have a nice clean csv.
What’s in a DataFrame?
Now, when you have a DataFrame (with loads of data), the fun part begins. Pandas allow you to do a lot with a DataFrame with a couple of lines of code. But first, let’s see what we’ve got in it.
Let’s see how big is the DataFrame we’ve got:
df.shape
--------------------
(5043, 28)
--------------------
There you go - 5043 rows and 28 columns.
What are the column names?
list(df)
--------------------
['color', 'director_name', 'num_critic_for_reviews', 'duration', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_facebook_likes', 'actor_3_name', 'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link', 'num_user_for_reviews', 'language', 'country', 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score', 'aspect_ratio', 'movie_facebook_likes']
--------------------
Like with Series you can use .head() or tail() to “preview” what the data looks like:
df.head(3)
--------------------
color director_name num_critic_for_reviews duration \
0 Color James Cameron 723.0 178.0
1 Color Gore Verbinski 302.0 169.0
2 Color Sam Mendes 602.0 148.0
director_facebook_likes actor_3_facebook_likes actor_2_name \
0 0.0 855.0 Joel David Moore
1 563.0 1000.0 Orlando Bloom
2 0.0 161.0 Rory Kinnear
actor_1_facebook_likes gross genres \
0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi
1 40000.0 309404152.0 Action|Adventure|Fantasy
2 11000.0 200074175.0 Action|Adventure|Thriller
...
Let’s get some data
Now that we have an idea of what’s in this DataFrame, we can start digging some data out of it.
We can start by getting a column by its name, and set a range of indices we want, say we want first three movie titles:
df['movie_title'][:3]
--------------------
0 Avatar
1 Pirates of the Caribbean: At Worlds End
2 Spectre
Name: movie_title, dtype: object
--------------------
We can specify multiple columns too:
df[['movie_title','director_name']][:3]
--------------------
movie_title director_name
0 Avatar James Cameron
1 Pirates of the Caribbean: At Worlds End Gore Verbinski
2 Spectre Sam Mendes
--------------------
We can see top 5 directors with the most movies made by them (at least from these records).
df['director_name'].value_counts()[:5]
--------------------
Steven Spielberg 26
Woody Allen 22
Martin Scorsese 20
Clint Eastwood 20
Ridley Scott 17
Name: director_name, dtype: int64
--------------------
We can see what’s the average number of movies for a director:
df['director_name'].value_counts().mean()
--------------------
2.05963302752
--------------------
Let’s find all the movies where Woody Allen is a director and plays the main character:
df[(df['director_name'] =='Woody Allen') & (df['actor_1_name']=='Woody Allen')]['movie_title']
--------------------
1861 The Curse of the Jade Scorpion
2287 Deconstructing Harry
2430 Small Time Crooks
2457 Anything Else
2577 Hollywood Ending
2695 New York Stories
3889 Annie Hall
4250 Sleeper
4252 Everything You Always Wanted to Know About Sex...
4324 Bananas
Name: movie_title, dtype: object
--------------------
How many movies are there where a director and the main actor are the same person?
len(df[(df['director_name'] == df['actor_1_name'])])
--------------------
64
--------------------
Let’s say we’d like to find out whose movies have brought more gross overall. We’ll need to use groupby and aggregate for that
movies = df[['director_name','gross']]
movies.groupby('director_name').aggregate(sum).sort_values(by="gross", ascending=False).head(10)
# This line means "Group the rows by director_name.
# Add up all the values for each director_name,
# then sort values in column gross in descending order. Show only first 10".
--------------------
gross
director_name
Steven Spielberg 4.114233e+09
Peter Jackson 2.592969e+09
Michael Bay 2.231243e+09
Tim Burton 2.071275e+09
Sam Raimi 2.049549e+09
James Cameron 1.948126e+09
Christopher Nolan 1.813228e+09
George Lucas 1.741418e+09
Joss Whedon 1.730887e+09
Robert Zemeckis 1.619309e+09
--------------------
Just like that. One line, no loops. Isn’t it great?