Pandas Data Structures: Series

Pandas is a massively popular python data manipulation and analysis library. It offers data structures and operations that make it easier to manipulate data. One of the fundamental data structures in pandas is Series.

Series is a one-dimensional labeled array capable of holding any data type. The axis labels are collectively referred to as the index.

This post accumulates some of the things that I’ve learned about Series that I’d like to keep in one place for myself for a quick future reference. This is merely a basic intro to Series, and there’s a lot more to this data structure.

Before doing anything with Series, we need to import pandas:

import pandas as pd

Series Basics

You can create a Series from a list with any data type:

fruit = ['banana', 'apple', 'orange', 42, ['blueberries','raspberries']]
my_fruit_series = pd.Series(fruit)
----------------
0                        banana
1                         apple
2                        orange
3                            42
4    [blueberries, raspberries]
dtype: object
----------------

By default, each item received an index label from 0 to N-1, where N is the length of the Series. You can find out the length of the Series by calling len() or size:

len(my_fruit_series)
----------------
5
----------------
my_fruit_series.size
----------------
5
----------------

If you want to set the indices yourself, you can do that too:

fruit = ['banana', 'apple', 'orange', 42, ['blueberries','raspberries']]
my_fruit_series = pd.Series(fruit, index=list('abcde'))
----------------
a                        banana
b                         apple
c                        orange
d                            42
e    [blueberries, raspberries]
dtype: object
----------------

You can also create a Series from a dictionary, in this case keys will be used to build the index.

GOT_cast = {'Tyrion Lannister': 'Peter Dinklage', 'Cersei Lannister': 'Lena Headey',
           'Daenerys Targaryen':'Emilia Clarke','Jon Snow':'Kit Harington'}

GOT_series = pd.Series(GOT_cast)
----------------
Cersei Lannister         Lena Headey
Daenerys Targaryen     Emilia Clarke
Jon Snow               Kit Harington
Tyrion Lannister      Peter Dinklage
dtype: object
----------------

Sometimes it’s useful to create a large Series of random numbers, for example, here’s a Series with 10000 numbers with random numbers from 0 to 500.

import numpy as np
random_numbers = pd.Series(np.random.randint(0,500,10000))

If you have a large Series, you may want to use head() or tail() method to, sort of, preview it. By default head() shows the first 5 elements of the Series, and the tail() shows the last 5 elements, however, you can specify any other number.

GOT_cast = {'Tyrion Lannister': 'Peter Dinklage', 'Cersei Lannister': 'Lena Headey',
           'Daenerys Targaryen':'Emilia Clarke','Jon Snow':'Kit Harington',
           'Sansa Stark':'Sophie Turner', 'Arya Stark':'Maisie Williams',
           'Jaime Lannister':'Nikolaj Coster-Waldau', 'Jorah Mormont':'Iain Glen',
           'Theon Greyjoy':'Alfie Allen','Samwell Tarly':'John Bradley'}

GOT_series = pd.Series(GOT_cast)
GOT_series.head(3)
----------------
Arya Stark            Maisie Williams
Cersei Lannister          Lena Headey
Daenerys Targaryen      Emilia Clarke
dtype: object
----------------

If you want to get all the values and don’t really care about the index, you can get the array of values with values:

GOT_series.values
----------------
['Maisie Williams' 'Lena Headey' 'Emilia Clarke' 'Nikolaj Coster-Waldau'
 'Kit Harington' 'Iain Glen' 'John Bradley' 'Sophie Turner' 'Alfie Allen'
 'Peter Dinklage']
----------------

You can get the Index object too:

GOT_series.index
----------------
Index(['Arya Stark', 'Cersei Lannister', 'Daenerys Targaryen',
       'Jaime Lannister', 'Jon Snow', 'Jorah Mormont', 'Samwell Tarly',
       'Sansa Stark', 'Theon Greyjoy', 'Tyrion Lannister'],
      dtype='object')
----------------

Speaking of indices, let’s see how we can select something out of a Series. If we know the label where the value is we can use .loc, or []

GOT_series.loc['Tyrion Lannister']
----------------
Peter Dinklage
----------------
GOT_series['Tyrion Lannister']
----------------
Peter Dinklage
----------------

If the label isn’t in the index, it’ll raise a KeyError:

GOT_series.loc['Bronn']
----------------
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1434, in _has_valid_type
    error()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1429, in error
    (key, self.obj._get_axis_name(axis)))
KeyError: 'the label [Bronn] is not in the [index]'
----------------

But you can use a non-existent index to “append” the series:

GOT_series.loc['Bronn'] = 'Jerome Flynn'
GOT_series
----------------
Arya Stark                  Maisie Williams
Cersei Lannister                Lena Headey
Daenerys Targaryen            Emilia Clarke
Jaime Lannister       Nikolaj Coster-Waldau
Jon Snow                      Kit Harington
Jorah Mormont                     Iain Glen
Samwell Tarly                  John Bradley
Sansa Stark                   Sophie Turner
Theon Greyjoy                   Alfie Allen
Tyrion Lannister             Peter Dinklage
Bronn                          Jerome Flynn
dtype: object
----------------

It’s also possible to refer to integer location based index using .iloc.

GOT_series.iloc[2]
----------------
Kit Harington
----------------

.loc, .iloc, and also [] indexing can accept a callable as indexer, which I find pretty cool. So you can do something like this:

random_numbers = pd.Series(np.random.randint(0,500,10000))
random_numbers.loc[lambda s: s > 400].head()
----------------
17    451
19    442
22    488
24    431
32    479
dtype: int64
----------------

At first, it seemed to me that the where() method would return the same as selection by callable, given the same condition, but there’s a big difference. It returns a series of exactly the same shape with those values that match the condition exactly where they are, and the rest is NaN.

random_numbers.loc[lambda s: s > 5]
random_numbers.where(random_numbers>5)
----------------
3    9
6    9
7    6
9    9
dtype: int64
0    NaN
1    NaN
2    NaN
3    9.0
4    NaN
5    NaN
6    9.0
7    6.0
8    NaN
9    9.0
dtype: float64
----------------

By default where() returns a copy and doesn’t modify the original data. There is an optional parameter inplace (inplace=True) so that the original data can be modified without creating a copy.

Some Math with Series

Let’s take this Series as an example:

s = pd.Series([3,12,1,7,15])
----------------
0     3
1    12
2     1
3     7
4    15
dtype: int64
----------------

You can get the sum of the values with sum() from numpy:

total = np.sum(s)
total
----------------
38
----------------

You can add N to each item in Series using broadcasting (same goes for division, multiplication, subtraction):

s+=2
----------------
0     5
1    14
2     3
3     9
4    17
dtype: int64
----------------

You can do this:

s = pd.Series([3,12,1,7,15])
s1 = pd.Series([1,2,3,4,5,6])
s+s1
----------------
0     4.0
1    14.0
2     4.0
3    11.0
4    20.0
5     NaN
dtype: float64
----------------

It doesn’t matter if the length of two Series is different, or even if the indices are not an exact match. The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN.

With describe() you can get a quick statistic summary of your data:

random_numbers = pd.Series(np.random.randint(0,500,10000))
random_numbers.describe()
----------------
count    10000.000000
mean       251.338500
std        145.344209
min          0.000000
25%        124.750000
50%        252.000000
75%        379.000000
max        499.000000
dtype: float64
----------------

To see if two Series are exactly the same (both indices and values), you can use equals():

s = pd.Series([1,2,3,4,5])
s1 = pd.Series([1,2,3,4,5], index=list('abcde'))
s.equals(s1)
----------------
False
----------------
s = pd.Series([1,2,3,4,5])
s2 = pd.Series([1,2,3,4,5])
s.equals(s2)
----------------
True
----------------

You can do element-wise comparisons with a scalar value:

s = pd.Series(['foo', 'bar', 'baz'])
s == 'foo'
----------------
0     True
1    False
2    False
dtype: bool
----------------

You can locate labels of the minimum and maximum values with the idxmin() and idxmax() functions:

s = pd.Series([3,12,1,7,15], index=list('abcde'))
s.idxmax()
s.idxmin()
----------------
e
c
----------------

Modifying Data in a Series

Replacing values in a Series with a new value:

s = pd.Series([3,7,12,1,7,15])
s.replace(7,777)
----------------
0      3
1    777
2     12
3      1
4    777
5     15
dtype: int64
----------------

Modifying data with apply()

Let’s say we have a Series with heights of my imaginary friends in cm:

s = pd.Series([175,168,154,183], index=['Tim', 'Kate', 'Ann', 'Jon'])
----------------
Tim     175
Kate    168
Ann     154
Jon     183
dtype: int64
----------------

If I want to convert their height from cm to inches, I can do it with apply():

s = pd.Series([175,168,154,183], index=['Tim', 'Kate', 'Ann', 'Jon'])
s = s.apply(lambda x: x/2.54)
----------------
Tim     68.897638
Kate    66.141732
Ann     60.629921
Jon     72.047244
dtype: float64
----------------

Or, I can define a function, and pass it:

s = pd.Series([175,168,154,183], index=['Tim', 'Kate', 'Ann', 'Jon'])
def convert_cm_to_inch(x):
   return x/2.54
s = s.apply(convert_cm_to_inch)
----------------
Tim     68.897638
Kate    66.141732
Ann     60.629921
Jon     72.047244
dtype: float64
----------------

If the function takes more arguments, you can specify them with args=

s = pd.Series([175,168,154,183], index=['Tim', 'Kate', 'Ann', 'Jon'])
def add_height_of_their_hat(x, hat_height):
   return x+hat_height
s = s.apply(add_height_of_their_hat, args=(15,))
----------------
Tim     190
Kate    183
Ann     169
Jon     198
dtype: int64
----------------

I’d say that should do it for beginning with Series, and in the next post I’ll cover DataFrame.