print(df.groupby('year')['pop'].mean())
print(df.groupby('year')['gdpPercap'].mean())
So far, so good. But what if we want to group our data by more than one column? We can do this by passing columns in lists:
print(df.groupby(['year', 'continent'])
[['lifeExp', 'gdpPercap']].mean())
lifeExp gdpPercap
year continent
1952 Africa 39.135500 1252.572466
Americas 53.279840 4079.062552
Asia 46.314394 5195.484004
Europe 64.408500 5661.057435
Oceania 69.255000 10298.085650
1957 Africa 41.266346 1385.236062
Americas 55.960280 4616.043733
Asia 49.318544 5787.732940
Europe 66.703067 6963.012816
Oceania 70.295000 11598.522455
1962 Africa 43.319442 1598.078825
Americas 58.398760 4901.541870
Asia 51.563223 5729.369625
Europe 68.539233 8365.486814
Oceania 71.085000 12696.452430
This .groupby() operation takes our data and groups it first by year, and then by continent. Then, it generates mean values from the life-expectancy and GDP columns. This way, you can create groups in your data and rank how they are to be presented and calculated.
If you want to “flatten” the results into a single, incrementally indexed frame, you can use the .reset_index() method on the results:
gb = df.groupby(['year', 'continent'])
[['lifeExp', 'gdpPercap']].mean()
flat = gb.reset_index()
print(flat.head())
| year continent lifeExp gdpPercap
| 0 1952 Africa 39.135500 1252.572466
| 1 1952 Americas 53.279840 4079.062552
| 2 1952 Asia 46.314394 5195.484004
| 3 1952 Europe 64.408500 5661.057435
| 4 1952 Oceana 69.255000 10298.085650
Grouped frequency counts
Something else we often do with data is compute frequencies. The nunique and value_counts methods can be used to get unique values in a series, and their frequencies. For instance, here’s how to find out how many countries we have in each continent:
print(df.groupby('continent')['country'].nunique())
continent
Africa 52
Americas 25
Asia 33
Europe 30
Oceana 2
Basic plotting with Pandas and Matplotlib
Most of the time, when you want to visualize data, you’ll use another library such as Matplotlib to generate those graphics. However, you can use Matplotlib directly (along with some other plotting libraries) to generate visualizations from within Pandas.
To use the simple Matplotlib extension for Pandas, first make sure you’ve installed Matplotlib with pip install matplotlib.
Now let’s look at the yearly life expectancies for the world population again:
global_yearly_life_expectancy = df.groupby('year')['lifeExp'].mean()
print(global_yearly_life_expectancy)
| year
| 1952 49.057620
| 1957 51.507401
| 1962 53.609249
| 1967 55.678290
| 1972 57.647386
| 1977 59.570157
| 1982 61.533197
| 1987 63.212613
| 1992 64.160338
| 1997 65.014676
| 2002 65.694923
| 2007 67.007423
| Name: lifeExp, dtype: float64
To create a basic plot from this, use:
import matplotlib.pyplot as plt
global_yearly_life_expectancy = df.groupby('year')['lifeExp'].mean()
c = global_yearly_life_expectancy.plot().get_figure()
plt.savefig("output.png")
The plot will be saved to a file in the current working directory as output.png. The axes and other labeling on the plot can all be set manually, but for quick exports this method works fine.
Conclusion
Python and Pandas offer many features you can’t get from spreadsheets. For one, they let you automate your work with data and make the results reproducible. Rather than write spreadsheet macros, which are clunky and limited, you can use Pandas to analyze, segment, and transform data—and use Python’s expressive power and package ecosystem (for instance, for graphing or rendering data to other formats) to do even more than you could with Pandas alone.



