How to use Pandas for data analysis in Python


print(df.groupby('year')['pop'].mean())
print(df.groupby('year')['gdpPercap'].mean())

So far, so good. But what if we want to group our data by more than one column? We can do this by passing columns in lists:


print(df.groupby(['year', 'continent'])
  [['lifeExp', 'gdpPercap']].mean())
                  lifeExp     gdpPercap
year continent
1952 Africa     39.135500   1252.572466
     Americas   53.279840   4079.062552
     Asia       46.314394   5195.484004
     Europe     64.408500   5661.057435
     Oceania    69.255000  10298.085650
1957 Africa     41.266346   1385.236062
     Americas   55.960280   4616.043733
     Asia       49.318544   5787.732940
     Europe     66.703067   6963.012816
     Oceania    70.295000  11598.522455
1962 Africa     43.319442   1598.078825
     Americas   58.398760   4901.541870
     Asia       51.563223   5729.369625
     Europe     68.539233   8365.486814
     Oceania    71.085000  12696.452430

This .groupby() operation takes our data and groups it first by year, and then by continent. Then, it generates mean values from the life-expectancy and GDP columns. This way, you can create groups in your data and rank how they are to be presented and calculated.

If you want to “flatten” the results into a single, incrementally indexed frame, you can use the .reset_index() method on the results:


gb = df.groupby(['year', 'continent'])
[['lifeExp', 'gdpPercap']].mean()
flat = gb.reset_index() 
print(flat.head())
|     year  continent  lifeExp    gdpPercap
| 0   1952  Africa     39.135500   1252.572466
| 1   1952  Americas   53.279840   4079.062552
| 2   1952  Asia       46.314394   5195.484004
| 3   1952  Europe     64.408500   5661.057435
| 4   1952  Oceana     69.255000  10298.085650

Grouped frequency counts

Something else we often do with data is compute frequencies. The nunique and value_counts methods can be used to get unique values in a series, and their frequencies. For instance, here’s how to find out how many countries we have in each continent:


print(df.groupby('continent')['country'].nunique()) 
continent
Africa    52
Americas  25
Asia      33
Europe    30
Oceana     2

Basic plotting with Pandas and Matplotlib

Most of the time, when you want to visualize data, you’ll use another library such as Matplotlib to generate those graphics. However, you can use Matplotlib directly (along with some other plotting libraries) to generate visualizations from within Pandas.

To use the simple Matplotlib extension for Pandas, first make sure you’ve installed Matplotlib with pip install matplotlib.

Now let’s look at the yearly life expectancies for the world population again:


global_yearly_life_expectancy = df.groupby('year')['lifeExp'].mean() 
print(global_yearly_life_expectancy) 
| year
| 1952  49.057620
| 1957  51.507401
| 1962  53.609249
| 1967  55.678290
| 1972  57.647386
| 1977  59.570157
| 1982  61.533197
| 1987  63.212613
| 1992  64.160338
| 1997  65.014676
| 2002  65.694923
| 2007  67.007423
| Name: lifeExp, dtype: float64

To create a basic plot from this, use:


import matplotlib.pyplot as plt
global_yearly_life_expectancy = df.groupby('year')['lifeExp'].mean() 
c = global_yearly_life_expectancy.plot().get_figure()
plt.savefig("output.png")

The plot will be saved to a file in the current working directory as output.png. The axes and other labeling on the plot can all be set manually, but for quick exports this method works fine.

Conclusion

Python and Pandas offer many features you can’t get from spreadsheets. For one, they let you automate your work with data and make the results reproducible. Rather than write spreadsheet macros, which are clunky and limited, you can use Pandas to analyze, segment, and transform data—and use Python’s expressive power and package ecosystem (for instance, for graphing or rendering data to other formats) to do even more than you could with Pandas alone.

How to use Pandas for data analysis in Python

Grouped frequency counts

Basic plotting with Pandas and Matplotlib

Conclusion

This Sleek $26 USB Gadget Lets You See Hard-To-Reach Spaces

Can You Really Tell The Difference Between Expensive And Cheap Android Phones?

Microsoft’s ‘Project Silica’ Breakthrough Might Finally Kill Hard Drives

Angular releases patches for SSR security issues

Leave a reply Cancel reply

How to use Pandas for data analysis in Python

Grouped frequency counts

Basic plotting with Pandas and Matplotlib

Conclusion

This Sleek $26 USB Gadget Lets You See Hard-To-Reach Spaces

Can You Really Tell The Difference Between Expensive And Cheap Android Phones?

Microsoft’s ‘Project Silica’ Breakthrough Might Finally Kill Hard Drives

Angular releases patches for SSR security issues

Angular releases patches for SSR security issues

Why AI requires rethinking the storage-compute divide

Cloud architects earn the highest salaries

Leave a reply Cancel reply