The Python Data Analysis Library (pandas) is a data structures and analysis library.
Intro to pandas data structures, working with pandas data frames and Using pandas on the MovieLens dataset is a well-written three-part introduction to pandas blog series that builds on itself as the reader works from the first through the third post.
pandas exercises is a GitHub repository with Jupyter Notebooks that let you practice sorting, filtering, visualizing, grouping, merging and more with pandas.
A simple way to anonymize data with Python and Pandas is a good tutorial on removing sensitive data from your unfiltered data sets.
Learn a new pandas trick every day! is a running list of great pandas tips that the author originally posted on Twitter and then aggregated onto a single webpage.
Time Series Analysis with Pandas show you how to combine Python 3.6, pandas, matplotlib and seaborn to analyze and visualize open data from Germany's power grid. This is a great tutorial to learn these tools with a realistic data set.
Analyzing a photographer's flickr stream using pandas explains how the author grabbed a bunch of Flickr data using the flickr-api library then analyzed the EXIF data in the photos using pandas.
Pandas Crosstab Explained
shows how to use the crosstab
function in pandas so you can summarize
and group data.
Calculating streaks in pandas shows how to measure and report on streaks in data, which is where several events happen in a row consecutively.
How to Convert a Python Dictionary to a Pandas DataFrame is a straightforward tutorial with example code for loading and adding data stored in a typical Python dictionary into a DataFrame.
This two-part series on loading data into a pandas DataFrame presents what to do when CSV files do not match your expectations and how to handle missing values so you can start performing your analysis rather than getting frustrated with common issues at the beginning of your workflow.
Building a financial model with pandas explains how to create an amortization schedule with corresponding table and charts that show the pay off period broken down by interest and principal.
Efficiently cleaning text with pandas provides a really great practical tutorial on different approaches for cleaning a large data set so that you can begin to do your analysis. The tutorial also shows how to use the sidetable library, which creates summary tables of a DataFrame.
tabula-py: Extract table from PDF into Python DataFrame presents how to use the Python wrapper for the Tabula library that makes it easier to extract table data from PDF files.
Time Series Forecast Case Study with Python: Monthly Armed Robberies in Boston walks through the data wrangling, analysis and visualization steps with a public data set of murders in Boston from 1966 to 1975. This particular data problem may not be your thing but by going through the process you can learn a lot that can be applied to any data set.
A Gentle Visual Intro to Data Analysis in Python Using Pandas
presents spreadsheet-like pictures to show conceptually what
pandas is doing with your data as you apply various functions like
groupby
and loc
.
Data Manipulation with Pandas: A Brief Tutorial uses some example data sets to show how the most commonly-used functions in pandas work.
Analyzing Pronto CycleShare Data with Python and Pandas uses Seattle bikeshare data as a source for wrangling, analysis and visualization.
Stylin' with pandas shows how to add colors and sparklines to your output when using pandas for data visualization.
Python and JSON: Working with large datasets using Pandas is a well-done detailed tutorial that shows how to mung and analyze JSON data.
Fun with NFL Stats, Bokeh, and Pandas uses National (American) Football League data as a source for wrangling and visualization.
Analyzing my Spotify Music Library With Jupyter And a Bit of Pandas shows how to grab all of your user data from the Spotify API then analyze it using pandas in Jupyter Notebook.
Scalable Python Code with Pandas UDFs explains that pandas operations can often be parallelized for better performance using the Pandas UDFs feature in PySpark version 2.3 or greater.
How to use Pandas read_html to Scrape Data from HTML Tables has a bunch of great code examples that show how to load data from HTML directly into your DataFrames.
How to download fundamentals data with Python shows how to obtain and use financial data, such as balance sheets, stock prices, and various ratios to perform your own analysis on.
How to convert JSON to Excel with Python and pandas provides instructions for creating a spreadsheet out of JSON file.
Loading large datasets in Pandas
explains how to get around the MemoryError
issue that occurs
when using read_csv
because the data set is larger than the
available memory on a machine. You can use chunking with
the read_csv
function to divide the data set into smaller parts that
each can be loaded into memory. Alternatively, you can use a
SQLite database to create a relational database
with the data then use SQL queries or an
object-relational mapper (ORM)
to load the data and perform analysis in pandas.
Real-world Excel spreadsheets are often a mess of unstructured data, so this tutorial on Reading Poorly Structured Excel Files with Pandas gives example code for extracting only part of a file as well as reading ranges and tables.