Data analysis involves a broad set of activities to clean, process and transform a data collection to learn from it. Python is commonly used as a programming language to perform data analysis because many tools, such as Jupyter Notebook, pandas and Bokeh, are written in Python and can be quickly applied rather than coding your own data analysis libraries from scratch.
The following series on data exploration uses Python as the implementation language while walking through various stages of how to analyze a data set.
PyData 101 presents the slides for one of the leading developers in the Python ecosystem on how to orient yourself if you are new to data science.
The Python Data Science Handbook is available to read for free online, although I also recommend buying the book as it is a great resource for learning the topic.
PyData TV contains all the videos from the PyData conference series. The conference talks are often given by professional data scientists and the developers who write these analysis libraries, so there is a wealth of information not necessarily captured anywhere else.
Python Plotting for Exploratory Data Analysis is a great tutorial on how to use simple data visualizations to bootstrap your understanding of a data set. The walkthrough covers histograms, time series analysis, scatter plots and various forms of bar charts.
This series entitled "Agile Analytics" has three parts that cover how to work in a data science team and how to operate one if you are a manager:
Learning Seattle's Work Habits from Bicycle Counts provides a great example of using open data, in this case from the city of Seattle, messing with it using Python and pandas, then charting it using skikit-learn. You can do this type of analysis on almost any data set to find out its patterns.
Exploring the shapes of stories using Python and sentiment APIs is a wonderful read with context for the problem being solved, plenty of insight into how to reproduce the results with your own code and a good number of charts that show how sentiment analysis can extract information from blocks of text.
How to automate creating high end virtual machines on AWS for data science projects walks through setting up a development environment on Amazon Web Services so that you can perform data analysis without owning a high-end computer. Also check out the Introduction to AWS for Data Scientists for another tutorial that shows you how to set up additional commonly-used data science tools on AWS.
Analyzing bugs.python.org uses extracted data from CPython development to show the most-commented issues and issues by version number throughout the project's history.
Divergent and Convergent Phases of Data Analysis examines the flow most people doing data science and analysis projects go through during the exploration, synthesis, modeling and narration phases.
Forget privacy: you're terrible at targeting anyway is a different type of article. It is a strong piece of commentary rather than a tutorial on a specific data analysis topic. The author argues that collecting data is typically easy but doing the dirty analysis work often yields little in the way of definitive, actionable insight. Overall it's a well-written thought piece that will make you at least stop and ask yourself, "do we really need to collect this user data?"
Gender Distribution in North Korean Posters with Convolutional Neural Networks is a fascinating post that uses convolutional neural networks as a mechanism to identify gender by faces in North Korean posters. The article's analysis on this messy data set and the results it produces using some Python glue code with various open source libraries is a great example of how data analysis can answer questions that would be very time consuming for a person to figure out without a computer.
Time Series Analysis in Python: An Introduction shows how to use the open source Prophet library to perform time series analysis on a data set.
Python Data Wrangling Tutorial: Cryptocurrency Edition uses the pandas library to clean up a messy cryptocurrency data set and shift the data into a structure that is useful for analysis the author wantds to perform.
Handy Python Libraries for Formatting and Cleaning Data provides a short overview of the libraries such as Arrow and Dora that make it easier to wrangle your data before doing analysis.
Analyzing one million robots.txt files
explains what a robots.txt
file is, why it matters, how to download
a bunch of them and then perform some analysis with NumPy.
Safely Analyzing Popular Licenses on GitHub Projects uses a Google BigQuery Python helper library to work with a massive 3 terabyte data set provided by GitHub.
Cleaning and Preparing Data in Python shows how to uses pandas to do the "boring" part of a data analysis job and convert dirty data into a more consistent, structured format.
9 obscure Python libraries for data science presents several lesser-known but still very useful libraries for performing data analysis such as fuzzywuzzy and gym.
Nvidia's series on defining data analysis, machine learning and deep learning are worth reading for the background and how they break down the problem domains: