Notes on Python Data Science Handbook
This notebook is on my local filesystem under ~/github/learning-python/notes-PDSH.ipynb and contains notes I made while studying topics from the Python Data Science Handbook by Jake Vanderplas. His Github repo of the book is here. I forked the repo, and that clone is located on my local filesystem at
~/github/jakevdp_PythonDataScienceHandbook
Chapter 3: Data Manipulation with Pandas (p. 97)
Introducing Pandas objects (p. 98)
The Pandas Index object (p. 105)
Index objects are immutable. They can also be treated as ordered sets because
(p106) The
Indexobject follows many of the conventions used by Python’s built-insetdata structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way.
import pandas as pd
indA = pd.Index([1,3,5,7,9])
indB = pd.Index([2,3,5,7,11])
indA & indB # intersection
Int64Index([3, 5, 7], dtype='int64')
Chapter 5: Machine Learning (p. 331)
In Depth: Linear Regression (p. 390)
These are described as standard inputs for linear regression:
% matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
Let’s create some data and plot it:
rng = np.random.RandomState(1) # GJL: setting a seed
x = 10 * rng.rand(50) # creates 50 random numbers in (0,10)
y = 2 * x - 5 + rng.randn(50) # linear relationship + Gaussian noise
plt.scatter(x,y);

Let’s fit a regression:
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
What is model? Well, remember everything in Python is an object. Try help(model) to see all the different elements of model, which, being an object, is an instance of the class LinearRegression.
It seems obvious that the following model.fit() call modifies the fields of the model instance in place. That may model can be further used in later lines.
model.fit(x[:, np.newaxis],y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
What is the purpose of np.newaxis? See here. Basically,
np.newaxismight come in handy when you want to explicitly convert an 1D array to either a row vector or a column vector
In my case above, x[:, np.newaxis] makes x into a column vector. I’m not sure why the same treatment isn’t required for y.
Misc notes
Universal functions and array broadcasting
This info is also covered in PDSH. Broadcasting starts on p. 63, where it’s defined as
Broadcasting is simply a set of rules for applying binary ufuncs (addition, subtraction, multiplication, etc.) on arrays of different sizes.
This page of the Scipy documentation describes universal functions and array broadcasting. Basically array broadcasting is the automatic adjustment of array sizes and dimensions that is done in order to make a set of arrays the same size. Here’s a really good example from that page.
Example:
If a.shape is (5,1), b.shape is (1,6), c.shape is (6,) and d.shape is () so that d is a scalar, then a, b, c, and d are all broadcastable to dimension (5,6); and
aacts like a (5,6) array wherea[:,0]is broadcast to the other columns,bacts like a (5,6) array whereb[0,:]is broadcast to the other rows,cacts like a (1,6) array and therefore like a (5,6) array wherec[:]is broadcast to every row, and finally,dacts like a (5,6) array where the single value is repeated.
Orphaned notes (forgot what part of PDSH I was looking at)
- In Python, everything’s an object, which means many “things” have two ways of calling related methods: the object-oriented way and the non object-oriented way.
- I thought the good code for this idea was around the Pandas Index object (p105 or so), but on second look, I can’t find it there.