Let's start with something basic - with data. Since in machine learning we solve problems by learning from data we need to prepare and understand our data well. This time we explore the classic Boston house pricing dataset - using Python and a few great libraries. We'll learn the big picture of the process and a lot of small everyday tips.
I'd be following a great advice from the Machine Learning Mastery course which probably is applicable to any domain: In order to master a subject it is good to make a lot of small projects, each with its clear set of goals and producing to a set of artifacts - a github repo, a blog article, etc. This way we can focus on one thing at time, finish each project quickly and happily and build a momentum which can lead us towards bigger projects. Otherwise, if we instantly set a big goal the path can be very long and complex, the goal blurs and we might lose our motivation.
Goal
When I get a dataset and a ML problem associated with it, what should I do?
For this evening we'll explore one of the classic machine learning datasets - Boston house pricing.
The goals are to:
- understand the dataset and problem associated with it
- examine the tools which help us describe and visualize the data
- make some artifacts - report of our findings, plots and code
- be prepared for the next dataset with the right questions and code
To ask the right questions in our analysis we can start with a path outlined in the Applied Machine Learning Process book. More details and motivations are described in that book or can follow from our curiosity.
- Problem definition
- What is the problem?
- Informal description
- Formal description
- Assumptions
- Similar problems
- Description of provided data
- Why does the problem need to be solved?
- Motivation
- Benefits
- Use
- How would I solve the problem (manually)?
- What is the problem?
- Data analysis
- Summarize data
- Data structure
- Data distributions
- Correlations with target variable
- Correlations among attributes (redundancy)
- Visualize data
- Attribute histograms
- Pairwise scatter-plots
- Summarize data
Solution
In this article I'd like to describe my solution which is available in the ml-playground GitHub repo dedicated to learning Machine learning. The results of both parts (problem definition and data analysis) are presented in two Markdown files along with a lot of images. For the data analysis part I wrote a rather simple Python script data_analysis.py
. All the plots in data_analysis.md
can be automatically generated by running this script. The dataset is taken via one of the Python libraries.
$ git clone https://github.com/bzamecnik/ml-playground.git
$ cd ml-playground/boston_dataset_exploration
$ python data_analysis.py
Results:
Data analysis
Details of the Python implementation
The dataset is available either for download from the UCI ML repository or via a Python library scikit-learn. The Python language and the ecosystem of libraries make it a excelent tool for data analysis and machine learning, so we'll use it in this mini-project.
I assume you at least basic have knowledge of Python, installing packaging and have available a running Python 3 distribution. We'll utilize some useful packages:
numpy
- efficient numerical computationspandas
- data structures for data analysisscikit-learn
- machine learning algorithms, dataset accessmatplotlib
- plotting (both interactive and to files)seaborn
- extra plot types, elegant and readable plot style
I encourage you to work interactively and proceed one step at time. A great interactive console for Python is ipython
. Also if you don't use it already, google for pyvenv
and pip
.
Loading dataset into a pandas DataFrame
The dataset consists of a table - columns are attributes, rows are instances (individual observations). In order to do computations easily and efficiently and not to reinvent wheel we can use a suitable tool - pandas. So the first step is to obtain the dataset and load it into a DataFrame.
Fortunately the dataset is already available via the scikit-learn package which gives it to us as a python data structure:
import pandas as pd # conventional alias
from sklearn.datasets import load_boston
dataset = load_boston()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target
The result of load_boston()
is a map-like object with four components: ['target', 'data', 'DESCR', 'feature_names']
:
dataset['target']
- 1D numpy array of target attribute valuesdataset['data']
- 2D numpy array of attribute valuesdataset['feature_names']
- 1D numpy array of names of the attributesdataset['DESCR']
- text description of the dataset
So it is easy to convert it to a pandas DataFrame.
In case we got just a plain CSV file, we can read it too, like this:
df = pd.read_csv('dataset.csv')
The process of cleaning and preparing data always depends on data at hand.
Excercise
As an excercise you can try to clean and load the raw Boston dataset from the UCI archive.
Data summary as a text report
Now we have the data available in Python waiting for us to explore them!
We'd like to get a simple report like this: data_analysis_report.txt.
Data size
The most basic thing is to get the data size: number of attributes (features) and instances (data points).
instance_count, attr_count = df.shape
In fact the attributes can be divided to input ones and target ones. In this case there's just one target and several inputs.
The concrete meaning of attributes is described in the report.
Distributions of each attribute
Next we'd like to know how values of each attributes are distributed. We can readily use the basic statistics (count, mean, min, max, quartiles) via the pandas df.describe()
function. These can also be computed separately: df.count(), df.min(), df.max(), df.median(), df.quantile(q)
Other statistics such as mode are also available separately, eg. df.mode()
:
>>> df.describe()
CRIM ZN
count 506.000000 506.000000
mean 3.593761 11.363636
std 8.596783 23.322453
min 0.006320 0.000000 ...
25% 0.082045 0.000000
50% 0.256510 0.000000
75% 3.647423 12.500000
max 88.976200 100.000000
Missing values
It is good to know if some values are missing. In pandas missing values are represented by np.nan
. The pd.isnull(df).any()
command tells us whether each column contains any missing values, pd.isnull(df).sum()
then counts the missing values.
There are (un)fortunately no missing values in this dataset.
Corelations between attributes
It is useful to know whether some pairs of attributes are correlated and how much. For many ML algorithms correlated features might make some trouble, ideally we should try to get a set of independent features. However, there exist some methods for deriving features that are as uncorrelated as possible (google for PCA, ICA, autoencoder, dimensionality reduction, manifold learning, etc.).
Pandas offers us out-of-the-box three various correlation coefficients via the DataFrame.corr()
function: standard Pearson correlation coefficient, Spearman rank correlation, Kendall Tau correlation coefficient. See Wikipedia for more information.
>>> df.corr(method='pearson')
CRIM ZN INDUS CHAS
CRIM 1.000000 -0.199458 0.404471 -0.055295
ZN -0.199458 1.000000 -0.533828 -0.042697 ...
INDUS 0.404471 -0.533828 1.000000 0.062938
CHAS -0.055295 -0.042697 0.062938 1.000000
...
Practically it returns another DataFrame with pairs of all attributes and values in range [-1; 1], 1 meaning total correlation, -1 negative correlation, 0 no correlation. Note that the three methods differ in computational complexity. A quick test shows a difference:
>>> %timeit df.corr(method='pearson')
1000 loops, best of 3: 490 µs per loop
>>> %timeit df.corr(method='spearman')
100 loops, best of 3: 8.2 ms per loop
>>> %timeit df.corr(method='kendall')
1 loops, best of 3: 1.64 s per loop
%timeit
is a command in ipython
useful for micro-benchmarking.
Predictivity of attributes
Besides correlation between attributes, we'd like to know the correlation between input attributes and the target one, ie. how each input attribute is able to predict the target. It is called predictivity.
pearson = df.corr(method='pearson')
# assume target attr is the last, then remove corr with itself
corr_with_target = pearson.ix[-1][:-1]
# attributes sorted from the most predictive
predictivity = corr_with_target.sort(ascending=False)
And the result:
RM 0.695360
ZN 0.360445
B 0.333461
DIS 0.249929
CHAS 0.175260
AGE -0.376955
RAD -0.381626
CRIM -0.385832
NOX -0.427321
TAX -0.468536
INDUS -0.483725
PTRATIO -0.507787
LSTAT -0.737663
Name: MEDV, dtype: float64
Since we might be also interested in strong negative correlations it would be better to sort the correlations by the absolute value:
>>> corr_with_target[abs(corr_with_target).argsort()[::-1]]
LSTAT -0.737663
RM 0.695360
PTRATIO -0.507787
INDUS -0.483725
TAX -0.468536
NOX -0.427321
CRIM -0.385832
RAD -0.381626
AGE -0.376955
ZN 0.360445
B 0.333461
DIS 0.249929
CHAS 0.175260
Name: MEDV, dtype: float64
The np.argsort()
function returns the array indices that when aplied to the array return an array in sorted order.
Important correlations between input attributes
It might be interesting to select some strong correlations between attribute pairs. With a bit of Python magic it is possible:
attrs = pearson.iloc[:-1,:-1] # all except target
# only important correlations and not auto-correlations
threshold = 0.5
# {('LSTAT', 'TAX'): 0.543993, ('INDUS', 'RAD'): 0.595129, ...
important_corrs = (attrs[abs(attrs) > threshold][attrs != 1.0]) \
.unstack().dropna().to_dict()
# attribute pair correlation
# 0 (AGE, INDUS) 0.644779
# 1 (INDUS, RAD) 0.595129
# ...
unique_important_corrs = pd.DataFrame(
list(set([(tuple(sorted(key)), important_corrs[key]) \
for key in important_corrs])), columns=['attribute pair', 'correlation'])
# sorted by absolute value
unique_important_corrs = unique_important_corrs.ix[
abs(unique_important_corrs['correlation']).argsort()[::-1]]
unique_important_corrs
:
attribute pair correlation
(RAD, TAX) 0.910228
(DIS, NOX) -0.769230
(INDUS, NOX) 0.763651
(AGE, DIS) -0.747881
(AGE, NOX) 0.731470
(INDUS, TAX) 0.720760
(DIS, INDUS) -0.708027
(NOX, TAX) 0.668023
(DIS, ZN) 0.664408
(AGE, INDUS) 0.644779
(CRIM, RAD) 0.622029
(LSTAT, RM) -0.613808
(NOX, RAD) 0.611441
(INDUS, LSTAT) 0.603800
(AGE, LSTAT) 0.602339
(INDUS, RAD) 0.595129
(LSTAT, NOX) 0.590879
(CRIM, TAX) 0.579564
(AGE, ZN) -0.569537
(LSTAT, TAX) 0.543993
(DIS, TAX) -0.534432
(INDUS, ZN) -0.533828
(NOX, ZN) -0.516604
(AGE, TAX) 0.506456
Visualization
In order to better understand the dataset we can make things visual. Basically we generate a lot of various 1D or 2D plots and look at it hoping to find something interesting. Indeed, we, humans, are being highly trained to get information and find patterns visually.
There are Python packages that can help us with this task very much. In particular, we use the standard matplotlib package and additionally seaborn for some extra statistical plots and for more elegant and comprehensible plot styles.
While examining the dataset interactively we can utilize the default interactive plotting features. However for reproducibility and publishing our results eg. on a blog, we'd like to export the plots to images, eg. PNG. Python can help us to automate the things.
Attribute correlations
We'd like to plot the value of correlation of pairs of attributes, ie. a 2D matrix. To avoid duplicates (due to pair symmetry) just one triangle is enough (eg. lower triangular matrix). The range of correlation value is [-1; 1], thus the value can be color coded using a diverging color map - one hue for each sign (eg. red/blue) and saturation for absolute value.
The seaborn package offer corrplot() which is exactly what we need (example).
import seaborn as sns # just a conventional alias, don't know why
sns.corrplot(df) # compute and plot the pair-wise correlations
# save to file, remove the big white borders
plt.savefig('attribute_correlations.png', tight_layout=True)
The lower triangle show the correlation values as colored squares, on the diagonal are the attribute names and in the upper triangle are the actual correlation values and significance represented by stars. Also there is a colorbar.
We can change the correlation method with the method
parameter (the same variants available as in pandas).
We might want to adjust the plot to look better. The default font size is a quite big and looks ugly. I'm not aware of a clean way to decrease it, thus as a work-around we can enlarge the plot. If you know, please comment.
# make a plot of specified dimension (in inches), just a 1x1 subplot
fig, ax = plt.subplots(figsize=(10, 10))
# pass the axis to draw on
sns.corrplot(df, ax=ax)
Also we might want to customize the colormap to some other diverging one:
# red-blue (hues 250 and 10, just 3 colors) diverging pallete with white center
sns.corrplot(df, cmap=sns.diverging_palette(250, 10, n=3, as_cmap=True))
In case there is just too much attributes we might hide the actual values and put the attribute names aside:
sns.corrplot(df, annot=False, sig_stars=False, diag_names=False)
Attribute histograms
Next we'd like to plot a histogram of values for each attribute. It acts as an estimation of the probability density function and gives us a better understanding how values of each attribtue look like.
A simple means would be the standard hist()
method from matplotlib:
import matplotlib.pyplot as plt
attr = df['MEDV']
plt.hist(attr)
It automatically quantizes the value range into several bins. The number of bins can be specified via a parameter:
plt.hist(attr, bins=50)
Seaborn has a plot dedicated for distributions - distplot(). Moreover, various possibilities in plotting distributions are beautifully covered in its documentation: Visualizing distributions of data .
distplot
is able to combine a histogram with a kernel density estimation (KDE) plot (it looks like a smoothed histogram) into a single plot. For us it would be useful to plot real-valued data.
sns.displot(attr)
However, for integer-valued data (like categories) automatic quantization into a pre-defined number of bins might not be the best option. We'd rather like to quantize according the original distinct values. For that we can just compute this kind of histogram ourselves and use the bar
plot.
Example for RAD
: int (category) - index of accessibility to radial highways:
cat_attr = df['RAD']
h = cat_attr.value_counts()
values, counts = h.index, h
plt.bar(values, counts)
# or more compactly:
plt.bar(*list(zip(*cat_attr.value_counts().items())))
Pair-wise joint plots
So far we've visualized the distribution of values of individual attributes and correlation of attribute pairs. We can dig deeper into the relation of attribute pairs by examining their joint distributions. We'd like a 2D plot with each axis representing the particular attribute range and the points on the plot representing the probability that both attributes have the particular values at once. Also plotting 2D distributions is nicely described with many examples in the seaborn docs.
If we have a few data points we can plot a little dot on that 2D position for each observed point - this is called a scatter plot and is available via plt.scatter(X, Y)
.
plt.scatter(df['MEDV'], df['LSTAT'])
Seaborn enhances this plot by adding a small histogram for both attributes aside and a label with the correlation value and significance:
sns.jointplot(x, y, kind='scatter')
Rather than individual observations we'd rather estimate the joint PDF density and plot it color-coded. There are various approaches. Having a lot of data points we can fake it by a scatter plot with semi-transparent markers that add up.
x, y = df['MEDV'], df['LSTAT']
plt.scatter(x, y, alpha=0.5)
# or via jointplot (with histograms aside):
sns.jointplot(x, y, kind='scatter', joint_kws={'alpha':0.5})
Another possibility is to aggregate data points over 2D areas and estimate the PDF this way. Its a 2D generalization of a histogram. We can either use a rectangular grid, or even a hexagonal one.
sns.jointplot(df['MEDV'], df['LSTAT'], kind='hex')
Otherwise we can estimate the PDF smoothly by convolving each datapoint with a kernel function. It is possible with sns.kdeplot()
.
sns.kdeplot(df['MEDV'], df['LSTAT'], shade=True)
# or
sns.jointplot(df['MEDV'], df['LSTAT'], kind='kde')
Plotting exact observations (as with a scatter plot) may lead to "over-fitting", while plotting smoothed kernel density estimation may lead to to much generalization. A medium trade-off between both is using the hexagon-aggregated 2D histogram. But as always the human observer who will assess the data must use their own judgement.
Pair-wise scatter matrix
There's N * (N - 1) / 2
unique pairs of attributes, ie. quite a lot. In order to have a big picture for quick overview we might want to display a rough joint distribution plot for each pair in a single image. That's exactly what pairplot() from seaborn does - scatter plots, one for each pair, are aligned into a matrix and the diagonal is filled with attribute histograms.
sns.pairplot()
That's it! You can of course do it by hand using matplotlib, but why reinvent wheel and not leverage the power of existing libraries. Note that the resulting image is quite big.
Tips
Matplotlib opens the GUI window even for batch plotting to files
When plotting non-interactively I found a problem that the matplotlib window still opens. In order not to do this I found this work-around. Before importing the matplotlib.pyplot
package, select the backend to 'Agg'
.
matplotlib.use('Agg')
import matplotlib.pyplot as plt
Logging
While developing and running a script it is useful to know the intermediate status. The most basic tool in Python is the print()
function. The problem is to differentiate what are the program results and what are some helpful debugging messages.
The standard solution is to use logging. To each message we can specify the log level (eg. DEBUG
, INFO
, WARNING
, ERROR
) and then filter only relevant messages. Eg. DEBUG
and more important messages during development, and INFO
or WARNING
and above during normal usage.
Simplest we can log directly to the standard output (console), or we can log to files, etc.
The basic usage in Python:
import logging
# setup the logging to standard output with some simple format and INFO level
log_format='%(asctime)s %(levelname)s %(message)s'
logging.basicConfig(format=log_format, level=logging.INFO)
logging.info('this gets printed')
logging.debug('this doesn't get unless we set the level to DEBUG or lower)
For more information look at the Python logging
module documentation.
Printing the report to file
Also we'd like to make a text report that's easy to publish. One way would be to copy and paste the script output. But that's human work and we like automation and replicability. Another almost as simple solution is to print to a file. The print()
function supports this out-of-the-box, the only thing is to specify an open file.
report = open('report.txt', 'w')
print('some result', file=report)
Conclusion
In this article we described the basic process of examining a dataset for further usage eg. in machine learning or data visualization. We took the outline of basic questions from the Applied Machine Learning Process book and applied them to the classic Boston housing dataset. We described how powerful tools like Python and its libraries can help us to get quickly to the results while leaving us the freedom to get more complicated if needed. The actual results along with a lot of plots are presented separately in two reports ( problem-definition.md, data_analysis.md ).
I'm happy to try the process by hand without the usual time distress of work projects. Now I have a template of what questions to ask, how the answers might look like, there's a sample script implementing the basic computations and plot which might be tailored to other datasets encountered in future. It's released as open-source, so you can base your futher data examinations on it.
I hope you found something useful or inspiring here and feel free to share your own data exploration reports or code.
Get in touch on Twitter: @bzamecnik.