Let's start with something basic - with data. Since in machine learning we solve problems by learning from data we need to prepare and understand our data well. This time we explore the classic Boston house pricing dataset - using Python and a few great libraries. We'll learn the big picture of the process and a lot of small everyday tips.

I'd be following a great advice from the Machine Learning Mastery course which probably is applicable to any domain: In order to master a subject it is good to make a lot of small projects, each with its clear set of goals and producing to a set of artifacts - a github repo, a blog article, etc. This way we can focus on one thing at time, finish each project quickly and happily and build a momentum which can lead us towards bigger projects. Otherwise, if we instantly set a big goal the path can be very long and complex, the goal blurs and we might lose our motivation.

Goal

When I get a dataset and a ML problem associated with it, what should I do?

For this evening we'll explore one of the classic machine learning datasets - Boston house pricing.

The goals are to:

  • understand the dataset and problem associated with it
  • examine the tools which help us describe and visualize the data
  • make some artifacts - report of our findings, plots and code
  • be prepared for the next dataset with the right questions and code

To ask the right questions in our analysis we can start with a path outlined in the Applied Machine Learning Process book. More details and motivations are described in that book or can follow from our curiosity.

  • Problem definition
    • What is the problem?
      • Informal description
      • Formal description
      • Assumptions
      • Similar problems
      • Description of provided data
    • Why does the problem need to be solved?
      • Motivation
      • Benefits
      • Use
    • How would I solve the problem (manually)?
  • Data analysis
    • Summarize data
      • Data structure
      • Data distributions
        • Correlations with target variable
        • Correlations among attributes (redundancy)
    • Visualize data
      • Attribute histograms
      • Pairwise scatter-plots

Solution

In this article I'd like to describe my solution which is available in the ml-playground GitHub repo dedicated to learning Machine learning. The results of both parts (problem definition and data analysis) are presented in two Markdown files along with a lot of images. For the data analysis part I wrote a rather simple Python script data_analysis.py. All the plots in data_analysis.md can be automatically generated by running this script. The dataset is taken via one of the Python libraries.

$ git clone https://github.com/bzamecnik/ml-playground.git
$ cd ml-playground/boston_dataset_exploration
$ python data_analysis.py

Results:

Data analysis

Details of the Python implementation

The dataset is available either for download from the UCI ML repository or via a Python library scikit-learn. The Python language and the ecosystem of libraries make it a excelent tool for data analysis and machine learning, so we'll use it in this mini-project.

I assume you at least basic have knowledge of Python, installing packaging and have available a running Python 3 distribution. We'll utilize some useful packages:

  • numpy - efficient numerical computations
  • pandas - data structures for data analysis
  • scikit-learn - machine learning algorithms, dataset access
  • matplotlib - plotting (both interactive and to files)
  • seaborn - extra plot types, elegant and readable plot style

I encourage you to work interactively and proceed one step at time. A great interactive console for Python is ipython. Also if you don't use it already, google for pyvenv and pip.

Loading dataset into a pandas DataFrame

The dataset consists of a table - columns are attributes, rows are instances (individual observations). In order to do computations easily and efficiently and not to reinvent wheel we can use a suitable tool - pandas. So the first step is to obtain the dataset and load it into a DataFrame.

Fortunately the dataset is already available via the scikit-learn package which gives it to us as a python data structure:

import pandas as pd # conventional alias
from sklearn.datasets import load_boston

dataset = load_boston()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target

The result of load_boston() is a map-like object with four components: ['target', 'data', 'DESCR', 'feature_names']:

  • dataset['target'] - 1D numpy array of target attribute values
  • dataset['data'] - 2D numpy array of attribute values
  • dataset['feature_names'] - 1D numpy array of names of the attributes
  • dataset['DESCR'] - text description of the dataset

So it is easy to convert it to a pandas DataFrame.

In case we got just a plain CSV file, we can read it too, like this:

df = pd.read_csv('dataset.csv')

The process of cleaning and preparing data always depends on data at hand.

Excercise

As an excercise you can try to clean and load the raw Boston dataset from the UCI archive.

Data summary as a text report

Now we have the data available in Python waiting for us to explore them!

We'd like to get a simple report like this: data_analysis_report.txt.

Data size

The most basic thing is to get the data size: number of attributes (features) and instances (data points).

instance_count, attr_count = df.shape

In fact the attributes can be divided to input ones and target ones. In this case there's just one target and several inputs.

The concrete meaning of attributes is described in the report.

Distributions of each attribute

Next we'd like to know how values of each attributes are distributed. We can readily use the basic statistics (count, mean, min, max, quartiles) via the pandas df.describe() function. These can also be computed separately: df.count(), df.min(), df.max(), df.median(), df.quantile(q) Other statistics such as mode are also available separately, eg. df.mode():

>>> df.describe()
             CRIM          ZN
count  506.000000  506.000000
mean     3.593761   11.363636
std      8.596783   23.322453
min      0.006320    0.000000 ...
25%      0.082045    0.000000
50%      0.256510    0.000000
75%      3.647423   12.500000
max     88.976200  100.000000

Missing values

It is good to know if some values are missing. In pandas missing values are represented by np.nan. The pd.isnull(df).any() command tells us whether each column contains any missing values, pd.isnull(df).sum() then counts the missing values.

There are (un)fortunately no missing values in this dataset.

Corelations between attributes

It is useful to know whether some pairs of attributes are correlated and how much. For many ML algorithms correlated features might make some trouble, ideally we should try to get a set of independent features. However, there exist some methods for deriving features that are as uncorrelated as possible (google for PCA, ICA, autoencoder, dimensionality reduction, manifold learning, etc.).

Pandas offers us out-of-the-box three various correlation coefficients via the DataFrame.corr() function: standard Pearson correlation coefficient, Spearman rank correlation, Kendall Tau correlation coefficient. See Wikipedia for more information.

>>> df.corr(method='pearson')
             CRIM        ZN     INDUS      CHAS
CRIM     1.000000 -0.199458  0.404471 -0.055295
ZN      -0.199458  1.000000 -0.533828 -0.042697 ...
INDUS    0.404471 -0.533828  1.000000  0.062938
CHAS    -0.055295 -0.042697  0.062938  1.000000
                        ...

Practically it returns another DataFrame with pairs of all attributes and values in range [-1; 1], 1 meaning total correlation, -1 negative correlation, 0 no correlation. Note that the three methods differ in computational complexity. A quick test shows a difference:

>>> %timeit df.corr(method='pearson')
1000 loops, best of 3: 490 µs per loop
>>> %timeit df.corr(method='spearman')
100 loops, best of 3: 8.2 ms per loop
>>> %timeit df.corr(method='kendall')
1 loops, best of 3: 1.64 s per loop

%timeit is a command in ipython useful for micro-benchmarking.

Predictivity of attributes

Besides correlation between attributes, we'd like to know the correlation between input attributes and the target one, ie. how each input attribute is able to predict the target. It is called predictivity.

pearson = df.corr(method='pearson')
# assume target attr is the last, then remove corr with itself
corr_with_target = pearson.ix[-1][:-1]
# attributes sorted from the most predictive
predictivity = corr_with_target.sort(ascending=False)

And the result:

RM         0.695360
ZN         0.360445
B          0.333461
DIS        0.249929
CHAS       0.175260
AGE       -0.376955
RAD       -0.381626
CRIM      -0.385832
NOX       -0.427321
TAX       -0.468536
INDUS     -0.483725
PTRATIO   -0.507787
LSTAT     -0.737663
Name: MEDV, dtype: float64

Since we might be also interested in strong negative correlations it would be better to sort the correlations by the absolute value:

>>> corr_with_target[abs(corr_with_target).argsort()[::-1]]
LSTAT     -0.737663
RM         0.695360
PTRATIO   -0.507787
INDUS     -0.483725
TAX       -0.468536
NOX       -0.427321
CRIM      -0.385832
RAD       -0.381626
AGE       -0.376955
ZN         0.360445
B          0.333461
DIS        0.249929
CHAS       0.175260
Name: MEDV, dtype: float64

The np.argsort() function returns the array indices that when aplied to the array return an array in sorted order.

Important correlations between input attributes

It might be interesting to select some strong correlations between attribute pairs. With a bit of Python magic it is possible:

attrs = pearson.iloc[:-1,:-1] # all except target
# only important correlations and not auto-correlations
threshold = 0.5
# {('LSTAT', 'TAX'): 0.543993, ('INDUS', 'RAD'): 0.595129, ...
important_corrs = (attrs[abs(attrs) > threshold][attrs != 1.0]) \
    .unstack().dropna().to_dict()
#     attribute pair  correlation
# 0     (AGE, INDUS)     0.644779
# 1     (INDUS, RAD)     0.595129
# ...
unique_important_corrs = pd.DataFrame(
    list(set([(tuple(sorted(key)), important_corrs[key]) \
    for key in important_corrs])), columns=['attribute pair', 'correlation'])
# sorted by absolute value
unique_important_corrs = unique_important_corrs.ix[
    abs(unique_important_corrs['correlation']).argsort()[::-1]]

unique_important_corrs:

attribute pair  correlation
    (RAD, TAX)     0.910228
    (DIS, NOX)    -0.769230
  (INDUS, NOX)     0.763651
    (AGE, DIS)    -0.747881
    (AGE, NOX)     0.731470
  (INDUS, TAX)     0.720760
  (DIS, INDUS)    -0.708027
    (NOX, TAX)     0.668023
     (DIS, ZN)     0.664408
  (AGE, INDUS)     0.644779
   (CRIM, RAD)     0.622029
   (LSTAT, RM)    -0.613808
    (NOX, RAD)     0.611441
(INDUS, LSTAT)     0.603800
  (AGE, LSTAT)     0.602339
  (INDUS, RAD)     0.595129
  (LSTAT, NOX)     0.590879
   (CRIM, TAX)     0.579564
     (AGE, ZN)    -0.569537
  (LSTAT, TAX)     0.543993
    (DIS, TAX)    -0.534432
   (INDUS, ZN)    -0.533828
     (NOX, ZN)    -0.516604
    (AGE, TAX)     0.506456

Visualization

In order to better understand the dataset we can make things visual. Basically we generate a lot of various 1D or 2D plots and look at it hoping to find something interesting. Indeed, we, humans, are being highly trained to get information and find patterns visually.

There are Python packages that can help us with this task very much. In particular, we use the standard matplotlib package and additionally seaborn for some extra statistical plots and for more elegant and comprehensible plot styles.

While examining the dataset interactively we can utilize the default interactive plotting features. However for reproducibility and publishing our results eg. on a blog, we'd like to export the plots to images, eg. PNG. Python can help us to automate the things.

Attribute correlations

We'd like to plot the value of correlation of pairs of attributes, ie. a 2D matrix. To avoid duplicates (due to pair symmetry) just one triangle is enough (eg. lower triangular matrix). The range of correlation value is [-1; 1], thus the value can be color coded using a diverging color map - one hue for each sign (eg. red/blue) and saturation for absolute value.

The seaborn package offer corrplot() which is exactly what we need (example).

import seaborn as sns # just a conventional alias, don't know why
sns.corrplot(df) # compute and plot the pair-wise correlations
# save to file, remove the big white borders
plt.savefig('attribute_correlations.png', tight_layout=True)

attribute correlations default

The lower triangle show the correlation values as colored squares, on the diagonal are the attribute names and in the upper triangle are the actual correlation values and significance represented by stars. Also there is a colorbar.

We can change the correlation method with the method parameter (the same variants available as in pandas).

We might want to adjust the plot to look better. The default font size is a quite big and looks ugly. I'm not aware of a clean way to decrease it, thus as a work-around we can enlarge the plot. If you know, please comment.

# make a plot of specified dimension (in inches), just a 1x1 subplot
fig, ax = plt.subplots(figsize=(10, 10))
# pass the axis to draw on
sns.corrplot(df, ax=ax)

attribute correlations smaller font size

Also we might want to customize the colormap to some other diverging one:

# red-blue (hues 250 and 10, just 3 colors) diverging pallete with white center
sns.corrplot(df, cmap=sns.diverging_palette(250, 10, n=3, as_cmap=True))

attribute correlations smaller font size

In case there is just too much attributes we might hide the actual values and put the attribute names aside:

sns.corrplot(df, annot=False, sig_stars=False, diag_names=False)

attribute correlations smaller font size

Attribute histograms

Next we'd like to plot a histogram of values for each attribute. It acts as an estimation of the probability density function and gives us a better understanding how values of each attribtue look like.

A simple means would be the standard hist() method from matplotlib:

import matplotlib.pyplot as plt
attr = df['MEDV']
plt.hist(attr)

It automatically quantizes the value range into several bins. The number of bins can be specified via a parameter:

plt.hist(attr, bins=50)

Seaborn has a plot dedicated for distributions - distplot(). Moreover, various possibilities in plotting distributions are beautifully covered in its documentation: Visualizing distributions of data .

distplot is able to combine a histogram with a kernel density estimation (KDE) plot (it looks like a smoothed histogram) into a single plot. For us it would be useful to plot real-valued data.

sns.displot(attr)

distribution of MEDV - distplot

However, for integer-valued data (like categories) automatic quantization into a pre-defined number of bins might not be the best option. We'd rather like to quantize according the original distinct values. For that we can just compute this kind of histogram ourselves and use the bar plot.

Example for RAD: int (category) - index of accessibility to radial highways:

cat_attr = df['RAD']
h = cat_attr.value_counts()
values, counts = h.index, h
plt.bar(values, counts)
# or more compactly:
plt.bar(*list(zip(*cat_attr.value_counts().items())))

distribution of RAD - bar plot

Pair-wise joint plots

So far we've visualized the distribution of values of individual attributes and correlation of attribute pairs. We can dig deeper into the relation of attribute pairs by examining their joint distributions. We'd like a 2D plot with each axis representing the particular attribute range and the points on the plot representing the probability that both attributes have the particular values at once. Also plotting 2D distributions is nicely described with many examples in the seaborn docs.

If we have a few data points we can plot a little dot on that 2D position for each observed point - this is called a scatter plot and is available via plt.scatter(X, Y).

plt.scatter(df['MEDV'], df['LSTAT'])

Seaborn enhances this plot by adding a small histogram for both attributes aside and a label with the correlation value and significance:

sns.jointplot(x, y, kind='scatter')

DIS-MEDV jointplot scatter

Rather than individual observations we'd rather estimate the joint PDF density and plot it color-coded. There are various approaches. Having a lot of data points we can fake it by a scatter plot with semi-transparent markers that add up.

x, y = df['MEDV'], df['LSTAT']
plt.scatter(x, y, alpha=0.5)

# or via jointplot (with histograms aside):
sns.jointplot(x, y, kind='scatter', joint_kws={'alpha':0.5})

DIS-MEDV jointplot scatter with alpha 0.5

Another possibility is to aggregate data points over 2D areas and estimate the PDF this way. Its a 2D generalization of a histogram. We can either use a rectangular grid, or even a hexagonal one.

sns.jointplot(df['MEDV'], df['LSTAT'], kind='hex')

DIS-MEDV jointplot hexagonal histogram

Otherwise we can estimate the PDF smoothly by convolving each datapoint with a kernel function. It is possible with sns.kdeplot().

sns.kdeplot(df['MEDV'], df['LSTAT'], shade=True)
# or 
sns.jointplot(df['MEDV'], df['LSTAT'], kind='kde')

DIS-MEDV jointplot kernel density estimation

Plotting exact observations (as with a scatter plot) may lead to "over-fitting", while plotting smoothed kernel density estimation may lead to to much generalization. A medium trade-off between both is using the hexagon-aggregated 2D histogram. But as always the human observer who will assess the data must use their own judgement.

Pair-wise scatter matrix

There's N * (N - 1) / 2 unique pairs of attributes, ie. quite a lot. In order to have a big picture for quick overview we might want to display a rough joint distribution plot for each pair in a single image. That's exactly what pairplot() from seaborn does - scatter plots, one for each pair, are aligned into a matrix and the diagonal is filled with attribute histograms.

sns.pairplot()

That's it! You can of course do it by hand using matplotlib, but why reinvent wheel and not leverage the power of existing libraries. Note that the resulting image is quite big.

pairwise scatter plot matrix

Tips

Matplotlib opens the GUI window even for batch plotting to files

When plotting non-interactively I found a problem that the matplotlib window still opens. In order not to do this I found this work-around. Before importing the matplotlib.pyplot package, select the backend to 'Agg'.

matplotlib.use('Agg')
import matplotlib.pyplot as plt

Logging

While developing and running a script it is useful to know the intermediate status. The most basic tool in Python is the print() function. The problem is to differentiate what are the program results and what are some helpful debugging messages.

The standard solution is to use logging. To each message we can specify the log level (eg. DEBUG, INFO, WARNING, ERROR) and then filter only relevant messages. Eg. DEBUG and more important messages during development, and INFO or WARNING and above during normal usage.

Simplest we can log directly to the standard output (console), or we can log to files, etc.

The basic usage in Python:

import logging
# setup the logging to standard output with some simple format and INFO level
log_format='%(asctime)s %(levelname)s %(message)s'
logging.basicConfig(format=log_format, level=logging.INFO)

logging.info('this gets printed')
logging.debug('this doesn't get unless we set the level to DEBUG or lower)

For more information look at the Python logging module documentation.

Printing the report to file

Also we'd like to make a text report that's easy to publish. One way would be to copy and paste the script output. But that's human work and we like automation and replicability. Another almost as simple solution is to print to a file. The print() function supports this out-of-the-box, the only thing is to specify an open file.

report = open('report.txt', 'w')
print('some result', file=report)

Conclusion

In this article we described the basic process of examining a dataset for further usage eg. in machine learning or data visualization. We took the outline of basic questions from the Applied Machine Learning Process book and applied them to the classic Boston housing dataset. We described how powerful tools like Python and its libraries can help us to get quickly to the results while leaving us the freedom to get more complicated if needed. The actual results along with a lot of plots are presented separately in two reports ( problem-definition.md, data_analysis.md ).

I'm happy to try the process by hand without the usual time distress of work projects. Now I have a template of what questions to ask, how the answers might look like, there's a sample script implementing the basic computations and plot which might be tailored to other datasets encountered in future. It's released as open-source, so you can base your futher data examinations on it.

I hope you found something useful or inspiring here and feel free to share your own data exploration reports or code.

Get in touch on Twitter: @bzamecnik.