QuantRocket logo

© Copyright Quantopian Inc.
© Modifications Copyright QuantRocket LLC
Licensed under the Creative Commons Attribution 4.0.

Disclaimer

Introduction to Notebooks

Jupyter notebooks allow one to perform a great deal of data analysis and statistical validation. We'll demonstrate a few simple techniques here.

Code Cells vs. Text Cells

As you can see, each cell can be either code or text. To select between them, choose from the 'Markdown' dropdown menu on the top of the notebook.

Executing a Command

A code cell will be evaluated when you press play, or when you press the shortcut, shift-enter. Evaluating a cell evaluates each line of code in sequence, and prints the results of the last line below the cell.

In [1]:
2 + 2
Out[1]:
4

Sometimes there is no result to be printed, as is the case with assignment.

In [2]:
X = 2

Remember that only the result from the last line is printed.

In [3]:
2 + 2
3 + 3
Out[3]:
6

However, you can print whichever lines you want using the print statement.

In [4]:
print(2 + 2)
3 + 3
4
Out[4]:
6

Knowing When a Cell is Running

While a cell is running, a [*] will display on the left. When a cell has yet to be executed, [ ] will display. When it has been run, a number will display indicating the order in which it was run during the execution of the notebook [5]. Try on this cell and note it happening.

In [5]:
#Take some time to run something
c = 0
for i in range(10000000):
    c = c + i
c
Out[5]:
49999995000000

Importing Libraries

The vast majority of the time, you'll want to use functions from pre-built libraries. Here I import numpy and pandas, the two most common and useful libraries in quant finance. I recommend copying this import statement to every new notebook.

Notice that you can rename libraries to whatever you want after importing. The as statement allows this. Here we use np and pd as aliases for numpy and pandas. This is a very common aliasing and will be found in most code snippets around the web. The point behind this is to allow you to type fewer characters when you are frequently accessing these libraries.

In [6]:
import numpy as np
import pandas as pd

# This is a plotting library for pretty pictures.
import matplotlib.pyplot as plt

Tab Autocomplete

Pressing tab will give you a list of Python's best guesses for what you might want to type next. This is incredibly valuable and will save you a lot of time. If there is only one possible option for what you could type next, Python will fill that in for you. Try pressing tab very frequently, it will seldom fill in anything you don't want, as if there is ambiguity a list will be shown. This is a great way to see what functions are available in a library.

Try placing your cursor after the . and pressing tab.

In [ ]:
np.random.

Getting Documentation Help

Placing a question mark after a function and executing that line of code will give you the documentation Python has for that function. It's often best to do this in a new cell, as you avoid re-executing other code and running into bugs.

In [7]:
np.random.normal?
Docstring:
normal(loc=0.0, scale=1.0, size=None)

Draw random samples from a normal (Gaussian) distribution.

The probability density function of the normal distribution, first
derived by De Moivre and 200 years later by both Gauss and Laplace
independently [2]_, is often called the bell curve because of
its characteristic shape (see the example below).

The normal distributions occurs often in nature.  For example, it
describes the commonly occurring distribution of samples influenced
by a large number of tiny, random disturbances, each with its own
unique distribution [2]_.

Parameters
----------
loc : float or array_like of floats
    Mean ("centre") of the distribution.
scale : float or array_like of floats
    Standard deviation (spread or "width") of the distribution.
size : int or tuple of ints, optional
    Output shape.  If the given shape is, e.g., ``(m, n, k)``, then
    ``m * n * k`` samples are drawn.  If size is ``None`` (default),
    a single value is returned if ``loc`` and ``scale`` are both scalars.
    Otherwise, ``np.broadcast(loc, scale).size`` samples are drawn.

Returns
-------
out : ndarray or scalar
    Drawn samples from the parameterized normal distribution.

See Also
--------
scipy.stats.norm : probability density function, distribution or
    cumulative density function, etc.

Notes
-----
The probability density for the Gaussian distribution is

.. math:: p(x) = \frac{1}{\sqrt{ 2 \pi \sigma^2 }}
                 e^{ - \frac{ (x - \mu)^2 } {2 \sigma^2} },

where :math:`\mu` is the mean and :math:`\sigma` the standard
deviation. The square of the standard deviation, :math:`\sigma^2`,
is called the variance.

The function has its peak at the mean, and its "spread" increases with
the standard deviation (the function reaches 0.607 times its maximum at
:math:`x + \sigma` and :math:`x - \sigma` [2]_).  This implies that
`numpy.random.normal` is more likely to return samples lying close to
the mean, rather than those far away.

References
----------
.. [1] Wikipedia, "Normal distribution",
       http://en.wikipedia.org/wiki/Normal_distribution
.. [2] P. R. Peebles Jr., "Central Limit Theorem" in "Probability,
       Random Variables and Random Signal Principles", 4th ed., 2001,
       pp. 51, 51, 125.

Examples
--------
Draw samples from the distribution:

>>> mu, sigma = 0, 0.1 # mean and standard deviation
>>> s = np.random.normal(mu, sigma, 1000)

Verify the mean and the variance:

>>> abs(mu - np.mean(s)) < 0.01
True

>>> abs(sigma - np.std(s, ddof=1)) < 0.01
True

Display the histogram of the samples, along with
the probability density function:

>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, 30, normed=True)
>>> plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
...                np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
...          linewidth=2, color='r')
>>> plt.show()
Type:      builtin_function_or_method

Sampling

We'll sample some random data using a function from numpy.

In [8]:
# Sample 100 points with a mean of 0 and an std of 1. This is a standard normal distribution.
X = np.random.normal(0, 1, 100)

Plotting

We can use the plotting library we imported as follows.

In [9]:
plt.plot(X)
Out[9]:
[<matplotlib.lines.Line2D at 0x7f8d45a0d518>]

Squelching Line Output

You might have noticed the annoying line of the form [<matplotlib.lines.Line2D at 0x7f72fdbc1710>] before the plots. This is because the .plot function actually produces output. Sometimes we wish not to display output, we can accomplish this with the semi-colon as follows.

In [10]:
plt.plot(X);

Adding Axis Labels

No self-respecting quant leaves a graph without labeled axes. Here are some commands to help with that.

In [11]:
X = np.random.normal(0, 1, 100)
X2 = np.random.normal(0, 1, 100)

plt.plot(X);
plt.plot(X2);
plt.xlabel('Time') # The data we generated is unitless, but don't forget units in general.
plt.ylabel('Returns')
plt.legend(['X', 'X2']);

Generating Statistics

Let's use numpy to take some simple statistics.

In [12]:
np.mean(X)
Out[12]:
0.22668626421117452
In [13]:
np.std(X)
Out[13]:
1.0360427089489999

Getting Real Pricing Data

Randomly sampled data can be great for testing ideas, but let's get some real data. In QuantRocket, all securities are referenced by sid (short for "security ID") rather than by symbol since symbols can change. So, first, we'll use the get_securities function to look up the sid for MSFT.

(Notice the use of vendors='usstock' in the get_securities function call. This limits the query to securities from the US Stock dataset. This filter isn't necessary if you've only collected US Stock data, but is a best practice when looking up securities by symbol in case you've also collected data from other global exchanges where the same ticker symbols are re-used.)

In [14]:
from quantrocket.master import get_securities
securities = get_securities(symbols='MSFT', fields=['Sid','Symbol','Exchange'], vendors='usstock')
securities
Out[14]:
SymbolExchange
Sid
FIBBG000BPH459MSFTXNAS

This returns a pandas dataframe, where sids are stored in the dataframe's index.

Then we use get_prices to query our data bundle. Although the bundle contains minute data, here we use the data_frequency parameter to request the data at daily frequency:

In [15]:
MSFT = securities.index[0]
from quantrocket import get_prices
data = get_prices("usstock-free-1min", data_frequency='daily', sids=MSFT, start_date='2012-01-01', end_date='2015-06-01', fields="Close")

Our data is now a dataframe. You can see the datetime index and the colums with different pricing data.

In [16]:
data.head()
Out[16]:
SidFIBBG000BPH459
FieldDate
Close2012-01-0325.657
2012-01-0426.265
2012-01-0526.534
2012-01-0626.941
2012-01-0926.591

This is a pandas dataframe, so we can index in to just get the closing price for MSFT like this. For more info on pandas, please click here.

In [17]:
X = data.loc['Close'][MSFT]

Because there is now also date information in our data, we provide two series to .plot. X.index gives us the datetime index, and X.values gives us the pricing values. These are used as the X and Y coordinates to make a graph.

In [18]:
plt.plot(X.index, X.values)
plt.ylabel('Price')
plt.legend(['MSFT']);

We can get statistics again on real data.

In [19]:
np.mean(X)
Out[19]:
35.30557759626604
In [20]:
np.std(X)
Out[20]:
7.173065802833085

Getting Returns from Prices

We can use the pct_change function to get returns. Notice how we drop the first element after doing this, as it will be NaN (nothing -> something results in a NaN percent change).

In [21]:
R = X.pct_change()[1:]

We can plot the returns distribution as a histogram.

In [22]:
plt.hist(R, bins=20)
plt.xlabel('Return')
plt.ylabel('Frequency')
plt.legend(['MSFT Returns']);

Get statistics again.

In [23]:
np.mean(R)
Out[23]:
0.0008173587738800674
In [24]:
np.std(R)
Out[24]:
0.014426725203892897

Now let's go backwards and generate data out of a normal distribution using the statistics we estimated from Microsoft's returns. We'll see that we have good reason to suspect Microsoft's returns may not be normal, as the resulting normal distribution looks far different.

In [25]:
plt.hist(np.random.normal(np.mean(R), np.std(R), 10000), bins=20)
plt.xlabel('Return')
plt.ylabel('Frequency')
plt.legend(['Normally Distributed Returns']);

Generating a Moving Average

pandas has some nice tools to allow us to generate rolling statistics. Here's an example. Notice how there's no moving average for the first 60 days, as we don't have 60 days of data on which to generate the statistic.

In [26]:
# Take the average of the last 60 days at each timepoint.
MAVG = X.rolling(window=60).mean()
plt.plot(X.index, X.values)
plt.plot(MAVG.index, MAVG.values)
plt.ylabel('Price')
plt.legend(['MSFT', '60-day MAVG']);

This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by Quantopian, Inc. ("Quantopian") or QuantRocket LLC ("QuantRocket"). Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, neither Quantopian nor QuantRocket has taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information believed to be reliable at the time of publication. Neither Quantopian nor QuantRocket makes any guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.