Exploratory Data Analysis (EDA) Tutorial

Summary

This post download data from Yahoo finance and does some basic EDA.

In [1]:
!conda env list
%matplotlib inline
# conda environments:
#
base                  *  /Users/xianshi/anaconda3
blog                     /Users/xianshi/anaconda3/envs/blog

In [2]:
#Importing The Data


from pandas_datareader import data
import pandas as pd

# Define the instruments to download. We would like to see Apple, Microsoft and the S&P500 index.
# tickers = ['W', 'HD', 'AMZN','LOW','LL']
tickers = ['ROIC', 'SKT', 'TCO' ,'SPG' ,'MAC']
# Define which online source one should use
data_source = 'yahoo'

# We would like all available data from 01/01/2000 until 12/31/2016.
start_date = '2016-12-01'
end_date = '2017-12-31'

# User pandas_reader.data.DataReader to load the desired data. As simple as that.
panel_data = data.DataReader(tickers, data_source, start_date, end_date)
In [3]:
panel_data.head()
Out[3]:
Attributes High Low ... Volume Adj Close
Symbols MAC ROIC SKT SPG TCO MAC ROIC SKT SPG TCO ... MAC ROIC SKT SPG TCO MAC ROIC SKT SPG TCO
Date
2016-12-01 68.279999 20.629999 34.669998 179.380005 72.290001 66.809998 20.190001 34.090000 175.869995 71.010002 ... 927500.0 1041700.0 745100.0 2075900.0 355000.0 58.302998 18.167982 29.123262 157.796829 63.129143
2016-12-02 68.690002 20.799999 34.990002 180.789993 72.790001 67.540001 20.370001 34.380001 177.419998 71.459999 ... 676100.0 601200.0 418800.0 1273900.0 250200.0 58.925732 18.436207 29.420441 160.063568 63.632339
2016-12-05 68.839996 20.770000 34.990002 180.809998 72.709999 67.680000 20.549999 34.330002 178.029999 71.739998 ... 931000.0 648500.0 478000.0 1027300.0 245000.0 59.505215 18.534555 29.692146 161.312943 64.100204
2016-12-06 69.610001 21.020000 35.310001 182.139999 73.550003 68.099998 20.590000 34.830002 179.130005 72.209999 ... 915600.0 590800.0 391100.0 1202000.0 269000.0 59.721443 18.597143 29.946865 160.732864 64.877075
2016-12-07 70.830002 21.350000 35.950001 183.759995 75.320000 69.169998 20.860001 35.310001 180.509995 73.660004 ... 810800.0 843500.0 628000.0 1278400.0 406700.0 61.165833 18.954781 30.481783 163.981232 65.768707

5 rows × 30 columns

In [4]:
df = panel_data['Adj Close']
In [5]:
# Basic Description of the Data

df.describe()
Out[5]:
Symbols MAC ROIC SKT SPG TCO
count 272.000000 272.000000 272.000000 272.000000 272.000000
mean 55.118781 18.442282 24.788283 151.898158 55.258763
std 4.208408 0.656658 3.412448 7.387024 6.499111
min 47.703194 16.672773 19.759792 138.740128 42.535595
25% 51.428945 17.981809 21.999384 146.275017 50.471438
50% 55.492500 18.473031 23.172514 150.871696 54.953302
75% 59.109798 18.914228 28.274007 157.328503 60.127033
max 62.869678 20.064341 31.560102 167.898178 67.782234
In [6]:
first = df.head()
last = df.tail()
print(first)
print(last)
Symbols           MAC       ROIC        SKT         SPG        TCO
Date                                                              
2016-12-01  58.302998  18.167982  29.123262  157.796829  63.129143
2016-12-02  58.925732  18.436207  29.420441  160.063568  63.632339
2016-12-05  59.505215  18.534555  29.692146  161.312943  64.100204
2016-12-06  59.721443  18.597143  29.946865  160.732864  64.877075
2016-12-07  61.165833  18.954781  30.481783  163.981232  65.768707
Symbols           MAC       ROIC        SKT         SPG        TCO
Date                                                              
2017-12-22  59.747002  18.089575  22.627144  155.758743  59.732700
2017-12-26  60.232689  18.398554  23.054577  157.893692  60.799690
2017-12-27  60.370136  18.520279  23.428577  159.208237  60.966686
2017-12-28  60.489265  18.735632  23.597769  159.562500  60.911030
2017-12-29  60.186863  18.679453  23.606674  160.112564  60.706913
In [7]:
df.sample(6)
Out[7]:
Symbols MAC ROIC SKT SPG TCO
Date
2017-03-28 56.465996 18.949013 27.943995 151.426422 57.989056
2017-03-10 56.103745 18.396067 26.769661 151.534576 58.936821
2017-04-04 56.819401 19.276661 28.149717 153.688156 59.192852
2017-05-05 55.062401 19.003618 24.801085 148.245621 54.970566
2017-07-13 51.444580 17.836662 22.929958 144.016022 54.236202
2017-03-24 56.430656 19.222054 27.866848 151.336319 58.420258
In [8]:
# A Closer Look At Your Data: Queries

df.query('MAC == ROIC')
Out[8]:
Symbols MAC ROIC SKT SPG TCO
Date
In [9]:
#cleaning
print(df.columns[df.isnull().any()])
Index([], dtype='object', name='Symbols')
In [10]:
# Getting all weekdays between 01/01/2000 and 12/31/2016
all_weekdays = pd.date_range(start=start_date, end=end_date, freq='B')

# How do we align the existing prices in adj_close with our new set of dates?
# All we need to do is reindex adj_close using all_weekdays as the new index
df = df.reindex(all_weekdays)

# Reindexing will insert missing values (NaN) for the dates that were not present
# in the original set. To cope with this, we can fill the missing by replacing them
# with the latest available price for each instrument.
df = df.fillna(method='ffill')
df.isnull().head()
Out[10]:
Symbols MAC ROIC SKT SPG TCO
2016-12-01 False False False False False
2016-12-02 False False False False False
2016-12-05 False False False False False
2016-12-06 False False False False False
2016-12-07 False False False False False
In [11]:
# Define your own bins
mybins = range(int(df.MAC.min()), int(df.MAC.max()), 2)

# Cut the data with the help of the bins
df['MAC_bucket'] = pd.cut(df.MAC, bins=mybins)

# Count the number of values per bucket
df['MAC_bucket'].value_counts()
Out[11]:
(51, 53]    62
(59, 61]    61
(57, 59]    46
(49, 51]    35
(55, 57]    26
(53, 55]    21
(47, 49]    17
Name: MAC_bucket, dtype: int64

Simple exercise analysing stocks using IPython

In [1]:
from pandas_datareader import data
import pandas as pd
%matplotlib inline


# Define the instruments to download. We would like to see Apple, Microsoft and the S&P500 index.
# tickers = ['W', 'HD', 'AMZN','LOW','LL']
tickers = ['ROIC', 'SKT', 'TCO' ,'SPG' ,'MAC']
# Define which online source one should use
data_source = 'yahoo'

# We would like all available data from 01/01/2000 until 12/31/2016.
start_date = '2016-12-01'
end_date = '2017-12-31'

# User pandas_reader.data.DataReader to load the desired data. As simple as that.
panel_data = data.DataReader(tickers, data_source, start_date, end_date)
In [2]:
# Getting just the adjusted closing prices. This will return a Pandas DataFrame
# The index in this DataFrame is the major index of the panel_data.
close = panel_data.ix['Close']

# Getting all weekdays between 01/01/2000 and 12/31/2016
all_weekdays = pd.date_range(start=start_date, end=end_date, freq='B')

# How do we align the existing prices in adj_close with our new set of dates?
# All we need to do is reindex close using all_weekdays as the new index
close = close.reindex(all_weekdays)

close.head()
Out[2]:
MAC ROIC SKT SPG TCO
2016-12-01 67.410004 20.320000 34.299999 176.820007 71.510002
2016-12-02 68.129997 20.620001 34.650002 179.360001 72.080002
2016-12-05 68.800003 20.730000 34.970001 180.759995 72.610001
2016-12-06 69.050003 20.799999 35.270000 180.110001 73.489998
2016-12-07 70.720001 21.200001 35.900002 183.750000 74.500000
In [3]:
# Getting just the adjusted closing prices. This will return a Pandas DataFrame
# The index in this DataFrame is the major index of the panel_data.
adj_close = panel_data.ix['Adj Close']

# Getting all weekdays between 01/01/2000 and 12/31/2016
all_weekdays = pd.date_range(start=start_date, end=end_date, freq='B')

# How do we align the existing prices in adj_close with our new set of dates?
# All we need to do is reindex adj_close using all_weekdays as the new index
adj_close = adj_close.reindex(all_weekdays)

# Reindexing will insert missing values (NaN) for the dates that were not present
# in the original set. To cope with this, we can fill the missing by replacing them
# with the latest available price for each instrument.
adj_close = adj_close.fillna(method='ffill')
adj_close.describe()
Out[3]:
MAC ROIC SKT SPG TCO
count 282.000000 282.000000 282.000000 282.000000 282.000000
mean 60.359130 19.702806 27.499412 163.018773 59.680361
std 4.765473 0.695494 3.795979 7.915956 7.001413
min 52.056969 17.806831 21.890348 148.815491 45.844936
25% 56.158371 19.218705 24.376531 156.912205 54.580725
50% 60.819279 19.729540 25.674796 161.862656 59.257576
75% 64.790953 20.204208 31.491758 168.811760 64.934494
max 69.344276 21.429085 34.961044 180.091019 73.055801
In [1]:
# Define your own bins
mybins = range(df.MAC.min(), df.MAC.max(), 10)

# Cut the data with the help of the bins
df['MAC_bucket'] = pd.cut(df.MAC, bins=mybins)

# Count the number of values per bucket
df['MAC_bucket'].value_counts()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-5cc82bb1cc93> in <module>()
      1 # Define your own bins
----> 2 mybins = range(df.MAC.min(), df.MAC.max(), 10)
      3 
      4 # Cut the data with the help of the bins
      5 df['MAC_bucket'] = pd.cut(df.MAC, bins=mybins)

NameError: name 'df' is not defined

PCA in Python

Use Apple's stock as an example to create some graphs and analysis

Reference: http://tetration.xyz/lumpsum_vs_dca/

In [1]:
import pandas as pd
import pandas_datareader.data as web
import datetime
pd.set_option('display.width', 200) # Displaying more columns in one row

# Data date range, Google provides up to 4000 entries in one call
start = datetime.datetime(2017, 9, 1) 
end = datetime.datetime(2018, 2, 3)

spy = web.DataReader("AAPL", "yahoo", start, end)

print(spy.head()) # See first few rows
                  Open        High         Low       Close   Adj Close    Volume
Date                                                                            
2017-09-01  164.800003  164.940002  163.630005  164.050003  163.462372  16591100
2017-09-05  163.750000  164.250000  160.559998  162.080002  161.499435  29468500
2017-09-06  162.710007  162.990005  160.520004  161.910004  161.330048  21651700
2017-09-07  162.089996  162.240005  160.360001  161.259995  160.682358  21928500
2017-09-08  160.860001  161.149994  158.529999  158.630005  158.061798  28611500
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
from matplotlib import style
style.use('fivethirtyeight')

spy['Adj Close'].plot(figsize=(20,10))
ax = plt.subplot()
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos: '${:,.0f}'.format(x))) # Y axis dollarsymbols
plt.title('AAPL Historical Price on Close')
plt.xlabel('')
plt.ylabel('Stock Price ($)');
In [3]:
value_price = spy['Adj Close'][-1] # The final value of our stock
initial_investment = 10000 # Our initial investment of $10k

num_stocks_bought = initial_investment / spy['Adj Close']
lumpsum = num_stocks_bought * value_price
lumpsum.name = 'Lump Sum'

lumpsum.plot(figsize=(20,10))
ax = plt.subplot()
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos: '${:,.0f}'.format(x))) # Y axis dollarsymbols
plt.title('Lump sum - Value today of $10,000 invested on date')
plt.xlabel('')
plt.ylabel('Investment Value ($)');
In [6]:
def doDCA(investment, start_date):
    # Get 12 investment dates in 7 day increments starting from start date
    investment_dates_all = pd.date_range(start_date,periods=12,freq='7D')
    # Remove those dates beyond our known data range
    investment_dates = investment_dates_all[investment_dates_all < spy.index[-1]]

    # Get closest business dates with available data
    closest_investment_dates = spy.index.searchsorted(investment_dates)

    # How much to invest on each date
    portion = investment/12.0 # (Python 3.0 does implicit double conversion, Python 2.7 does not)

    # Get the total of all stocks purchased for each of those dates (on the Close)
    stocks_invested = sum(portion / spy['Adj Close'][closest_investment_dates])

    # Add uninvested amount back
    uninvested_dollars = portion * sum(investment_dates_all >= spy.index[-1])

    # value of stocks today
    total_value = value_price*stocks_invested + uninvested_dollars
    return total_value

# Generate DCA series for every possible date
dca = pd.Series(spy.index.map(lambda x: doDCA(initial_investment, x)), index=spy.index, name='Dollar Cost Averaging (DCA)')
In [7]:
dca.plot(figsize=(20,10))
ax = plt.subplot()
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos: '${:,.0f}'.format(x))) # Y axis dollarsymbols
plt.title('Dollar Cost Averaging - Value today of $10,000 invested on date')
plt.xlabel('')
plt.ylabel('Investment Value ($)');
In [ ]:
 

Your First Step by Step Machine Learning Project

Step 1: Download and install Anaconda

Step 2: Launch IPython

Step 3: Copy code from below and paste to your Notebook

     # Load libraries
     import pandas
     from pandas.tools.plotting import scatter_matrix
     import matplotlib.pyplot as plt
     from sklearn import model_selection
     from sklearn.metrics import classification_report
     from sklearn.metrics import confusion_matrix
     from sklearn.metrics import accuracy_score
     from sklearn.linear_model import LogisticRegression
     from sklearn.tree import DecisionTreeClassifier
     from sklearn.neighbors import KNeighborsClassifier
     from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
     from sklearn.naive_bayes import GaussianNB
     from sklearn.svm import SVC

First Blog Using IPython Notebook

  1. Download and install Anaconda (this gives you ipython and a nice enviroment)

  2. Install Nikola

  3. Follow the tutorial

Helpful Links:

http://www.damian.oquanta.info/posts/ipython-plugin-for-nikola-updated.html

https://shankarmsy.github.io/posts/blogging-with-the-awesome-nikola-ipython-and-github.html

In [1]:
# http://nbviewer.jupyter.org/gist/theandygross/4544012
import sys
print('Python: {}'.format(sys.version))
Python: 3.6.0 |Anaconda custom (x86_64)| (default, Dec 23 2016, 13:19:00) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
In [2]:
print ('This is really exciting')
This is really exciting
In [3]:
print (12)
12
In [ ]:
print ('hello, notebook blog :)')

Blog with the Nikola, IPython and Github

https://shankarmsy.github.io/posts/blogging-with-the-awesome-nikola-ipython-and-github.html

Problem 1: conda, virtual env doesn't work for me.

https://conda.io/docs/using/envs.html#create-an-environment

conda create -n blog python

jakehku blog $ conda create -n blog Fetching package metadata ........... Solving package specifications: Package plan for installation in environment /anaconda/envs/blog:

Proceed ([y]/n)? y

# # To activate this environment, use: # > source activate blog # # To deactivate an active environment, use: # > source deactivate #.

Deploying to GitHub

Nikola provides a separate command github_deploy to deploy your site to GitHub Pages. The command builds the site, commits the output to a gh-pages branch and pushes the output to GitHub. Nikola uses the ghp-import command for this.

In order to use this feature, you need to configure a few things first. Make sure you have nikola and git installed on your PATH.

Initialize a Nikola site, if you haven’t already.

Initialize a git repository in your Nikola source directory by running:

git init . git remote add origin git@github.com:user/repository.git Setup branches and remotes in conf.py:

GITHUB_DEPLOY_BRANCH is the branch where Nikola-generated HTML files will be deployed. It should be gh-pages for project pages and master for user pages (user.github.io). GITHUB_SOURCE_BRANCH is the branch where your Nikola site source will be deployed. We recommend and default to src. GITHUB_REMOTE_NAME is the remote to which changes are pushed. GITHUB_COMMIT_SOURCE controls whether or not the source branch is automatically committed to and pushed. We recommend setting it to True, unless you are automating builds with Travis CI. Create a .gitignore file. We recommend adding at least the following entries:

cache .doit.db __pycache__ output If you set GITHUB_COMMIT_SOURCE to False, you must switch to your source branch and commit to it. Otherwise, this is done for you.

Run nikola

github_deploy

. This will build the site, commit the output folder to your deploy branch, and push to GitHub. Your website should be up and running within a few minutes.

If you want to use a custom domain, create your CNAME file in files/CNAME on the source branch. Nikola will copy it to the output directory. To add a custom commit message, use the -m option, followed by your message.

git pull origin src --allow-unrelated-histories