Focus

Jake

2019-08-13 15:21

There are too many distractions. So it is very important to focus.

Exploratory Data Analysis (EDA) Tutorial

Jake

2018-02-05 21:48

Summary¶

This post download data from Yahoo finance and does some basic EDA.

In [1]:

!conda env list
%matplotlib inline

# conda environments:
#
base                  *  /Users/xianshi/anaconda3
blog                     /Users/xianshi/anaconda3/envs/blog

In [2]:

#Importing The Data


from pandas_datareader import data
import pandas as pd

# Define the instruments to download. We would like to see Apple, Microsoft and the S&P500 index.
# tickers = ['W', 'HD', 'AMZN','LOW','LL']
tickers = ['ROIC', 'SKT', 'TCO' ,'SPG' ,'MAC']
# Define which online source one should use
data_source = 'yahoo'

# We would like all available data from 01/01/2000 until 12/31/2016.
start_date = '2016-12-01'
end_date = '2017-12-31'

# User pandas_reader.data.DataReader to load the desired data. As simple as that.
panel_data = data.DataReader(tickers, data_source, start_date, end_date)

In [3]:

panel_data.head()

Out[3]:

Attributes	High					Low					...	Volume					Adj Close
Symbols	MAC	ROIC	SKT	SPG	TCO	MAC	ROIC	SKT	SPG	TCO	...	MAC	ROIC	SKT	SPG	TCO	MAC	ROIC	SKT	SPG	TCO
Date
2016-12-01	68.279999	20.629999	34.669998	179.380005	72.290001	66.809998	20.190001	34.090000	175.869995	71.010002	...	927500.0	1041700.0	745100.0	2075900.0	355000.0	58.302998	18.167982	29.123262	157.796829	63.129143
2016-12-02	68.690002	20.799999	34.990002	180.789993	72.790001	67.540001	20.370001	34.380001	177.419998	71.459999	...	676100.0	601200.0	418800.0	1273900.0	250200.0	58.925732	18.436207	29.420441	160.063568	63.632339
2016-12-05	68.839996	20.770000	34.990002	180.809998	72.709999	67.680000	20.549999	34.330002	178.029999	71.739998	...	931000.0	648500.0	478000.0	1027300.0	245000.0	59.505215	18.534555	29.692146	161.312943	64.100204
2016-12-06	69.610001	21.020000	35.310001	182.139999	73.550003	68.099998	20.590000	34.830002	179.130005	72.209999	...	915600.0	590800.0	391100.0	1202000.0	269000.0	59.721443	18.597143	29.946865	160.732864	64.877075
2016-12-07	70.830002	21.350000	35.950001	183.759995	75.320000	69.169998	20.860001	35.310001	180.509995	73.660004	...	810800.0	843500.0	628000.0	1278400.0	406700.0	61.165833	18.954781	30.481783	163.981232	65.768707

5 rows × 30 columns

In [4]:

df = panel_data['Adj Close']

In [5]:

# Basic Description of the Data

df.describe()

Out[5]:

Symbols	MAC	ROIC	SKT	SPG	TCO
count	272.000000	272.000000	272.000000	272.000000	272.000000
mean	55.118781	18.442282	24.788283	151.898158	55.258763
std	4.208408	0.656658	3.412448	7.387024	6.499111
min	47.703194	16.672773	19.759792	138.740128	42.535595
25%	51.428945	17.981809	21.999384	146.275017	50.471438
50%	55.492500	18.473031	23.172514	150.871696	54.953302
75%	59.109798	18.914228	28.274007	157.328503	60.127033
max	62.869678	20.064341	31.560102	167.898178	67.782234

In [6]:

first = df.head()
last = df.tail()
print(first)
print(last)

Symbols           MAC       ROIC        SKT         SPG        TCO
Date                                                              
2016-12-01  58.302998  18.167982  29.123262  157.796829  63.129143
2016-12-02  58.925732  18.436207  29.420441  160.063568  63.632339
2016-12-05  59.505215  18.534555  29.692146  161.312943  64.100204
2016-12-06  59.721443  18.597143  29.946865  160.732864  64.877075
2016-12-07  61.165833  18.954781  30.481783  163.981232  65.768707
Symbols           MAC       ROIC        SKT         SPG        TCO
Date                                                              
2017-12-22  59.747002  18.089575  22.627144  155.758743  59.732700
2017-12-26  60.232689  18.398554  23.054577  157.893692  60.799690
2017-12-27  60.370136  18.520279  23.428577  159.208237  60.966686
2017-12-28  60.489265  18.735632  23.597769  159.562500  60.911030
2017-12-29  60.186863  18.679453  23.606674  160.112564  60.706913

In [7]:

df.sample(6)

Out[7]:

Symbols	MAC	ROIC	SKT	SPG	TCO
Date
2017-03-28	56.465996	18.949013	27.943995	151.426422	57.989056
2017-03-10	56.103745	18.396067	26.769661	151.534576	58.936821
2017-04-04	56.819401	19.276661	28.149717	153.688156	59.192852
2017-05-05	55.062401	19.003618	24.801085	148.245621	54.970566
2017-07-13	51.444580	17.836662	22.929958	144.016022	54.236202
2017-03-24	56.430656	19.222054	27.866848	151.336319	58.420258

In [8]:

# A Closer Look At Your Data: Queries

df.query('MAC == ROIC')

Out[8]:

Symbols	MAC	ROIC	SKT	SPG	TCO
Date

In [9]:

#cleaning
print(df.columns[df.isnull().any()])

Index([], dtype='object', name='Symbols')

In [10]:

# Getting all weekdays between 01/01/2000 and 12/31/2016
all_weekdays = pd.date_range(start=start_date, end=end_date, freq='B')

# How do we align the existing prices in adj_close with our new set of dates?
# All we need to do is reindex adj_close using all_weekdays as the new index
df = df.reindex(all_weekdays)

# Reindexing will insert missing values (NaN) for the dates that were not present
# in the original set. To cope with this, we can fill the missing by replacing them
# with the latest available price for each instrument.
df = df.fillna(method='ffill')
df.isnull().head()

Out[10]:

Symbols	MAC	ROIC	SKT	SPG	TCO
2016-12-01	False	False	False	False	False
2016-12-02	False	False	False	False	False
2016-12-05	False	False	False	False	False
2016-12-06	False	False	False	False	False
2016-12-07	False	False	False	False	False

In [11]:

# Define your own bins
mybins = range(int(df.MAC.min()), int(df.MAC.max()), 2)

# Cut the data with the help of the bins
df['MAC_bucket'] = pd.cut(df.MAC, bins=mybins)

# Count the number of values per bucket
df['MAC_bucket'].value_counts()

Out[11]:

(51, 53]    62
(59, 61]    61
(57, 59]    46
(49, 51]    35
(55, 57]    26
(53, 55]    21
(47, 49]    17
Name: MAC_bucket, dtype: int64

Simple exercise analysing stocks using IPython

Jake

2018-02-02 00:00

In [1]:

from pandas_datareader import data
import pandas as pd
%matplotlib inline


# Define the instruments to download. We would like to see Apple, Microsoft and the S&P500 index.
# tickers = ['W', 'HD', 'AMZN','LOW','LL']
tickers = ['ROIC', 'SKT', 'TCO' ,'SPG' ,'MAC']
# Define which online source one should use
data_source = 'yahoo'

# We would like all available data from 01/01/2000 until 12/31/2016.
start_date = '2016-12-01'
end_date = '2017-12-31'

# User pandas_reader.data.DataReader to load the desired data. As simple as that.
panel_data = data.DataReader(tickers, data_source, start_date, end_date)

In [2]:

# Getting just the adjusted closing prices. This will return a Pandas DataFrame
# The index in this DataFrame is the major index of the panel_data.
close = panel_data.ix['Close']

# Getting all weekdays between 01/01/2000 and 12/31/2016
all_weekdays = pd.date_range(start=start_date, end=end_date, freq='B')

# How do we align the existing prices in adj_close with our new set of dates?
# All we need to do is reindex close using all_weekdays as the new index
close = close.reindex(all_weekdays)

close.head()

Out[2]:

	MAC	ROIC	SKT	SPG	TCO
2016-12-01	67.410004	20.320000	34.299999	176.820007	71.510002
2016-12-02	68.129997	20.620001	34.650002	179.360001	72.080002
2016-12-05	68.800003	20.730000	34.970001	180.759995	72.610001
2016-12-06	69.050003	20.799999	35.270000	180.110001	73.489998
2016-12-07	70.720001	21.200001	35.900002	183.750000	74.500000

In [3]:

# Getting just the adjusted closing prices. This will return a Pandas DataFrame
# The index in this DataFrame is the major index of the panel_data.
adj_close = panel_data.ix['Adj Close']

# Getting all weekdays between 01/01/2000 and 12/31/2016
all_weekdays = pd.date_range(start=start_date, end=end_date, freq='B')

# How do we align the existing prices in adj_close with our new set of dates?
# All we need to do is reindex adj_close using all_weekdays as the new index
adj_close = adj_close.reindex(all_weekdays)

# Reindexing will insert missing values (NaN) for the dates that were not present
# in the original set. To cope with this, we can fill the missing by replacing them
# with the latest available price for each instrument.
adj_close = adj_close.fillna(method='ffill')
adj_close.describe()

Out[3]:

	MAC	ROIC	SKT	SPG	TCO
count	282.000000	282.000000	282.000000	282.000000	282.000000
mean	60.359130	19.702806	27.499412	163.018773	59.680361
std	4.765473	0.695494	3.795979	7.915956	7.001413
min	52.056969	17.806831	21.890348	148.815491	45.844936
25%	56.158371	19.218705	24.376531	156.912205	54.580725
50%	60.819279	19.729540	25.674796	161.862656	59.257576
75%	64.790953	20.204208	31.491758	168.811760	64.934494
max	69.344276	21.429085	34.961044	180.091019	73.055801

In [1]:

# Define your own bins
mybins = range(df.MAC.min(), df.MAC.max(), 10)

# Cut the data with the help of the bins
df['MAC_bucket'] = pd.cut(df.MAC, bins=mybins)

# Count the number of values per bucket
df['MAC_bucket'].value_counts()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-5cc82bb1cc93> in <module>()
      1 # Define your own bins
----> 2 mybins = range(df.MAC.min(), df.MAC.max(), 10)
      3 
      4 # Cut the data with the help of the bins
      5 df['MAC_bucket'] = pd.cut(df.MAC, bins=mybins)

NameError: name 'df' is not defined

PCA in Python

Jake

2017-09-13 23:12

Use Apple's stock as an example to create some graphs and analysis¶

Reference: http://tetration.xyz/lumpsum_vs_dca/

In [1]:

import pandas as pd
import pandas_datareader.data as web
import datetime
pd.set_option('display.width', 200) # Displaying more columns in one row

# Data date range, Google provides up to 4000 entries in one call
start = datetime.datetime(2017, 9, 1) 
end = datetime.datetime(2018, 2, 3)

spy = web.DataReader("AAPL", "yahoo", start, end)

print(spy.head()) # See first few rows

                  Open        High         Low       Close   Adj Close    Volume
Date                                                                            
2017-09-01  164.800003  164.940002  163.630005  164.050003  163.462372  16591100
2017-09-05  163.750000  164.250000  160.559998  162.080002  161.499435  29468500
2017-09-06  162.710007  162.990005  160.520004  161.910004  161.330048  21651700
2017-09-07  162.089996  162.240005  160.360001  161.259995  160.682358  21928500
2017-09-08  160.860001  161.149994  158.529999  158.630005  158.061798  28611500

In [2]:

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
from matplotlib import style
style.use('fivethirtyeight')

spy['Adj Close'].plot(figsize=(20,10))
ax = plt.subplot()
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos: '${:,.0f}'.format(x))) # Y axis dollarsymbols
plt.title('AAPL Historical Price on Close')
plt.xlabel('')
plt.ylabel('Stock Price ($)');

In [3]:

value_price = spy['Adj Close'][-1] # The final value of our stock
initial_investment = 10000 # Our initial investment of $10k

num_stocks_bought = initial_investment / spy['Adj Close']
lumpsum = num_stocks_bought * value_price
lumpsum.name = 'Lump Sum'

lumpsum.plot(figsize=(20,10))
ax = plt.subplot()
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos: '${:,.0f}'.format(x))) # Y axis dollarsymbols
plt.title('Lump sum - Value today of $10,000 invested on date')
plt.xlabel('')
plt.ylabel('Investment Value ($)');

In [6]:

def doDCA(investment, start_date):
    # Get 12 investment dates in 7 day increments starting from start date
    investment_dates_all = pd.date_range(start_date,periods=12,freq='7D')
    # Remove those dates beyond our known data range
    investment_dates = investment_dates_all[investment_dates_all < spy.index[-1]]

    # Get closest business dates with available data
    closest_investment_dates = spy.index.searchsorted(investment_dates)

    # How much to invest on each date
    portion = investment/12.0 # (Python 3.0 does implicit double conversion, Python 2.7 does not)

    # Get the total of all stocks purchased for each of those dates (on the Close)
    stocks_invested = sum(portion / spy['Adj Close'][closest_investment_dates])

    # Add uninvested amount back
    uninvested_dollars = portion * sum(investment_dates_all >= spy.index[-1])

    # value of stocks today
    total_value = value_price*stocks_invested + uninvested_dollars
    return total_value

# Generate DCA series for every possible date
dca = pd.Series(spy.index.map(lambda x: doDCA(initial_investment, x)), index=spy.index, name='Dollar Cost Averaging (DCA)')

In [7]:

dca.plot(figsize=(20,10))
ax = plt.subplot()
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos: '${:,.0f}'.format(x))) # Y axis dollarsymbols
plt.title('Dollar Cost Averaging - Value today of $10,000 invested on date')
plt.xlabel('')
plt.ylabel('Investment Value ($)');

In [ ]:

Your First Step by Step Machine Learning Project

Jake

2017-01-23 20:12

Step 1: Download and install Anaconda

Step 2: Launch IPython

Step 3: Copy code from below and paste to your Notebook

     # Load libraries
     import pandas
     from pandas.tools.plotting import scatter_matrix
     import matplotlib.pyplot as plt
     from sklearn import model_selection
     from sklearn.metrics import classification_report
     from sklearn.metrics import confusion_matrix
     from sklearn.metrics import accuracy_score
     from sklearn.linear_model import LogisticRegression
     from sklearn.tree import DecisionTreeClassifier
     from sklearn.neighbors import KNeighborsClassifier
     from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
     from sklearn.naive_bayes import GaussianNB
     from sklearn.svm import SVC

First Blog Using IPython Notebook

Jake

2016-09-23 20:42

Download and install Anaconda (this gives you ipython and a nice enviroment)
Install Nikola
Follow the tutorial

Helpful Links:

http://www.damian.oquanta.info/posts/ipython-plugin-for-nikola-updated.html

https://shankarmsy.github.io/posts/blogging-with-the-awesome-nikola-ipython-and-github.html

In [1]:

# http://nbviewer.jupyter.org/gist/theandygross/4544012
import sys
print('Python: {}'.format(sys.version))

Python: 3.6.0 |Anaconda custom (x86_64)| (default, Dec 23 2016, 13:19:00) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]

In [2]:

print ('This is really exciting')

This is really exciting

In [3]:

print (12)

In [ ]:

print ('hello, notebook blog :)')

Blog with the Nikola, IPython and Github

Jake

2016-08-06 16:25

https://shankarmsy.github.io/posts/blogging-with-the-awesome-nikola-ipython-and-github.html

Problem 1: conda, virtual env doesn't work for me.

https://conda.io/docs/using/envs.html#create-an-environment

conda create -n blog python

jakehku blog $ conda create -n blog Fetching package metadata ........... Solving package specifications: Package plan for installation in environment /anaconda/envs/blog:

Proceed ([y]/n)? y

# # To activate this environment, use: # > source activate blog # # To deactivate an active environment, use: # > source deactivate #.

Deploying to GitHub

Nikola provides a separate command github_deploy to deploy your site to GitHub Pages. The command builds the site, commits the output to a gh-pages branch and pushes the output to GitHub. Nikola uses the ghp-import command for this.

In order to use this feature, you need to configure a few things first. Make sure you have nikola and git installed on your PATH.

Initialize a Nikola site, if you haven’t already.

Initialize a git repository in your Nikola source directory by running:

git init . git remote add origin git@github.com:user/repository.git Setup branches and remotes in conf.py:

GITHUB_DEPLOY_BRANCH is the branch where Nikola-generated HTML files will be deployed. It should be gh-pages for project pages and master for user pages (user.github.io). GITHUB_SOURCE_BRANCH is the branch where your Nikola site source will be deployed. We recommend and default to src. GITHUB_REMOTE_NAME is the remote to which changes are pushed. GITHUB_COMMIT_SOURCE controls whether or not the source branch is automatically committed to and pushed. We recommend setting it to True, unless you are automating builds with Travis CI. Create a .gitignore file. We recommend adding at least the following entries:

cache .doit.db __pycache__ output If you set GITHUB_COMMIT_SOURCE to False, you must switch to your source branch and commit to it. Otherwise, this is done for you.

Run nikola

github_deploy

. This will build the site, commit the output folder to your deploy branch, and push to GitHub. Your website should be up and running within a few minutes.

If you want to use a custom domain, create your CNAME file in files/CNAME on the source branch. Nikola will copy it to the output directory. To add a custom commit message, use the -m option, followed by your message.

git pull origin src --allow-unrelated-histories