Introduction
Let's take a very common kind of analysis. You've probably done this yourself. Suppose you want to analyze stock performance. You might:
 Find a stock on Yahoo finance.
 Download the historical data as a CSV file.
 Import the CSV file into Excel.
 Perform mathematical analysis: Regression, descriptive statistics,or linear optimization using the Excel Solver tool.
That's fine, but this article shows you a simpler, more intuitive, and more powerful way to do the same kind of analysis using IPython and pandas.
The IPython library is one of the key tools for a data scientist who uses Python. This tool is a key differentiator from Excel, specifically because you can interactively explore data and analysis from an interactive prompt. The examples in this article mainly use IPython as the mechanism to run them.
The Python Data Analysis Library (pandas) is an open source, BSDlicensed library that provides highperformance, easytouse data structures and data analysis tools for the Python programming language.
Getting started
To get started with IPython and pandas, set up your Linux or Unix operating system such as Ubuntu or OS X.
 Install pip, a tool for installing and managing Python packages. You might have used easy_install before — pip is a replacement for easy_install. To install pip, go to the pip index page on the Python website and follow the instructions.
 Now that you have pip, install IPython using this command:
sudo pip install IPython
 Use pip to install pandas:
sudo pip install pandas
 One more tool to install — matplotlib, a plotting library for the Python program language and its NumPy numerical mathematics extension. Use this command:
sudo pip install matplotlib
You have all the pieces you need, so let's begin.
Pandas introduction
To feed your investment portfolio data into pandas, use this code:
Listing 1. Pandas portfolio ingestion
In [1]: import pandas.io.data as web In [2]: from pandas import DataFrame In [3]: data_feed = {} In [4]: symbols=['AAPL','FB', 'GOOG', 'SPLK', 'YELP', 'GG','BP','SCPJ','JNJ', 'OMG'] In [5]: for ticker in symbols: ...: data_feed[ticker] = web.get_data_yahoo(ticker, '05/21/2012', '11/1/2012') ...: In [6]: price = DataFrame({tic: data['Adj Close'] ...: for tic, data in data_feed.iteritems()}) In [7]: volume = DataFrame({tic: data['Volume'] ...: for tic, data in data_feed.iteritems()}) In [8]: returns = price.pct_change()
To determine the return percentages for the year and plot them, the return DataFrame
method can be called along with plot
.
This is done by calling sum
which sums each column in the DataFrame, and by using matplotlib, which does the heavy lifting to create the chart shown in Figure 1.
Listing 2. Returns for year
In [9]: import matplotlib.pyplot as plt In [10]: returns.sum().plot(kind='bar',title="% Return For Year") Out[10]: <matplotlib.axes.AxesSubplot at 0x10c1b0350> In [11]: plt.show()
Here's the result:
Figure 1. Returns for year
As you can see in Figure 1, Facebook had a rough IPO, and yeartodate it has lost close to 40% of its IPO value.
In comparison, Yelp — in the same industry — has gained almost 40%.
In hindsight, shorting Facebook and going long on Yelp could have led a yeartodate return that would have almost doubled the original investment.
The text output of the sum()
command shows the real raw values of the yearly returns in this code:
Listing 3. Raw sum output
In [12]: returns.sum() Out[12]: AAPL 0.077139 BP 0.155668 FB 0.376935 GG 0.285309 GOOG 0.124510 JNJ 0.140735 OMG 0.145005 SCPJ 0.189855 SPLK 0.021382 YELP 0.357202
Another way to look at the data is to create a histogram of the daily return percentage change for the year and see if this reveals any underlying insights about the data. Fortunately, this is fairly straight forward as the code example below shows:
Listing 4. Creating a histogram of daily returns
In [13]: returns.diff().hist() Out[13]: array([[Axes(0.125,0.677778;0.158163x0.222222), Axes(0.330612,0.677778;0.158163x0.222222), Axes(0.536224,0.677778;0.158163x0.222222), Axes(0.741837,0.677778;0.158163x0.222222)], [Axes(0.125,0.388889;0.158163x0.222222), Axes(0.330612,0.388889;0.158163x0.222222), Axes(0.536224,0.388889;0.158163x0.222222), Axes(0.741837,0.388889;0.158163x0.222222)], [Axes(0.125,0.1;0.158163x0.222222), Axes(0.330612,0.1;0.158163x0.222222), Axes(0.536224,0.1;0.158163x0.222222), Axes(0.741837,0.1;0.158163x0.222222)]], dtype=object) In [14]: plt.show()
The code in Listing 4 gives this chart:
Figure 2. Histogram daily returns
Another way to look at the data is to take the daily returns and do a line chart for the year. The code sample below shows how to do this:
Listing 5. Pandas portfolio correlation line chart for the year
In [15]: returns.plot(title="% Daily Change For Year") Out[15]: <matplotlib.axes.AxesSubplot at 0x10b56e850> In [16]: plt.show()
Here's the result:
Figure 3. Histogram daily returns as a line chart
One issue with this simple graph is that it is a bit tough to figure out what's going on.
A way to deal with this time series data is to use the cumsum
function, and then chart that.
Listing 6. Cumulative sum
In [17]: ts = returns.cumsum() In [18]: plt.figure(); ts.plot(); plt.legend(loc='upper left') Out[18]: <matplotlib.legend.Legend at 0x10c69cb50> In [19]: plt.show()
The result, shown in Figure 4, tells us even more information about your portfolio. By doing a time series analysis and charting the results, it's apparent that Facebook had a rougher time than originally thought. While it was down for the year by 40%, at one point it was down 60% in September. Intuitively, additional data about the movement of the stock indicates that the standard deviation is quite high for Facebook. Since standard deviation is a rough proxy for risk, this is something to watch out for while formulating this portfolio and determining weights.
Figure 4. Cumulative sum of portfolio
Determining the correlation of the percentage change between the ten stocks is as simple as calling the corr
method on the DataFrame returns.
Listing 7. Pandas portfolio correlation of percentage change
In [9]: returns.corr() Out[9]: AAPL BP FB GG GOOG JNJ OMG SCPJ SPLK YELP AAPL 1.000000 0.169053 0.094286 0.134131 0.376466 0.163904 0.411568 0.117152 0.368266 0.124856 BP 0.169053 1.000000 0.011832 0.294994 0.291391 0.437816 0.436781 0.009499 0.224151 0.084014 FB 0.094286 0.011832 1.000000 0.065156 0.081912 0.020755 0.130815 0.039980 0.038010 0.343646 GG 0.134131 0.294994 0.065156 1.000000 0.302844 0.138329 0.206255 0.066144 0.148690 0.006135 GOOG 0.376466 0.291391 0.081912 0.302844 1.000000 0.144882 0.305486 0.001538 0.226364 0.154207 JNJ 0.163904 0.437816 0.020755 0.138329 0.144882 1.000000 0.268308 0.021108 0.190023 0.009803 OMG 0.411568 0.436781 0.130815 0.206255 0.305486 0.268308 1.000000 0.117257 0.279653 0.146944 SCPJ 0.117152 0.009499 0.039980 0.066144  0.001538 0.021108 0.117257 1.000000 0.017114 0.058541 SPLK 0.368266 0.224151 0.038010 0.148690 0.226364 0.190023 0.279653 0.017114 1.000000 0.215260 YELP 0.124856 0.084014 0.343646 0.006135 0.154207 0.009803 0.146944 0.058541 0.215260 1.000000 In [58]: plt.show()
Portfolio theory
So far, the analysis has been fairly simple. Investment analysis usually involves determining an optimal portfolio that can "beat the market." The market is typically referred to as the Standard and Poor 500 index. There is no definitive proof that mutual funds can actually beat the market through skill (not luck). The odds are in your favor that you can beat most mutual funds by a passive investment in index funds since the vast majority of mutual funds cannot beat a market portfolio year after year.
Despite the fact that trained professionals with armies of Ph.D. mathematicians, rocket scientists, and billions of dollars of capitol can't beat the market, let's try to do it with pandas in your free time. The first step is to see how your portfolio stacks up against the market portfolio, the Standard and Poor 500.
Listing 8. SPY cumulative time chart
In [116]: market_data_feed = {} In [117]: market_symbols=['SPY'] In [118]: for ticker in market_symbols: .....: market_data_feed[ticker] = web.get_data_yahoo (ticker, '05/21/2012', '11/1/2012') .....: In [119]: market_price = DataFrame({tic: data['Adj Close'] .....: for tic, data in market_data_feed.iteritems()}) In [120]: In [120]: market_volume = DataFrame({tic: data['Volume'] .....: for tic, data in market_data_feed.iteritems()}) In [121]: In [121]: market_returns = market_price.pct_change() In [122]: market_returns.cumsum() In [123]: mts = market_returns.cumsum() In [124]: plt.figure(); mts.plot(); plt.legend(loc='upper left') Out[124]: <matplotlib.legend.Legend at 0x10b8f4650> In [125]: plt.show()
In this example, another DataFrame is created, for the same time period, and it acts as your "market portfolio". The chart in Figure 5 shows the return generated for SPY, which is a proxy for the Standard and Poor 500 index.
Figure 5. Return generated for SPY
With two separate timeseries charts completed, the next step in doing an analysis is to look at your portfolio against the market portfolio. Two quick and dirty ways to do this are (a) to look at the average/mean return of your portfolio versus the market portfolio, and (b) to look at the standard deviation (stdev), a very rough proxy for risk on your portfolio versus the market portfolio.
Listing 9. Beating the market
In [126]: sum_returns = returns.sum() In [127]: sum_returns.mean() Out[127]: 0.11198689337482581 In [128]: market_returns.sum().mean() Out[128]: 0.093679854637400028 In [239]: market_returns.std() Out[239]: minor SPY 0.008511 In [240]: returns.std().mean() Out[240]: 0.025706773344634132
In conclusion
In the final interactive example in Listing 9, you beat the market by getting an 11% return on your portfolio versus the market portfolio of 9%. Before you can start a hedge fund, though, you might need a good story to explain why the market portfolio had a stdev of .8% and your portfolio had a 2% stdev. The quick story is that you took on much more risk and simply got lucky. Further analysis would involve determining the alpha, beta, expected return, and doing more advanced analysis like FamaFrench and efficient frontier optimization.
In this article, Python was used to perform a quick and dirty investment portfolio analysis. Python is turning into a goto language for real world data analysis. Libraries like Pyomo, pandas, Numpy, and IPython take much of the pain out of doing advanced applied math in Python. If you want to learn more about portfolio analysis, see some reading suggestions in Resources.
Download
Description  Name  Size 

CSV file and source code  src.zip  6.30KB 
Resources
Learn
 In the book Python For Unix and Linux System Administration by Noah Gift and Jeremy Jones (O'Reily Media, 2008), chapter 2 covers IPython.
 Visit the IPython web site.
 In the book SciPy and Numpy by Eli Bressert (O'Reilly Media, 2012), learn how to use NumPy for numerical processing, including array indexing, math operations, and loading and saving data. Learn how to use SciPy in work with advanced mathematical functions such as optimization, interpolation, integration, clustering, statistics, and other tools for scientific programming.
 Statusmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests.
 The book Python for Data Analysis by Wes McKinney (O'Reilly Media, 2012) is a practical introduction to scientific computing in Python, tailored for dataintensive applications.
 The book Pyomo – Optimization Modeling in Python by W. E. Hart et al is highly recommended.
 You can read about investment portfolio management in Harvard Business School Case Study: Dimensional Fund Advisors by Jay O. Light (Harvard Business School, 1993).
 Also see Investments (McGrawHill/Irwin Series in Finance, Insurance and Real Estate), by A. Bodie, A. Kane, and A. Marcus (McGrawHill/Irwin, 2010).
 Statistics for Management and Economics teaches how to apply statistics to realworld business problems.
 In his book The Signal and the Noise: Why So Many Predictions Fail  but Some Don't (Penguin Press, 2012), Nate Silver examines the world of prediction, investigating how we can distinguish a true signal from a universe of noisy data.
 R in Action by Robert Kabacoff (Manning Publications, 2011) presents the R system, a powerful language for statistical computing and graphics.
 Study the book Probability Theory: The Logic of Science by E. T. Jaynes (Cambridge University Press, 2003)
 Learn more about cloud computing technologies at cloud at developerWorks.
 Access IBM SmartCloud Enterprise.
 Follow developerWorks on Twitter.
 Watch developerWorks demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
Get products and technologies
 Download pip and use it to install IPython.
 Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
Discuss
 Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developerdriven blogs, forums, groups, and wikis.
Comments
Dig deeper into Cloud computing on developerWorks

developerWorks Premium
Exclusive tools to build your next great app. Learn more.

Cloud newsletter
Crazy about Cloud? Sign up for our monthly newsletter and the latest cloud news.

Try SoftLayer Cloud
Deploy public cloud instances in as few as 5 minutes. Try the SoftLayer public cloud instance for one month.