Data science in the cloud

Investment analysis with IPython and pandas

Data Science is a growing specialization that can touch on many of the following topics: Cloud computing, big data, math, business theory, and computer science theory. A scripting language like Python is often a great choice for the typical cycle of prototyping to make sure the math of the problem works, then "productizing" the result to a distributed farm of cloud servers. This article presents some hands-on examples of investment analysis and statistical analysis using IPython and pandas.

Share:

Noah Gift (noah.gift@giftcs.com), Founder, CTO, Giftcs

Noah GiftNoah Gift is an experienced technical leader and software developer at AT&T Interactive. He solves interesting problems in a variety of languages including Python/Iron Python, Erlang, F#, C#, and JavaScript. (He's also worked at Caltech, Disney Feature Animation, Sony Imageworks, and Weta Digital.) A member of the Python Software Foundation, he is also an author of many developerWorks articles and the co-author of Python For Unix and Linux System Administration. He earned a BS in Nutritional Science from Cal Poly San Luis Obispo, an MS in Computer Information Systems from CSULA, and is an MBA Candidate at UC Davis specializing in business analytics, finance, and entrepreneurship. In his spare time, he composes for the piano and runs in marathons. Find him at his web site, on Twitter, or for consulting.



21 February 2013

Also available in Chinese Spanish

Introduction

Let's take a very common kind of analysis. You've probably done this yourself. Suppose you want to analyze stock performance. You might:

  1. Find a stock on Yahoo finance.
  2. Download the historical data as a CSV file.
  3. Import the CSV file into Excel.
  4. Perform mathematical analysis: Regression, descriptive statistics,or linear optimization using the Excel Solver tool.

That's fine, but this article shows you a simpler, more intuitive, and more powerful way to do the same kind of analysis using IPython and pandas.

The IPython library is one of the key tools for a data scientist who uses Python. This tool is a key differentiator from Excel, specifically because you can interactively explore data and analysis from an interactive prompt. The examples in this article mainly use IPython as the mechanism to run them.

The Python Data Analysis Library (pandas) is an open source, BSD-licensed library that provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Getting started

To get started with IPython and pandas, set up your Linux or Unix operating system such as Ubuntu or OS X.

  1. Install pip, a tool for installing and managing Python packages. You might have used easy_install before — pip is a replacement for easy_install. To install pip, go to the pip index page on the Python website and follow the instructions.
  2. Now that you have pip, install IPython using this command:
    sudo pip install IPython
  3. Use pip to install pandas:
    sudo pip install pandas
  4. One more tool to install — matplotlib, a plotting library for the Python program language and its NumPy numerical mathematics extension. Use this command:
    sudo pip install matplotlib

You have all the pieces you need, so let's begin.


Pandas introduction

To feed your investment portfolio data into pandas, use this code:

Listing 1. Pandas portfolio ingestion
    In [1]: import pandas.io.data as web
    
    In [2]: from pandas import DataFrame
    
    In [3]: data_feed = {}

    In [4]: symbols=['AAPL','FB', 'GOOG', 'SPLK', 'YELP', 'GG','BP','SCPJ','JNJ', 'OMG']
    
    In [5]: for ticker in symbols:
    ...:         data_feed[ticker] = web.get_data_yahoo(ticker, '05/21/2012', '11/1/2012')
    ...:
    
    In [6]: price = DataFrame({tic: data['Adj Close']
    ...:     for tic, data in data_feed.iteritems()})
    
    In [7]: volume = DataFrame({tic: data['Volume']
    ...:     for tic, data in data_feed.iteritems()})
    
    In [8]: returns = price.pct_change()

To determine the return percentages for the year and plot them, the return DataFrame method can be called along with plot. This is done by calling sum which sums each column in the DataFrame, and by using matplotlib, which does the heavy lifting to create the chart shown in Figure 1.

Listing 2. Returns for year
    In [9]: import matplotlib.pyplot as plt
    In [10]: returns.sum().plot(kind='bar',title="% Return For Year")
    Out[10]: <matplotlib.axes.AxesSubplot at 0x10c1b0350>
    In [11]: plt.show()

Here's the result:

Figure 1. Returns for year
Returns for year

As you can see in Figure 1, Facebook had a rough IPO, and year-to-date it has lost close to 40% of its IPO value. In comparison, Yelp — in the same industry — has gained almost 40%. In hindsight, shorting Facebook and going long on Yelp could have led a year-to-date return that would have almost doubled the original investment. The text output of the sum() command shows the real raw values of the yearly returns in this code:

Listing 3. Raw sum output
    In [12]: returns.sum()
    Out[12]:
    AAPL    0.077139
    BP      0.155668
    FB     -0.376935
    GG      0.285309
    GOOG    0.124510
    JNJ     0.140735
    OMG     0.145005
    SCPJ    0.189855
    SPLK    0.021382
    YELP    0.357202

Another way to look at the data is to create a histogram of the daily return percentage change for the year and see if this reveals any underlying insights about the data. Fortunately, this is fairly straight forward as the code example below shows:

Listing 4. Creating a histogram of daily returns
    In [13]: returns.diff().hist()
    Out[13]:
    array([[Axes(0.125,0.677778;0.158163x0.222222),
    Axes(0.330612,0.677778;0.158163x0.222222),
    Axes(0.536224,0.677778;0.158163x0.222222),
    Axes(0.741837,0.677778;0.158163x0.222222)],
    [Axes(0.125,0.388889;0.158163x0.222222),
    Axes(0.330612,0.388889;0.158163x0.222222),
    Axes(0.536224,0.388889;0.158163x0.222222),
    Axes(0.741837,0.388889;0.158163x0.222222)],
    [Axes(0.125,0.1;0.158163x0.222222),
    Axes(0.330612,0.1;0.158163x0.222222),
    Axes(0.536224,0.1;0.158163x0.222222),
    Axes(0.741837,0.1;0.158163x0.222222)]], dtype=object)
    
    In [14]: plt.show()

The code in Listing 4 gives this chart:

Figure 2. Histogram daily returns
Chart showing histogram daily returns

Another way to look at the data is to take the daily returns and do a line chart for the year. The code sample below shows how to do this:

Listing 5. Pandas portfolio correlation line chart for the year
    In [15]: returns.plot(title="% Daily Change For Year")
    Out[15]: <matplotlib.axes.AxesSubplot at 0x10b56e850>
    
    In [16]: plt.show()

Here's the result:

Figure 3. Histogram daily returns as a line chart
Chart showing histogram daily returns as a line chart

One issue with this simple graph is that it is a bit tough to figure out what's going on. A way to deal with this time series data is to use the cumsum function, and then chart that.

Listing 6. Cumulative sum
In [17]: ts = returns.cumsum()

In [18]: plt.figure(); ts.plot(); plt.legend(loc='upper left')
Out[18]: <matplotlib.legend.Legend at 0x10c69cb50>
    
In [19]: plt.show()

The result, shown in Figure 4, tells us even more information about your portfolio. By doing a time series analysis and charting the results, it's apparent that Facebook had a rougher time than originally thought. While it was down for the year by 40%, at one point it was down 60% in September. Intuitively, additional data about the movement of the stock indicates that the standard deviation is quite high for Facebook. Since standard deviation is a rough proxy for risk, this is something to watch out for while formulating this portfolio and determining weights.

Figure 4. Cumulative sum of portfolio
Chart showing cumulative sum of portfolio

Determining the correlation of the percentage change between the ten stocks is as simple as calling the corr method on the DataFrame returns.

Listing 7. Pandas portfolio correlation of percentage change
    In [9]: returns.corr()
    Out[9]:
            AAPL        BP        FB        GG      GOOG       JNJ       
                      OMG      SCPJ      SPLK      YELP
    AAPL  1.000000  0.169053  0.094286  0.134131  0.376466  
                      0.163904  0.411568  0.117152  0.368266  0.124856
    BP    0.169053  1.000000  0.011832  0.294994  0.291391  
                      0.437816  0.436781 -0.009499  0.224151  0.084014
    FB    0.094286  0.011832  1.000000 -0.065156  0.081912  
                      0.020755  0.130815  0.039980  0.038010  0.343646
    GG    0.134131  0.294994 -0.065156  1.000000  0.302844  
                      0.138329  0.206255 -0.066144  0.148690 -0.006135
    GOOG  0.376466  0.291391  0.081912  0.302844  1.000000  
                      0.144882  0.305486 -0.001538  0.226364  0.154207
    JNJ   0.163904  0.437816  0.020755  0.138329  0.144882  
                      1.000000  0.268308  0.021108  0.190023 -0.009803
    OMG   0.411568  0.436781  0.130815  0.206255  
                      0.305486  0.268308  1.000000  0.117257  0.279653  0.146944
    SCPJ  0.117152 -0.009499  0.039980 -0.066144 -
                      0.001538  0.021108  0.117257  1.000000 -0.017114  0.058541
    SPLK  0.368266  0.224151  0.038010  0.148690  0.226364  
                      0.190023  0.279653 -0.017114  1.000000  0.215260
    YELP  0.124856  0.084014  0.343646 -0.006135  0.154207 
                    -0.009803  0.146944  0.058541  0.215260  1.000000
    In [58]: plt.show()

Portfolio theory

So far, the analysis has been fairly simple. Investment analysis usually involves determining an optimal portfolio that can "beat the market." The market is typically referred to as the Standard and Poor 500 index. There is no definitive proof that mutual funds can actually beat the market through skill (not luck). The odds are in your favor that you can beat most mutual funds by a passive investment in index funds since the vast majority of mutual funds cannot beat a market portfolio year after year.

Despite the fact that trained professionals with armies of Ph.D. mathematicians, rocket scientists, and billions of dollars of capitol can't beat the market, let's try to do it with pandas in your free time. The first step is to see how your portfolio stacks up against the market portfolio, the Standard and Poor 500.

Listing 8. SPY cumulative time chart
        In [116]: market_data_feed = {}
        
        In [117]: market_symbols=['SPY']
        
        In [118]: for ticker in market_symbols:
        .....:         market_data_feed[ticker] = web.get_data_yahoo
                      (ticker, '05/21/2012', '11/1/2012')
        .....:
        
        In [119]: market_price = DataFrame({tic: data['Adj Close']
        .....:     for tic, data in market_data_feed.iteritems()})
        
        In [120]:
        
        In [120]: market_volume = DataFrame({tic: data['Volume']
        .....:     for tic, data in market_data_feed.iteritems()})
        
        In [121]: 
        
        In [121]: market_returns = market_price.pct_change()
        
        In [122]: market_returns.cumsum()
        In [123]: mts = market_returns.cumsum()
        
        In [124]: plt.figure(); mts.plot(); plt.legend(loc='upper left')
        Out[124]: <matplotlib.legend.Legend at 0x10b8f4650>
        
        In [125]: plt.show()

In this example, another DataFrame is created, for the same time period, and it acts as your "market portfolio". The chart in Figure 5 shows the return generated for SPY, which is a proxy for the Standard and Poor 500 index.

Figure 5. Return generated for SPY
Chart of return generated for SPY

With two separate time-series charts completed, the next step in doing an analysis is to look at your portfolio against the market portfolio. Two quick and dirty ways to do this are (a) to look at the average/mean return of your portfolio versus the market portfolio, and (b) to look at the standard deviation (stdev), a very rough proxy for risk on your portfolio versus the market portfolio.

Listing 9. Beating the market
In [126]: sum_returns = returns.sum()
In [127]: sum_returns.mean()
Out[127]: 0.11198689337482581
In [128]: market_returns.sum().mean()
Out[128]: 0.093679854637400028
In [239]: market_returns.std()
Out[239]:
minor
SPY      0.008511        
In [240]: returns.std().mean()
Out[240]: 0.025706773344634132

In conclusion

In the final interactive example in Listing 9, you beat the market by getting an 11% return on your portfolio versus the market portfolio of 9%. Before you can start a hedge fund, though, you might need a good story to explain why the market portfolio had a stdev of .8% and your portfolio had a 2% stdev. The quick story is that you took on much more risk and simply got lucky. Further analysis would involve determining the alpha, beta, expected return, and doing more advanced analysis like Fama-French and efficient frontier optimization.

In this article, Python was used to perform a quick and dirty investment portfolio analysis. Python is turning into a go-to language for real world data analysis. Libraries like Pyomo, pandas, Numpy, and IPython take much of the pain out of doing advanced applied math in Python. If you want to learn more about portfolio analysis, see some reading suggestions in Resources.


Download

DescriptionNameSize
CSV file and source codesrc.zip6.30KB

Resources

Learn

Get products and technologies

  • Download pip and use it to install IPython.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

  • Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Cloud computing on developerWorks


  • Bluemix Developers Community

    Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.

  • developerWorks Labs

    Experiment with new directions in software development.

  • DevOps Services

    Software development in the cloud. Register today to create a project.

  • Try SoftLayer Cloud

    Deploy public cloud instances in as few as 5 minutes. Try the SoftLayer public cloud instance for one month.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Cloud computing, Open source
ArticleID=858854
ArticleTitle=Data science in the cloud
publish-date=02212013