Design Patterns for Portfolio Backtesting

jmh530 Wrote: ------------------------------------------------------- > The best reason I have to use multidimensional > arrays when doing some of this stuff is because it > simplifies the code. For instance, if I have a > function that creates portfolio returns and > performance statistics, it’s less code to do > everything in terms of a multidimensional array > than it is to call the function for every > portfolio. If you have no need for dynamism of the portfolios, then the larger dimensionality is a fine solution. Nonetheless, you can keep a list of the portfolios and iterate over that just as easily. The amount of code is the same either way.

I don’t generally use languages like R and Matlab for these kinds of things because I find them clunky (I use R and Mathematica for short, i.e., < 100 line programs). I use Delphi, C#, and C++ and I have no confidence at all that object models translate from one to the other. One difference is that I always try not to use arrays because an array is a big clunky thing designed to give you access to random points like stock i on day j. That’s not an important part of designing a backtesting engine so almost everything I do is with chronological lists. That means I have array like functions but they involve a search (which takes no time at all since I have things in chronological order). If I was using R I would certainly use arrays. I would never use Matlab because I hate it. I think that any software that does this kind of thing imposes some level of philosophy on how trades should be selected and how portfolios should be built. In particular, if you try to include lots of interrelationships between portfolios and individual trades you get circular, opaque, and of unknown complexity. I have three fundamental types of objects: a) A trading system object - This is something that can take code (written in this sort of enhanced Pascal for historical reasons) and data streams and outputs a series of trades on one stock, futures contract, or whatever. There is no connection between the trading system and the portfolio and I have never used those kinds of rules (e.g., if you are down 10% off the high water mark you need a better reason to put on a trade than if you are at the high water mark). I think real-world implementation of rules like that is a bad idea so I don’t even allow them in the program. A trading system object can take any set of inputs and output a series of trades that are independent of portfolio size. Hence, the stream that comes out looks like ((1/1/2008, -100), (2/2/2008, -50), (3/3/2008, 0), (4/4/2009, 100)) . A negative value means short and the 100 means “a full position” (defined by portfolio), 50 means “half a full position”, etc… b) An equity stream object - This contains all the daily data for the security for every day the security is traded, a list of trades, a “weight” that is filled in by the portfolio object. This object can do lots of stuff but it knows nothing of how it got the trades or of any other securities. c) A portfolio object - This object takes code and combines equity streams into portfolios. It is a list of dates with methods like AddEquityStream and RunIt. A RunIt command takes the equity stream data and combines it into a daily return stream by running each equity stream through the portfolio methodology on each day (portfolio methodologies can theoretically change each day but in practice don’t). In general, portfolio objects combine equity streams without knowing about their relative positions. That means that if you have a small long position in corn and a giant long position in wheat the portfolio calculation would be the same as if you had a giant long position in corn and a small long position in wheat. An important choice here is how portfolio calculations get included with trades driven by individual systems. For example, 1) every time you make a trade you can bring things to portfolio-optimal; or 2) you can bring everything to portfolio optimal periodically 3) you can make bands around optimal and only trade when you are outside those bands. 3) is the most dangerous and I’ve sworn off that even though it seems good. 1) is almost always the best performing but harded to implement (much easier to tell a trader to exit half the position than to make up trade sheets representing 38% of the position). 2) seems naive but is operationally the simplest. Back in the day when I was a statistician, I had a boss at RAND who wrote a paper about data management. His central point was that data management is half a practicing statistcian’s job and that there were best practices that were elegant. It seemed profound to me then, although nobody had ever heard of a DBA then. I think I would go nuts using a language like R or Matlab for this. I also think the notion of separating your data management from your portfolio development as mentioned above is naive. I mostly think about trading futures and there is no such thing as a corn price stream. What exists is a set of prices of prices that roll somehow into another set of prices. There is not a clearly best roll methodology and the market constrains what would be reasonable. However, how you roll is really important (e.g., you might want to roll by having a calendar spread for awhile), the roll can represent a trading opportunity or not, and you might want to roll according to basis gaps. Having an iconic corn stream and trading off that doesn’t work so well (I’ve certainly done it, so it’s possible). The other issue about portfolios is that there are always missing data and unequal count issues. One of the ways that I think markets are inefficient is that most portfolio methodologies use correlations as measures of association for reducing risk and doing stress tests. The problem is that correlations are very unstable, change with volatility (despite that very silly CFA reading that was in the curriculum for years that was just mathematically wrong), and are only the right measure of association for multivariate normality. That means that good portfolio methodology needs to include ways of estimating copula functions with missing data. I always do that using EM which is the world’s slowest converging algorithm but very stable. Note that you can’t even do Markowitz-style rebalancing with missing data by ignoring it because the pairwise omission covariance matrix always ends up not being positive definite because it isn’t really a covariance matrix (It took me two weeks of my life once to figure that out). There are other copula problems that get even more messed when you do things like ignore that nothing traded in France on Bastille day while a nuclear bomb went off in Tokyo. Amazing that it took two days for that news to reach France. A serious market inefficiency maybe? Anyway, my two cents…

@justin88 I disagree. It’s not just iterating them, it may also be about combining them together into something that is easy to iterate on. If I create have 5 risk aversion coefficients and want to create a portfolio for each, a 5 element vector of risk aversions (or some matrix of risk aversions) and a multidimensional array of weights is not only simpler to iterate through than some list of variables, but it is far more flexible. Imagine if instead of one input you have two, what naming scheme will you use for the variables? It gets very tricky to deal with. If I’m testing a strategy, I may construct literally hundreds of portfolios (at each point in time in a backtest) adjusting some of the inputs. For instance, I was constructing one strategy that dynamically adjusts risk aversion. The portfolio construction itself (ie. getting the weights) might be done based on methods you’re familiar with (though I may also incorporate parallel processing to speed it up which is its own separate set of problems). I compile them into a multi-dimensional array and then calculate the portfolio returns, the average returns/information ratio over some period, and plot some 2- or 3-d graph of how the returns/information ratio changes as the inputs are adjusted (so I know the strategy isn’t sensitive to the inputs). Doing something like this is virtually impossible without some multi-dimensional array or something along those lines.

@JoeyDVivre That was a very interesting post. I have often wondered the best way to incorporate futures data in Matlab and haven’t yet taken that plunge. One method I had considered was to simply work with the spot prices and then model the futures curve off of it. Seemed like it would be a pain, but that’s the best I had come up with. I think some of the issues you were talking about, like missing data or other time issues, are perhaps more significant for something like a very active global strategy. If I have a 1 or 3 month horizon, then I don’t feel so bad interpolating in the missing data before estimating the correlations (I don’t think 1 period will make much of a difference for that horizon). Also, I try to do some dimension reduction (as I’m sure you do). If I regress Microsoft against the US index and the U.S. tech index, then I don’t have to be worried at this level about whether other markets are open so long as the correlations I estimate for the US index vs. some world indices or some U.S. tech index against the world tech take this into account.

Joey, thanks for the design pattern. Great for expanding my thinking here. I’m having a little trouble understanding the equity stream object. I understand that the trading object figures out whether to have a position on and what size (compared to some kind of “full position”). And I understand that the portfolio object is basically a collection of equity stream objects, and figures out the allocations to each at any given time. The equity object presumably holds either levels or returns data for a single asset, and adjusts these based on inputs from the trading object. Is that right? Can you elaborate a little more on how you like to do the equity object?

jmh530 Wrote: ------------------------------------------------------- > @justin88 > I disagree. It’s not just iterating them, it may > also be about combining them together into > something that is easy to iterate on. If I create > have 5 risk aversion coefficients and want to > create a portfolio for each, a 5 element vector of > risk aversions (or some matrix of risk aversions) > and a multidimensional array of weights is not > only simpler to iterate through than some list of > variables, but it is far more flexible. Imagine if > instead of one input you have two, what naming > scheme will you use for the variables? It gets > very tricky to deal with. Iterating over any built-in data structure is trivial. I don’t understand what you mean about naming… you don’t have to create variables for the contents of arrays, lists, etc. > If I’m testing a strategy, I may construct > literally hundreds of portfolios (at each point in > time in a backtest) adjusting some of the inputs. > For instance, I was constructing one strategy that > dynamically adjusts risk aversion. The portfolio > construction itself (ie. getting the weights) > might be done based on methods you’re familiar > with (though I may also incorporate parallel > processing to speed it up which is its own > separate set of problems). I compile them into a > multi-dimensional array and then calculate the > portfolio returns, the average returns/information > ratio over some period, and plot some 2- or 3-d > graph of how the returns/information ratio changes > as the inputs are adjusted (so I know the strategy > isn’t sensitive to the inputs). Doing something > like this is virtually impossible without some > multi-dimensional array or something along those > lines. I don’t see how this is relevant – we were talking about arrays versus dynamic data structures like lists. Both are adequate solutions. There are only two reasons to use an array: (1) performance, either time- or space-wise. (2) Some API forces you to do so.

OK, this is going to be a long post, but I think there may be something useful that comes out of it… I think what we’re getting at is that it would be nice to have a data structure that does something like this: If we start with: # assetReturnsList

Well I’m talking about what I use in Matlab. So far as I’m aware, Matlab doesn’t have an easy to use system for creating something like a dynamic list. In R, I suppose you could create a list and just go wgt[[i]]=fun_portfolio(inputs[i]). So I suppose you’re right as it relates to R. As far as other languages are concerned, I’ve never used C# and my understanding of C++ is rather elementary. Imagine the structure in Matlab as this: I can input returns, covariance, risk aversion, and transaction cost into a function that translates these into portfolio weights. I create an data structure (it’s an array b/c that’s the easiest thing to do in Matlab) that contains some 100 combinations of risk aversion and transaction costs. I want to loop through every period of time and create portfolio weights (for some n assets) for each of those 100 combinations. If you were doing it with only 5 different risk aversions, you could probably name the weights port_wgt1-5 and it wouldn’t be much of a problem to just do it individually. My point was it is better to use something or anything, rather than naming them individually (to the OP). If it is some dynamic list that is easy to iterate through, fine. So far as I’m aware, the best way to store these weights in Matlab is in a multidimensional array.

I thought Matlab had full-fledged OOP now. Someone out there must have implemented dynamic data structures like vectors.

A vector in matlab is just a 1 dimensional array.

There are many types of one-dimensional data structures, e.g. arrays, vectors, linked lists, etc. Are arrays in matlab dynamic?

What does it mean for an array to be dynamic? Just that it can be resized/reshaped after you first use and/or declare them?

I’m by no means an expert on programming, so I’m not sure on the exact terminology here. It is easy to pull out specific dimensions of the array or add on new ones in Matlab (like A=NaN(m,n); A(m+1,:)=NaN(1,n);. However, if you dynamically create an array in a for loop, it is faster to pre-allocate memory to it than let it grow in the loop (though Matlab will still do it). I don’t recall being able to declare create a mXn array, then resize it to pXq when mn!=pq (you can of course create a new pXq array and try to put the mXn in it). In R, you can put arrays of different sizes into a list and then loop through the list, but in Matlab you would have to do it with cell arrays if they are different sizes. I’m not sure if these would make it technically “dynamic” or not.

It’s definitely faster if you pre-allocate memory. If I want to fill an array with stuff that I compute via loops, I’ll often do this first, assuming that for each date and asset I will be using a value. outputArray

I agree with jmh’s suggestion. It’s nice to use dataset arrays in Matlab. Another suggestion by justin is very valuable: switching to matrix/vector paradigm. That really speeds up applications.

justin88 Wrote: > While programming in R or MATLAB can be convenient > for simple tasks, nontrivial software engineering > (like a portfolio backtesting engine) is > significantly more difficult. I have implemented backtesting engine in Matlab and didn’t find it too difficult.

maratikus Wrote: ------------------------------------------------------- > I have implemented backtesting engine in Matlab > and didn’t find it too difficult. Oh, it’s definitely possible. Most matlab programmers I’ve met are quants who never learned how to program. You seem to not belong in this group. :slight_smile: