Rethinking R with Chains %>% tidyr + dplyr + magrittr + rCharts

Finance Case with French Factors

R seems to be experiencing a quiet revolution led by pipes | chains borrowed from Javascript, F#, and Unix. dplyr and magrittr are independent projects, but they have benefitted greatly from each other. Chaining results in much more readable code, and as a nice side benefit, Romain Francois' C magic makes dplyr extremely fast. I thought I would collect a couple of example workflows with the French-Fama factors and xts data. dplyr and magrittr are not designed to work with xts time series out of the box, so these time series require a couple of extra steps.

I will also use tidyr, which is Hadley Wickham's rethought reshape2. tidyr is designed to fit nicely into the dplyr/magrittr workflow. Its simplicity makes it power deceptive.

Best practices with chains in R are still not yet decided, and magrittr is evolving rapidly, so much might change, but I think we have already moved far enough in this direction that return to our old ways is unlikely.

Let's require all the libraries. If you do not have them, install_github from devtools will get you up to date.

  1. require(quantmod)
  2. require(PerformanceAnalytics)
  3. require(dplyr)
  4. require(tidyr)
  5. require(magrittr)
  6. #not necessary but include for examples
  7. require(lattice)
  8. require(ggplot2)

Similar to lots of posts, I will use this ugly R code to load in the data from the Kenneth French data library.

  1. #daily factors from Kenneth French Data Library
  2. #get Mkt.RF, SMB, HML, and RF
  3. #UMD is in a different file
  4. my.url="http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_daily.zip"
  5. my.tempfile<-paste(tempdir(),"\\frenchfactors.zip",sep="")
  6. my.usefile<-paste(tempdir(),"\\F-F_Research_Data_Factors_daily.txt",sep="")
  7. download.file(my.url, my.tempfile, method="auto",
  8. quiet = FALSE, mode = "wb",cacheOK = TRUE)
  9. unzip(my.tempfile,exdir=tempdir(),junkpath=TRUE)
  10. #read space delimited text file extracted from zip
  11. french_factors <- read.table(file=my.usefile,
  12. header = TRUE, sep = "",
  13. as.is = TRUE,
  14. skip = 4, nrows=23215)
  15. #get xts for analysis
  16. french_factors_xts <- as.xts(
  17. french_factors,
  18. order.by=as.Date(
  19. rownames(french_factors),
  20. format="%Y%m%d"
  21. )
  22. )
  23. #now get the momentum factor
  24. my.url="http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Momentum_Factor_daily.zip"
  25. my.usefile<-paste(tempdir(),"\\F-F_Momentum_Factor_daily.txt",sep="")
  26. download.file(my.url, my.tempfile, method="auto",
  27. quiet = FALSE, mode = "wb",cacheOK = TRUE)
  28. unzip(my.tempfile,exdir=tempdir(),junkpath=TRUE)
  29. #read space delimited text file extracted from zip
  30. french_momentum <- read.table(file=my.usefile,
  31. header = TRUE, sep = "",
  32. as.is = TRUE,
  33. skip = 13, nrows=23114)
  34. #get xts for analysis
  35. french_momentum_xts <- as.xts(
  36. french_momentum,
  37. order.by=as.Date(
  38. rownames(french_momentum),
  39. format="%Y%m%d"
  40. )
  41. )
  42. #merge UMD (momentum) with other french factors
  43. french_factors_xts <- na.omit( merge( french_factors_xts, french_momentum_xts ) )
  44. french_factors_xts <- french_factors_xts/100

I have noticed that rolling analysis with xts can sometimes be slow. as.matrix is my favorite way to speed things up, since I usually do not need xts powerful indexing and subsetting features. I thought the additional complexity of rolling analysis would offer a nice challenge to improve my understanding of xts + dplyr. Here is a quick test. I would love thoughts on a better approach with comments that offer the comprable melt and ddply method.

  1. #now we should have all the french factor data that we need
  2. #we can start to do our exploration
  3. #but this time use dplyr
  4. system.time(
  5. df_dplyr <-
  6. #get xts as data.frame to take advantage of new features
  7. data.frame("date"=index(french_factors_xts),french_factors_xts) %>%
  8. # long form similar to melt(
  9. # data.frame(
  10. # date=as.Date(index(french_factors_xts)),
  11. # french_factors_xts
  12. # ),
  13. # id.vars = "date",
  14. # variable.name = "mkt_factor",
  15. # value.name = "roc"
  16. #)
  17. gather(ff_factor,roc,-date) %.%
  18. # group it and apply a function similar to ddply(
  19. # df,
  20. # .(ff_factor,roc),
  21. # summarise(
  22. # date = french_factors_xts$date[seq(1,nrow(french_factors_xts)-199,by=1)],
  23. # omega = function(x) {
  24. # rollapply( as.numeric(x$roc), Omega, width = 200, by = 1)
  25. # }
  26. # )
  27. # )
  28. group_by( ff_factor ) %.%
  29. do(
  30. data.frame(
  31. date = .$date[seq(1,nrow(.)-199,by=1)],
  32. omega = rollapply( as.numeric(.$roc) , Omega, width=200, by=1)
  33. )
  34. )
  35. )

|========= | 20% ~14 s remaining
|=================== | 40% ~10 s remaining
|============================= | 60% ~6 s remaining
|======================================= | 80% ~3 s remaining
Completed after 16 s
user system elapsed 15.74 0.02 16.28

This might be the longest I have gone without a plot, so let's use lattice to create a very quick and admittedly ugly line plot.

  1. xyplot(omega~date, groups = ff_factor, data = df_dplyr,type="l",ylim=c(-1,4))

plot of chunk oldway_plot

I am ashamed to admit how long it took me to realize that plotting could integrate nicely into chains. Below I show how we can use Gmisc htmlTable to nicely output a table with the last 5 daily returns from each of the factors.

  1. require(Gmisc)
  2. data.frame(
  3. "date"=format(index(french_factors_xts)),
  4. french_factors_xts
  5. ) %>%
  6. gather(ff_factor,roc,-date) %>%
  7. mutate(
  8. date = as.character(date),
  9. ff_factor = as.character(ff_factor),
  10. roc = paste0(format(roc*100,digits=4),"%")
  11. ) %>%
  12. group_by( ff_factor ) %>%
  13. top_n(n=5,date) %>%
  14. htmlTable %>%
  15. cat
. date ff_factor roc
1 2014-04-24 Mkt.RF 0.080%
2 2014-04-25 Mkt.RF -1.040%
3 2014-04-28 Mkt.RF 0.110%
4 2014-04-29 Mkt.RF 0.560%
5 2014-04-30 Mkt.RF 0.350%
6 2014-04-24 SMB -0.390%
7 2014-04-25 SMB -0.870%
8 2014-04-28 SMB -0.610%
9 2014-04-29 SMB -0.220%
10 2014-04-30 SMB 0.230%
11 2014-04-24 HML -0.080%
12 2014-04-25 HML 0.630%
13 2014-04-28 HML -0.430%
14 2014-04-29 HML -0.230%
15 2014-04-30 HML -0.030%
16 2014-04-24 RF 0.000%
17 2014-04-25 RF 0.000%
18 2014-04-28 RF 0.000%
19 2014-04-29 RF 0.000%
20 2014-04-30 RF 0.000%
21 2014-04-24 UMD -0.540%
22 2014-04-25 UMD -1.240%
23 2014-04-28 UMD -1.150%
24 2014-04-29 UMD 0.670%
25 2014-04-30 UMD 0.530%

I do not think it was intentional, but ggplot2 also fits nicely and cleanly into our chains. Often, I think data cleaning and aggregation should be separated from the output, but it is nice to be able to walk from raw data to final output in one uninterrupted block of code.

  1. data.frame("date"=index(french_factors_xts),french_factors_xts) %>%
  2. gather(ff_factor,roc,-date) %>%
  3. ggplot(data = .,aes(x=date,y=roc,colour=ff_factor)) + geom_line()

plot of chunk ggplot_returns

The previous plot did not do any calculations, so let's add a simple cumsum to get a cumulative line chart of the returns for each factor. These calculations could be much more complex using this same technique.

  1. data.frame("date"=index(french_factors_xts),french_factors_xts) %>%
  2. gather(ff_factor,roc,-date) %>%
  3. group_by( ff_factor ) %>%
  4. mutate(cumul = cumsum(roc)) %>%
  5. ggplot(data = .,aes(x=date,y=cumul,colour=ff_factor)) + geom_line()

plot of chunk ggplot_cumul

As the R world moves to chains and pipes, the entire vis world is simultaneously moving to interactive charts. Within R visualization, we can see this parallel shift to interactivity with rCharts, ggvis, googleVis, and animint. Since ggvis and dplyr share the same source, I am sure we will see ggvis chains soon, so here I will show rCharts in our chain.

  1. require(rCharts)
  2. data.frame(
  3. "date"= french_factors_xts %>% index %>% format,
  4. french_factors_xts,
  5. row.names= NULL
  6. ) %>%
  7. tbl_df %>%
  8. gather(ff_factor,roc,-date) %>%
  9. group_by( ff_factor ) %>%
  10. mutate(cumul = cumsum(roc)) %>%
  11. #demo filter to get end of month instead of daily
  12. filter(
  13. date %in% format(
  14. index(
  15. french_factors_xts[french_factors_xts %>% endpoints(on="months")]
  16. )
  17. )
  18. ) %>%
  19. dPlot(
  20. cumul~date
  21. ,groups="ff_factor"
  22. ,data = .
  23. ,type="line"
  24. ,xAxis = list(
  25. type = "addTimeAxis"
  26. , inputFormat = '%Y-%m-%d'
  27. , outputFormat = "%b %Y"
  28. )
  29. ,yAxis = list( outputFormat = ".2f")
  30. )
Jan 1927Jan 1928Jan 1929Jan 1930Jan 1931Jan 1932Jan 1933Jan 1934Jan 1935Jan 1936Jan 1937Jan 1938Jan 1939Jan 1940Jan 1941Jan 1942Jan 1943Jan 1944Jan 1945Jan 1946Jan 1947Jan 1948Jan 1949Jan 1950Jan 1951Jan 1952Jan 1953Jan 1954Jan 1955Jan 1956Jan 1957Jan 1958Jan 1959Jan 1960Jan 1961Jan 1962Jan 1963Jan 1964Jan 1965Jan 1966Jan 1967Jan 1968Jan 1969Jan 1970Jan 1971Jan 1972Jan 1973Jan 1974Jan 1975Jan 1976Jan 1977Jan 1978Jan 1979Jan 1980Jan 1981Jan 1982Jan 1983Jan 1984Jan 1985Jan 1986Jan 1987Jan 1988Jan 1989Jan 1990Jan 1991Jan 1992Jan 1993Jan 1994Jan 1995Jan 1996Jan 1997Jan 1998Jan 1999Jan 2000Jan 2001Jan 2002Jan 2003Jan 2004Jan 2005Jan 2006Jan 2007Jan 2008Jan 2009Jan 2010Jan 2011Jan 2012Jan 2013Jan 2014date-1.000.001.002.003.004.005.006.007.00cumul

Refined output currently requires some additional manipulation. In the chart above, I do not like the x axis, and want to include some code to just make tick marks for each decade. For this to occur, rCharts functions might need to be redesigned to return the chart instead of manipulate the object. I will appeal to expert R gurus for the best approach to this. Here is my ugly first hack.

  1. #very hacky way of accomplishing
  2. #need to iterate to something better
  3. modifyChartList <- function( x, element, val ) {
  4. rTemp <- x$copy()
  5. rTemp[[element]] <- modifyList(rTemp[[element]], val)
  6. return(rTemp)
  7. }
  8. data.frame(
  9. #maybe chaining here makes more confusing
  10. "date"= french_factors_xts %>% index %>% format,
  11. french_factors_xts,
  12. row.names= NULL
  13. ) %>%
  14. tbl_df %>%
  15. gather(ff_factor,roc,-date) %>%
  16. group_by( ff_factor ) %>%
  17. mutate(cumul = cumsum(roc)) %>%
  18. #demo filter to get end of quarter instead of daily
  19. filter(
  20. date %in% format(index(french_factors_xts[french_factors_xts %>% endpoints(on="quarters")]))
  21. ) %>%
  22. dPlot(
  23. cumul~date
  24. ,groups="ff_factor"
  25. ,data = .
  26. ,type="line"
  27. ,xAxis = list(
  28. type = "addTimeAxis"
  29. , inputFormat = '%Y-%m-%d'
  30. , outputFormat = "%b %Y"
  31. )
  32. ,yAxis = list( outputFormat = ".2f")
  33. ) %>%
  34. modifyChartList(
  35. element = "templates",
  36. val = list(afterScript = '
  37. <script>
  38. {{chartId}}[0].axes[0]
  39. .timePeriod = d3.time.years
  40. .timeInterval = 10
  41. {{chartId}}[0].draw();
  42. </script>
  43. '
  44. )
  45. )
Jan 1930Jan 1940Jan 1950Jan 1960Jan 1970Jan 1980Jan 1990Jan 2000Jan 2010-1.000.001.002.003.004.005.006.007.00datecumul

Fortunately a thoughtful reader commented with the better way to add afterScript using the %T>% operator from magrittr. I have modified the code from above with what I think is a better workflow which removes the need for our helper modifyChartList.

  1. data.frame(
  2. #maybe chaining here makes more confusing
  3. "date"= french_factors_xts %>% index %>% format,
  4. french_factors_xts,
  5. row.names= NULL
  6. ) %>%
  7. tbl_df %>%
  8. gather(ff_factor,roc,-date) %>%
  9. group_by( ff_factor ) %>%
  10. mutate(cumul = cumsum(roc)) %>%
  11. #demo filter to get end of quarter instead of daily
  12. filter(
  13. date %in% format(index(french_factors_xts[french_factors_xts %>% endpoints(on="quarters")]))
  14. ) %>%
  15. dPlot(
  16. cumul~date
  17. ,groups="ff_factor"
  18. ,data = .
  19. ,type="line"
  20. ,xAxis = list(
  21. type = "addTimeAxis"
  22. , inputFormat = '%Y-%m-%d'
  23. , outputFormat = "%b %Y"
  24. )
  25. ,yAxis = list( outputFormat = ".2f")
  26. ) %T>%
  27. .$setTemplate(afterScript = '
  28. <script>
  29. {{chartId}}[0].axes[0]
  30. .timePeriod = d3.time.years
  31. .timeInterval = 10
  32. {{chartId}}[0].draw();
  33. </script>
  34. '
  35. )
Jan 1930Jan 1940Jan 1950Jan 1960Jan 1970Jan 1980Jan 1990Jan 2000Jan 2010-1.000.001.002.003.004.005.006.007.00datecumul

After a little bit of experimentation, chains and pipes quickly become quite natural. I will eagerly read any new code and closely follow magrittr to become even more skilled at this, so June 23, 2014 might be the last bit of code that I share with no chains.

As I hope you can tell, this post was more a function of the efforts of others than of my own.

Thanks specifically: