LobsteR - NASDAQ under a "tidy" Microscope

During my PhD studies, I have been working with high-frequency trading data provided by Lobster a lot for some of my research projects.
In this short series of posts, I want share some of my code and routines to efficiently handly the extremely large amounts of data that go through NASDAQs servers on a daily basis. In fact, if you look at the figure below, there is plenty to explore: during less than 2 minutes on March 17th, 2020, thousands of trades have been executed for SPY, a large ETF. The red line shows the traded prices during that period and the blue shaded areas show the dynamics of the orderbook. The darker the areas, the more liquidity (measured as size of the order book levels).

First, I provide some snippets to read-in Lobster files and to compute some potentially interesting statistics. In a second post, I illustrate long-run characteristics of the orderbook dynamics and I’ll finally focus some really recent events: the days since the outbreak of COVID19 have been extremely bumpy for SPY, the largest ETF in the world and it is amazing to see, how liquidity supply changed during these rough days.

Handling Lobster Data

Lobster is an online limit order book data tool to provide easy-to-use, high-quality limit order book data for the entire universe of NASDAQ traded stocks. I requested some of the data based on their online interface and stored it before running the code below. The actual data which I will use for the next post is much larger. I downloaded all trading messages for ticker SPY (order submissions, cancellations, trades, …) that went through NASDAQ since July, 27th 2007 until March, 25th, 2020. The files contain the entire orderbooks until level 10.

First steps

I work in R with message level data from Lobster in a tidy and (hopefully) efficient way.

library(tidyverse)
library(lubridate)

As an example, I illustrate the computations for a tiny glimpse of March 17th, 2020. Lobster files always come with the same naming convention ticker_date_34200000_57600000_filetype_level.csv, whereas filetype either denotes message or the corresponding orderbook snapshots.

asset <- "SPY"
date <- "2020-03-17"
level <- 10
messages_filename <- paste0(asset,"_",date,"_34200000_57600000_message_", level,".csv")
orderbook_filename <- paste0(asset, "_",date,"_34200000_57600000_orderbook_", level,".csv")

Let’s have a look at the raw message feed first.

messages_raw <- read_csv(messages_filename, 
                col_names = c("ts", "type", "order_id", "m_size", "m_price", 
                              "direction", "null"),
                col_types = cols(ts = col_double(), 
                                 type = col_integer(),
                                 order_id = col_integer(),
                                 m_size = col_double(),
                                 m_price = col_double(),
                                 direction = col_integer(),
                                 null = col_skip())) %>% 
  mutate(ts = as.POSIXct(ts, origin=date, tz="GMT"), 
         m_price = m_price / 10000)

messages_raw
## # A tibble: 20,000 x 6
##    ts                   type order_id m_size m_price direction
##    <dttm>              <int>    <int>  <dbl>   <dbl>     <int>
##  1 2020-03-17 09:30:00     4 24654260    230    245          1
##  2 2020-03-17 09:30:00     3 24683304    500    245.        -1
##  3 2020-03-17 09:30:00     3 24690848    500    245.         1
##  4 2020-03-17 09:30:00     1 24699256    500    245.        -1
##  5 2020-03-17 09:30:00     3 24690812    500    245.        -1
##  6 2020-03-17 09:30:00     3 24699256    500    245.        -1
##  7 2020-03-17 09:30:00     1 24699992    500    245.         1
##  8 2020-03-17 09:30:00     1 24700384    500    245.         1
##  9 2020-03-17 09:30:00     3 24700384    500    245.         1
## 10 2020-03-17 09:30:00     1 24700516    500    245.         1
## # ... with 19,990 more rows

By default, ts denotes the time in seconds since midnight (decimals are precise until nanosecond level) and price always comes in 10.000 USD. type denotes the message type: 4, for instance, corresponds to the execution of a visible order. The remaining variables are explained in more detail here.

Next, the corresponding orderbook snapshots contain price and quoted size for each of the 10 levels.

orderbook_raw <- read_csv(orderbook_filename,
    col_names = paste(rep(c("ask_price", "ask_size", "bid_price", "bid_size"), level),
                      rep(1:level, each=4), sep="_"),
    cols(.default = col_double())) %>% 
  mutate_at(vars(contains("price")), ~./10000)

Putting the files together

Each message is associated with the corresponding orderbook snapshot at that point in time. After merging message and orderbook files, the entire data thus looks as follows

orderbook <- bind_cols(messages_raw, orderbook_raw) 
ts type order_id m_size m_price ask_price_1 ask_size_1 bid_price_1 bid_size_1
2020-03-17 09:30:00 4 24654260 230 245.00 245.10 500 244.88 1000
2020-03-17 09:30:00 3 24683304 500 245.14 245.10 500 244.88 1000
2020-03-17 09:30:00 3 24690848 500 244.88 245.10 500 244.88 500
2020-03-17 09:30:00 1 24699256 500 245.03 245.03 500 244.88 500
2020-03-17 09:30:00 3 24690812 500 245.10 245.03 500 244.88 500
2020-03-17 09:30:00 3 24699256 500 245.03 245.11 500 244.88 500

Compute summary statistics

Next, I compute summary statistics on 20 second levels. In particular I am interested in quoted prices, spreads, and depth (the amount of tradeable units in the orderbook):

  • Midquote \(q_t = (a_t + b_t)/2\) (where \(a_t\) and \(b_t\) denote the best bid and best ask)
  • Spread \(S_t= (a_t - b_t)\) (values below are computed in basis points relative to the concurrent midquote)
  • Volume is the aggretate sum of traded units of the stock. I do differentiate between hidden (type==5) and visible volume.
orderbook <- orderbook %>% mutate(midquote = ask_price_1/2 + bid_price_1/2, 
                     spread = (ask_price_1 - bid_price_1)/midquote * 10000,
                     volume = if_else(type ==4|type ==5, m_size, 0),
                     hidden_volume = if_else(type ==5, m_size, 0))

As a last step, depth of the orderbook denotes the number of assets that can be traded without moving the quoted price more than a given range (measured in basis points) from the concurrent midquote. The function below takes care of the slightly involved computations.

compute_depth <- function(df, side = "bid", bp = 0){
  if(side =="bid"){
    value_bid <- (1-bp/10000)*df %>% select("bid_price_1") 
    index_bid <- df %>% select(contains("bid_price")) %>% 
      mutate_all(function(x) {x >= value_bid})
    sum_vector <- (df %>% select(contains("bid_size"))*index_bid) %>% rowSums()
  }else{
    value_ask <- (1+bp/10000)*df %>% select("ask_price_1")
    index_ask <- df %>% select(contains("ask_price")) %>% 
      mutate_all(function(x) {x <= value_ask})
    sum_vector <- (df %>% select(contains("ask_size"))*index_ask) %>% rowSums()
    
  }
  return(sum_vector)
}

orderbook <- orderbook %>% mutate(depth_bid = compute_depth(orderbook),
                                  depth_ask = compute_depth(orderbook, side="ask"),
                                  depth_bid_5 = compute_depth(orderbook, bp = 5),
                                  depth_ask_5 = compute_depth(orderbook, bp = 5, side="ask"))

Almost there! The snippet below splits the data into 20 second intervals and computes the averages of the computed summary statistics.

orderbook_dense <- orderbook %>%
  mutate(ts_minute = floor_date(ts, "20 seconds")) %>% 
  select(midquote:ts_minute) %>% 
  group_by(ts_minute) %>% 
  mutate(messages = n(),
         volume = sum(volume),
         hidden_volume = sum(hidden_volume)) %>%
  summarise_all(mean)

Here we go: during the first 100 seconds on March 17th, 20.000 messages related to the orderbook of SPY have been processed by NASDAQ. The quoted spread on average was around 3bp. On average, roughly 90.000 contracts have been traded during each 20 second slot - in other words, assets worth roughly 90 million USD have been exchanged. Quoted liquidity at the best bid and best ask seems rather small relative to the tremendous amounts of trading activity during this (very short) period of time.

ts_minute midquote spread volume hidden_volume depth_bid depth_ask depth_bid_5 depth_ask_5 messages
2020-03-17 09:30:00 245.0332 4.010257 89606 19923 353.7358 354.1362 1854.152 2516.916 5890
2020-03-17 09:30:20 245.2229 3.142070 54733 23716 190.3232 238.8164 2099.857 2041.646 3165
2020-03-17 09:30:40 245.5052 2.177630 53273 18188 121.9574 182.5553 2113.945 2282.149 4246
2020-03-17 09:31:00 245.2010 1.488751 146974 86780 297.4000 254.3316 1985.406 2416.603 4210
2020-03-17 09:31:20 244.6590 1.514445 26286 6655 122.6870 115.6107 2174.080 2325.517 2489

Finally, some visualisation of the data at hand: The code below creates the figure at the beginning of the post and shows the dynamics of the traded prices (red line) and the quoted prices at the higher levels of the orderbook.

orderbook_trades <- orderbook %>% 
  filter(type==4|type==5) %>% 
  select(ts, m_price)

orderbook_quotes <- orderbook %>% 
  mutate(id = row_number()) %>%
  select(ts, id, matches("bid|ask")) %>% 
  gather(level, price, -ts, -id) %>%
  separate(level, into=c("side","variable","level"), sep="_") %>%
  mutate(level = as.numeric(level))  %>% 
  spread(variable, price)

p1 <- ggplot() + 
  theme_bw() +
  geom_point(data = orderbook_quotes, aes(x=ts, y=price, color=level, size = size/max(size)), alpha = 0.1)+
  geom_line(data = orderbook_trades, aes(x=ts, y=m_price), color='red') + 
  labs(title="SPY: Orderbook Dynamics",
       y="Price",
       x="") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position ="none") +
  scale_y_continuous()