hadoophdfshstorehdfstorehft

Using Hadoop for storing stock market tick data



I'm having fun learning about Hadoop and the various projects around it and currently have 2 different strategies I'm thinking about for building a system to store a large collection of market tick data, I'm just getting started with both Hadoop/HDSF and HBase but hoping someone can help me plant a system seed that I won't have to junk later using these technologies. Below is an outline of my system and requirements with some query and data usage use cases and lastly my current thinking about the best approach from the little documentation I have read. It is an open ended question and I'll gladly like any answer that is insightful and accept the best one, feel free to comment on any or all of the points below. - Duncan Krebs

System Requirements - Be able to leverage the data store for historical back testing of systems, historical data charting and future data mining. Once stored, data will always be read-only, fast data access is desired but not a must-have when back testing.

Static Schema - Very Simple, I want to capture 3 types of messages from the feed:

  1. Timestamp including date,day,time
  2. Quote including Symbol,timestamp,ask,askSize,bid,bidSize,volume....(About 40 columns of data)
  3. Trade including Symbol,timestamp,price,size,exchange.... (About 20 columns of data)

Data Insert Use Cases - Either from a live market stream of data or lookup via broker API

Data Query Use Cases - Below demonstrates how I would like to logically query my data.

  1. Get me all Quotes,Trades,Timestamps for GOOG on 9/22/2014
  2. Get me all Trades for GOOG,FB BEFORE 9/1/2014 AND AFTER 5/1/2014
  3. Get me the number of trades for these 50 symbols for each day over the last 90 days.

The Holy Grail - Can MapReduce be used for uses cases like these below??

  1. Generate meta-data from the raw market data through distributed agents. For example, Write a job that will compute the average trading volume on a 1 minute interval for all stocks and all sessions stored in the database. Create the job to have an agent for each stock/session that I tell what stock and session it should compute this value for. (Is this what MapReduce can do???)

  2. On the classpath of the agents can I add my own util code so that the use case above for example could publish its value into a central repo or Messaging server? Can I deploy an agent as an OSGI bundle?

  3. Create different types of agents for different types of metrics and scores that are executed every morning before pre-market trading?

High Frequency Trading
I'm also interested if anyone can share some experience using Hadoop in the context of high frequency trading systems. Just getting into this technology my initial sense is Hadoop can be great for storing and processing large volumes of historic tick data, if anyone is using this for real-time trading I'd be interested in learning more! - Duncan Krebs


Solution

  • Based of my understanding of your requirements, Hadoop would be really good solution to store your data and run your queries on it using Hive.

    Storage: You can store the data in Hadoop in a directory structure like:

    ~/stock_data/years=2014/months=201409/days=20140925/hours=01/file
    

    Inside the hours folder, the data specific to that hour of the day can reside.

    One advantage of using such structure is that you can create external tables in Hive over this data with your partitions on years, months, days and hours. Something like this:

    Create external table stock_data (schema) PARTITIONED BY (years bigint, months bigint, days bigint, hours int) ROW FORMAT DELIMITED   FIELDS TERMINATED BY '\t' LOCATION
      '~/stock_data'
    

    Coming to the queries part, once you have the data stored in the format mentioned above you can easily run simple queries.

    Get me all Quotes,Trades,Timestamps for GOOG on 9/22/2014

    select * from stock_data where stock = 'GOOG' and days = 20140922
    

    Get me all Trades for GOOG,FB BEFORE 9/1/2014 AND AFTER 5/1/2014

    select * from stock_data where stock in ('GOOG', 'FB') and days > 20140501 and days < 20140901)
    

    You can run any such aggregation queries once in a day and use the output to come up with the metrics before pre-market trading. Since Hive internally runs mapreduce these queries won't be very fast.

    In order to get faster results, you can use some of the in memory projects like Impala or Spark. I have myself used Impala to run queries on my hive tables and I have seen a major improvement in the run time for my queries (around 40x). Also you wouldn't need to make any changes to the structure of the data.

    Data Insert Use Cases : You can use tools like Flume or Kafka for inserting data in real time to Hadoop (and thus to the hive tables). Flume is linearly scalable and can also help in processing events on the fly while transferring.

    Overall, a combination of multiple big data technologies can provide a really decent solution to the problem you proposed and these solution would scale to huge amounts of data.