KDB+/q: How to implement an aggregation table for features?
Asked Answered
kdb
S

1

6

I am trying to implement an aggregation table that collates data from multiple disparate tables into one for feature engineering, preprocessing and normalization. I am facing numerous problems the first of which is that I have to somehow construct the schema for this aggregation table without hard coding it, this gives me sufficient flexibility with respect to adding additional data feeds.

trades
  - exch1
    - sym1
    - sym2
  - exch2
    - sym1
    - sym2
book
  - exch1
    - sym1
    - sym2
  - exch2
    - sym1
    - sym2
sentiment
   - sym1
   - sym2

enter image description here

The problem arises, as I have noted in previous questions when I want to insert new aggregations that may or may not have different schema's into the aggregation table (after aggregation) within the kdb-tick architecture.

I have noted the uj operation which seemed like an adequate operation seen as though the output rate will only be roughly 0.5-1 hertz, However I have been told that this can be seen as an anti-pattern on account of the fact that it may cause problems with persistence, is not an efficient operation etc.

I have thought of checking the schema before conducting an insert/upsert operation (If the schema is different, update the schema, then insert). However this might also be inefficient.

I have taken note of answers to previous questions however all seem to have negative points that may outweigh their positives.

The nature of the aggregation means that I only need a table of roughly 1000 rows on the RTE subscriber/worker in order to effectively run aggregations whereby older records will be purged to disk. However the number of columns could change intermittently (new feeds added etc.) not necessarily intra-day.

The nature of the data also means that aggregations need to be run continuously i.e. cutting of data into days would be inneffective.

I have also thought of maintaining a separate table for each new feature, however the number of tables would also cause inefficiency.

Issues also arise when one chooses to try send the old/purged aggregated rows to a worker that then persists those aggregations periodically, how does one modify the kdb-tick publisher-subscriber architecture to support .u.upd[] when the columns of the aggregated data might change? The problem is not the kdb-tick architecture itself but how one might conduct aggregations on on the data therein whilst maintaining backwards schema compatibility/efficiency?

I have even contemplated creating my own domain specific database in Rust, partitioning the data into sharded flat-files. however have opted to stick with kdb/q on account of the superior query operations I would be able to conduct/create.

I think running aggregations like this online/in real time would be an important feature for the utilization of ml with kdb+, however I have not been able to find any documentation eluding to this.

My questions therefore can be summarized as follows: What is the canonical implementation of this? How does one effectively aggregate data from multiple sources as is depicted above in kdb? Your advice is truly appreciated.

Swiss answered 5/9, 2019 at 23:24 Comment(0)
S
2

Your tickerplant should have a fixed/set schema for each feed/table. You should use the vanilla/stock tick.q code for this. If a new feed is added then you configure a new table in the tickerplant.

Your subscriber/aggregator should subscribe to all raw tables from the tickerplant and it should do the clever work of figuring out what to do with the incoming records based on the table name.

You shouldn't have to merge different tables/schemas into a single table. Your subscriber/aggregator (or perhaps your model process) should take incoming records and produce a feature. Then you can insert this feature into a table with a fixed schema:

source    time                          feature output
---------------------------------------------------------
trade     2019.09.06D08:47:56.525854000 f1      0.4707883
trade     2019.09.06D08:47:56.525855000 f10     0.6346716
book      2019.09.06D08:47:56.525856000 f3      0.9672398
trade     2019.09.06D08:47:56.525857000 f5      0.2306385
sentiment 2019.09.06D08:47:56.525858000 f2      0.949975

Then you continually grow this table. If you need to pivot the table such that the features become columns then you can do that after-the-fact.

It's going to be hard for anyone here to fully understand your system and what you're trying to do. Stack is more a place to ask small questions, not large architectural questions because that requires so much additional information.

Smarten answered 6/9, 2019 at 7:49 Comment(1)
Ok thanks Terry, tried to break the question down into smaller chunks as per stack's form though I felt that my questions lacked context. Though your response has sparked an idea, ThanksSwiss

© 2022 - 2024 — McMap. All rights reserved.