Decompress and read Dukascopy .bi5 tick files
Asked Answered
R

5

12

I need to open a .bi5 file and read the contents to cut a long story short. The problem: I have tens of thousands of .bi5 files containing time-series data that I need to decompress and process (read, dump into pandas).

I ended up installing Python 3 (I use 2.7 normally) specifically for the lzma library, as I ran into compiling nightmares using the lzma back-ports for Python 2.7, so I conceded and ran with Python 3, but with no success. The problems are too numerous to divulge, no one reads long questions!

I have included one of the .bi5 files, if someone could manage to get it into a Pandas Dataframe and show me how they did it, that would be ideal.

ps the fie is only a few kb, it will download in a second. Thanks very much in advance.

(The file) http://www.filedropper.com/13hticks

Rondure answered 16/12, 2016 at 1:56 Comment(3)
Do you know the format of the original data? Is it for example: (int, int, int, float, float) in each line?Inchoation
32-bit integer: milliseconds since epoch, 32-bit float: Ask price, 32-bit float: Bid price, 32-bit float: Ask volume, 32-bit float: Bid volume. There is a C++ library out there specifically for this task, but I need to work from Python. You can look at it here github.com/ninety47/dukascopyRondure
Python 2.7 that is.Rondure
I
17

The code below should do the trick. First, it opens a file and decodes it in lzma and then uses struct to unpack the binary data.

import lzma
import struct
import pandas as pd


def bi5_to_df(filename, fmt):
    chunk_size = struct.calcsize(fmt)
    data = []
    with lzma.open(filename) as f:
        while True:
            chunk = f.read(chunk_size)
            if chunk:
                data.append(struct.unpack(fmt, chunk))
            else:
                break
    df = pd.DataFrame(data)
    return df

The most important thing is to know the right format. I googled around and tried to guess and '>3i2f' (or >3I2f) works quite good. (It's big endian 3 ints 2 floats. What you suggest: 'i4f' doesn't produce sensible floats - regardless whether big or little endian.) For struct and format syntax see the docs.

df = bi5_to_df('13h_ticks.bi5', '>3i2f')
df.head()
Out[177]: 
      0       1       2     3     4
0   210  110218  110216  1.87  1.12
1   362  110219  110216  1.00  5.85
2   875  110220  110217  1.00  1.12
3  1408  110220  110218  1.50  1.00
4  1884  110221  110219  3.94  1.00

Update

To compare the output of bi5_to_df with https://github.com/ninety47/dukascopy, I compiled and run test_read_bi5 from there. The first lines of the output are:

time, bid, bid_vol, ask, ask_vol
2012-Dec-03 01:00:03.581000, 131.945, 1.5, 131.966, 1.5
2012-Dec-03 01:00:05.142000, 131.943, 1.5, 131.964, 1.5
2012-Dec-03 01:00:05.202000, 131.943, 1.5, 131.964, 2.25
2012-Dec-03 01:00:05.321000, 131.944, 1.5, 131.964, 1.5
2012-Dec-03 01:00:05.441000, 131.944, 1.5, 131.964, 1.5

And bi5_to_df on the same input file gives:

bi5_to_df('01h_ticks.bi5', '>3I2f').head()
Out[295]: 
      0       1       2     3    4
0  3581  131966  131945  1.50  1.5
1  5142  131964  131943  1.50  1.5
2  5202  131964  131943  2.25  1.5
3  5321  131964  131944  1.50  1.5
4  5441  131964  131944  1.50  1.5

So everything seems to be fine (ninety47's code reorders columns).

Also, it's probably more accurate to use '>3I2f' instead of '>3i2f' (i.e. unsigned int instead of int).

Inchoation answered 18/12, 2016 at 4:31 Comment(11)
It looks plausible ptrj. The 1st column is supposed to be a timestamp, but if you are correct, and that is a true representation of the data I will have to take the initial timestamp from the dated folder structure and add to it (long story). I will check it out tomorrow just to be sure, but it looks like you banked the bounty. Speak soon.Rondure
The first columns is "miliseconds since epoch". If you run pd.TimedeltaIndex(df[0], 'ms'), you'll see that it covers 1 hour. To get the timestamps, do for example ts + pd.TimedeltaIndex(df[0], 'ms') where ts is your timestamp.Inchoation
the first two columns are supposed to be floats though. It's the price of a currency (EURUSD). Would there be any chance that it has been compressed this way, to save space perhaps?Rondure
When I say the first two columns, I mean the first two columns after the milliseconds, so that would be the second and third to be clear. Or 1 and 2 this case.Rondure
I am going with the answer regardless, it's just a case of munging from here. Thanks for your time.Rondure
@Rondure These two columns indeed look like the EURUSD prices * 10000. And if I'm not mistaken it's always df[1] > df[2]. I doubt it's a matter of space - maybe rather the resolution of prices matters.Inchoation
the convention is usually to put the bid price before the ask price which would normally result in df[1] < df[2], where the difference between is the spread, but that's not even a consideration at this point, I suppose it's down to broker preference, or whoever captures their data for them in this case. Anyway thanks for your efforts, much appreciated.Rondure
@Rondure Glad I could help. See also the update - comparison with ninety47's C++ code.Inchoation
Thank you @ptrj. The only thing, how to transform columns 1 and 2 so that they show the decimals as wel? Here is a example: from your function i get this number 7106260 but the actual price is 7106,26Oriane
@Oriane Not sure what you mean. You can simply divide the numbers in both columns by 100, for example df.iloc[:, [1, 2]] = df.iloc[:, [1, 2]] / 100. For this currency pair the factor is 100. It can be different for another pair - it's not encoded in the input file - you have ot know it.Inchoation
thank you @ptrj; i just realized that my question was a bit stupid. Of course, i just had to divide it :)Oriane
S
4
import requests
import struct
from lzma import LZMADecompressor, FORMAT_AUTO

# for download compressed EURUSD 2020/06/15/10h_ticks.bi5 file
res = requests.get("https://www.dukascopy.com/datafeed/EURUSD/2020/06/15/10h_ticks.bi5", stream=True)
print(res.headers.get('content-type'))

rawdata = res.content

decomp = LZMADecompressor(FORMAT_AUTO, None, None)
decompresseddata = decomp.decompress(rawdata)

firstrow = struct.unpack('!IIIff', decompresseddata[0: 20])
print("firstrow:", firstrow)
# firstrow: (436, 114271, 114268, 0.9399999976158142, 0.75)
# time = 2020/06/15/10h + (1 month) + 436 milisecond

secondrow = struct.unpack('!IIIff', decompresseddata[20: 40])
print("secondrow:", secondrow)
# secondrow: (537, 114271, 114267, 4.309999942779541, 2.25)

# time = 2020/06/15/10h + (1 month) + 537 milisecond
# ask = 114271 / 100000 = 1.14271
# bid = 114267 / 100000 = 1.14267
# askvolume = 4.31
# bidvolume = 2.25

# note that 00 -> is january
# "https://www.dukascopy.com/datafeed/EURUSD/2020/00/15/10h_ticks.bi5" for january
# "https://www.dukascopy.com/datafeed/EURUSD/2020/01/15/10h_ticks.bi5" for february

#  iterating
print(len(decompresseddata), int(len(decompresseddata) / 20))
for i in range(0, int(len(decompresseddata) / 20)):
    print(struct.unpack('!IIIff', decompresseddata[i * 20: (i + 1) * 20]))
Sensual answered 20/7, 2020 at 1:9 Comment(1)
Thanks for that, much appreciated, do you know how often they update their feed by any chance?Rondure
E
2

I know I'm re-opening an old thread, but if someone needs to know what the structure of .bi5 file is, find it here. Hope even after those years somebody'd find it helpful.

The bi5 structure background The size of one tick record in a bi5 file is 5 × 4 bytes. These bytes carry the timestamp, bid and ask quotes, and tick volume for both quote sides. The structure of the bi5 record is the following:

1st 4 bytes → The time part of the timestamp

2nd 4 bytes → Bid

3rd 4 bytes → Ask

4th 4 bytes → Bid Volume

5th 4 bytes → Ask Volume

The timestamp is not date and time, but actually, it’s a count of milliseconds from the starting hour of the file. The bytes must also be uncompressed first, and then reverted during translation.

Embitter answered 22/7, 2023 at 19:36 Comment(0)
K
1

In case someone used endpoint https://datafeed.dukascopy.com/datafeed/EURUSD/2022/11/06/BID_candles_min_1.bi5

the format is '>IIIIIf' (big endian, 5x integer, 1x float ) and the columns are Seconds, O, H, L, C, V

Keyhole answered 16/12, 2022 at 16:29 Comment(0)
B
0

Did you try using numpy as to parse the data before transfer it to pandas. Maybe is a long way solution, but I will allow you to manipulate and clean the data before you made the analysis in Panda, also the integration between them are pretty straight forward,

Bumpy answered 18/12, 2016 at 4:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.