COAP Implementation Deep Dive (1): How to Parse Bitcoin data.

Author: Peiling Ding (ArcBlock Software Engineer, Lead Developer of OCAP project)

To help our community understand the technical details of ArcBlock's open chain access protocol (OCAP), our engineering team will regularly write technical articles "decrypting" the design and development of OCAP to help readers better understand these concepts. With this knowledge, developers can participate in the implementation of more OCAP chain adapters so that OCAP can support more blockchain protocols.

As always, we welcome feedback on how our design can be improved.

Why parse Bitcoin data?

We released the first version of OCAP at the end of July, as scheduled. In this version, OCAP provides a query service for Bitcoin data. Users can not only quickly query a block of transaction information through the hash, but also make complex cascading queries. For example, it is possible to query transactions between specific two addresses. The query below returns the famous “pizza deal.”

The native Bitcoin API does not support such data queries, so if OCAP is to have such a query function, we must preprocess the data on Bitcoin and store it the way we want it. Which brings us to our topic today -- how do we parse Bitcoin data?

Summary

Before we go into the technical details, let's take a look at the basics of what we want to do. For the sake of clarity, let’s compare it to a search engine.

As we all know, the data on the Internet is complicated, and out of this endless ocean of data, a search engine can quickly find the result a user wants. The reason it can do this is that there are two components behind the search engine: the Internet crawler and inverted indexing. The crawler is responsible for collecting data from the Internet continuously, and the inverted index is responsible for storing data in specific forms that search engines can quickly query.

Essentially, blockchain queries supported by ArcBlock OCAP are equivalent to providing "search engine" services for blockchains, and similar data preprocessing is required. Unlike search engines, we do not need a crawler because Bitcoin's data is stored on the disk of the node in binary form. However, the data is organized in a way that is not very human-friendly, so we need a parser to read these binary data and restore them to their original appearance. Then some further processing must be applied to the parsed data in order to form the data in the way that OCAP needs.

Technical details

In this section, I will focus on how Bitcoin's data is stored and what additional calculations need to be done after this data is parsed from the raw binary file.

How Bitcoin data is stored

Files to store data

The original data of Bitcoin can be found below $HOME/.bitcoin/blocks. There are two main types of files under this directory, one is blk00952.dat (or similar), and the other is rev000952.dat (or similar). The first type of file starting with “blk” is the one that stores Bitcoin source data, and the second one is used for rewind. We’ll talk about the rewind file next time. Right now, let’s focus on source data files.

Let's first look at what is stored in these documents through the following commands.

The data displayed above is Bitcoin's Genesis Block. It is binary data (in hexadecimal), with one byte per two characters. A “blk” file is composed of multiple blocks like it. The upper limit of size for each “blk” file is 128MB. When a file reaches the maximum allowed size, the Bitcoin program creates another file to store new blocks.

At first glance, it's a mess, but when you get to know the data format, it's not so hard.

Data storage structure

Bitcoin has a detailed explanation of the data format on its website, but the organization of their documents is not easy to read or understand, so we have created a more intuitive chart.

block structure

Description of figure:

  1. In the figure above, blue represents the composite data type, orange represents the simple data type, and green is annotations.
  2. This figure represents a section, and a blkxxx data file is composed of many sections.
  3. Each section consists of a preamble and a block. A block consists of a block header, the number of transactions, plus those transactions. Each transaction consists of several simple data types and several transaction inputs/outputs. And so on.
  4. Almost all data is stored in little-endian, except for scripts in Transact Input/Output. They are stored in big-endian.
  5. Magic Number is a fixed number with a value of 0xD9B4BEF9
  6. The length of the variable integers is uncertain, between 1 and 9 bytes. The detailed rules to parse them are as follows:

    • If the first byte is less than 253, then directly returns the value represented by the byte
    • If the first byte is equal to 253, read the next byte as the return value
    • If the first byte is equal to 254, read the next four bytes as the return value
    • If the first byte is equal to 255, read the next eight bytes as the return value

If we fill the Bitcoin Genesis Block data into this chart, it will look like this:

filled block structure

This makes the data format much less mystifying. Note that I have not changed the data into big-endian. If you would like to see that, please convert it yourself.

Calculating additional fields

Some of you may have noticed that this chart display does not include several very important fields such as block hash, block height, transaction hash, or addresses. As the father of Bitcoin, Satoshi Nakamoto was extremely "stingy" when he/she/it/they were designing Bitcoin. If a data is calculable, it is not stored on the chain. This is a vital feature of blockchain, as it can save a lot of disk space, but it does create the need for some extra work. We need to figure out those data by ourselves. So let's talk about how to calculate those values respectively.

Calculating block hash and transaction hash

Block hash and transaction hash values are derived from the same algorithm. The only difference between them is that the data involved in the calculation. For the block hash, our input data are the 80 bytes of block header, but for the transaction hash, it is the entire transaction data. This is because a block header includes the Merkle root, so the entire block is not required when calculating the block hash.

The hash value is derived from the two SHA-256 operations of the input data. The pseudo-code is as follows:

We can get the corresponding hash value by bringing the block header and the whole transaction into the above formula respectively.

Calculating block height

As we know, Bitcoin's raw data is stored in blkxxxxxx.dat, with many blocks in each data file. If we read from byte 0 sequentially and parse all the blocks out of it, we will find that the blocks are not sequentially ordered. For example, you may read the 178th block before you read out 177th block from the “blk” file. To understand why, you can think about what the progress bar was like when downloading something with BitTorrent. To get the correct block height, we have to sort the blocks.

Anyone who is familiar with blockchain should know that the data structure of a blockchain is an inverted single-linked list, and we can rejoin these blocks through block hash and previous block hash. The previous block hash of the genesis block is fixed to 0.

Calculating addresses

To support the address-based queries in the “pizza deal” example, we have to find the addresses of the payer and the payee for each transaction. These two data are contained in the script fields of transaction input and output, respectively. Different strategies should be used to parse them:

  1. For transaction output, find out the public key or public key hash, and then calculate the address according to a certain algorithm.
  2. For transaction input, get the address from its corresponding previous transaction output.

We will address transaction output in-depth in a separate article, but for now, let’s focus on the transaction input.

Although you can get the public key in transaction input, it is not recommended that you compute the address directly from transaction input’s script. This is because after the advent of Segregated Witness, the way the address is calculated has changed. It is very likely that parsing the address directly from input will result in an address that is not the same as previous transaction output. Simply put, you may have received of coins in the name of Mat and spent it in the name of Matthew. Mat and Matthew are the same people, but because of this error, you very likely end up with two incorrect balance records in the database instead of one correct record for Mat.

Conclusion

After all the steps described above, we completed the first step of OCAP data preprocessing - parsing the data. After this, we will need to reinterpret the history of Bitcoin to get the whole UTXO Pool and calculate some statistics on each address, such as the number of transactions they made and their balance.

As the father of blockchain, Satoshi Nakamoto had a lot of wonderful ideas when he designed Bitcoin, and his control of the details was amazing. In this project, I am learning while doing, and I have learned so much. If you have time, you too will learn a lot by writing some code to try to parse the raw data. This is the most helpful way to gain a thorough understanding of Bitcoin.

Let's first look at what is stored in these documents through the following commands.