Bitcoin network dataset

This is the updated dataset, containing all Bitcoin transactions in the first 508241 blocks (approximately up to 9 Feb 2018). For background information and previous versions, see the original project pages at and

Also, see our papers on the topic:

Note: transaction and address IDs are different than the ones used in our previously published datasets! Please do not mix the old and new data with each other.

This dataset contains the following files:

All output files use numeric IDs to refer to transactions, blocks and addresses (these are all counters starting from 0). These are mapped to hashes in the three files txh.dat, bh.dat and addresses.dat respectively. A special value of -1 for txID means a bug in the processing (should not happen). A special value of -1 for an addrID means that the address could not be decoded. This is not necessarily an error, there are certain nonstandard transactions where this can happen. The flow of bitcoins can still be followed in these cases as all transaction inputs are linked to the corresponding previous transactions outputs in the txin.dat.

All sums are in Satoshis (1e-8 BTC).

Transaction inputs and outputs include a sequence number (input_seq and output_seq respectively), which identifies the input / output. These are counters starting from 0 for each transaction. I'm not sure if these will correspond to the same used by other Bitcoin clients, but can be used to map inputs to previous outputs. The txin.dat file includes this information: the prev_txID and prev_output_seq columns refer to the previous transaction output that is being spent.

Mining rewards (coinbase transaction) can be identified by having zero inputs. For all other transactions, the sum of inputs should be greater than the sum of outputs, but this is not checked explicitely during processing.

Note: some files are compressed with xz, giving a higher compression ratio. For other files, the extra processing time did not seem to be worth it, so they are compressed with standard gzip. This is the case for the files containing Bitcoin addresses and transaction hashes, which are basically random data, so the only "compression" comes from storing them in a binary format, and not in human readable hex / base32 / base58.

The modified bitcoind client to generate this dataset can be downloaded here: or here.

Further code which can be used to convert this dataset to a weighted directed graph (list of edges) is available here:

The "address contraction" dataset describes a possible grouping of Bitcoin addresses to entities / users that control them using a simple heuristic of assuming that all input addresses of a transaction are controlled by the same entity. See for the steps how this was created from the transaction inputs.

Send questions to kondor . dani [at] gmail . com