Data Insertion in Bitcoin ’ s Blockchain

This paper provides the first comprehensive survey of methods for inserting arbitrary data into Bitcoin’s blockchain. Historical methods of data insertion are described, along with lesser-known techniques that are optimized for efficiency. Insertion methods are compared on the basis of efficiency, cost, convenience of data reconstruction, permanence, and potentially negative impact on the Bitcoin ecosystem.


Introduction
From its genesis block, and the now infamous headline that Satoshi chose to inscribe as the first permanent message in the Blockchain, Bitcoin has been utilized as a free speech platform. 1,2 n addition to exchanging digital currency on a global scale, Bitcoin also provides users with the ability to publish information that cannot be censored or retracted, and will be permanently available to the world (as long as Bitcoin itself persists). 3However, the Bitcoin community is divided with regard to whether this use of Bitcoin as a platform for data publication/storage is an appropriate one: "The use of bitcoin's blockchain to store data unrelated to bitcoin payments is a controversial subject.Many developers consider such use abusive and want to discourage it.Others view it as a demonstration of the powerful capabilities of blockchain technology and want to encourage such experimentation."-Andreas Antonopoulos 4 Everyone has their own vision of what Bitcoin can and should be used for.While we are inclined to favor the view that the insertion of data can be a legitimate and valuable use of the Blockchain, the purpose of this article is not to argue in favor of (or against) the practice; rather, it is to enumerate the historical and efficient methods of data publication, and to examine the benefits and drawbacks corresponding to each method.Specifically, we will compare data publication methods on the basis of efficiency, cost, convenience of data reconstruction, permanence, and the potentially negative impact on the Bitcoin ecosystem.
We believe this work will be of interest to several audiences:

2
(1) For those who wish to store data in the Blockchain, we identify which methods optimize data storage (and minimize the associated cost) given the constraints of the protocol.(2)  For those who are concerned that the Blockchain is being "co-opted" for data publication/storage, we provide a clear outline of the presently-available methods and explain which methods mitigate negative side effects for other users.(3)  For future digital archeologists, we provide a valuable point of reference that may allow posterity to unearth virtual artifacts that otherwise might remain hidden forever in the Blockchain in binary format. 5

Related Work
It is common knowledge that extrinsic data can be stored in the Blockchain, and there are numerous websites that provide access to a subset of that data, 6,7,8 and some excellent sleuthing has uncovered a variety of interesting historical artifacts that have previously been stored. 9Nevertheless, there remains confusion and misinformation about the variety of different methods by which data can be (and has been) stored.For instance, a recent comprehensive textbook on Bitcoin included the following: "There's no good way to prevent people from writing arbitrary data into the Bitcoin block chain [sic].One possible countermeasure is to only accept Payto-Script-Hash transactions.This would make it a bit more expensive to write in arbitrary data, but it still wouldn't prevent it." 10e first claim, that one cannot prevent arbitrary data insertion, is correct, since there is no general way to distinguish between legitimate address hashes and arbitrary binary data.However, the second claim is false, as P2SH (Pay-to-Script-Hash) transactions actually provide the least expensive and most efficient methods for storing large amounts of arbitrary data (see section 5).
There are also a variety of websites that provide user-friendly tools to publish data of one's choice. 11 , 12However, these tools are currently using the Pay-to-Fake-Key-Hash (P2FKH) method, which has serious drawbacks (discussed in section 4.2) that make it inefficient for users of the service and harmful to the Bitcoin infrastructure.
While some previous works have analyzed the graph structure and anonymity of Bitcoin's transaction ledger, 13,14 there is a dearth of academic work studying the publication and storage of arbitrary data, with only a few notable exceptions.In 2015, apparently unaware that Bitcoin users were already embedding arbitrary (ASCII and binary) data into Blockchain transactions, Sleiman et al.naively proposed a protocol and developed software for including text messages in the Blockchain by using the transaction currency amount field to encode data. 15The result was a highly inefficient publication mechanism that could only store up to 8 lowercase English letters per transaction output.More recently, Bartoletti and Pompianu analyzed the metadata attached to transactions that use one specific data insertion method (OP_RETURN, see section 4.5) to build protocol layers on top of the Bitcoin protocol (e.g., for asset management, notarization, etc...). 16In a different vein, Permacoin proposed the idea of building an alternative to Bitcoin that uses "proof-of-retrievability" rather than proof-of-work, which

Background: The Bitcoin Script Language
3.1.Standard Scripts-Bitcoin's stack-based scripting language for creating transactions is simply called "Script."Bitcoin transactions contain input scripts and output scripts.The input scripts are solutions (unlocking scripts) to previous output scripts (locking scripts) in prior transactions stored in the Blockchain. 18There are currently 5 standard script types that are used and accepted on the Bitcoin network for transactions. 19,20 he standard script types include Pay-to-Public-Key (P2PK), Pay-to-Public-Key-Hash (P2PKH), Multi-Signature, Payto-Script-Hash (P2SH), and OP_RETURN (see Appendix B for the Script formats).Sections 4 and 5 will demonstrate how each of these script types can be used to store arbitrary data in Bitcoin's blockchain.
3.2.Methods-For analysis of historical methods and testing each of the data insertion methods in this paper, we used the open-source Java library, BitcoinJ. 21With this tool we iterated through the Blockchain and searched for scripts that do not fit the standard script types, as well as specific script formats.We also used BitcoinJ to build scripts to use in our own transactions, and we tested these by broadcasting them to the Bitcoin network (through Blockchain.info). 22Example code for iterating through the Blockchain and building scripts can be found in Appendix D.

Technical Limitations on Scripts and
Transactions-At the time of writing, a standard Bitcoin transaction is limited to 100 KB, each input script is limited to 1650 bytes, 23 and any single element being pushed onto the execution stack is limited to 520 bytes.After script execution, the stack must contain exactly one non-false element. 24Input scripts may not contain any OP codes other than OP_PUSHDATA (except within the special Redeem Script portion of a P2SH).The minimum output value (min non-dust -see definition in Appendix A) for a P2PKH is currently 546 satoshis.Transactions that deviate from these rules are considered non-standard and will not be picked up by most miners. 25.4.Standard Script Enforcement-Most of the above restrictions on scripts are enforced by a method in the Bitcoin Core source code called isStandard(). 26These limitations were imposed in the Bitcoin Core client for a variety of reasons, including performance considerations and preventing an issue known as transaction malleability (see section 6.2).However, this severely restricts the input and output scripts that one can write.An input script that spends a P2SH transaction is the only place that affords some flexibility in the use of the Bitcoin Script language.This flexibility allows more complex logical operations for financial transactions, and it also allows the greatest variety of data insertion mechanisms.We will first explain each of the four simpler non-P2SH methods (section 4) then explain the more sophisticated P2SH-based methods (section 5).

Data Insertion Methods Not Involving P2SH
4.1.Coinbase-The coinbase data is the content of the input of a generation transaction.The coinbase data is arbitrary and can be up to 100 bytes in size. 27,28 he coinbase data has ledgerjournal.orgISSN 2379-5980 (online) DOI 10.5915/LEDGER.2018.101 4 been left to the discretion of the miners and has typically been a field where miners insert ASCII encoded strings declaring the name of their mining pool, or other short messages.The coinbase data is also used by miners to signal support for various proposed changes to the Bitcoin protocol.Some, if not all, of the coinbase data may be commandeered by developers in future versions of the Bitcoin protocol.While this field is a way of storing arbitrary data in the Blockchain, it is available only to miners and not general Bitcoin users; it is therefore included in this paper for thoroughness, but will not be mentioned again.
4.2.P2FKH-A very common and controversial data insertion method utilizes the standard Pay-to-Public-Key-Hash script, storing the data in the <PubKeyHash> field of the output script along with a non-dust amount of Bitcoin to "burn."We refer to this as Pay-to-Fake-Key-Hash (P2FKH).The user does not have a public key that would hash to the data they are storing; because of this, these transaction outputs can never be spent.However, because they are valid Unspent Transaction Outputs (UTXOs -see Appendix A) and the miners have no way of knowing whether the hash corresponds to a real public key that someone possesses, the miners must keep track of these UTXOs (forever).The storage afforded by the P2FKH method is 20 bytes per output, but many outputs can be included in a single transaction.This method has been used to store text, 29 images (see Fig. 1), and mp3 files in Bitcoin's blockchain 30 and is currently the method employed by tools like Apertus.io. 11g. 1.This JPEG image of Nelson Mandela was stored on 7 December 2013 as P2FKHs spread across multiple transactions, within block 273,536.Size: 14,400 bytes. 313.P2FK-Data can also be stored as a fake public key (P2FK), instead of a fake public key hash.An uncompressed public key is 65 bytes, 32 and the overall script has 3 fewer OP codes, making this a much more efficient method for data storage than P2FKH.However, it does not seem to be in prevalent use by the community as a method for storing data.One possible reason for this is that it would be relatively easy for nodes to detect fake (uncompressed) public keys and the Bitcoin developers (or miners) could shut down this approach in the future.33 Storing data using a fake compressed public key (33 bytes) could work around this, and would still provide more data efficiency than P2FKH.However, this method also suffers from the problem of creating unspendable UTXOs.

5
(3) These methods irretrievably "burn" Bitcoin.P2FKH and P2FK both require the user to send a small amount of Bitcoin (greater than or equal to the min non-dust value) to each fake address.(4) Storing arbitrary data in the Blockchain will create "bloat" to the overall ledger size.The first three problems can be addressed by using improved data storage methods.The fourth objection will apply to any data insertion method, and the Blockchain is destined to grow larger as long as blocks are mined and transactions are occurring, regardless of what the transactions themselves actually represent.Whether the value of the data being stored is a worthwhile use of the Bitcoin network's resources is a point the community will continue to debate.Regardless of the data storage use case, Bitcoin will face scalability issues, which developers are already attempting to address (e.g., segwit, 34 Peter Todd's Merkle Mountain Range proposal to use commitments to obviate the need to store the full UTXO set 35 ).Table 1 shows an estimated amount of Bitcoin that has been burned to fake mostly-text addresses using the P2FKH method, as of 7 June 2017.Specifically, we aggregated the balances for all P2PKH UTXOs for which the address has never been used as an input script, and the key hash contains 18 (or more) consecutive bytes from the set of printable ASCII characters, plus tabs, newlines, and null ('\x00') characters that may have been used as padding around textual data. 36.5.OP_RETURN-The OP_RETURN standard script was added as a response to the increasing numbers of users using P2FKH to store data (or metadata) in transactions. 37OP_RETURN allows a small amount of data to be included in each transaction, creating a provably unspendable UTXO that the miners do not need to track, and that does not require a non-dust burn value.
There can be many outputs in a single Bitcoin transaction, but only one of these can be an OP_RETURN in a standard transaction. 38OP_RETURN can currently only store 80 bytes per transaction.This limit has fluctuated over time (see Bartoletti and Pompianu for a discussion about the history of OP_RETURN 16 ).To use more than one OP_RETURN multiple transactions are required. 39The order in which these transactions are mined by the decentralized Bitcoin network is difficult to control.Overall this method is appropriate for inserting small amounts of data (or transaction metadata), but it is not suitable for large quantities of data.Some community members have also expressed concern about the robustness of storing data using OP_RETURN, since provably unspendable UTXOs can be pruned by nodes, and may not be permanently stored/distributed by as many nodes. 40.6.P2FMS-Another data insertion method (Pay-to-Fake-Multisig) that commonly appears in the Blockchain is a 1-of-2 or 1-of-3 multisig script, 41 with one real public key, and 1 or 2 fake keys containing arbitrary data. 42Because these transactions are spendable, a user can avoid creating UTXO bloat.For the lowest overhead cost, one would use a (real) ledgerjournal.orgISSN 2379-5980 (online) DOI 10.5915/LEDGER.2018.1016 compressed public key, and store the data using two fake uncompressed public keys (65 bytes each).This method would keep the data in the UTXO set only until the user decides to spend these outputs (using the one real key).Multiple P2FMS outputs can be stored within a single transaction, consistently using the same real public key in all of them, making data reconstruction straightforward.
However, transactions containing a single OP_CHECKMULTISIG must be larger than 400 bytes; specifically, the default requirement is 20 bytes per sigop, 43 and one instance of OP_CHECKMULTISIG counts as 20 sigops. 44This limitation makes redemption of these UTXOs uneconomical: the cost in fees for spending these UTXOs will be greater than the min nondust values that would typically be sent to them. 45Therefore, users with no regard for the UTXO bloat can simply use all 3 pubkey fields to store arbitrary data with a burn amount.

Data Insertion Methods Using P2SH
5.1.P2FSH-Similar to P2FKH, the Pay-to-Fake-Script-Hash (P2FSH) method simply stores data as a fake hash.P2FSH requires two fewer OP codes than P2FKH (making it slightly more efficient) but still creates an unspendable UTXO.The remainder of section 5 is dedicated to methods that store data in the input script that spends a P2SH output, rather than in the output script.

Two Stages of P2SH
Transactions-There are two stages of P2SH: creating the UTXO and spending the UTXO.To create a P2SH UTXO, the user first creates a Redeem Script, and then applies the HASH160 algorithm to this script. 46The output script is then: To spend this UTXO, the user creates an input script (referencing the UTXO above) consisting of the Redeem Script itself (as a single stack element, thus limited to 520 bytes) preceded by a sequence of Script operations that will make the Redeem Script result in only true after execution. 47There are two approaches to data insertion: either store arbitrary data inside the Redeem Script itself, and/or store arbitrary data in the portion of the input script that precedes the Redeem Script.For instance, a user might simply make a Redeem Script that contains an OP_PUSHDATA2 (3 bytes) followed by a 517-byte data element. 48Since any stack element other than OP_0 is evaluated as "true," this script will successfully redeem the UTXO.However, because of the 520-byte Redeem Script limit, it is more efficient to store large amounts of data in the portion of the input script that precedes the Redeem Script (see Fig. 6 for a visual representation).We will next discuss such methods (see Appendix C for the full scripts).Variations of the following P2SH-based methods have been used to store data in the Blockchain since June 2014. 49.3.Data Drop Method-The Data Drop method pushes data onto the stack and drops it off the stack during script execution, typically with the use of the OP DROP operation.Consider the following Redeem Script: OP_DROP ... OP_DROP <PubKey> OP_CHECKSIG. 50he preceding input script operations are then <Sig> <Data>...<Data>.The stored data must be split into chunks of at most size 520 bytes each.The signature is 71-73 bytes and the Redeem Script is 37 bytes, which leaves 1529 bytes for arbitrary data after accounting for the pushdata OP codes.Recall the input script is constrained by the input size limit of 1650 bytes

7
(see section 3.3), but these inputs can be chained together within a single transaction (up to the 100 KB TX size limit) to store large amounts of data in a nearly contiguous and easy-toreconstruct format (more about reconstruction in section 8).This method has been used to store relatively large image files within a single transaction in the Blockchain (see Fig. 2).
We include a compressed <PubKey> as part of the Redeem Script to ensure that the Redeem Script hashes to something new each time this method is used with a new key, and the use of a signature (<Sig> ... OP_CHECKSIG) prevents a double-spend attack (see section 6.1).The data insertion method that provides the lowest known overhead (and publication cost) is a variant of this (Data Drop w/o Sig) that eschews the use of signatures and keys in order to pack more data into each transaction input, at the cost of potential adversarial tampering.However, even using signatures, an adversary could perform an online attack to tamper with data stored using the Data Drop method (see section 6.2).The trade-off between maximizing storage capacity and ensuring transaction security and data integrity is discussed further below.

Data Hash Method-
The Data Hash method is a more sophisticated method for inserting data in the Blockchain. 52The largest input script in Blockchain history is an example of this script type; this transaction was included on 27 November 2014, by an unknown author. 53,54 his transaction included a parody of a Western Union advertisement (see Fig. 3).Similar to the Data Drop method, the input script preceding the Redeem Script contains repeated chunks of <Data>...<Data>.The Redeem Script is of the form: OP_HASH160 <DataElementHash> OP_EQUALVERIFY These three commands are then repeated for each data element that is pushed onto the stack by the input script.Rather than merely dropping each data element off the stack, this script uses hashes to verify that each chunk of data has not been tampered with.Since the hashes are stored in the Redeem Script, and the hash of the Redeem Script was recorded in the first stage UTXO, no other data can be substituted into the input script that spends this UTXO, even if the inputs for this transaction were not signed.However, signing each input (by inserting ledgerjournal.orgISSN 2379-5980 (online) DOI 10.5915/LEDGER.2018.1018 <Sig> at the beginning of the input script and <PubKey> OP_CHECKSIG at the end of the Redeem Script) is still necessary to prevent an adversary from potentially reordering the inputs, or including a subset of the inputs, in a competing transaction.These security concerns are further discussed in the next section.Fig. 3.This JPEG image is stored in Bitcoin's blockchain as a GZIP archive file inside one input script of a P2SH output.(Compressed) size: 9,265 bytes.This input script is the largest input script present in the Blockchain to date. 55

Security and Data Integrity
6.1.Sniping UTXOs-We refer to sniping as the process of re-appropriating a transaction's unsigned inputs to a new transaction with different outputs (created by the sniper and broadcast simultaneously) to hijack the funds those inputs represent. 56Only one of these double-spend attempts may be included in the Blockchain.Signatures are designed to protect against sniping because they prohibit adversaries from making any changes to the signed portion of the transaction (to do so would require generating a new valid signature, which the adversary cannot do without the user's private key).However, when a user creates a signature for an input script, the output scripts are secured, but not the input scripts. 57edeem Scripts that do not require a signature are thus vulnerable.If such a script is used multiple times, it may become associated with its hash, and UTXOs that use this hash may be spent by anyone who provides the corresponding Redeem Script.One could include a unique element in the Redeem Script so that the hash (of the Redeem Script) is different with each use. 58These transactions, however, could still be sniped in real-time by sophisticated bots.
6.2.Transaction Malleability-We define transaction malleability to mean any change to a transaction that is broadcast (prior to block acceptance).Transaction malleability is a problem that has plagued Bitcoin for years, and has been addressed in a variety of ways by the Bitcoin Core development team.The threat to normal users is now rarely more than an annoyance but for data publishers, it is a potentially severe problem that warrants discussion. 59hen a new transaction is broadcast to the P2P Bitcoin network, it gets passed from node to node, with nodes verifying it and storing it into the mempool of possible transactions to include in a block.An adversarial node may receive a transaction and create a modified version of this transaction to pass along to others in the network.These changes may be as innocuous as changing a PUSHDATA OP code, 60 but a more detrimental change could be to alter the arbitrary data stored using the Data Drop method.As long as the scripts themselves still result in valid execution, the modified transaction will have a new transaction ID and could be included in the Blockchain in this modified form.No "functional" transaction data has been changed: the inputs and outputs are still accounted for correctly.
Since the Data Drop with signatures method prevents sniping, it does not currently appear to be a target for malicious agents. 61However, the DataDrop method includes no measures to prevent an agent on the network from modifying the arbitrary data a user is trying to store, even if each input is signed.In contrast, the Data Hash method ensures data integrity because the hash of each data element is checked during execution of the Redeem Script.While a Data Hash transaction that does not contain a signature could be easily sniped, the sniper would still have to include the exact unmodified data as input.However for data spanning multiple (unsigned) inputs, a sniper could rearrange the inputs, or only spend some of the inputs and not others, causing the data to be stored in the Blockchain in an unintended order.Adversaries motivated by mere financial gain can be discouraged by assigning only the min non-dust Bitcoin value for each (unsigned) P2SH input, making the sniper effectively pay more in fees (to store your desired data) than they would recoup from redirecting the output to their own address. 62Thus, the only method guaranteed to preserve data integrity when using multiple outputs is Data Hash with signatures.
None of the simpler data insertion methods (P2FKH, P2FK, P2FMS, P2FSH, OP_RETURN) suffer from malleability or sniping concerns, since the data is stored within signed outputs. 63* If sniped, multiple inputs within the transaction can be reordered, even though the data within each input cannot be changed.
Table 2 summarizes the two P2SH-based methods with and without signatures in terms of security and data capacity.Although the Data Hash w/ Sig method provides the least data capacity of these methods, the benefit of guaranteed data integrity likely outweighs the loss of efficiency.

Efficiency Comparison and Costs
First, regarding efficiency concerns about bloating the UTXO set using fake addresses in UTXOs (as discussed in section 4.4), which impacts the scalability of the Bitcoin ecosystem: • P2FKH and P2FSH are both extremely wasteful, providing only 20 bytes of data per unspendable UTXO.• P2FK is also quite wasteful, although using uncompressed keys currently affords 65 bytes of data per unspendable UTXO. 64 The currently allowed form of P2FMS (with all 3 addresses fake) could store as much as 195 bytes (using 3 uncompressed keys) per unspendable UTXO.Versions of P2FMS with 1 real key are spendable, but there is currently no economic benefit to retrieve min non-dust values.• OP_RETURN does not bloat the UTXO set, since it is provably unspendable and nodes may prune it.• Both forms of the P2SH-based methods that store the data in input scripts (Data Drop and Data Hash) do not increase the UTXO set at all, since all created TXOs get redeemed.
Next, we consider two additional measures of efficiency: (1) The total amount of data (i.e.including overhead) that is required to be added to the Blockchain in order to store a specified amount of arbitrary data (shown in Fig. 4 and Table 3).This relates to scalability issues, and will be of interest to those concerned with storing full copies of the Blockchain.(2) The total cost in satoshis, using current minimal (20 satoshis/byte) fee and min nondust burn rates necessary for a transaction to be accepted, for storing a specified amount of arbitrary data (shown in Fig. 5 and Table 3). 65This measure is of interest to those who wish to store data in the Blockchain inexpensively.

11
As Fig. 4 and Fig. 5 show, OP_RETURN is the most efficient choice for storing small amounts of data (up to 80 bytes).For medium amounts of data (between 80 and 800 bytes), P2FMS is the most cost-effective option, and it provides the least data overhead up to ≈ 10 KB.For large amounts of data (beyond 800 bytes), the Data Drop w/o Sig method provides the least expensive option, and it requires the least data overhead beyond 10 KB.The P2SHbased methods that store data in the input script (Data Drop and Data Hash) have a higher fixed overhead (due to needing an initial transaction to set up the UTXOs that the second transaction redeems), but offer competitive levels of data overhead compared to P2FK and P2FMS for larger amounts of data at much lower costs (since they avoid the burn costs for each UTXO).Example: for a 50 KB file, the most cost-effective secure method (Data Hash w/ Sig) costs approximately 0.012 Ƀ , which is a 61% savings compared to P2FKH (≈ 0.03 Ƀ).At current exchange rates (1 BTC ≈ 2500 USD), this would cost about $30 to publish in the Blockchain.

Data Reconstruction
8.1.Methods Involving Burns-All methods relying on fake keys and/or hashes are cumbersome to reconstruct.For P2FKH, each output contains 20 bytes of data to be retrieved, and many ordered outputs can be used to store a contiguous data set.To reconstruct the data, extract the data from the key or hash in each output script. 66One must be careful to avoid any P2PKH outputs in the transaction that represent "change" addresses; the data outputs are typically marked by their min non-dust values.There does not seem to be a defined limit on the number of outputs a transaction can have. 67Under the 100 KB size limit, P2FKH has a maximum storage size of 58,680 bytes with a total transaction size of 99,983 bytes.Files larger than this will have to be split among different transactions, and subsequently linked together (either within the Blockchain itself or by external information). 68This makes fully automatic reconstruction of datasets stored in the Blockchain more difficult.For P2FMS, reconstruction also means avoiding the pushdata OP codes between the fake keys.8.2.Methods Not Involving Burns-For both Data Drop and Data Hash methods, the data is stored in the input script in the same way.To reconstruct the data, ignore any signature data if present, the pushdata OP codes between the data elements, and the Redeem Script itself.Assuming no malleability concerns, the data will be stored in the same order in which it was broadcast, within a single transaction (up to 100 KB with overhead), achieving a maximum file size of 96,060 bytes. 69An OP_RETURN output can be used for metadata, such as the name of the file, or the TX ID of the next chunk of data for files larger than 100 KB. 70 To ease retrievability, one may include a single P2FKH output that pays to the hash of the data file being stored, similar to the approach taken by Cryptograffiti. 12This method allows anyone with this hash to use common blockchain exploration tools to find the transaction where the data was stored.A figure showing the anatomy of an input script is provided, see Fig. 6.
As a point of reference for reconstruction, consider the following transaction, which contains a JPEG image stored using the Data Hash w/o Sigs method: TX ID: 033d185d1a04c4bd6de9bb23985f8c15aa46234206ad29101c31f4b33f1a0e49 Block: 474586

12
The Redeem Script data is easily identified as the last data element of each input.The JPEG data precedes the Redeem Scripts, three data elements at a time.The second-to-last input contains only two data elements preceding the Redeem Script.The final input does not contain image data; it is used to pay fees.

Conclusion
A comprehensive survey of the benefits and drawbacks of extant methods revealed that there is no optimal data insertion method that dominates all of the others.Instead, different methods will be optimal depending on one's priorities, and the amount of data to store.For small quantities of data, using OP_RETURN is a solid choice, and is probably also the closest to an "approved" standard for data publication.For larger amounts of data, if quantity at low cost is paramount and security is unimportant, the Data Drop w/o Sig method may be the best choice.Alternatively, the Data Hash w/ Sig method provides a nice balance of data integrity with an efficient cost function for large data.However, many in the community believe that storing large quantities of data is not an appropriate use of the Blockchain, and that it should be used for storing short hashes of documents (i.e. as time-stamped existence proofs) rather than the full documents themselves.Others in the community take a strong free market stance, and hold that if users are willing to bear the cost of data insertion, they should be able to use the technology as they see fit.The purpose of this paper is not to cast value judgments about these perspectives, but rather to encourage informed discussion about the technical and economic issues at stake.On a pragmatic level, given Bitcoin exchange rates in recent times, even the most efficient methods may be prohibitively expensive to publish large files, unless the insertion of that data has significant/lasting value to the publisher.
It is striking that P2FKH (which appears to be a dominant approach used by several data publication tools) fares poorly in almost all regards: it creates the most unspendable UTXO bloat, it requires the largest overhead, and it costs the most. 11,12We have several hypotheses that may explain its (possibly unwarranted) popularity:

14
(1) It is one of the simplest to implement. 712) Most people are unaware that more sophisticated approaches (like using inputscripts to store data) exist.72 (3) Tool-makers are concerned that more complex methods may be banned in future versions of Bitcoin, which would break compatibility.(4) Users are concerned that any data that does not create unspendable UTXOs will not be sufficiently permanent, as it may end up being pruned in the future.This last hypothesis is the most interesting one. Onhe one hand, as long as Bitcoin survives, surely some nodes will always full and complete ledger (including input scripts), in order to have a complete archive of past transactions, and to be able to verify the hashes of all blocks from the beginning.On the other hand, UTXOs themselves may not be immune to pruning, as the future might bring the possibility of using cryptographic data structures with commitments to store the status of UTXOs without storing the UTXO data directly.35 However, this would likely serve as a caching optimization for miners/nodes, and the full record including very old UTXOs would still be archived on disk.
As a final caveat, we have attempted to provide a comprehensive review of the major current and past data insertion techniques, but the knowledge and methods contained in this article are based on a scripting protocol that is subject to continual change, and thus some of the methods discussed may become unavailable in the future.For instance, the impact of the Segregated Witness (segwit) BIP on the feasibility of long-term data storage using input scripts is an important question for future testing and research. 34However, even if future changes to the Bitcoin Core disable or enable new features relating to data storage, there is important academic value in documenting the methods that have been used to date.Knowledge of these methods will be useful for historical research, and may form the building blocks of future methods of data publication for Bitcoin, as well as other cryptocurrencies.We list some common definitions and abbreviations used in the paper.
• Dust: If the fees to spend a transaction output (determined from the size of the output and the input required to spend it) would cost more than one third the value of that output, the output value is considered dust.Transactions with dust output values are considered non-standard.• Min Non-Dust: The minimum non-dust value is the least value one can send without the output being flagged as dust.The minimum output value for a P2PKH is currently 546 satoshis.This minimum threshold value changes depending on the script being used.
• Provably Unspendable: An OP_RETURN UTXO is provably unspendable, meaning that the Bitcoin protocol has marked it as impossible to spend.Thus, it does not need to be included in the set of UTXOs that may be spent in the future.In contrast, some transaction outputs are effectively unspendable because their scripts have no known solution.Spending a P2FKH output would require generating a private key that corresponded to that public key, which is astronomically improbable, but does not render the UTXO provably unspendable.• Redeem Script: A script that is hashed, and this hash is used as the output of a Payto-Script-Hash transaction.The Redeem Script and any inputs it takes are supplied when a user wishes to spend the output that was created.These inputs and the Redeem Script itself are executed and must return true in order for the transaction to be valid.• Snipeable: In this paper, sniping refers to the process of re-appropriating the unsigned inputs of an unconfirmed transaction by creating a new transaction with different outputs to hijack the funds those inputs represent.If a transaction has unsecured inputs that can be re-appropriated, it is snipeable.• Transaction Malleability: Transaction malleability refers to the ability to change any part of a transaction without invalidating that transaction.Transaction malleability is often no more than a minor inconvenience, but some of the data storage methods described are subject to a malicious actor potentially changing some or all of the data to be stored.• UTXO: Unspent Transaction Output.Each (non-coinbase) input references a previous UTXO to spend the coins associated with that UTXO.One can think of the set of UTXOs as places where Bitcoin is stored and can potentially be used as sources for future transactions.

Fig. 4 .
Fig.4.Total data required vs. stored data size, for small (LEFT) and large (RIGHT) data sizes, up to the maximum size possible within a single transaction.

Fig. 5 . 13 Fig. 6 .
Fig.5.Currency cost vs.stored data size for small (LEFT) and large (RIGHT) data sizes, up to the maximum size possible within a single transaction.This graph assumes a transaction fee of 20 satoshis/byte and burn values of 1100 satoshis for P2FMS and 546 for other methods requiring burns.

Table 2 .
P2SH-based Data Insertion Method Summary (Single Input) * See Appendix C for the full scripts used for these calculations.

Table 3 .
Method Summary (Max Size and Cost) for a Fee of 20 Satoshi/byte * Data in Bytes **Cost in Bitcoin *** Efficiency in Satoshi per Byte of arbitrary data stored