Using AI & Python to Populate Form 8949 for your Crypto Gains & Losses

Mackenzie Patel
Mackenzie Patel
save time
July 10, 2023

Related articles

Browse all articles

Using AI & Python to Populate Form 8949 for your Crypto Gains & Losses

July 10, 2023
Crypto Tax

Form 8949 might not be the most thrilling tax form (although I think it is!), but it’s an essential piece of the crypto tax stack. Its longer, less snappy subtitle is “Sales and other Dispositions of Capital Assets,” which means the form records all gains and losses related to capital assets (like property). In the US, digital assets are taxed as property, so any fluctuations in pricing when you send, trade or ape into an LP position wind up on their own line in Form 8949. The equation for calculating the gain or loss for a transaction is:

Gain/Loss = Fair market value of crypto - cost basis of crypto

  • Fair Market Value (FMV) = current price of the crypto when you swap or send it → think of the current prices listed on Coingecko, Coinbase, Binance, etc.
  • Cost Basis = the price at which you originally acquired the unit of crypto in question.

Crypto prices are volatile, so the two prices listed above are usually never the same (unless we’re talking about non-sketchy stablecoins like USDC or DAI). The difference that arises is our lovely gain (or loss), which is required to be reported on Form 8949 in either the short term (crypto held for < 1 year) or long term (crypto held for > 1 year) sections.

If you’re an individual using an off-the-shelf crypto tax software like Koinly, Cointracker or Taxbit, Form 8949 is automatically tabulated for you and delivered in a neat pdf. However, for enterprise crypto accounting products (think Cryptio, Bitwave, Integral, etc), this feature manifests itself as a csv of line-by-line gains and losses, not as a preformatted pdf split into short and long term parts.

I completed my first Form 1120 this year (🎉) for dOrg, a DAO that is incorporated as an LLC in Vermont. I realized that LLCs taxed as C Corporations have to report their gains & losses on Form 8949 as well (before, I thought the form was strictly for individuals). As I quickly discovered, copying 500 lines of miniscule data from our crypto software into the pdf template wasn’t going to cut it. As a shortcut, I summed up the data (i.e. proceeds, cost basis and gain/loss) by digital asset, but the end result wasn’t as detailed or “buttoned up” as I would’ve liked.

And so I went down the rabbit hole of python scripts to help me automate this (admittedly lackluster) process. My problem statement was this:

I need a python script that will intake a csv of crypto data split by proceeds, cost basis and gain/loss and “write” this data to an empty Form 8949 pdf template.

I’m obsessed with Perplexity (aka “Perp”), an AI powered search engine that’s a more accurate and user-friendly version of Chat GPT-4. So I tinkered around with search queries and back & forth debugging with this semi-sentient and overly apologetic being (“My apologies for the confusion in the previous response”). This was my first prompt to Perp:

Hi! I need you to help write me some code. I'm looking to upload a preformatted csv with data into Form 8949 for taxes. I have a list of assets, their amount, cost basis, proceeds, gain/loss and long or short term status. I want these rows of data in my csv to be uploaded so that it fills out Form 8949 for me. If the rows in Form 8949 run out, then I want a new Form 8949 to be populated with the remaining data in my csv. Please write this in Python so I can run in vs code. I use a Mac.

Seems simple enough for a supercomputer to tackle on its own, right? 

Unfortunately, the reality of using AI as a tool is much different than the sensational headlines of AI automating all of our jobs. Perplexity and other AIs like it are incredibly powerful and revolutionary, but you still have to know the basics of what query you’re asking, especially if it’s a technical query. For coding queries, you need to understand how programming works so you know what questions to ask the AI. For accounting questions, you have to be debit and credit savvy to spot mistakes in the responses. This utopia of AI automating everything isn’t here (yet), so for now, AI is the perfect jam session partner and guided technical people pleaser.

I took a few Python courses back in 2020, but that knowledge has withered away to make room for payroll accounting and new FASB standards. I was struggling with reinstalling the latest python version and the PypdfForm library - Perplexity was helping, but my programming fundamentals were weak / missing altogether. Memories from my high school AP Computer Science class came flooding back and I was 17 again, grinding my teeth in frustration over a damn nested for loop. Suffice it to say, programming turns me into a little angry gremlin, so I impetuously told Perplexity I was done for the day and rang up my boyfriend, Nikhil (who is a software engineer at Perplexity).

Lucky for me, Nikhil is happiest when he’s programming (nerd!), so we spent all of a Tuesday night starting this program and getting 70% of the way through in about two hours. As an engineer, Nikhil knew exactly what queries to ask Perplexity (pro tip: turn on writing mode for coding) and it returned thorough answers. Perplexity and Github Copilot speeded up the programming process considerably and within hours, we had the shell completed. We spent the following Friday night finishing and refactoring the code and there it was, a beautiful, shiny beacon of logic that erased hours of busywork. Nikhil explained the code as he was writing it, so I have a high level understanding of the functions called and variables used. 

Overall, it was a humbling experience. I thought with an ounce of Python knowledge and an amiable AI model, the software world was my oyster. But we’re not there yet - humans still need mental frameworks and a deep understanding of the why in order to guide our padawan AI friends to the right answer.

Let’s dive into it!

Materials Needed

  • csv file of short term gains & losses
  • csv file of long term gains & losses

In this example, I used a cleaned up download from Cryptio (see the “From Cryptio” tab here). The “short term” and “long term” tabs will need to be exported as separate csv files.

  • Git clone https://github.com/mackaronipatel/8949.git
  • Includes csv_to_8949.py, the main program that will be run

This article assumes the reader has a Macbook OS.

The Code 🤖🌀

The first concept to understand is that pdfs are composed of bite-size interactive elements. When you open a pdf in Finder and hover over the elements, they’ll turn a light shade of blue:

Indices on Form 8949

On the first page of Form 8949, there are a total of 122 interactive elements or “annotations” with the blue shading (this includes the checkboxes in the middle). Each element can be assigned a number or index, starting with an index of 0 for “Name(s) shown on return” and an index of 122 for the “Total” related to Column H. The table itself is also discreet and self-contained - it’s a matrix of 8 columns and 14 rows (112 annotations), plus five sum cells at the bottom.

In coding, there’s a data structure called a “list” that is a collection of various items (i.e. strings, integers, objects) that are connected under a single variable. The basic idea is that our program is looping through this list of annotations (blue boxes) and writing data from the csv files to the boxes.

This begs the question - what if our csv data has more than 14 rows? Let’s say we have 620 rows of short term gains & losses that need to be written to the pdf (as we do in the example data here). We need the code to create enough copies of page 1 of Form 8949 so that all of the data is included.

Enter the first bug! 🐛

Creating copies of pdf pages isn’t as simple as it seems (honestly, the theme for this article). In programming, each piece of data has a “memory address”, which is used to “store and retrieve data in a computer's memory, allowing the CPU to track the location of each memory byte."1 Each element in the pdf has its own unique address and when copies are created, the elements on each “copied” page are carried over. In other words, the new pdf copies have the same memory addresses as the original pdf, which means the same data from the csv is overwritten again and again. Nikhil fixed this by creating “deep copies”2 of the pdf and writing a debugging function to make sure all interactive cells had unique addresses.

{Back to the regularly scheduled “programming”} 📺

Now that we could make independent copies of the pdf, the next question to tackle was “How many copies should we make?” In the process_csv_data function, the code looks at the length of the csv data and computes the number of pages needed:

num_pages = (len(csv_data) // 14) + 1

  • // = integer divide by
  • len grabs the number of rows in the csv_data variable. It divides by 14 since there are 14 rows in the table section of Form 8949 and adds 1 so that the first page is included.

It then creates a shell pdf with the required number of pages and independent cell references by looping through the num_pages variable:

As a side note, the csv data was brought in as a dictionary data type. Per Perplexity, “A dictionary data type is a data structure that stores items in key-value pairs. A key is a unique identifier for an item, and a value is the data associated with that key.” In the Form 8949 data, the keys are the row headers (i.e. ['Asset', 'Date Acquired', 'Date Sold', 'Proceeds', 'Cost Basis', 'Empty1', ' Empty2', 'Gain or Loss']) and the values are the information related to the sends/sells of crypto.

Learning Checkpoint 💡

  • We understand the dimensions of Form 8949 (122 elements) and what a list of annotations is
  • We know how to create independent copies of Form 8949 with unique memory addresses
  • We can calculate how many pages of the shell pdf we need to accommodate all the data in our csv’s (for short and long term gains and losses)
  • We know what a dictionary data type is and how the csv data is introduced to the program

Here comes the exhilarating part - copying in our csv data to the pdf! 🎉 This action is broken down into three distinct functions that fill in different pieces of the pdf:

fill_initial_fields()

This function creates static variables that fill in the “Name shown on return”, “Social Security Number” and checkbox question. We want these variables to be static because they’re the same for every page of the return (for gains/losses from a crypto accounting software, checkbox (c) should be filled in).

fill_table()

This function is critical since it fills in the meat & potatoes of Form 8949: the assets that were sold and their proceeds3, cost basis and gain/loss. The code starts off by looping through the applicable annotations. It starts at index 5 and goes until index -5 (which is 122 - 5 = 117, don’t ask me why programmers do it this way!).

Thefor loop grabs the correct row and column from the csv and updates the current annotation in the pdf to the value.

Brief math lesson in case you’re curious about this line of code:

When the for loop gets started, the i counter is equal to zero. Therefore, 0 // 8 = 0 and 0 % 8 = 0 as well.

The reference is then (0,0) which corresponds to the first cell of data in our example short term dataset (0.035566343927705465 ETH). We integer divide and modulus4 by 8 since there are 8 columns in our table.

When the counter equals 1, the row equals 1 // 8 and the column is 1 % 8 = (0,1). This corresponds to a Date Acquired of 11/8/21. The code loops through all possible values until it reaches the last row of the csv.

The final piece of this function is tabulating the sum totals for the proceeds, cost basis and gain or loss columns. Since the pdf thinks these values are strings (bits of text), the code converts the string to a number so it can be added up across all the columns (the dollar signs and commas have to be axed so this can happen).

fill_sum_fields()

Easy peasy! This function fills in the sum fields in the pdf with the calculation we already performed in the previous function.

The last function, main, is the mega function that laces all of our smaller functions together. Importantly, it defines our csv variables (broken into short.csv and long.csv) and Form 8949 variable (f8949.pdf). These are separate files that need to live within the same folder as our csv_to_8949.py code so everything runs smoothly. All of the functions defined earlier are called and the short and long term pieces of Form 8949 are stitched together into the combined_pdf. The new pdf is published as output_filled, which drops into your existing folder like logical magic.✨

And there you have it! 

Working on this code (or rather, sitting next to Nikhil as he hacked away after dinner) was an enriching and fabulous experience. Not only did my rusty Python knowledge peep its head, but we also built a tool with immediate functionality. This humble program will save at least 30 - 45 minutes of manual work next tax season, which is gold for a number cruncher like myself. And just imagine - GPT-5 will probably be out next spring, so who knows what other programs we’ll be able to build in the future.

Loom video of how to fork the code & run it yourself in your local environment:

Get $10 off Perplexity Pro by following this link.

Special thanks to Nikhil for writing this program for me. I had an amazing time learning from you and watching you problem-solve in real time. Your mind astounds me! 

Disclaimer: this post was not sponsored by Perplexity.

Thanks for reading!

1. Find the definition of memory address here.

2. “A deep copy refers to the process of creating a new object that is a fully independent clone of the original object, including all of its nested objects. This means that any changes made to the new object will not affect the original object, and vice versa. A deep copy is different from a shallow copy, which only creates a new object with references to the original object's elements, meaning that changes made to the new object can affect the original object.” Source.

3. Proceeds = fair market value

4. For more info on the integer division and modulus operations, see here.

Mackenzie Patel

Related Content

BACK TO BLOGBACK TO BLOG