Andy
Software developer, entrepreneur. Currently exploring Blockchain, ML, and CV.

That one time we crowdsourced the price of marijuana

The idea

A few years ago, my friend Cory and I launched a project, subtly named PriceOfWeed.com, to answer some curiosities we had about the real street value of marijuana.

After watching a National Geographic documentary on marijuana where they cited some interesting (but seemingly exorbitant) figures about the price of the plant as it travels across borders - facing different economic/legal statuses from state to state - we realized that nobody really knows the true street price as the flow of information is nearly none due to it's black market status.

And so we decided to start an experiment of crowdsourcing this data by simply asking the consumers how much they paid. Initially, the idea seemed so stupid it would be more aptly named a highdea; who the hell would voluntarily submit data about what they paid for an illegal product?

Launching

The plan was to create the site with as little investment as possible - a single page with a web form for posting new submissions. We had one call-to-action: "We crowdsource the street value of marijuana from the most accurate source possible: you, the consumer. Help by anonymously submitting data on the latest transaction you've made." It looked something like this:

Initial version of the homepage

Once we had the site working, we launched it by posting on three online communities- Hacker News, and the 2 of the major Reddit forums focused on pot (/r/marijuana/ and /r/trees):

Our initial post on reddit

The posts picked up a ton of traction, rocketing to the front page of all three websites; people were definitely interested in the mission and it's potential findings. Furthermore, with California's Proposition 19 just around the corner, the timing couldn't be better.

Most importantly though, users were submitting data. A lot of it! We displayed the submitted data in a simple table:

Showing the data in a table

Good data, bad data

As the rate at which people were submitting data began picking up - hundreds, then thousands - so too did the number of bogus price entries. Immediately, we manually removed them by sorting and deleting outliers directly from the database. We'd soon need a more scalable solution.

We decided on a simple outlier filter using standard deviations. The idea is that any data points too far (~2 standard deviations) from the average are disregarded from the data set.

Removing outliers with standard deviations

To calculate the standard deviation for our data set, the following formula is used:

Formula for calculation standard deviation

To implement this in PHP, we did the following:

// fetch the submission set

$submissions = Submissions::find(array(
    'country' => $country, 
    'region' => $region,
    'city' => $city
));

// calculate the sum and mean

$count = count($submissions);

$sum_prices = array_reduce($submissions, function($carry, $submission) {
    return $carry += $submission['price'];
});

$mean_price = $sum/$count;

// calculate the standard deviation

$sum_of_squared_differences = array_reduce($submissions, function($carry, $submission){
     $carry += pow($submissions[$i]['price'] - $mean_price, 2);
});

$std_deviation = sqrt(1/$num_samples * $sum_of_squared_differences);

// remove outliers

$filtered_submissions = array_filter($submissions, function($submission){
    return abs($submission['price'] - $mean_price) < 2 * $std_deviation;
});

This cleaned up the data a lot. With outliers no longer affecting the data, the numbers appeared much more accurate.

Mapping out the data

In just a few days we had data in all 50 states and 10 provinces of Canada. The site would also eventually collect enough data for Europe, Australia, and even city-level statistics.

The logical next step was to visualize all this data. We plotted the data points on top of a map using Google's API. Green pins for cheap, red for expensive.

Mapping the data using Google Maps API

Immediately, we notice some obvious trends:

For example, the price difference between Southern Ontario and New York - only a few hours drive - is over 200$ per ounce! Does this reveal some sort of arbitrage oppurtunity?

Adding social metrics

Our (somewhat obvious) hypothesis was that the regional prices increased based on the legal and social hostility towards the drug.

Although data for the legal status for pot in different regions could be found online, it didn't tell you much about how heavily it was enforced, and certainly nothing about the general public's social attitudes towards it.

So again, we found ourselves crowdsourcing this data. We decided to add 2 new metrics - "Social Acceptance" and "Law Enforcement". To avoid cluttering and taking away from the main goal of the site, we added this as a secondary poll on the landing page once the user had submitted a price.

Asking for social data

Blowing up in the press and opening up our data

"It's either anonymous, or an ingeniously devious DEA sting operation" - LA Weekly

We began receiving quite a bit of traffic, driven by coverage from many of the major news outlets including the front page of TIME, Forbes, FOX, ABC, CBS, etc. A beautiful, full-page, infographic also appeared in the September 2011 issue of WIRED magazine.

In the Sept. 2011 issue of WIRED magazine

A ton of requests also came from professors, researchers, students, hobbyists, etc. requesting access to the raw data for their studies or personal interests. Excited about the possibility of awesome projects built on top of ours, we began actively distributing raw database dumps, with plans to have an open API. A few highlights of how some people have made use of the data:

1URL: http://www.andyhin.com/post/6/price-of-weed
2Date: December 31, 1969