About Petra

Hi. I am a Computer Science student at the University of Victoria in BC, Canada. As of May 2013 I have finished my third year and in September I'll be starting my fourth. I am also acting President for the Women in Engineering and Computer Science (WECS) group at my school. Check out my fan page @ bandcamp.com/petraclementson and follow some really amazing music on Bandcamp!

Thanks Bandcamp


And that’s it. My internship at Bandcamp is officially over. I like the idea of summarizing my summer as simply adding an asterisk to a CSV file, but it was obviously much more than that. I really could write pages about how much I enjoyed this experience, but I’d rather just finish with a story about Swedish Fish.

powers

There happens to be a lovely deli in Vermont where they still support the idea of single candy purchases. Obviously as an attraction for those who like to hearken back to the ‘good old days’ rather than out of any serious motivation to make a profit. Among the old fashioned treats there is a huge jar of Swedish Fish. Given its novelty status, I was surprised to hear from an employee about the many responsibilities associated with this jar of candy. For something that appears to not have a significant impact on the store’s success, she explained that it required a significant amount of maintenance. Staff had to ensure that special orders were made, and that the fish were rotated, and checked regularly to make sure they didn’t get stale. All for something that would boost morale within the community, and please the tourists and the grandparents taking their grandkids out for a treat. And all of a sudden I found myself feeling a sense of respect and admiration for that jar of Swedish Fish.

It makes me think about Bandcamp.

Bandcamp is like a jar of Swedish Fish, it evokes similar feelings through its emphasis on serving the happiness of others and building community. Most especially where the connection rises, though, is through my observations of and interactions with the Bandcamp team. Now although the story suggests that people don’t necessarily notice the hard work behind a service such as Bandcamp, I am not suggesting that BC goes unappreciated. What I am saying is that I have become Bandcamp’s number 1 fan through witnessing their hard work first hand. Bandcamp does get a lot of praise via fans, users and bloggers, and I can assure you from an insider’s perspective that it is all extremely well deserved.

I will forever be proud to rightfully have ‘this one time at Bandcamp’ as part of my vocabulary. Thanks again Bandcamp, there certainly is no better way to spend a summer.

Auf Wiedersehen

Finishing up


As promised, here are some of the results from 2 weeks of observing completing IPNs. Below is a nice looking figure that shows when pending transactions are completed per day based on the date the transaction was initiated. Notably, there is a huge peak on the first day, but also on the third and fifth days, which is the time indicated on the PayPal website for the completing of echeck payments (one of our significant pending transactions).

part1

You can also see this trend in the next two plots where the major peaks on the total pending transactions completing line up with the peaks on only echeck payments completing. Also the significance of echeck payments completing is shown via the units in plot 3, which represent the percent of echeck payments that completed out of the total each day represented in plot 2.

part2

part3

Look at how many of the pending transactions are completed with IPNs! Two weeks seems to be an appropriate upper limit for wait time after the pending transaction was created.

completing

Obviously our efforts to obtain the IPN to complete payments are not fruitless.

So next we actually want to do something with the IPN that’s not simply looking at it. So back to the initial goal of updating the fee amount, our next step would be recording the updated fee in the tables. We could easily just write a table insert for IPNs that looked something like

UPDATE paypal_auths 
SET paypal_fee=#{fee_amt} 
WHERE paypal_transaction_id=#{txn_id}

but this isn’t a completely effective way to do it because we will never know which transactions were pending or which ones are expected to update. To see this we’ve added two columns to the table, pending reason and date completed. We need to update the table to reflect this, and normally you could just make a change like this:

ALTER TABLE paypal_auths
ADD (pending_reason CHAR(1) CHARACTER SET ascii, date_completed DATETIME);

using a preset value to distinguish what single characters represent which kind of pending reason (defined below) so we only need to use one char per row for the pending reason.

ADDRESS - a
AUTHORIZATION - z
ECHECK - e
INTL - i
MULTI_CURRENCY - c
ORDER - o
PAYMENTREVIEW - r
UNILATERAL - u
UPGRADE - y
VERIFY - v
OTHER - x

Bandcamp has support in code for migrations to the table. A schema is a snapshot of the specs of a table. A new schema is made every time there is a change to the table and the tables are then diff’d to the new layout. This is useful because each engineer has their own database to update and being able to simply update your code with the most recent schema and restart the app to update your table is very efficient.

Once the table is ready to be updated, we need to properly process the IPNs by implementing the specs described in What is an IPN? And that completed looks like this inside the IPN listener:

query_string = request.body.string

# Contact PP again to verify the IPN
verify = PayPalApi.do_notify_validate(query_string) # makes an api call to paypal
if !verify.has_key?("VERIFIED")
    if !verify.has_key("INVALID")
        BC::Log.warn("This IPN is not verified: #{verify}.  txn_id=#{txn_id}")
        respond('Not Found', 404)
    end
    BC::Log.warn("This IPN is INVALID. txn_id=#{txn_id}")
    respond('OK', 200)
end

if request["payment_status"] != "Completed"
    respond('OK', 200)
end

q_txn_id = SQL.escape(txn_id)
bc_transaction = SQL.query("SELECT seller_paypal_email, sub_total, currency, 
                                   paypal_pending_reason, paypal_pending_completed_date
                            FROM   paypal_auths WHERE paypal_transaction_id = #{q_txn_id}").first

if !bc_transaction
    respond('OK', 200)
else
    if request["receiver_email"] != bc_transaction.seller_paypal_email
        respond('OK', 200)
    end

    if request["mc_gross"].to_f != bc_transaction.sub_total.to_f
        respond('OK', 200)
    end

    if request["mc_currency"] != bc_transaction.currency
        respond('OK', 200)
    end

    if !bc_transaction.paypal_pending_completed_date && bc_transaction.paypal_pending_reason
        # Update paypal_auth
        fee_amt = request.params["mc_fee"]
        q_fee_amt = SQL.escape(fee_amt)
        sql = "UPDATE paypal_auths 
               SET completed_date = UTC_TIMESTAMP(), fee_amt = #{q_fee_amt}
               WHERE paypal_transaction_id=#{q_txn_id}"
        SQL.update(sql)
    end
end

Success! We are now inserting the newly acquired fee amount via IPNs retrieved for pending payments. We can even see those changes in the CSVs! This is great, but is that it?

We also want to indicate in the band’s sales report where the fee is expected to be returned. Right now an empty fee_amt cell means the item is part of a cart of items and 0 means the fee could be pending, or is really 0. So we want to add a marker to note that the fee is pending and expected to be filled in eventually, and have 0 only mean there is no fee. This marker is going to be an asterisk that will be added to the CSV upon download by using this logic:

if item[:pending_reason] && !item[:completed_date] && item[:fee_amt].to_f == 0.0
    item[:fee_amt] = '*'
end

If you have a band, you will see this change by downloading your sales and importing them into a spreadsheet program. It should look something like this:

with_asterisk

And 14 weeks later, it looks like that’s it, my final project is completed.

Receiving IPNs

(About time I posted some Coeur de pirate.)


At the end of the last post, I had found the information about IPNs and catching them. This post will be more about observing the execution of the plan developed from the last two weeks.

This week the final product is in sight! We set up receiving IPNs over the weekend and caught a good amount. That’s good news but I needed to figure out if any of these IPNs were completed pending payments. So I looked at all the transaction IDs of the pending transactions to see if I could find them with a ‘Completed’ status at a later date/time. It became obvious that grep would be the right tool for me so I started reading the man page learning some important features like
-c which counts all the instances of your match. I started thinking about how I could use regular expressions within the grep command to select a specific section of the line search. I started with something simple by printing just the pending transactions:

$ grep 'payment_status=Pending' log

Since the message is 1 long string, we print the whole transaction when we search like this. It would be better to only print out the transaction ID since that is all that we need to identify the transaction if it becomes completed. So with a bit more research of the man page, this should do that:

grep -Eo 'txn_id=[^[&]*]*\&'

The –Eo functionality is a combination of E, which interprets the pattern as an extended regular expression, and o, which shows only the part of a matching line that matches the pattern. Then we can combine the commands by piping one to the other from less, which can understand zip files:

$ less log | grep 'payment_status=Pending' | grep -Eo 'txn_id=[^[&]*]*\&'

Ok, so I have a list of transaction IDs, probably hundreds of them too, which will look something like this:

txn_id=39RKW48294IR823&
txn_id=KF7EI393U2R849F&
txn_id=E23J393093OW93H&
txn_id=38DI2O4FJ8JF822&
…

What would be ideal is if I could strip all of these down even further to just get the transaction ID without the ‘txn_id=’ and ‘&’. This took a fair bit of interneting to find the sed command, a text stream editor, that did the trick. And dun da da dah! The whole thing:

$ less log | grep 'payment_status=Pending' | grep -Eo 'txn_id=[^[&]*]*\&' | sed 's/txn_id=//;s/&//'

Now I can use the exact same command to find the completed payments by greping with ‘payment_status=Completed’ and then compare the two lists: pending transaction ids to completed transaction ids. Since I don’t have write privileges on the webapp where I was reading the logs, I copied the two types of lists to my machine and created a Python function to read two files and print out the ones that matched. Only about 10% of the pending transactions had completed. I think this is probably OK to see such a low number this far into the game because I’m sure we will see many more complete as time progresses. In fact, here is a sneak peak of what the trend of progression will look like over the next two weeks:

blog8

This is showing an interesting flat line at ~50% of transactions that begin as pending and never complete. We’ll see how this changes after another week of data. This is also giving us some foresight into what the average wait time is for the vast majority of pending transactions to complete.

Also notable is the kinds of pending transactions that were completing since we noticed before that the reasons for the most pending transactions are either unilateral (the seller’s email has not been verified) or echeck payments. Although over the weekend we mostly received unilateral pending payments, some other, some multi_currency, and some verify, none were echeck! Explicit definitions of all of these reasons can be found on the PayPal website. Although it is good that we are seeing payments completing, it is important to know why we aren’t seeing any echeck payments completing. It’s possible that the delay may be caused by the functioning hours of finantial institutions and mostly looking at data receiving over the weekend. And, according to the PayPal website, it usually takes between 3 and 5 business days for an eCheck to clear and the money to appear in the recipient’s PayPal account.

After a few days of viewing our very own IPNs, on Wednesday I finally found my first echeck payment completed. A monumental day indeed! The rest of the week I continued to look at how many payments had completed and approximately how many days it took to complete pending transactions.

After a bit more data I’ll post some interesting stats, in particular some graphs because pictures are way more fun.

What is an IPN?

(One of my favorite artists of all time!)


Week 2

Last week I learned that 0 values in the sales records are caused by a pending transaction that can be solved by retrieving Instant Payment Notifications from PayPal. This week I will be looking into how to receive IPN notifications from PayPal.

An IPN is an asynchronous message that notifies the host site about events occurring that were not initiated by the site itself. You can detect IPN messages with a method called a ‘listener’. When the listener is hit, the data from the message is similar to the feedback given from any API call. Since messages can become lost or delayed, PayPal periodically sends the message until our site responds that we have successfully received the message. According to the PayPal website, messages may be resent up to 4 days after the original message and they look something like this:

mc_gross=19.95&protection_eligibility=Eligible&address_status=confirmed&
payer_id=LPLWNMTBWMFAY&tax=0.00&address_street=1+Main+St&
payment_date=20%3A12%3A59+Jan+13%2C+2009+PST&payment_status=Completed&
charset=windows-1252&address_zip=95131&first_name=Test&mc_fee=0.88&
address_country_code=US&address_name=Test+User&notify_version=2.6&
custom=&payer_status=verified&address_country=United+States&
address_city=San+Jose&quantity=1&
verify_sign=AtkOfCXbDm2hu0ZELryHFjY-Vb7PAUvS6nMXgysbElEn9v-1XcmSoGtf&
payer_email=gpmac_1231902590_per%40paypal.com&txn_id=61E67681CH3238416&
payment_type=instant&last_name=User&address_state=CA&
receiver_email=gpmac_1231902686_biz%40paypal.com&payment_fee=0.88&
receiver_id=S8XGHLYDW9T3S&txn_type=express_checkout&item_name=&
mc_currency=USD&item_number=&residence_country=US&test_ipn=1&
handling_amount=0.00&transaction_subject=&payment_gross=19.95&
shipping=0.00

In order to receive these messages, there must be a url endpoint that PayPal can ping us with. The design of the url looks something like this https://bandcamp.com/ipn_endpoint, where ipn_endpoint is the same name given to the listener. Once we have the message, we need to use Ruby to read it somehow, that’s where the listener method comes in. One of the engineers had started looking into retrieving IPNs so I took a look at his old code. With a little bit of poking around, I found that there is checkout-ipn.rb which looks like it works as our listener and is what interprets the message sent to the HTTP endpoint. Interestingly, the listener was using a mutex for the case where two messages are sent at the same time with the same transaction ID.

The mutex is a clever way to avoid race conditions and similar problems by preventing processes or threads from occurring at the same time while they are sharing resources. From the time when the IPN is beginning to be processed to the instance where it is inserted into the table there should be a lock on all other messages that contain the same transaction ID, so we don’t find ourselves with a table insertion error since all transaction ids must be unique. It was interesting to see a high level implementation of a mutex when my experience with them mainly consisted of drawing circuits like these for my Digital Logic course:

mutex

As a side note, this was one of my favorite classes that I have taken at university so if you have the opportunity I highly recommend taking a digital logic course. Once you see the convenience of drawing karnaugh maps there’s no going back!

While investigating the implementation of a mutex, I noticed that the comments also referred to the mutex script as a semaphore. From what I’ve been able to gather, a semaphore can have a similar purpose as a mutex, but there is also much more that you can do with it. A mutex involves locking a shared resource for only one process to access at a time, and is released once the process is done using it. The semaphore can also do this, but uses a signal to let other processes know that a certain resource is being used. With this method you can control multiple resources that are being accessed by multiple processes without any collisions. I think technically in our case the design would be a mutex because we are locking and releasing the process as it accesses our one resource, the paypal_auth table.

theMoreYouKnow

The design for storing the IPN message before was that there would be a separate table with a few attributes, but I don’t think this is necessary to solve my problem because I only want to update the PayPal fee. So using mutex won’t be necessary since I’m simply going to be altering the paypal_auths table and won’t be running into insertion errors. By design, we will know if the table has already been updated for a pending transaction or not by adding two columns to the paypal_auths table: the payment pending reason and the date the transaction completed. Although, I’ll have to figure out if it’s a problem if a row in the table is altered more than once in the unlikely case where two threads will update the same row in the table if we happen to receive to IPNs simultaneously with the same transaction ID. It should be the case that the IPNs with the same transaction IDs that make it though all the filters are exactly the same message and this shouldn’t change the information in the table. Although if we allow this, we will be allowing redundancy by occasionally letting some IPN messages update the table with information that is already there.

Since this is more detail than is needed for this week, let’s get back to simply receiving the IPN. In order to really understand how we can see the message the URL and listener are receiving, I needed to read up about the web framework that Bandcamp uses. Instead of the ever popular Ruby on Rails, Bandcamp has decided to go for Ramaze, which has a reputation of being a leaner framework for Ruby.

The listener can use the instance method off Ramaze called request to store the message in a struct that can be easily searched. Once we have received the message we need to let PayPal know that we got it so they can stop sending the message (Also to make sure that the message was sent from PayPal). To do this, we make an API call back to PayPal with the exact message they sent us with this prefix: ‘cmd=_notify-validate’. PayPal should respond with either ‘VERIFIED’ or ‘INVALID’ and we should only process the payments that are verified.

Once the above is set up, we are ready to receive the IPN messages, but we still need to tell PayPal that we want them. There are two ways to do this: 1) specify our listener’s url in the IPN’s preferences on the PayPal website, or 2) Request the notification during the DoExpressCheckoutPayment API call. We have decided to use the later and dynamically request the IPNs so that our site can have more control over the requests.

Further checks we will need to do on the received IPNs:

  1. Make sure we have not already processed the transaction ID.
  2. Verify the transaction is ours by checking the email against our records.
  3. Only update our records with ‘completed’ payment types.
  4. Confirm currency and payment amount.

Week 3

All I need to do in order to complete this week’s goal is force my branch to only process messages through the Sandbox, take one of the webapps out of rotation and restart it with my branch of code, have this written as the listener:

class MainController
	def ppipn
		query_string = request.body.string
		BC::Log.debug("This is the IPN message: #{query_string}")
	end
end

and make sure this code is called from somewhere within our source code.

Since we’re making the requests for the IPN dynamic, we are actually asking for them when we send the DoExpressCheckoutPayment API by filling the field PAYMENTREQUEST_0_NOTIFYURL. To have more control we created a config feature that gives us the ability to turn on and off this request and also to allow only specific web servers to be able to send requests. So we can start testing by only requesting IPNs for DoExpressCheckoutPayment APIs that are requested on the webapp that is offline and is running with my code. But before we start allowing real IPNs through, I need to send some fake ones using the PayPal IPN simulator.

After sending the first one I immediately found an error in the log.

ESC[32m[07-11 10:55:20.072 23100 ramaze/log/hub.rb:34] ESC[0mESC[32mINFOESC[0m  Dynamic request from 127.0.0.1: /check?; host: bandcamp.com; referrer: none; client id: none
ESC[32m[07-11 10:55:20.511 85240 ramaze/log/hub.rb:34] ESC[0mESC[32mINFOESC[0m  Dynamic request from 127.0.0.1: /check?; host: bandcamp.com; referrer: none; client id: none
ESC[32m[07-11 10:55:22.650 29020 ramaze/log/hub.rb:34] ESC[0mESC[32mINFOESC[0m  Dynamic request from 127.0.0.1: /check?; host: bandcamp.com; referrer: none; client id: none
ESC[32m[07-11 10:55:22.655 99000 ramaze/log/hub.rb:34] ESC[0mESC[32mINFOESC[0m  Dynamic request from 127.0.0.1: /check?; host: bandcamp.com; referrer: none; client id: none
ESC[32m[07-11 10:55:23.269 54920 ramaze/log/hub.rb:34] ESC[0mESC[32mINFOESC[0m  Dynamic request from 127.0.0.1: /check?; host: bandcamp.com; referrer: none; client id: none
ESC[32m[07-11 10:55:23.523 96420 ramaze/log/hub.rb:34] ESC[0mESC[32mINFOESC[0m  Dynamic request from 127.0.0.1: /check?; host: bandcamp.com; referrer: none; client id: none
ESC[32m[07-11 10:55:25.460 13620 ramaze/log/hub.rb:34] ESC[0mESC[32mINFOESC[0m  Dynamic request from 127.0.0.1: /check?; host: bandcamp.com; referrer: none; client id: none
ESC[32m[07-11 10:55:25.598 57060 ramaze/log/hub.rb:34] ESC[0mESC[32mINFOESC[0m  Dynamic request from 127.0.0.1: /check?; host: bandcamp.com; referrer: none; client id: none
ESC[32m[07-11 10:55:26.309 13300 ramaze/log/hub.rb:34] ESC[0mESC[32mINFOESC[0m  Dynamic request from 127.0.0.1: /check?; host: bandcamp.com; referrer: none; client id: none
ESC[32m[07-11 10:55:26.433 77960 ramaze/log/hub.rb:34] ESC[0mESC[32mINFOESC[0m  Dynamic request from 127.0.0.1: /check?; host: bandcamp.com; referrer: none; client id: none
ESC[32m[07-11 10:55:27.995 24660 ramaze/log/hub.rb:34] ESC[0mESC[32mINFOESC[0m  Dynamic request from 173.0.82.126: /ppipn; host: bandcamp.com; referrer: none; client id: none
ESC[31m[07-11 10:55:27.997 24660 ramaze/log/hub.rb:34] ESC[0mESC[31mERRORESC[0m undefined method `Log' for BC:Module in: #
ESC[31m[07-11 10:55:27.997 24660 ramaze/log/hub.rb:34] ESC[0mESC[31mERRORESC[0m #
ESC[31m[07-11 10:55:27.998 24660 ramaze/log/hub.rb:34] ESC[0mESC[31mERRORESC[0m ./src/cart/checkout-ipn.rb:19:in `ppipn'

Yikes! But it’s actually not that bad because I was simply referencing the ‘Log’ class as a method like this BC.Log.debug, instead of as an embedded class, like this BC::Log.debug. Whoops.

I sent a couple more test IPNs and I eventually deemed my code ready to receive real IPNs. So first I turned them on using the config feature for one webabb, received about 7 per minute and decided it was safe to receive all the IPNs from each webapp and log them on the one I was using for testing. Now all I have to do is watch the log and check out what we get.

All right, so we’ll receive IPNs over the weekend and fish for them on Monday!

Nulls and Zeros

(Discover some really great music by listening to the Bandcamp weekly!)


Although we know that pending transactions will not have an associated fee value, we are a little uncertain about what a pending transaction will look like in the sales records displayed for a band. Currently we get three categories of fees: ‘0’, NULL, or some float value. So the main question is what does it mean when there is a zero or null value in the table? In the database there are many instances of both null and zero fee values:

+---------+--------+
| fee_amt | count% |
+---------+--------+
|    NULL |   33.7 |
|       0 |    2.6 |
+---------+--------+

What we want to know is if this reflects the values we are seeing in the exported CSV files.
Well, first off, what does it mean to have a value inserted into the paypal_auths table where the fee value is stored? Everytime a payment is initiated and information is recorded, then an entry must be made in the table. This means that if someone puts an item in their cart, clicks ‘checkout’, and enters their personal information, there will be a record of that before the actual payment is processed. At least in cases where the transaction is terminated after this initial step, no PayPal fee will have been calculated because no purchase will have been made. In order to identify these kinds of payments I looked at the types of status values that were assigned to each payment:

select status, count(*) from paypal_auths group by status;
+----------+--------+
| status   | count% |
+----------+--------+
| checkout |    .09 |
| declined |   1.78 |
| error    |    .48 |
| finish   |    .01 |
| ignore   |    .02 |
| paid     |  66.46 |
| refund   |    .02 |
| start    |  31.15 |
+----------+--------+

Turns out these are values that are assigned to the transaction depending on how far through the code the transaction has hit. So we have 8 different statuses for payments. The start, checkout, finish and paid statuses represent stages of the payment with respect to the site’s interaction with PayPal and are assigned in the code. The declined and error statuses are given to incomplete payments for errors to do with the PayPal side of things, while the refund and ignore statuses, used sparingly, are manually set by BC to payments that should not show up as paid.
There is also a status called payer_paypal_status with three options: verified, unverified, and NULL. I was interested what the two types of statuses have in common. What stands out is that where the PayPal status is verified, there are no ‘start’ statuses. So in order for a payment to become verified it must have at least been initiated by the Bandcamp site and sent to and returned from the PayPal site before executing the payment. Good to know, but not really that helpful for us. Since the PayPal statuses are less detailed I decided to focus my attention on the null and zero values with respect to the Bandcamp set status. First, looking at the status options for payments that had 0-valued fees,

SELECT status, COUNT(*) 
FROM paypal_auths 
WHERE fee_amt=0 
GROUP BY status;
+--------+---------+
| status | count%  |
+--------+---------+
| error  |    .003 |
| ignore |    .07  |
| paid   |  99.55  |
| refund |    .38  |
+--------+---------+

then null-valued fees.

SELECT status, COUNT(*) 
FROM paypal_auths 
WHERE fee_amt IS NULL 
GROUP BY status;
+----------+---------+
| status   | count%  |
+----------+---------+
| checkout |    .29  |
| declined |   5.31  |
| error    |   1.42  |
| finish   |    .02  |
| paid     |    .002 |
| refund   |    .002 |
| start    |  92.95  |
+----------+---------+

A bit more interesting, so I looked at each status and how they were associated with fees based on three different categories: 0, null, or other fee values. In fact, in several states there was never a fee amount that wasn’t 0 or null. For example, ‘checkout’, ‘declined’, ‘finish’ and ‘start’ only ever have a fee of null. Those with ‘error’ status can either be zero or null, but more notable is that the only statuses that can have a value which isn’t null or zero are ‘ignore’, ‘refund’, and ‘paid’. (Promise I’m going somewhere with this.) These are important because ‘paid’ is the final stage of the transaction where you would expect the fee to be filled in and ‘ignore’ and ‘refund’ are the statuses that are changed manually by Bandcamp employees in unusual and uncommon circumstances. Hizzah! Things are starting to make sense.

Screen Shot 2013-07-16 at 2.58.54 PM

We’re about half way there. Here is a bit of comic relief for your patience. (referencing the xkcd comic I linked in ‘The Race Condition’)

Next step: how does this relate to the values seen in the CSV records? Well, from above, lots of paid transactions have fee values of zero (205267), but very few are null (44). Though in the CSV files it would seem that there are quite a few nulls, they even seem to appear more frequently than zero values. Ideally there should only be ‘paid’ transactions in the sales report, so what’s going on there?

Fairly randomly selecting attributes to look at, I queried for the countries that the paid transactions with null fee values were coming from. That didn’t show anything unusual, so I looked at what types of purchases were being made and both digital and merchandise orders were culprits, so also not unusual. Though considering there are millions of transaction records in the database, it is fairly unlikely that the 44 null values with a paid status represent the null values in the bands sale records so investigating them further would surely be fruitless.

I started looking at specific column values for any null fee values in the table. I ended up querying the payment types and payment date, which, apart from one entry, were both always null when the fee value was null. Looks like the null payment fees come from dud payments!

SELECT paypal_payment_type, paypal_payment_date, COUNT(*) 
FROM paypal_auths 
WHERE fee_amt IS NULL 
GROUP BY paypal_payment_type, paypal_payment_date;
+---------------------+---------------------+--------+
| paypal_payment_type | paypal_payment_date | count% |
+---------------------+---------------------+--------+
| NULL                | NULL                |    100 |
+---------------------+---------------------+--------+

Alright, now we’re getting somewhere. So null fee values are essentially incomplete transactions that will never be seen in the UI. This means that only 0 and float valued fee entries will show up in the table. It would appear that the problem is somewhere in the translation between the database tables to the CSV. So I went to investigate specific examples from the CSV where the fee value was null. Perhaps the missing fee value has something to do with inserting the values into the CSV. Using an email from a band’s sales record and the date of the transaction I searched the table:

SELECT fee_amt, paypal_payment_type, payment_date 
FROM paypal_auths 
WHERE payer_paypal_email='me@bandcamp.com' 
    AND status='paid' 
    AND payment_date='2012-03-04 22:30:26';
+---------+---------------------+---------------------+
| fee_amt | paypal_payment_type | payment_date        |
+---------+---------------------+---------------------+
|    0.36 | instant             | 2012-03-04 22:30:26 |
+---------+---------------------+---------------------+

A fee does exist! So I checked another email address where the fee was null. A fee value also existed for that buyer. Though, looking at instances where the fee value was zero in the spreadsheet, there was also a value of zero in the table. So the zero value made sense but the null value didn’t yet. Also notable from these investigations was that of the two types of payments, instant and echeck, all echeck payments had a fee value of 0. This means that once we start receiving the IPN, then those 0 entries should slowly turn into accurate values. But what about the null values we were seeing?

Let’s check out some more null payments on the CSV. While blankly staring at an excel spreadsheet of existing sales records, I started noticing that whenever there was a group of null fee value table entries, the same buyer email was listed for each. This meant that someone who was making a lot of purchases was not receiving the fee amount for many of the purchases. WOOAAAH, I think I know why the PayPal fee doesn’t show up! All the purchases were made at the exact same time, indicating that only one payment was made. Upon checking the sales records again it looked like a fee value was shown in the correct field for only one of the purchases made during the single payment. This should be consistent with all purchases of 1 or more items since Bandcamp only lets you make a payment with items from the same artist if multiple items are in your cart. This means that the Null value in the sales records isn’t really a value at all, but more of a placement filler for large order purchases. See an example below.

Screen Shot 2013-07-16 at 9.14.33 AM

We probably won’t change the null values in the sales records because technically all the information is there, but we may add some notes in the Bandcamp faq. Otherwise the 0 values will simply be the placement holders for pending transactions until they are updated with the information via IPNs. So next step, how do we go about receiving IPNs?

Milestones

(Good, local music.)


Time to investigate this problem with the PayPal fees! First question is where to get started. My first thought is that I need to understand what the tables look like and figure out how Bandcamp currently processes the fee.

Usually when I start a project in school, I open a new file and begin writing code before I even properly understand what my objective is. Poor practice, I know, but with only my future self to worry about it’s hard to resist writing code straight away. This simply isn’t an option when working with an existing codebase for a company like Bandcamp because everything is integrated. So, I need to begin each project by learning a lot and figuring out exactly what I’m doing. When I first started looking into this problem there were a few questions that were getting me started with my research. Though, upon further investigation I think I came up with more questions than answers. I began with figuring out what the data that needed processing looked like.

When we get the PayPal fee from DoExpressCheckoutPayment, we store the information in a table called paypal_auths (see below). This table doesn’t necessarily store every detail about the transaction, but it does store a payment status and transaction ID. In fact, it stores 2 statuses, one simply called status that Bandcamp assigns to the payment and one called payer_paypal_status which PayPal provides. I will need to query the tables further to really understand what those mean. The transaction ID is useful because when PayPal sends you the IPN (instant payment notification) after the transaction has completed, you can use that id to update the table. We know that the PayPal fee comes from this magical IPN signal, but I don’t know how to receive it or what it looks like. Apparently one of Bandcamp’s engineers, Robbie, had already started the movement towards grabbing the IPN notifications, so there is a collection of code somewhere that I need to find that will eventually (with my help) get the fee information for a transaction that was previously pending. Just to get an idea of what the data we store looks like in paypal_auths, here is what a table describe spits out:

+-------------------------+------------------+------+-----+---------+-------+
| Field                   | Type             | Null | Key | Default | Extra |
+-------------------------+------------------+------+-----+---------+-------+
| id                      | int(10) unsigned | NO   | PRI | NULL    |       |
| status                  | varchar(20)      | YES  |     | NULL    |       |
| checkout_date           | datetime         | YES  | MUL | NULL    |       |
| order_total             | float            | YES  |     | NULL    |       |
| item_name               | varchar(127)     | YES  |     | NULL    |       |
| item_desc               | varchar(127)     | YES  |     | NULL    |       |
| is_set_price            | tinyint(4)       | YES  |     | 1       |       |
| payer_firstname         | varchar(40)      | YES  |     | NULL    |       |
| payer_lastname          | varchar(40)      | YES  |     | NULL    |       |
| payer_business          | varchar(127)     | YES  |     | NULL    |       |
| payer_paypal_id         | varchar(13)      | YES  | MUL | NULL    |       |
| payer_paypal_email      | varchar(127)     | YES  | MUL | NULL    |       |
| payer_paypal_status     | varchar(10)      | YES  |     | NULL    |       |
| payer_countrycode       | varchar(2)       | YES  |     | NULL    |       |
| paypal_token            | varchar(20)      | YES  | MUL | NULL    |       |
| paypal_correlation_id   | varchar(13)      | YES  |     | NULL    |       |
| paypal_api_version      | varchar(8)       | YES  |     | NULL    |       |
| paypal_build            | varchar(12)      | YES  |     | NULL    |       |
| ip_address              | varchar(39)      | YES  |     | NULL    |       |
| payment_date            | datetime         | YES  | MUL | NULL    |       |
| fee_amt                 | float            | YES  |     | NULL    |       |
| paypal_transaction_id   | varchar(19)      | YES  | UNI | NULL    |       |
| paypal_payment_date     | datetime         | YES  |     | NULL    |       |
| paypal_payment_type     | varchar(7)       | YES  |     | NULL    |       |
| currency                | varchar(3)       | YES  | MUL | USD     |       |
| usd_total               | float            | YES  |     | NULL    |       |
| quantity                | int(11)          | YES  |     | NULL    |       |
| sub_total               | float            | YES  |     | NULL    |       |
| shipping                | float            | YES  |     | NULL    |       |
| tax                     | float            | YES  |     | NULL    |       |
| payer_shiptoname        | varchar(60)      | YES  |     | NULL    |       |
| payer_shiptostreet      | varchar(150)     | YES  |     | NULL    |       |
| payer_shiptostreet2     | varchar(150)     | YES  |     | NULL    |       |
| payer_shiptocity        | varchar(60)      | YES  |     | NULL    |       |
| payer_shiptostate       | varchar(60)      | YES  |     | NULL    |       |
| payer_shiptozip         | varchar(20)      | YES  |     | NULL    |       |
| payer_shiptocountrycode | varchar(2)       | YES  |     | NULL    |       |
| payer_shiptocountryname | varchar(60)      | YES  |     | NULL    |       |
| payer_note              | varchar(255)     | YES  |     | NULL    |       |
| ship_date               | datetime         | YES  | MUL | NULL    |       |
| email_date              | datetime         | YES  |     | NULL    |       |
| ship_notes              | text             | YES  |     | NULL    |       |
| payer_shiptophonenum    | varchar(30)      | YES  |     | NULL    |       |
| item_price              | float            | YES  |     | NULL    |       |
| discount_code_id        | int(10) unsigned | YES  | MUL | NULL    |       |
| seller_paypal_email     | varchar(127)     | YES  |     | NULL    |       |
| paypal_error_code       | varchar(12)      | YES  |     | NULL    |       |
| item_detail             | varchar(50)      | YES  |     | NULL    |       |
| band_id                 | int(10) unsigned | YES  | MUL | NULL    |       |
| account_id              | int(10) unsigned | YES  | MUL | NULL    |       |
| share                   | int(10) unsigned | YES  |     | 0       |       |
| item_title              | varchar(300)     | YES  |     | NULL    |       |
| item_title2             | varchar(300)     | YES  |     | NULL    |       |
| item_credit             | varchar(100)     | YES  |     | NULL    |       |
| share_balance           | int(11)          | YES  |     | 0       |       |
| currency_rate           | float            | YES  |     | NULL    |       |
| fan_receipt_date        | datetime         | YES  |     | NULL    |       |
| band_receipt_date       | datetime         | YES  |     | NULL    |       |
| preorder                | tinyint(4)       | YES  | MUL | NULL    |       |
| release_receipt_date    | datetime         | YES  |     | NULL    |       |
+-------------------------+------------------+------+-----+---------+-------+

And looking into the statuses mentions above, there seem to be 8 status options and 3 payer_paypal_status options by executing a couple simple queries:

+----------+----------+
| status   | count(*) |
+----------+----------+
| checkout |     7793 |
| declined |   145313 |
| error    |    38976 |
| finish   |      517 |
| ignore   |     1424 |
| paid     |  5422767 |
| refund   |     1616 |
| start    |  2539145 |
+----------+----------+
 
+---------------------+----------+
| payer_paypal_status | count(*) |
+---------------------+----------+
| NULL                |  2561365 |
| unverified          |  2734676 |
| verified            |  2861602 |
+---------------------+----------+

From a look at the source code it seems like the status should only be ‘paid’ if the transaction is complete. Although there are payments with both null and zero values that are indicated as ‘paid’. Also, looking at the attributes of the table paypal_auths, there doesn’t seem to be any obvious indication when a payment is pending. It would be useful to know when that is the case.

After this initial investigation into changing the PayPal fee, I hadn’t even started thinking about writing any code and it had become obvious that this task was going to be a lot more work than that of my first bug. So, Joe decided it would be a good idea to flush out a proper plan and put it up on Bandcamp’s wiki page. I came up with a list of steps that I thought needed to be done in order to complete this task. Then Joe and I set up a meeting to polish the plan and divide the steps into weekly goals.

In school, we are introduced to planning in our first year by being encouraged to make flow charts of our programs before we begin. Unfortunately, the lack of persistence in encouraging planning in the classroom results in students easily ignoring design until well into writing the code. In other engineering disciplines, most of their degree is spent on planning, but software engineering is so accessible and is such a fast moving industry that planning is often ignored and quality, lacking. In conclusion, make a plan first.

standard flow chart.

standard flow chart.

In the phone call, we discussed several key points to creating a good plan. The first mentioned above is having distinct goals within a plan. You are essentially giving yourself a set of accomplishments that you can hand in to your supervisor and say, “look what I did!” A plan, on the other hand, will provide the steps to help you get to each of these goals. It is important to set your goals in increments of similar difficulty, and this is where you need the skill of ‘sizing’ your work load. It can feel a lot like guess work to give an estimate of how long a project will take, in particular, how much time debugging will take. But, the more specific and detailed your plan is, the easier it will be to stay on track. The tears and hours spent over debugging poorly written code is pretty much a right of passage in your undergrad. Although you can quite easily avoid that stress by designing yourself a detailed timeline for solving the problem. In this case, I divided my work into weekly increments.

Finally, we talked about the kind of detail that I would need to discuss my objectives in the wiki. Since Bandcamp employees are spread out across the world, the emphasis on good communication is very high. This means that we don’t want to alienate different members of the team by using very specific language, but still fully communicate the idea. We came up with a persona who I would essentially write to any time I had to explain something, and the specific person in mind was my friend Alexa. She is a very bright astrophysics major (recently graduated!) who has an interest in programming, but doesn’t have a ton of experience. I will need to be able to explain to her what I’m trying to accomplish, how I’m going to do it, and what the project will look like at the end of each milestone, especially what will not be accomplished yet.

Here is what we came up with:

WEEK 1
The first week looking into this problem will involve interpreting the values inserted into the table paypal_auths. The values inserted can be understood via notes in the code and the documentation on retruned values from PayPal. Instances where there is no value or the value is zero, there will be more problem solving involved for values like the fee, status and type. By the end of the week I should have a write up describing what the important values are, what it means when they are missing, how to use them, and how they are inserted into the table.
GOAL: Understand PayPal transactions

WEEK 2
This is the first of two weeks that will be investigating how to receive the IPNs sent by PayPal. No coding will be done this week. All efforts will be focused on understanding the content of an IPN and how bandcamp currently interprets IPNs. Documentation of all findings will be provided at the end of the week.
GOAL: Preparing to get IPNs

WEEK 3
This is the second week of the IPN investigation and we should know enough by now to write the code to read in the IPN sent when the transaction has finally completed. At this point we should know what the code needs to do and where the changes need to be made. No interpretation of the data will be done, simply reading it in and printing it to the logs.
GOAL: Get IPNs

WEEK 4
Focus on altering the paypal_auth table this week. I will add two columns that will include the pending payment reason and the date of the completed transaction. In order to make a smooth transition, transactions previous to the date these columns are added will need to be filled. Whether the entry is null, 0, or a value will need to be determined this week.
GOAL: Migrate paypal_auths table

WEEK 5
This week we combine the efforts from the results of week 3 and 4 so that the IPN that we are now receiving is inserted into the improved paypal_auths table. In order to do this, we will need to see where in the code insertions into the paypal_auths table are occuring and will need to update the code accordingly. At the end of this week, everything back end should be doing what it is supposed to, including storing the information we need to update the bands’ sales records.
GOAL: Update database with info returned from IPN

WEEK 6
Final week: actually getting the updated fee information to the UI. The value of the PayPal fee should directly translate from the database since no new information is needed, but how the data is pulled to the CSV will need to be understood prior to this update.
GOAL: Update CSV

I’m feeling pretty good about keeping on track with this plan, but you never know what I might run into. Keep an eye out for more blogging about my progress throughout the next 6 weeks!

Bump in the Road

(Some soothing and allusive tunes for the following post.)

Before I can continue with my new and exciting project, I have to fix my development environment. I’d like to think that most developers have to go through these types of problems before they can become great, but even thinking that doesn’t make me feel more comforted. So, despite causing myself a bit of embarrassment, I’m going to explain what happened in order to maintain a sense of honesty and realism about the job. Here goes.

It all started when uninstalled MacPorts. I followed a web forum’s instructions and executed a few commands that ended up scraping my local file. In all fairness, it wasn’t the forum’s fault, I should have paid more attention to the delicacy of the environment set up I was using for Bandcamp. Reasonably, some of the tools used by my local Bandcamp app were now uninstalled and I was unable to continue developing without them. This isn’t really the worst thing that could have happened because under normal circumstances, I would simply reinstall any lost programs by running setup.rb like I did on my first day. Unfortunately my computer is quite old (an early 2008 macbook) and mechanically more sensitive to major software installations. So the ruby program got stuck at the make install command during the first tool installation. Although this problem didn’t directly affect my team, I was disarmed and essentially useless to Bandcamp (except for my awesome blogging) until I could get my machine back up and running.

I began by breaking down the installations and trying to do what the set up script did by hand. After installing several tools, including the one that was hanging at the make install while running the setup.rb script, and feeling close to a solution, I hit the same hang while installing ffmpeg. (Since ‘got stuck’ and ‘was hanging’ is a very confusing way to explain the problem I was having, I will explain in more detail what that means in the next paragraph when I use my utilities to look at the problem.) Scouring online tech communities for solace and coming up empty handed, I desperately asked Joe for advice. He suggested my hard drive might be the problem.

The most difficult part of the problem was that make install wasn’t generating an error, it was just an errant process. When make install didn’t work, the command would run forever (if I let it), and yet, it didn’t look like anything was happening from observing the Activity Monitor. The Activity Monitor is similar to the command line program top, but also shows stats related to the system memory, and disk usage and activity and provides a useful interface to delete, inspect, and sample processes. But, as far as I could tell from all that, make install was not even running.

So, I ran a check to verify the integrity of my hard drive using Disk Utility, another application in my ~/Applications/utilities/ folder. A couple red lines of text told me that my drive was corrupt and missing an expected file, so the next step was to perform a disk repair. By restarting my computer and holding down shift key, I started my computer in safe mode and accessed my disk utility from there and did a disk repair.

Back from a successful disk repair, I tried to run setup.rb again and still had the same problem. Deciding to check my utilities for something else that could be useful, I actually found a lot of cool stuff. Have you every tried using the graphing utility, Grapher? It is so fun:

Screen Shot 2013-06-25 at 4.49.51 PM

It even gives you the option to take derivatives and integrals! I would have loved to use that in the 1st and 2nd year calculus courses I had to take. You can also use Grab to capture screen shots and look at the colours on your screen pixel by pixel with the DigitalColor Meter. There is also Console which provides a very detailed system log. This one I could actually use, so I executed a make install command and kept an eye on the console. I noticed an indexing error with mds, an indexing tool for mac, as make install was ‘running’:

Screen Shot 2013-06-14 at 4.04.11 PM

Interesting, so seemingly still a hard drive issue after repairing the disk? To make sure the repair actually worked I double checked the disk quality by verifying disk again, but it returned with a clean check. In one last ditch effort to get my computer back up on its feet, I reinstalled my OS. Surprisingly, a few hours later, it worked! I was able to run setup.rb successfully and could once again play with the bandcamp app on my personal machine.

Although, the fact that I had a problem in the first place makes me question the integrity of the mechanical parts of my computer. Joe says he “would bet dollars and donuts that [my] drive is failing” from a combination of the age of my macbook and the added stress caused by uninstalling and reinstalling a lot of items.

Perhaps one day I’ll will have a post about what it’s like to replace your hard drive, but for now, on to investigating PayPal fees.