Playing with SQL

(My favorite song from a friend’s band in Vancouver.)


Next up, I’ve been asked to figure out some statistics about trends in purchasing, specifically, what percent of purchasers in the last two years have only bought digital items? This is going to be an SQL query on one of the slave database servers named munda. The slave, munda, mirrors the changes made by the two master servers dwayne and morris. Managing multiple master servers by dealing with concurrency complications is handled by the Multi-Master Replication Manager for MySQL. The idea behind a slave is that it takes the pressure off of the master servers by responding to most of the read-only queries. These servers are really huge and always active. Hence, while I’m learning I need to be particularly careful when making a query on one. In fact, every time I wanted to run a query I had to send it to Joe and have it approved first.

This was a little bit intimidating because I didn’t get to show my boss the beautiful final product, but instead, had to advertise every mistake along the way. Although, I liked the process. Usually I’m used to trying out whatever idea my mind wanders to rather than logically stepping through each thought completely in an organized fashion. I really liked seeing how I properly problem solve step by step, and after a few show tables and describe commands, I had written my first query.

In order to make sure I wasn’t stressing the database, we had a look at a MySQL version of top called innotop to monitor the servers. From this you get information about cache usage, bit transfer and a list of queries. In addition this information is divided into queries per second (QPS), query cache hits (QCacheHit), use of the buffer pool cache (KCacheHit), bits per second incoming (BpsIn), and bits per second outgoing (BpsOut). All of these items are shown in a table of rows ‘Now’ and ‘Total’ so you can compare the current stress to the normal work the server performs. So, while running innotop in a neighboring window, I executed my first query:

SELECT DISTINCT payment_id 
FROM sale_items
WHERE sold_date >= '2013-05-31';

Below the innotop stats there is a list of the queries running at that moment and details about how long it has been running, and sure enough mine popped up there. After 30 seconds the color of the query changed to green and again changed after a minute to yellow. If the query has been running too long and looks like it is affecting the speed at which queries are being processed, you can kill it by typing ‘x’ and entering the query id. Joe had me practice killing queries by running

SELECT SLEEP(30);

The payment_id query above didn’t seem to cause too much unpleasantness and finished in just over a minute. The result gave me a list of payment ids indicating all the payments in the last week. Next I tried this, still leaving innotop open:

SELECT COUNT(distinct paypal_auths.payer_paypal_email) 
FROM sale_items
JOIN payments     ON sale_items.payment_id = payments.id 
JOIN paypal_auths ON payments.auth_id = paypal_auths.id 
WHERE sold_date >= '2013-05-31';

Now I was getting something more interesting, a count of how many different buyers there were via joining all the appropriate tables. This query ran a little bit faster and didn’t quite graduate to the yellow state. Next I started thinking about how to distinguish between types of sales. The sale item records are tagged with an item_type, but I didn’t know what the values could be:

SELECT item_type 
FROM sale_items 
LIMIT 30;

From the semi-random result I discovered that ‘a’ meant album, ‘p’ – package, ‘t’ – track, and ‘s’ – service (like subscribing to Pro accounts). Then I realized I wanted to be executing two different queries, one for discovering how many distinct buyers have purchased just a package:

SELECT COUNT(distinct paypal_auths.payer_paypal_email) AS packages_sold
FROM sale_items
JOIN payments     ON sale_items.payment_id = payments.id
JOIN paypal_auths ON payments.auth_id = paypal_auths.id
WHERE sale_items.sold_date >= '2011-05-31'
    AND sale_items.item_type = 'p'
    AND sale_items.state = 's';

and one for how many buyers of any kind of music (i.e., not a service):

SELECT COUNT(DISTINCT paypal_auths.payer_paypal_email) AS albums_tracks_packages_sold 
FROM sale_items
JOIN payments     ON sale_items.payment_id = payments.id 
JOIN paypal_auths ON payments.auth_id = paypal_auths.id 
WHERE sale_items.sold_date >= '2011-05-31'
    AND sale_items.item_type != 's' 
    AND sale_items.state = 's';

From these two queries I was able to obtain enough information to answer my original question. Of course for confidential reasons (and fear for my life) I can’t say what the result was, but it is awesome.

I think I have an idea about what I’ll be doing next week, but let’s leave that as a cliff hanger.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s