Well, the embargo has lifted (funny thing when you are publishing a paper — before you submit it you tell everyone about it, then once the journal accepts it you need to keep it all quiet, but you’re like “yeah, but I was just… well, okay”) and now I can finally talk about my latest research paper!

This time the paper is coming out in Science Translational Medicine (one of my favorite up-and-coming journals). The paper describes how we developed methods to mine adverse drug events on a very large scale. The trick is to account for all the nasty bias in the database that has, historically, inhibited such data mining before. We adjust for that bias by matching identifying “better controls” for the drugs we are studying.

Looks like local news is starting to pick it up and even put together a video on the topic. Although the message they took away from the paper wasn’t really what I intended — still cool. As well as Curante and something called Mother Nature News.

Here’s an embedded version of that video:

One curante commenter says, “And how many “side effects” are there from eating an apple? Or drinking a cup of tea?” — good question I say good question indeed!

Week 13 was rivalry week with lots of potentially exciting games. However, there weren’t too many surprises as the favored team won in nearly all the top 25 match-ups.

A quality win over Notre Dame pushed Stanford up a few notches, while OSU falls a notch on a bye week.

At this point BCSRank would put LSU vs Houston in the BCS Championship game — a scenario that is incredibly unlikely. Also, while Bama sits on nearly all ranking systems at a solid #2, according the BCSRank they are down at #7. Again, I suspect there is bias against teams in good conferences and need a way to address this. I would like to do this without sacrificing the generality of the method.

As a student of statistics/computer science and a fan of college football — I got pretty interested in just how these computer rankings work. This pages lists the different algorithms that are used, but the sites don’t really describe how the teams are ranked (probably trade secrets or something).

It occurred to me that Larry Page and Sergey Brin invented a this little algorithm to rank web pages on the internet based on their credibility. The idea behind their ranking algorithm is that more credible pages will have more links to them AND that links from more credible pages are worth more. Well, this is quite similar to college football. Teams that win lots of games are better teams (i.e. more credible). So instead of links between pages we have links between teams. A link is established between each team that has completed a game. The link is directional and goes from the team that lost to the team that won.

This is quite intuitive in that if you beat a team that loses to a lot of teams (e.g. a webpage that promiscuously links to all other pages) you are going to get less credit than if you beat a team that has never lost (e.g. a page that all other pages point to, but where it only points to you). Unless two teams play twice in a year (which typically won’t happen in the regular season) then there will at most one link between each team.

My implementation is far from finished and there are definitely some limitations, but I figure the internet may be able to help me improve the method so here it is.

As an example, lets say we have three teams (A, B, and C), we define our link matrix according the results of those teams playing each other.

L = ((0, 0, 0),

(1, 0, 0),

(1, 1, 0))

We call the matrix L for links for losses, you pick. When we try to invert this matrix according to the algorithm we cannot so to make the matrix non-singular I set the diagonal equal to 1.

L = ((1, 0, 0),

(1, 1, 0),

(1, 1, 1))

This makes the matrix invertible and we can implement the algorithm as described on Wikipedia. This means that team A has never lost, team B lost only to team A, and team C lost to both A and B.

The other parameter we need is the dampening factor — which for the PageRank algorithm is the probability that a random surfer goes to a random page on the internet rather than using the links. When this value is set to 1 that means the surfer always uses the links and will end up on a terminally linked page. For our football ranking this means that the value will congregate in the top one or two teams (analogous to the surfer ending up at a terminally linked page). We want to spread out the scores a bit more than that so we will set this dampening factor quite low (~0.3).

So, in week 13 of the 2011 season, what does PageRank say about the way the teams should be ranked? Here are the top 25.

Now this is pretty good for a first pass at ranking. LSU is clearly the best team going into week 13 and indeed they do come out on top of our rankings. One interesting feature of our rankings is that Houston is right there at #3 above Alabama. This is the result of our algorithm assuming all teams are equal at the start. It’s a little disappointing that Stanford is sitting at #8 behind a two-loss TCU, but TCU has beat “higher quality” opponents (i.e. Boise State) and that’s contributing their score.

Another interesting/curious ranking is Arkansas sitting very low at #13. This makes me wonder about biases against conferences that are filled with good teams (i.e. SEC West). Because a team has to play the teams in their own conference more than from other conferences this may bias against teams in good conferences. I have not figured out a way to visualize or estimate this bias yet. Once we can understand it we can invent methods for removing it.

If you have ideas for improving this method please feel free to comment. This may ultimately prove a futile effort and will not perform as well as those established ranking systems — but at least it’s an open solution based on some sound CS models.

JAMA just wrote a very nice article covering our recent discovery of a drug-drug interaction between paroxetine and pravastatin by Dr. Hampton. You can check it out here.

I once was told that submitting a scientific paper is like sending your child off to college. You’ve done what you can to prepare them — you’ve tried to work out all the kinks and polish off the edges. But, ultimately, they have to go out there and stand on their own. I won’t pretend to know what sending your grown baby away to start their life is like, but getting this paper published has been one crazy ride. And I’m very happy to report that today my paper is graduating from college and I’m so proud of the little guy. I mean, of course, that the paper has finally been accepted and is being published today. And by one of my very favorite journals, Clinical Pharmacology & Therapeutics!

Once I get the PMID and all that I”ll post it here for your viewing pleasure. The proofs were quite esthetically pleasing, so I expect it to make a good over-the-mantel piece.

Science it stupid-fun. I mean, honestly, as a graduate student in informatics I get to sit in front of my computer all day, musing about problems, programming scripts, and testing hypotheses. It’s the greatest job in the world and every so often you hit on something really cool and get to publish a paper. Sometimes you figure out a new way to analyze the data or you apply an old technique to a new field. Either way, sharing your discovery is both fun and gratifying.

The papers you publish are the lifeblood of a scientist’s career and as a graduate student will form the basis of my PhD thesis. I’ve been lucky so far to have worked with some very talented people and publish a couple of papers in my first two years. However, recently I stumbled upon a discovery which could get me a paper in, by most metrics, the most prestigious journal in the world, The New England Journal of Medicine. Just typing this sentence blows my mind. Anyway, because I don’t think about much else, I’m going to blog about my experience submitting a paper to NEJM and will share all the ups and downs…. cross your fingers for ups!

Update: April 28, 2010

The goal was to submit the paper today. But right before my advisor and I were about to submit the final draft of the paper, and I realized that I left out a whole set of analyses! Doh, now I have to go back to the database to extract more patient records and analyze another set. Looks like, it won’t be today.

Update: April 29, 2010

Okay, made a whole bunch of new figures today, modified the text where needed and sat down with my advisor again. We made some final edits (some of which we should have caught in the previous 16 iterations of the manuscript), but alas we feel great about the paper. Personally, I think it’s a work of art and couldn’t imagine it getting rejected . Don’t worry, I’m not allowing my hopes to get too high. My advisor let me submit the paper and click on all the “double-check-your-submission” links. In his words “you’re not going to be doing this that often, so you might as well click around.” It was fun and it’s amazing I even get to submit to the NEJM, really.

Paper status is now “Submitted” and we got a confirmation email that it will be forwarded to the editors. Time to order some flowers and wine for our collaborators!

Update: April 30, 2010

I have been neurotically checking the author website at NEJM and today our paper was “Assigned to an Editor”, I have no idea exactly what that means, but it’s movement!

Update: May 2nd, 2010

I went on the website today (yes, and yesterday too), not expecting any updates over the weekend, but much to my surprise the status of the paper has changed. It looks like it made passed the editor and they are now looking for peer reviewers. According to a close friend getting passed the editors is a big deal and a lot of papers don’t make it passed that point. Very exciting! — I just read on the NEJM website about the editorial process and it seems that it has been read by the Editor-in-chief and also another expert editor and has passed both of their filters. Here’s to the reviewers liking it! [crosses-fingers]

Paper status is now “Searching for Reviewers.”

Update: May 10th, 2010

It’s been one week and one day since the last update, and I was starting to think that maybe the next update would come when the reviews came back. That is not the case, however. Looks like they found some reviewers for the paper! Not sure how long they have to review the paper, probably a few weeks at least I imagine. I‘ll check with my PI and add that info in here. Looks like its 2 to 3 weeks.

Paper status is now “Out for Review.”

Update: May 20th, 2010

The reviews of the paper have been returned to the Journal, but a decision has not yet been made. The next step is for the editorial board to discuss the paper and make a decision. I’m not sure how long this process takes, but every other time the status included the word “editor” it only took a few days to get another update. Perhaps I will hear something by weeks end. YIKES! This is getting real.

Paper status is now “With the Editor.”

Update: May 25th, 2010

It’s been a serious roller-coaster ride and I have had quite the experience submitting to the world’s most prestigious journal. The reviews came back a few days ago and we just got a chance to read them today. As far as reviews for The Journal go, they are quite benevolent and really quite positive. They made some great suggestions on how to improve the paper and I have already made most of the changes they suggested (they were quite minor). Unfortunately, for us, the editors were not as favorable on us as the reviewers were and they rejected our paper today. Perhaps the paper is too bold for the New England Journal, perhaps it’s ahead of its time. Perhaps, they just don’t like me or my parentage — I simply can’t be sure. But, what I do know is that this is some of the best work I have ever done and there are many other Journals out there. I will get the good word out about this discovery one way or another.

I just found this blog post on kpumuk.info here (http://kpumuk.info/mac-os-x/customizing-iterm-creating-a-display-profile-with-pastel-colors/) and it has changed my iTerm life. I used to be a Terminal.app fanboy but between TextMate’s iTerm integration and this color theme, I am a fully iTerm man now. Because this has changed my life so much, I’m going to repost the script that sets the default iTerm color scheme to this beautiful pastel on dark style.

I’m only posting this for prosterity sake, you should really go to the original blog post (link above) for more detailed information.

Close down iTerm and run this bash script from Terminal. Open up iTerm and you’ll see the changes.

Sigmoid functions are our friends and sometimes you have data which you would like to fit with a sigmoid function. We can use R to find such a fit. First let us look at a sigmoid function.

y = 1 / (1 + exp( -a*x + b) )

Now let’s say you are given a vector x and y, say:

x = c(0.00,0.02,0.04,0.06,0.08,0.10,0.12,0.14,0.16,0.18,0.20,0.24,0.26,

In this case the y values here represent probabilities and one thing you’ll notice is that we have probs of 1 and 0. Both of which are bad. So we apply a little “laplace smoothing” to them:

y[y==0] = 0.001

y[y==1] = 0.999

Now let’s look at what the data looks like.

plot(x, y)

Well, it may be sigmoidal, maybe not. For now let’s assume we think it is. Which we do for the most part.

Okay, now let’s solve for a line in our sigmoid function:

y = 1 / (1 + exp( a*x + b) )

1 + exp( a*x + b) = 1/ y

a*x + b = log ( (1/ y) – 1 )

Now the left hand side of the equation is a line and the right hand side is some logarithm of the y data. We can plot x versus this right hand side:

new_y = log( 1 / y – 1 )

plot(x, new_y)

Looks pretty interesting and hopefully at this point it also looks kinda linear, which it kinda does.

Now let’s fit it with a line:

lm.res <- lm( new_y ~ x )

lm.res

Which produces this output:

Coefficients:

(Intercept) x

1.122 -11.647

We can also test the significance of the fit with an ANOVA.

anova(lm.res)

Which produces this output:

Analysis of Variance Table

Response: new_y

Df Sum Sq Mean Sq F value Pr(>F)

x 1 172.80 172.802 14.641 0.0009834 ***

And we can plot the resulting fit in linear space:

Now let’s see how our fit looks back in normal space using our formula with our derived a and b values.

a = -11.647

b = 1.122

plot(x, y)

sim_x = (1:101-1)/100

points(sim_x, 1/(1+exp(a*sim_x+b)), type=”l”)

Voila! We have fit a sigmoid function to our data.

Just found a new data set that I couldn’t help running some stats on. OpenSecrets.org published this table which lists all of the members of the house, the number of earmarks they requested, and the total dollar amounts.

The columns of the data are:

Representative Name

State

Number of Earmarks

Total Cost

Solo Earmarks

Solo Cost

The solo columns are for earmarks where that representative was the only representative who requested the earmark.

Republicans vs. Democrats

The first obvious division is to split the data on party lines and see if their behavior is any different.

Column

Mean Democrats

Mean Republicans

P-value

Histogram

Total Earmarks

26.8

22.3

3.26E-4

Total Cost

$37,402,953

$30,683,681

0.02873

Solo Earmarks

10.3

9.2

0.0925

Solo Cost

$7,606,210

$7,746,574

0.7782

Red: Republicans, Blue: Democrats. Significant p-values are in bold.

Table 1. The above table shows that the average number of earmarks that are approved is significantly higher for democrats than for republicans and also that democrats get significantly more money for their earmarks than republicans do. Please note that when I say “significantly” I mean it in a statistical sense. The p-values (or the probability that the difference between democrats and republicans is purely by chance) for the first two rows of the table are signifiant ( less than 0.05). You can interpret this as having a less than 5% chance of this occurring completely by chance. However, when looking at solo earmarks there is not a significant difference in the number of earmarks granted or their cost.

For the Statisticians: To calculate the p-value I used the wilcoxon rank sum test as the distributions are not normally distributed.

However, I feel obligated to point out that because the house has a majority of democrats (237 to 163) it may be easier for democrats to get their earmarks passed, thus there are more for democrats. For comparison, data from when the GOP has control of the house is required.