Data + Soul = Story: 2015

Thursday, May 28, 2015

First they pay, then the delay

Imagine it’s late afternoon on a weekday, let’s say Friday. You decide to start the weekend a little early and head to a local establishment for an adult beverage or two. You’ve planned well and arrived at that magical time, happy hour, when select drinks can be bought at discounted prices. You venture up to the bar and put in your order, eager to enjoy the drink and bask in the glory of being a savvy money-saving consumer. The bartender hands you your cocktail and the bill, which causes you to freeze for a moment. You’ve been charged full price. You frustratingly ask, "what happened to the happy hour discount?" To which the bartender calmly replies, "all the local bars got together and agreed to abolish happy hour. Sorry about that." You wander back to your seat, drink in hand and wallet lighter than you'd like. This situation is not far off from what’s happening with brand-name and generic pharmaceuticals.

Pay the piper

“When patients demand brand, it’s like insisting on paying full price at happy hour,” said Kyle Weiler, a pharmacist from Phoenix, Arizona [Tweet this quote]. A reverse payment settlement agreement, also known as the more catchy ‘pay-for-delay’, is the pharmaceutical version of charging full prices when happy hour prices are available. It is a legal tactic which some branded drug manufacturers use “to stifle competition from lower-cost generic medicines,” according to the US Federal Trade Commission (FTC) website.

“These drug makers have been able to sidestep competition by offering patent settlements that pay generic companies not to bring lower-cost alternatives to market.” -FTC

In situations involving pay-for-delay, especially for blockbuster drugs with the potential for billions of dollars in annual sales, it’s a win-win for the drug companies. The generic firm wins by avoiding a potentially lengthy and costly patent dispute, and makes a significant chunk of change without having to manufacture a thing. The brand firm wins by keeping any generic competitors out of the market until after the brand patent expires, creating a monopoly where prices stay high and profits more than cover the costs of settling.

The big losers in delayed entry of cheaper generics are those who have to pay up for the more expensive brands. For example, it would be like if the generic versions of Advil, which have the same effectiveness but are half the cost as the brand-name, suddenly weren't available. The resulting universal high price would create a transfer of wealth from consumers to drug manufacturers, with both brand-name and generic firms sharing in the spoils. Altogether, these settlements cost US consumers and taxpayers $3.5 billion in elevated drug prices every year, according to an FTC study.

All for one, and one for all

1984 was a big year for pharmaceuticals. This was the year when the Hatch-Waxman bill, officially the Drug Price Competition and Patent Term Restoration Act, brought landmark changes which made it easier for generic drugs to enter the market. Generics are important because they make medicine affordable for millions of people and help keep down the cost of healthcare.

Before 1984, only about one third of brand name drugs faced competition from generics. Since then, nearly every branded drug has at least one generic competitor which, in certain instances, can account for more than half of the market share, significantly reducing the cost to consumers.

1984 was also the beginning of a pattern typical for the release of major drugs: launch, challenge, sue. This process spawned the pay-for-delay tactic within a decade. Here’s how it goes: a brand-name firm launches a new patented drug on the market; one, or a few, generic firms challenge the brand drug by marketing a competing product; the brand-name firm sues for patent infringement; rinse and repeat.

In an article examining the frequency and evolution of brand-generic settlements, author C. Scott Hemphill brings to light details of nearly all the significant pay-for-delay settlements between 1984 and 2009. By analysing archived press releases, trade publications, financial analyst reports, analyst calls with management, court filings of patent and antitrust litigation, SEC filings, FDA dockets, and FTC reports, Hemphill was able to uncover the terms of these settlements.

According to the aggregated data, there is an upward trend for the sum of the annual sales of the drugs involved in brand-generic settlements, and a transition from purely monetary agreements to more involving terms which include retained exclusivity. Retained exclusivity is also a byproduct of the Hatch-Waxman Act, whereby if several generic firms want to launch competing versions of a brand-drug, the first to submit what is called an Abbreviated New Drug Application (ANDA) may be granted a 180-day exclusive right to market its generic formulation directly against the brand-name. If this ‘first filer’ gets paid upfront to delay and gets to keep its half-a-year head start over other generics when it does finally enter the market, that’s pretty good reason to accept a pay-to-delay settlement.

Of the nine blockbuster (over $1 billion annual sales) drugs identified, eight involved retained exclusivity for first filers. These included Lipitor ($7.2 billion), Nexium ($3.4 billion) and Plavix ($3.4 billion) for which a 180-day market with only two competing firms would be worth hundreds of millions of dollars.

Delay in justice is injustice

-Walter Savage Landor [Tweet this quote]

On June 17, 2013, four years and four and a half months after the FTC first filed a complaint in the US District Court for the Central District of California, the Supreme Court ruled five to three that profit-sharing deals between drug companies which delay the manufacturing of generic drugs can be challenged as anticompetitive.

In a statement regarding the decision in FTC v. Actavis, Inc., FTC Chairwoman Edith Ramirez said: “The Supreme Court’s decision is a significant victory for American consumers, American taxpayers, and free markets. The Court has made it clear that pay-for-delay agreements between brand and generic drug companies are subject to antitrust scrutiny, and it has rejected the attempt by branded and generic companies to effectively immunize these agreements from the antitrust laws.”

The judicial floodgates may have finally been opened. On April 20, Teva Pharmaceutical agreed to pay $512 million in the first resolution of a pay-for-delay allegation. This resolves nearly a decade of litigation against Cephalon Inc., which Teva acquired in 2011, of allegedly paying $136 million in cash to delay sales of generic versions of its narcolepsy pill Provigil.

Teva was in the headlines again on May 7, when a California appeals court ruled that a $398.1 million payment between Bayer and Barr Pharmaceuticals (now owned by Teva) was antitrust. The 1997 agreement allegedly delayed the release of the generic to Bayer’s Cipro antibiotic until 2003, a period in which Bayer made profits of about $6 billion according to court documents.

Despite the FTC’s prioritisation of going after anticompetitive pharmaceutical agreements and the recent court victories in antitrust settlements, the complex nature of the current pharmaceutical-patent system makes one thing clear: the pattern of drug launch, challenge, and sue isn’t going away anytime soon. And neither is pay-for-delay.

Update: Thursday May 28, Teva has now agreed to pay $1.2 billion to settle the Cephalon Provigil case.

Thursday, May 7, 2015

♫ These are a few of my favourite data-things... ♫

This is the era of data, and journalism is evolving with the times. There are so many tools and resources available for data journalists (especially science data journalists) with more being added every other day it seems. It can get a bit overwhelming (and downright impossible) to keep up with all the latest developments, but I've put together a list of some of my favourite sources to help get you started.

Sources:

World Health Organization (WHO) - Global Health Observatory Data Repository
Provides data (for viewing &/or downloading) pertaining to health-related topics such as Health systems, Infectious diseases, and Public health and environment.

World Bank - Open Data
Free and open access to data about development in countries around the world.

NASA - Data Portal
Growing catalog of publicly available datasets relating to both Space and Earth Science.

ClinicalTrials.gov - Registry & Results Database
Database of publicly and privately supported clinical studies of human participants conducted around the world. A service of the US National Institutes of Health (NIH).

Data.gov - US Government's Open Data
Over 130,000 datasets on topics such as Agriculture, Education, Public Safety, and Science & Research.
(see also: Data.gov.uk - UK Transparency and Open Data team)

Scrapers:

kimono - A wonderful web browser plugin that allows you to easily (no coding required) turn websites into APIs to extract only the data you want. It's infinite scroll and pagination functions are extremely useful for websites where the contents of the page expand as you scroll to the bottom or continue on more pages. A chat window that connects you with a helping hand and the kimono blog are also excellent features for problem-solving any issues and being involved in the wider kimmunity.

import.io - Another great online tool to turn web pages into data with no coding required. This also comes with a blog showcasing how the community is using import.io. A new feature allows you to send your extracted data straight to Plotly for streamlining the visualization of your just-scraped data.

Cleaner:

Google/Open Refine - My favourite tool for cleaning and transforming messy data (see previous blogpost). The best feature is that every action or operation on the data is recorded and stored in the order performed. This allows mistakes to be corrected with a simple undo, and the ability to copy the sequence of operations and quickly repeat the process for another (similar?) dataset.

*I've shared tools for scraping and cleaning that don't require coding, but can be modified and optimized with some coding knowledge. A great (free) online resource for learning pretty much everything online-coding related is W3Schools.

Visualizations:

tableau public - By far my favourite tool for building interactive charts. It also has a feature called dashboard which allows for combining multiple charts and/or maps to build more complex visualizations that can accentuate a particular point or angle and help weave together a narrative (see my CO2 emissions example, health spending and life expectancy example, and my other CO2 emissions example).

cartoDB - A mapmaking tool, for anything from the more localized city level, to countries on the global scale. Torque is a new-ish feature which allows for the map to change over time in an automatic and dynamic way (see my earthquake example). CartoDB uses CSS, which is a fairly straightforward computer language, but the interface is designed so as not to require any coding. Just in case you do want to modify your maps in a way requiring CSS code, or just to get an idea of the basics and special tricks for making interactive maps using CartoDB, they offer free webinars.

plotly - An easy-to-use and very useful tool for graphing data and finding the best chart-type to maximize the soul of the data (see earthquake depth example). The Plotly Blog is also a great resource for tips on choosing the right type of chart, seeing what other people have created, and maybe even showcasing a bit of your own work.

Datawrapper - Chart/map-making tool with the tagline: "create charts and maps in just four steps." Like Plotly it's very easy to use, and has a simple interface for customizing your visualization. A chart gallery shows the more than 100,000 charts that have been created using Datawrapper.

Websites:

The Upshot - online news and data visualization site for the New York Times.

FiveThirtyEight - Started by Nate Silver as a politics data blog for the New York Times until bought by ESPN. In addition to covering politics, FiveThirtyEight also touches on economics, sports, and SCIENCE!

theguardian datablog - data journalism courtesy of The Guardian.

Science data journalists:

Peter Aldhous - Is currently a science and health reporter for BuzzFeedNews and has previously worked at Nature and The New Scientist.

David Herzog - A veteran investigative reporter and data journalist, and the academic adviser to the
National Institute for Computer-Assisted Reporting.

Christie Aschwanden - Lead science writer for FiveThirtyEight.com and health columnist for The Washington Post.

This list is by no means perfect or exhaustive. There are so many different sources of data, a constant progression of the tools to acquire, clean, and visualize data, and so many websites/blogs/journalists/data-nerds with their own unique skills and perspectives. The idea is not to tell you what you should or shouldn't do, but to give you a grounding for what's out there, what I personally use, and to help you find your unique voice and style, and impart a bit of your soul into the data.

Thursday, April 30, 2015

Critical analysis of a data-driven news story

For this post I will be analyzing a piece of data-driven science journalism by Alister Doyle for Reuters with the headline "China to surpass U.S. as top cause of modern global warming".

The top line for this story is that if you add up all the CO2 emissions for each country from 1990 until now, China has caught up to the US in cumulative emissions and is projected to exceed it by the end of this year or sometime next year. Or as Mr Doyle put it in his article:

"China is poised to overtake the United States as the main cause of man-made global warming since 1990, the benchmark year for U.N.-led action, in a historic shift that may raise pressure on Beijing to act.

China's cumulative greenhouse gas emissions since 1990, when governments were becoming aware of climate change, will outstrip those of the United States in 2015 or 2016, according to separate estimates by experts in Norway and the United States."

I think that both my summation and Mr Doyle's descriptions are a bit wordy and could be aided by the use of a chart or visualization. Mr Doyle uses carbon dioxide emissions data from two independent sources for this story; The Center for International Climate and Environmental Research, Oslo (CICERO) in Norway, and the World Resources Institute (WRI), a US-based think-tank. The main point of the story is supported by just two numbers referenced, 151 and 147 billion tonnes, which correspond to the cumulative CO2 emissions between 1990-2016 for China and the US respectively. I think that using a bar chart (similar to the one I made in a previous post on CO2 emissions) would be a really nice way to break up all the text in the original article and provide some sense of scale for the reader.

I played around a bit with CO2 emissions data I got from The World Bank, since there is no link in the article to the data mentioned, and came up with a pretty simple and straightforward visualization that I think would strengthen the article without distracting from the story told in the text.

The article does an excellent job of contextualizing the carbon emissions data with regards to the consequences of increased CO2 levels and international efforts to control them. It also has a good diversity of relevant and useful quotes from experts on the topic.

"A few years ago China's per capita emissions were low, its historical responsibility was low. That's changing fast," said Glen Peters of CICERO.

The rise of cumulative emissions "obviously does open China up to claims of responsibility from other developing countries," said Daniel Farber, a professor of law at the University of California, Berkeley.

"All countries now have responsibility. It's not just a story about China -- it's a story about the whole world," said Ottmar Edenhofer of the Potsdam Institute for Climate Impact Research and co-chair of a U.N. climate report last year.

"China is acting. It has acknowledged its position as a key polluter," said Saleemel Huq, of the International Institute for Environment and Development in London.

Any fair formula for sharing out that trillion tonnes, or roughly 30 years of emissions at current rates, inevitably has to consider what each country has done in the past, said Myles Allen, a scientist at Oxford University.

My only criticisms are the lack of a chart or some kind of visualization to really emphasize the numbers and how they relate, and that the data used for the story is not provided, nor linked to anywhere in the article.

Thursday, April 16, 2015

Two of the most useful spreadsheet tricks ever

Before I get going on making some sweet looking data visualizations, there are a couple of data-cleaning steps which make a world of difference. Both can be done relatively quickly (and painlessly) using OpenRefine (also known as Google Refine)"a powerful tool for working with messy data: cleaning it; transforming it from one format into another; extending it with web services; and linking it to databases."

There are so many useful options for working with data, what OpenRefine calls "facets" and "filters". Also, a simple interface tracks every step you make, allowing easy undo's/redo's anytime along the way. I'm going to go over two functions which I find incredibly useful in preparing data for later visualization: lengthening a wide dataset, and merging two datasets with a common column.

First, giving your data a growth spurt by trimming the fat. What the heck do I mean by that?

Well, lots of datasets which are downloadable from the internet come in a wide format. For example, World Bank data has its datasets sorted by year across columns. I'm going to use the data for carbon dioxide emissions to illustrate this function, and you can download the data here to follow along as well.

Create a project and upload the .csv file, leaving the default settings as is. You should get something that looks like this:

As you can see, individual years run across the columns making the dataset wide. What we want is to get all the yearly data into two columns, one with the year and the other with CO2 emissions. Here comes the easy part. Left click on the downward arrow next to the column heading 1960, hover over Transpose and select Transpose cells across columns into rows.

Fill in the dialogue box like below (from column 1960 to the last column it will create 2 new columns). There are some missing data values for certain countries so we don't want to Ignore blank cells (uncheck) but it's important to Fill down in other columns (check).

Now we have our data nice and tight width-wise, and 13640 rows long.

Second, adding data from another source can help add valuable details and more categorizations/filters for your original data. You can download the second dataset here. This has information about the region and income group for every country, which we're going to merge with our recently lengthened CO2 emissions dataset.

Like before, create a new project and import the nations.csv file, keeping the default import settings. Return back to the CO2 emissions browser tab. Left click on the downward arrow next to country, hover over Edit columns and select Add column based on this column. (note: you could also do this using the iso_a3 column since it is also featured in the nations.csv dataset)

Now we're going to use the GREL (Google Refine Expression Language) command: cell.cross("string projectName", "string commonColumn").cells["string columnName"].value[0]

Keep the quotation marks, but replace string projectName with nations csv, string commonColumn with country, and string columnName with region.

This will add a new column to the CO2 emissions dataset with the appropriate regions for the corresponding countries. Repeat the GREL cell.cross command to add a column containing income_group.

Voila! This dataset is now ready for a nice visualization makeover. I used it to make this treemap bar chart using tableau.

For other OpenRefine functions see https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions

Wednesday, April 15, 2015

Oklahoma - where the earth quakes, fracking down the plains

I made this visualization using data from The USGS after reading an article in The Guardian about how it's likely that the spike in earthquakes in Oklahoma and nearby states are man-made and caused by the fracking process of injecting wastewater into deep-underground disposal wells.

I was inspired to make a map which could show the epicentre and magnitude of earthquakes over a period of time, and found a great example of this with a CartoDB Torque map.

I really like this visualization because I think it gives a good historical context of earthquakes in this area going back to 1975, and then shows the explosion that begins in 2009. The map may proceed through time a bit quickly, which is a fairly simple element to change using CartoDB's interface. I think an accompanying static bar chart with the number and magnitudes of earthquakes over time in Oklahoma and the surrounding states would help supplement the dynamic nature of the map.

On a side note, I also found the following chart made by Steve Maier using Plotly which tells another very important stories about the history of earthquakes in Oklahoma.

<b>Oklahoma Earthquakes</b><br>1990-1999: 788 quakes (green)<br>2004-2013: 6,569 quakes (red)<br>

Monday, April 13, 2015

Another interactive chart courtesy of my new best friend...tableau

Data from The World Bank

The inspiration for this visualization was one I saw on the website Gapminder charting the Wealth & Health of Nations. The graph shows how long people live (average life expectancy) and how much money they earn (GDP per capita) for each country since 1800 to 2013. In my opinion, this interactive graphic is a bit overwhelming for inclusion in an article. However, it is an excellent example for taking bits and pieces and using for your own visualizations.

For example, in my tableau version where I replaced GDP per capita with health expenditure as a per cent of GDP, I kept several elements from the Gapminder chart:

1) I used a bubble chart with the area of the bubble corresponding to the population size of the country (note: Gapminder allows you to change this feature for a whole range of indicators).
2) I included a slider which changes the chart according to the specific year (note: Gapminder allows manual control over the time slider or you can hit Play and it will automatically proceed through time).
3) I included a side bar for selecting particular countries by their name, region, and/or income group. While Gapminder went with a list of all the countries to select, I used a search box for mine.
4) A map colour-coded by region is also on display and changes as specific countries are selected. This is one feature in which I think I improved over Gapminder's. My graphic highlights specific countries when selected, whereas the Gapminder one only highlights the bigger region.
5) When hovering over the bubble chart, the data for the specific country is displayed. This is another aspect I think I improved over the Gapminder example. My chart shows this data next to the bubble hovered over. The Gapminder version just highlights the data values on the x- and y-axes.

Saturday, April 11, 2015

Tableau = awesome; CO2 emissions = not-so-awesome

What I really like about this visualization is that it is helping to tell two stories about the same thing, CO2 emissions.

The treemap bar chart on the left clearly shows that overall CO2 emissions are rising annually, and that China/SE Asia is the region contributing the most amount of said emissions.

The line chart on the right tells a different story. It shows that even though China emits the most CO2, per capita it is actually quite low. Significantly lower than the US.

What is alarming about this is taking the total emissions story and pairing it with the per capita story. China is trending upwards in both instances which has major global consequences for climate change.

Thursday, March 26, 2015

Kimono is the Koolest

kimono is a must have app/chrome extension for easily and quickly scraping data from a website with no code writing required. Download it here.

What's really nice is that kimono can be used to scrape data from twitter pretty much hassle-free, without the need to write any code. The kimono data extractor recognizes patterns in web content which you can easily optimize to get exactly what data you want, and leaving behind the rest.

The kimono blog is an excellent resource for seeing all the cool stuff other people have used it for, and then so can you. I'm going to give you an example of using kimono to scrape a twitter account (Neil deGrasse Tyson's to be specific) and then analyzing the texts of his tweets by visualizing the most common words with a word cloud/map using Wordle.

First, I just want to say that what takes kimono to the next-level is a chat window that starts-up automatically when you sign-in to your kimono account. A very helpful person is available to answer any questions and problem-solve any issues you face. I learned this first-hand as I attempted to scrape twitter but ran into problems. The kind assistant worked with me to find a solution, which it turns out is to use the mobile twitter site which is where we'll begin.

So, once you have the kimono plugin installed for chrome and have navigated to mobile.twitter.com, click on the kimono icon next to the address bar. You will notice that a yellow box will appear talking about Auth mode:

'Kimonify'ing a mobile twitter page requires you to login so continue by clicking the pulsating lock icon.

Follow the instructions by clicking the yellow USERNAME circle followed by the Phone, email or username box. Repeat for PASSWORD and SUBMIT as well then click Done.

Enter your twitter login information let the kimono magic happen. Now we want to navigate to the twitter page we are going to scrape. This requires us to click the NAVIGATION MODE icon and then we can search for the twitter account we want (the new Mr Cosmos Neil deGrasse Tyson). We have to click NAVIGATION MODE one more time and then select the appropriate twitter handle (@neiltyson). Now we're at the page we need to begin scraping.

Change the dialogue box at the top left which says property1 to something like tweet. Now scroll over any of the tweets in the tweet feed until a light yellow box overlays the text and left-click. You'll notice that the box becomes solid yellow and there is a 1 in the yellow circle at the top left. There should also be light yellow boxes overlaying other tweets with x✓ on the right. Click ✓ next to another tweet and all the tweets in the feed should turn yellow and a 30 should now be in the top left yellow circle.

If we were to run the kimono API at this point, it would only give us results for the 30 tweets displayed on the page. But there are many more tweets (over 4,000 for Mr Tyson) that we can extract and it only takes one more step. Scrolling down to the bottom of the page you will see a Load older Tweets button, this changes the page to the preceding 30 tweets. Click the bluish circle at the top right which is called PAGINATION and then click this Load older Tweets button and then Done.

Give your API a name, like Neil deGrasse Tyson Tweets, and click Create API. Follow the link provided and now you're ready to Start Crawling (extracting the data from Mr Tyson's twitter feed). Under the CRAWL SETUP tab, you can see the status of the crawl and can change the PAGINATION LIMIT from 1 page to 1,000 pages. The DATA PREVIEW tab lets you see 10 rows at a time, or to copy/download the entire dataset. Download the data as a CSV file.

[Analyzing text data is called text mining and can be done using tools like MonkeyLearn. An example of analyzing news headlines using kimono and MonkeyLearn can be found here.]

For our purposes, we're just going to simply visualize the most common words from Neil deGrasse Tyson's tweets. Copy the tweet column from the downloaded CSV file and paste it into the text box here to get an image looking something like this:

There may not be much statistically significant with this, but it's a nice insight into the message being shared by one of the most prominent scientists of our time.

Monday, March 9, 2015

The data blogging adventure begins!

"Maybe stories are just data with soul."

Brené Brown

This blog will chronicle my education and experimentation with data journalism. Whether it's chronicling my exploration and scraping of data from the web; critiquing already-published pieces; or trying (& most likely failing) to master the plethora of tools available to visualize data, I hope to give you an entertaining and interesting look inside the life of an aspiring data journalist.

And so they say here in Britain, Tally-ho!

Oh and if you want to make your own image with personalized text, check out Chisel. That's how I made the photo above (photoshopped mad scientist with my face not included).

Data + Soul = Story

Translate