Chapter 22. All Your Data Are Belong to Us: Liberating Government Data

When government refuses to make itself transparent and open and fails to make public information meaningfully available, hackers will liberate the data. It has happened many times over, and it will doubtlessly happen again. Each time government data is freed, citizens gain useful access to valuable information that rightly belongs to them. But perhaps more importantly, government is forced to deal with the new reality of a networked world in which the people demand free online access to public information.

Data is liberated by hacking government. Because of how the popular press has used it, the word hack is often misunderstood to mean only illicit access to computer networks. In fact, to techies that is only one possible meaning. According to Wikipedia, the term usually means “a clever or quick fix to a computer program problem” or “a modification of a program or device to give the user access to features that were otherwise unavailable to them.”

Liberating Government Data: Carl Malamud Versus the Man

Carl Malamud is probably the original government hacker. A technologist, activist, and entrepreneur, Malamud is well known for having forced the U.S. Securities and Exchange Commission (SEC) to make much of its public data available to the public over the Internet in 1995. At that time, the SEC did not provide free access to the corporate filings it collected. Instead, the SEC’s database, the Electronic Data Gathering Analysis and Retrieval system (EDGAR), was operated under contract with information wholesaler Mead Data, which provided data feeds to data retailers who in turn sold access to the public. According to John Markoff, author of the New York Times article “Plan Opens More Data to Public”, “Under this system, a retail information provider, like Mead Data’s own Nexis service, charge[d] about $15 for each S.E.C. document, plus a connection charge of$39 an hour and a printing charge of about $1 a page.” One can imagine customers were largely limited to firms on Wall Street. After first trying and failing to convince the SEC that it should make its database available on the Internet, Malamud began to purchase the SEC’s wholesale data and made it available on his own website free of charge to anyone. The service included corporate annual reports, 10-K filings, proxy statements, and other data valuable to investors, journalists, and others. In December of that year, Malamud expanded his free offerings by adding large portions of the U.S. Patent and Trademark Office’s (PTO) patent and trademark database, including full text of all patents, and text and images from the trademark database. Malamud, however, always believed that it was government’s responsibility to provide its data for free to the public, especially since the then-recently enacted Paperwork Reduction Act mandated that agencies make public information available electronically. On August 11, 1995, Malamud announced on his website that it would discontinue its free access to government data on October 1. As Malamud later recounted: Our goal, however, wasn’t to be in the database business. Our goal was to have the SEC serve their own data on the Internet. After we built up our user base, I decided it was time to force the issue. That’s when the fireworks began. When users visited our EDGAR system in August 1995, they got an interesting message: This Service Will Terminate in 60 Days Click Here For More Information Click here they did! One of the lessons I’ve learned from building Internet services is that when people get something for free, they want their money’s worth. The message informed users that under no circumstances would the unofficial service be continued, and suggested that it was “time for the stakeholders in this data to step up to the plate and forge a solution.” It also asked users to write to Congress and the administration, which they did. The SEC at first resisted. Eventually, however, it relented and the agency took over Malamud’s service as the core of a new online EDGAR system. The public uproar apparently caught SEC commissioners off guard and they took on the responsibility of making data available before Malamud’s October deadline. According to Malamud, “The commissioners of the SEC had clearly not been aware of the issue, but there is nothing like pieces in the Wall Street Journal and 15,000 messages to the Chairman to raise the profile of an issue” (http://www.mundi.net/cartography/EDGAR). Malamud had similar plans for his patent and trademark database. In 1998, he wrote to Vice President Al Gore and Commerce Secretary William Daley (who oversaw the PTO), announcing that unless the PTO began offering its databases on the Internet, he would create a free and robust online alternative. In the years since Malamud first put a patent database online in 1994, the PTO had not been as accommodating as the SEC. The agency was self-financed by user fees, a large portion of which came from requests for paper copies of patent and trademark information. As the commissioner for patents told the New York Times, “If he can [put the patent and trademark database online] we’d be out all$20 million we now receive in fees…. Why would anyone want paper?”

Malamud’s strategy to overcome government’s resistance was a familiar one. “I’m going to buy the trademark data and will build the user base as big as I can in a year,” Malamud said at the time. “At the end of the year, I’ll pull the rug out from the users and give them Al Gore’s E-mail address” (http://www.nytimes.com/1998/05/04/business/us-is-urged-to-offer-more-data-on-line.html). The gambit worked, and less than two months later the Clinton administration announced that it would put the full patent database online. In each instance, by forcefully but legally releasing online data that the government had either not disclosed on the Internet or not made easily accessible, Malamud was able to effect a change in policy that led to a more open and transparent government (see Chapter 14).

Disclosing Government Data: Paper Versus the Internet

The United States is one of the most open and transparent countries in the world. Citizens generally have the right to inspect the records, minutes, balance sheets, and votes of almost all public bodies.

Laws encouraging government transparency and accountability have been a feature of the U.S. system of government since the founding of the Republic. The Constitution, for example, requires that each house of Congress “keep a Journal of its Proceedings, and from time to time publish the same, excepting such Parts as may in their Judgment require Secrecy.” Today, the Congressional Record satisfies this requirement.

Unfortunately, many of the statutory requirements for disclosure do not take Internet technology into account. For example, the 1978 Ethics in Government Act requires the disclosure of financial information—including the source, type, and amount of income—by many federal employees, elected officials, and candidates for office, including the president and vice president, and members of Congress. The act further requires that all filings be available to the public, subject to certain limited exceptions. One might imagine, then, that every representative’s or senator’s information would be just a web search away, but one would be wrong.

Members of the House of Representatives must file their disclosures with the clerk of the House of Representatives, while senators must do the same with the secretary of the Senate. Each of these offices maintains a searchable electronic database of the filings. However, until very recently, to access these databases citizens had to go to Washington, D.C., and visit those Capitol Hill offices during business hours. There were no other means of searching the databases, something that presented a major barrier to the widespread dissemination of nominally publicly available information. Even today, the disclosure forms offered are scanned images that are not easily searched or parsed.

The result is that to make the information available online, third parties such as transparency website LegiStorm must acquire paper copies of the forms and manually scan and parse them. In contrast, the clerk of the House and the secretary of the Senate could likely make their existing databases available online at little extra cost. LegiStorm also offers something the official sites still won’t: financial disclosure forms for congressional staffers, not just members.

So, why would government fail to take advantage of the benefits of online disclosure? Not only would using standard Internet technologies make it easier for citizens to find and access government information, but it would probably also present efficiencies and cost savings to government itself. In most cases, the obstacle is likely bureaucratic inertia. In other cases, however, government will have little incentive, and often a disincentive, to make public information easily accessible.

Sometimes, as we saw with the PTO, government agencies make data freely available, but collect user fees for easy access to that data. This can create an incentive to protect the revenue stream at the expense of wider public access to government information. In other instances, government reticence to make data easily accessible can have political motivations.

Accessing Government Data: Open Distribution Versus Jealous Control

Much like Josh Tauberer’s GovTrack.us, the Washington Post’s Congress Votes database allows users to easily search and sort through a database of congressional bills and votes. When the Post was building its site, the House offered its roll call votes in XML, a standard machine-readable format, while the Senate did not. This forced the Post, like GovTrack, to rely on cumbersome “screen scraping” of Senate web pages to make their roll call votes usable (see Chapter 18).

In 2007, however, co-creator Derek Willis was poking around the Senate website when he discovered a directory of XML files of vote data for past sessions. This demonstrated that the Senate had the ability to make its votes available online in a structured format. Willis was elated at the thought that perhaps there was easy access to Senate vote data after all. He wrote to the Senate webmaster asking whether structured voting data was available for the current session and, if so, whether this data would be made public. The telling response read, in part, as follows:

A few representative votes (only a few from the early congresses) were published out to the active site during some testing periods. I really need to remove them from the site.

We are not authorized to publish the XML structured vote information. The Committee on Rules and Administration has authorized us to publish vote tally information in HTML format [not a structured format]. Senators prefer to be the ones to publish their own voting records. As you know, looking at a series of vote results by Senator or by subject does not tell the whole story. Senators have a right to present and comment on their votes to their constituents in the manner they prefer. This issue was reviewed again recently and the policy did not change.

Senators doubtlessly would “prefer to be the ones to publish their own voting records.” But jealous control over information by government is anathema to democracy. Looking at a series of votes by a senator does in fact tell the “whole story” of that senator’s voting record, and despite what the webmaster said, senators do not have a “right” to present their votes to the public “in the manner they prefer.” Of course, this only motivated hackers such as Willis and Tauberer further.

When third parties make government data available, it demonstrates that it is possible to do so cheaply and efficiently. In some cases, officials can be unaware of what is technically possible, or they may believe that state-of-the-art technology is prohibitively expensive. Freeing information can also generate an awareness and demand for the newly accessible data among citizens. This can lead to embarrassing questions for government: why isn’t it making the data available itself? Why are citizens forced to hack the data in order to access it? Also, hacking government data can demonstrate to cautious officials that when information is made accessible and useful, the world does not end.

In fact, since GovTrack.us and the Washington Post Congress Votes database brought attention to the issue, the Senate Rules Committee finally relented and has recently begun to make roll call votes available in XML. Two years after Derek Willis was rebuffed by the Senate webmaster, a group of seemingly embarrassed senators wrote to Committee Chairman Chuck Schumer demanding a repeal of the prohibition on XML.

“This policy has created a situation where outside groups are forced to create databases that are more likely to contain errors and omissions,” they wrote. “The suggestion that the Senate would intentionally hamstring the distribution of roll call votes so Senators could put a better spin on them is concerning. The public is capable of interpreting our votes on its own.”

The release of Congressional Research Service (CRS) reports is another example of hacking that is slowly leading to change. CRS is a think tank for Congress that is funded by U.S. taxpayers to the tune of $100 million per year. It produces objective in-depth briefings and high-quality research papers on topical public policy issues. The studies it produces are widely well regarded, and members of Congress and their staffs rely on them as they legislate. By law, however, CRS can make its reports available only to members of Congress. There are several oft-cited rationales for this policy. First is the idea that by releasing reports directly to the public, CRS would come between lawmakers and their constituents. There is also the concern that if CRS reports were widely disseminated, they would come to be written with a public audience in mind, rather than focusing on congressional needs. Finally, some fear a burden on CRS and congressional staff members who might have to respond to questions and comments generated by publicly available reports. Not surprisingly, several third parties who are not persuaded by these rationales have taken it upon themselves to collect and make the reports available on the Web. Members of Congress are free to release copies of the reports to citizens if they wish, and they often do so at the request of a constituent. Unfortunately, constituents must first know of a report’s existence before they can request it. There is no public listing of all the available titles, so the fear is that embarrassing reports are never released. Once a report is released, however, one is free to copy and disseminate it because works of the U.S. government are not protected by copyright. For several years, many organizations, including the Federation of American Scientists and the National Council for Science and the Environment, have published on their websites hundreds of CRS reports related to their research areas that they have collected over the years. In 2005, the Center for Democracy and Technology (CDT), a nonprofit dedicated to Internet public policy issues, brought the different CRS collections under one roof at OpenCRS.com. There, users can search the combined collections of CDT and partner groups, which total more than 8,000 CRS reports. More importantly, the site invites users to upload to the online library reports that they have acquired. It also provides a list of CRS reports that are missing from the library and instructions on how citizens can request them from their representatives. CDT acquires this list from a sympathetic but anonymous member of Congress who provides it on a regular basis. Further compromising the secrecy surrounding CRS reports, in February 2009 whistleblower site WikiLeaks.org released 6,780 CRS reports, which it said “represents the total output of the Congressional Research Service electronically available to Congressional offices” (http://wikileaks.org/wiki/Change_you_can_download:_a_billion_in_secret_Congressional_reports). For more than a decade, lawmakers sympathetic to open government have perennially introduced bills or resolutions to make CRS reports public. While these efforts have so far failed, the vast number of CRS reports now available to the public on third-party sites undercut the rationales for a policy of selective release. Citizens have access to a wide array of CRS reports, yet the quality of those reports has not suffered, CRS’s institutional character has not been diminished, and constituent relationships with representatives remain intact. As unofficial collections of CRS reports continue to grow, not only will citizens benefit from access to this information, but the selective release policy will become increasingly untenable and ripe for change. RECAP: Freeing PACER Documents for Public Use As long as some public documents are out of reach online, third parties will be motivated to free them. Carl Malamud’s most recent intellectual descendants may be a team of hacker scholars dedicated to liberating the millions of pages of public records now locked behind a paywall on the federal court’s online database. Stephen Schultze, Tim Lee, and Harlan Yu of the Center for Information Technology Policy at Princeton developed a web browser plug-in—RECAP—that distributes the hacking of the court database among many users. Each day, federal courts around the country generate thousands of pages of court filings, transcripts, judgments, and opinions—all of public interest. Free access to these documents is guaranteed to all citizens, so long as you visit the courthouse in question during business hours. The Public Access to Court Electronic Records (PACER) system was created in 1988 as a dial-up service that charged by the minute and afforded attorneys convenient remote access to court dockets and filings. In 1998, the system was migrated to the Web, and the per minute charge was replaced by a charge per page downloaded. Today, PACER charges 8 cents per page accessed, which doesn’t sound like much. However, the charge is completely out of proportion with how much it costs the court system to make the data available. As a result, the court system has found its IT budget with a substantial surplus. Some estimate that in 2006, the cost to maintain PACER was about$27 million, yet the system brought in \$62 million in revenue.

This is not a concern for lawyers, who pass their costs on to clients, or to commercial data retailers, such as Westlaw and LexisNexis, that purchase their data in bulk and benefit from the lack of convenient free access to court records. However, scholars, journalists, and average citizens are left without free online access to court records.

Schultze, Lee, and Yu’s scheme to free the documents on PACER is an ingenious one. They have built a Firefox plug-in called RECAP that attorneys, librarians, and other regular users of PACER can install on their computers. When a user downloads a document from PACER, the plug-in sends a copy to RECAP’s server, where it is made publicly available. If enough PACER users install RECAP, it will only be a matter of time before the entire database is liberated.

This is a brilliant system, but it raises the obvious question: altruistic motives aside, why would attorneys or other PACER users install the RECAP plug-in? The answer is just as brilliant. When a user clicks to download a document from PACER, the RECAP plug-in first searches its own server to see whether that document has already been made publicly available. If it has, then it is served to the user, who avoids PACER’s download charges. Additionally, the RECAP plug-in adds some nice touches, including reformatting the document filename and adding useful metadata.

Although it is a shot across the bow of PACER’s revenue stream, RECAP is entirely legal. The court documents that RECAP shares, like CRS reports, are not subject to copyright and can be freely distributed by citizens. By creating a public repository of these documents, the RECAP team is forcing the issue and making the court system implement legal or technical countermeasures, or seek a new business model that includes public access to its documents.

Conclusion

This tough-love approach is a necessary counterpart to a public education and advocacy strategy for realizing a more open government. While we must convince citizens and government officials of the merits of transparency, talented hackers can often simply show them. When this happens the burden of proof is shifted to government. And, in turn, it’s difficult to make the case that more, better, and easier access to public data isn’t a good thing.