How to get rid of Screen Scrapers from your Website

While driving on a long trip this weekend, I had a bit of time to think. One topic that came to my mind was screen scraping, with a focus on APIs. It hit me: screen scraping is more of a problem with the content producer than it is with the “unauthorized scraping” application.

Screen scraping is the process of taking information that is rendered on the client, and then transforming the information in another process. Typically, the information that is obtained is later processed for filtering, saving, or making a calculation on the information. Everyone has performed some [legitimate form] of screen scraping. When you print a web page, the content is reformatted to be printed. Many of the unauthorized formats of screen scraping have been collecting information on current gambling games [poker, etc], redirecting capchas, and collecting airline fare/availability information.

The scrapee’s [the organization that the scraper is targeting] argument against the process is typically a claim that the tool puts an unusual demand on their service. Typically this demand does not provide them with their usual predictable probability of profit that they are used to. Another argument is that the scraper provides an unfair advantage to other users on the service. In most cases, the scrapee fights against this in legal or technical manners. A third argument is that the content is being misappropriated, or some value is being gained by the scraper and defrauded from the scrapee.

The problem I have with the fighting back against scrapers, is that it never solves the problem that the scrapers try to fix. Let’s take a few examples to go over my point: the KVS tool, TV schedules, and poker bots. The KVS tool uses [frequently updated] plugins to scrape airline sites to get accurate pricing and seat availability details. The tool is really good for people that want to get a fair bit of information on what fares are available and when. It does not provide any information that was not provided by anyone else. It just made many more queries than most people can do manually. Airlines fight against this because they make a lot of money on uninformed users. Their business model is to guarantee that their passengers are not buying up cheap seats. When an airline claims that they have a “lowest price guarantee” that typically means that they show the discount tickets for as long as possible, until they’re gone.

Another case where web scraping has caused another issue is with TV schedules. With the MythTV craze a few years ago, many open source users were using MythTV to record programs via their TV card. It’s a great technology, however the schedule is not provided in the cable TV feed, at least in an unencrypted manner. Users had to resort to scrapping television sites for publicly available “copyrighted” schedules.

The Poker-bots are a little bit of an ethical issue. This is something that differs from the real world rules of the game. When playing poker outside of the internet, players do not have access to real-time statistic tools. Online poker providers aggressively fight against the bots. It makes sense; bots can perform the calculations a lot faster than humans can.

Service providers try to block scrapers in a few different ways. The end of the Wikipedia article lists more; this is a shortened version. Web sites try to deny/misinform scrapers in a few manners: profile the web request traffic (clients that have difficulty with cookies, and do not load JavaScript/images are big warning signs), block the requesting provider, provide “invisible false data” (honeypot-like paths on the content), etc. Application-based services [Pokerbots] are more focused on trying to look for processes that may influence the running executable, securing the internal message handling, and sometimes record the session (also typically done on MMORPGs)

In the three cases, my point is not to argue why the service is justified in attempting to block them, my point is that the service providers are ignoring an untapped secondary market. Those service providers have refused to address the needs of this market – or maybe just haven’t seen the market as viable, and are merely ignoring it.

If people wish to make poker bots, create a service that allows just the bots to compete against each other. The developers of these bots are [generally] interested in the technology, not so much the part about ripping-off non-bot users.

For airlines, do not try to hide your data. Open up API keys for individual users. If an individual user is trying to abuse the data to resell it, to create a Hipmunk/Kayak clone, revoke the key. Even if the individual user’s service request don’t fit the profile; there are ways of catching this behavior. Mapmakers have solved this problem a long time ago by creating trap streets. Scrapers are typically used as a last resort, they’re used to do something that the current process is made very difficult to do.

Warning more ranting: with airline sites, it’s difficult to get a very good impression on the cost differences of flying to different markets [like flying from Greensboro rather than Charlotte] or even changing tickets, so purchasing from an airline is difficult without the aid of this kind of tool. Most customers want to book a single round trip ticket, but some may have a complex itinerary that will have them leaving Charlotte stopping over in Texas, then to San Francisco, and then returning to Texas and flying back to my original destination. That could be accomplished by purchasing separate round trip tickets, but the rules of the tickets allow such combinations to exist on a single literary. Why not allow your users to take advantage of these rules [without the aid of a costly customer service representative]?

People who use scrapers do not represent the majority of the service’s customers. In the case of the television schedules example, they do not profit off the information, and the content that they wished to retrieve wasn’t even motivated by profit. Luckily, an organization stepped in and provided this information at a reasonable [$25/yr] cost. The organization is SchedulesDirect.

The silver lining to the battle on scrapers can get interesting. The PokerClients have prompted scraper developers to come up with clever solutions. The “Coding the Wheel” blog has an interesting article about this and how they inject DLLs into running applications, use OCR, and abuse Windows Message Handles [again of another process]. Web scraping introduces interesting topics that deal with machine learning [to create profiles], and identifying usage patterns.

In conclusion, solve the issue that the screen scrapers attempt to solve, and if you have a situation like poker, prevent the behavior you wish to deny.

The Linguistics of Webservices

It is frequently recommended that, readable/”good code,” should closely resemble speech. For example name a variable representing a collection “library” and a method “remove.” From those two items one can have the library remove a book. For example:

 
library.remove(book);

Services, in an odd situation, are meant to hide the client from the actual networking background, but yet be treated as the neighboring code within the application. Without the proxies and similar method calls, one should just write their own protocol and avoid using webservice. Therefore, I thought it would be a little interesting to do a thought experiment on how one can reference web services.

REST Services ask two basic questions “Can you _____ for me?” and “What do you know about ______?” The subject “you” in this case is a reference to the webservice. “Me” is the service client. Asking if the service can do something for you is referring to REST calls with the POST, DELETE, or PUT action code. Asking for something from the REST service is hinting at the GET [and sometimes post] HTTP action code will be used.  With REST running over HTTP all information has to be in Base64 encoding, meaning that there is only one encoding send to the receiving webservice.

SOAP is a little more complex. Scratch that it is a lot more complex. When SOAP was created, the deciding power to be went a little overboard for 90%* of the use cases. SOAP includes information about how the webservice is composed, structure of complex data types, error handling, and the calls themselves. Therefore, there should be a new way to refer to them. Web services can be asked the following questions:

  1. What can you do?
  2. What is that weird data type that is being returned by this method?
  3. What exceptions can be expected from this method?
  4. Can you run _____ [method]?
  5. Can we keep a secret? If so, then please run this privately _______. [WS-Security]
  6. Can you return/receive the results from [method] in another language? [WS-DIME]

Given any extra extensions to the WS-* specification, more questions may be asked.

* Statistics here are not an accurate measure, more of a wild guess. Nevertheless, it sounds right, that what counts, right?

APIs: A Strategy Guide

When considering the importance of APIs, I took a look at the book APIs: A Strategy Guide by Daniel Jacobson and Dan Woods. It didn’t meet my hopes, but it did go on an interesting journey. I was going to write some of my thoughts on APIs, but this book covers them better than I could. When I saw the subject of the book, and the publisher [Oreilly Meida], I had the impression that the book was going to be developer centric and would cover many of the popular APIs [Twitter, Facebook, Weather.com, Reddit, Last.fm/Audioscrobler, Amazon, etc.]. I was sorely wrong. This book is more for selling the idea of creating an API, marketing it to future developers, and discussing the technical problems introduced by an in place API.

What I had hoped: I had hoped that the book would be more of a Rosetta stone of APIs. How to critique an API, how to find weaknesses [it does hint at this, but it wasn’t a serious look into it], optimize usage, and how to potentially adapt software into an API. I can’t qualify myself as an API designer; however I do have some authority on the subject. My experience includes adapting a massively large legacy system into a license controlled API. I can’t go into too many of the details of this as that it was an assignment for work.

While I’m still off subject I should also add to my rant. APIs are notoriously annoying to learn. Every provider has a new way of accessing data, and passing around data IDs [an identifier specifying a user, their credentials to make API calls, etc.]. Providers may include good documentation, some don’t. Some providers make their API incredibly difficult to understand, some don’t.  Some APIs have synchronous and asynchronous calls. These technical details were not mentioned, compared, or criticized in the book. I was left wishing that they were. As a developer those details interest me more than how to sell the idea to my boss.

Getting back on the subject, I wish that the book discussed more about the communication channels in which the APIs may be transferred over. The book highly promoted REST, and mentioned SOAP. That’s great, but what about XMPP, in-proc messages, binary web services [Hessian], etc. I finished the book wondering, what about separating the API layer from the actual implementation. It also did not address the issue of multi-language environments. Should you support non-Java based clients, etc. I realize that the answer to that should be “it must be language independent.”But it brings up an important question on how you, the producer, want the API to be used. That is an important consideration if your service/product is mean to be scalable, or if the tasks are intended to take a long time. [Hint: Use a message queuing service].

According to the write-ups on the authors, they have the credentials to talk about APIs from a business sense. But when it comes to actually implementing, I wish they had a subject expert[s] go into more detail. The technical detail and ridiculous amount of depth is what I come to expect from an O’Reilly published book.

Overall, this book is interesting in an architectural sense. However in a developer perspective, it’ll give you things to argue about in the planning stage, but won’t help out later during implementation. Want to give your developer users an API for your product? Give your manager this book, and soon you’ll be asked to write an API.