I don’t like Virtual Machines

I don’t like virtual machines. Give that my current position and employer involves building, maintaining, supporting, optimizing, and selling solutions/services this statement is a bit ironic. I don’t hate the benefits that the technology has given us. It’s amazing about what it has provided. It’s also amazing how you can scale up a service without having to bring in lots of new hardware and maintain that as well. It’s more efficient and cost effective than the old way of doing things. It has progressed development of operating systems, and drivers.

The problem I have with virtual machines involves more of the context of what they are. In a very simple manner, a VM is an emulation of a physical machine run within a computer, also known as a hypervisor. Oftentimes, lots of virtual machines are run on the same box by using systems like VMWare ESX server, Zen, or KVM. Very frequently the difference between a physical machine and a virtual machine are very little. The differences show up with 3D/low-latency applications and VMs that depend on hardware input that cannot be emulated (I.E. number generation on VMs).  After reading that previous statement, and considering the downside, there should be something that sticks out to you. It’s something that should make you feel uncomfortable.

For me, I’m made very uncomfortable in the fact that we have many VMs running on the same box. Much of the processing, storage, and memory are consumed by redundant operating system processes, and/or files associated.  This seems really inefficient to have 30 instances of Windows Server 2012 all running IIS at the same time. The alternative to this madness is through the use of containers. I like the idea. I would love to get a chance to learn more about OpenVZ and LXC when I get more time. I like containers because they are sandboxed/managed containers which pushes the processing ability onto the actual job being performed. It feels more efficient, and more inlined with solving the problem rather than creating more infrastructure.

Prior to the virtualization era: We were encouraged to build grid services. This was great, you could throw a lot of machines at a problem and had them work in harmony. However, that didn’t work as well as we hoped due to the immature tools and frameworks offered at the time. In replacement of grid computing, the next approach was to split the problem up into individual processing united and to just to throw a lot of machines at the problem. After VMs are “nearly-free.” This really doesn’t fix the problem, it just seems like we’re timesharing on a powerful server once again.

Two book reviews that were not done

This year, I haven’t blogged as much as I have previously. Things have kept me busy… However, in the world wind of all the things I’ve been up to… I forgot to write a recommendation/review of a few books.

Instead of boring you with the details on every item of the book I liked or disliked of it here are a few of the books that I greatly approve of:

  • DSLs In Action – Ghosh
  • Everything is Negotiable – Kennedy
  • 33 Strategies of War – Green
  • The Obstacle is the Way – Holiday

Why I’ll buy more Logitech Wireless mice

If you’ve read much of my blog, I tend to be a bit more critical rather than rewarding. It’s human nature, the negative emotions/experiences tend to be amplified more than the positive ones. However, I feel that I must mention my experience with the Logitech wireless mice. I first bought a Bluetooth wireless mouse and I had a horrible experience with it. The battery would last 1-2 weeks at most. However, a friend of mine owned a m325 [which has its own custom wireless protocol and adaptor]. From my experience, I have never had to change the battery that comes with the mouse, and it’s been nearly a year now.

Recently, I have moved and one of the USB micro adaptors broke. I bought a new mouse and found out another one of the mice got lost. [I previously had two] It turns out that the adaptors are easily reprogrammable with the “Logitech Unifying Connectors.” Although this software is for Windows/Mac only, this is slick. Additionally, you can program multiple keyboards and mice to one adaptor.

How to get rid of Screen Scrapers from your Website

While driving on a long trip this weekend, I had a bit of time to think. One topic that came to my mind was screen scraping, with a focus on APIs. It hit me: screen scraping is more of a problem with the content producer than it is with the “unauthorized scraping” application.

Screen scraping is the process of taking information that is rendered on the client, and then transforming the information in another process. Typically, the information that is obtained is later processed for filtering, saving, or making a calculation on the information. Everyone has performed some [legitimate form] of screen scraping. When you print a web page, the content is reformatted to be printed. Many of the unauthorized formats of screen scraping have been collecting information on current gambling games [poker, etc], redirecting capchas, and collecting airline fare/availability information.

The scrapee’s [the organization that the scraper is targeting] argument against the process is typically a claim that the tool puts an unusual demand on their service. Typically this demand does not provide them with their usual predictable probability of profit that they are used to. Another argument is that the scraper provides an unfair advantage to other users on the service. In most cases, the scrapee fights against this in legal or technical manners. A third argument is that the content is being misappropriated, or some value is being gained by the scraper and defrauded from the scrapee.

The problem I have with the fighting back against scrapers, is that it never solves the problem that the scrapers try to fix. Let’s take a few examples to go over my point: the KVS tool, TV schedules, and poker bots. The KVS tool uses [frequently updated] plugins to scrape airline sites to get accurate pricing and seat availability details. The tool is really good for people that want to get a fair bit of information on what fares are available and when. It does not provide any information that was not provided by anyone else. It just made many more queries than most people can do manually. Airlines fight against this because they make a lot of money on uninformed users. Their business model is to guarantee that their passengers are not buying up cheap seats. When an airline claims that they have a “lowest price guarantee” that typically means that they show the discount tickets for as long as possible, until they’re gone.

Another case where web scraping has caused another issue is with TV schedules. With the MythTV craze a few years ago, many open source users were using MythTV to record programs via their TV card. It’s a great technology, however the schedule is not provided in the cable TV feed, at least in an unencrypted manner. Users had to resort to scrapping television sites for publicly available “copyrighted” schedules.

The Poker-bots are a little bit of an ethical issue. This is something that differs from the real world rules of the game. When playing poker outside of the internet, players do not have access to real-time statistic tools. Online poker providers aggressively fight against the bots. It makes sense; bots can perform the calculations a lot faster than humans can.

Service providers try to block scrapers in a few different ways. The end of the Wikipedia article lists more; this is a shortened version. Web sites try to deny/misinform scrapers in a few manners: profile the web request traffic (clients that have difficulty with cookies, and do not load JavaScript/images are big warning signs), block the requesting provider, provide “invisible false data” (honeypot-like paths on the content), etc. Application-based services [Pokerbots] are more focused on trying to look for processes that may influence the running executable, securing the internal message handling, and sometimes record the session (also typically done on MMORPGs)

In the three cases, my point is not to argue why the service is justified in attempting to block them, my point is that the service providers are ignoring an untapped secondary market. Those service providers have refused to address the needs of this market – or maybe just haven’t seen the market as viable, and are merely ignoring it.

If people wish to make poker bots, create a service that allows just the bots to compete against each other. The developers of these bots are [generally] interested in the technology, not so much the part about ripping-off non-bot users.

For airlines, do not try to hide your data. Open up API keys for individual users. If an individual user is trying to abuse the data to resell it, to create a Hipmunk/Kayak clone, revoke the key. Even if the individual user’s service request don’t fit the profile; there are ways of catching this behavior. Mapmakers have solved this problem a long time ago by creating trap streets. Scrapers are typically used as a last resort, they’re used to do something that the current process is made very difficult to do.

Warning more ranting: with airline sites, it’s difficult to get a very good impression on the cost differences of flying to different markets [like flying from Greensboro rather than Charlotte] or even changing tickets, so purchasing from an airline is difficult without the aid of this kind of tool. Most customers want to book a single round trip ticket, but some may have a complex itinerary that will have them leaving Charlotte stopping over in Texas, then to San Francisco, and then returning to Texas and flying back to my original destination. That could be accomplished by purchasing separate round trip tickets, but the rules of the tickets allow such combinations to exist on a single literary. Why not allow your users to take advantage of these rules [without the aid of a costly customer service representative]?

People who use scrapers do not represent the majority of the service’s customers. In the case of the television schedules example, they do not profit off the information, and the content that they wished to retrieve wasn’t even motivated by profit. Luckily, an organization stepped in and provided this information at a reasonable [$25/yr] cost. The organization is SchedulesDirect.

The silver lining to the battle on scrapers can get interesting. The PokerClients have prompted scraper developers to come up with clever solutions. The “Coding the Wheel” blog has an interesting article about this and how they inject DLLs into running applications, use OCR, and abuse Windows Message Handles [again of another process]. Web scraping introduces interesting topics that deal with machine learning [to create profiles], and identifying usage patterns.

In conclusion, solve the issue that the screen scrapers attempt to solve, and if you have a situation like poker, prevent the behavior you wish to deny.

Things I Wish Existed a Long Time ago: Unit Test Reporting

Reporting is, generally, one of the most boring tasks/subsections of Software Developments. The level of “uninterestingness” is exemplified by the “TPS” reports in OfficeSpace. However, reports can be awesome. They can be incredibly useful when they help you if they  accomplish your tasks, are interactive, and show data in a convenient/uncluttered fashion.

Unit testing is one of the categories that is absolutely necessary in development [even more so for larger systems], however the tools to scour these tests are often unhelpful. The tools that interact with NUnit, JUnit, GUnit tend to just reveal the results of the tests shown. Code coverage tools help, in reporting back the % of code covereed by the tests [and some times, with TestCacoon (C++), and EMMA/EMMA-compatable (Java) IDEs] However these tools lack an overall reporting capiblity to say “hey these bits/seconds/methods/components” lack the coverage the most in your code.

The only tool that I have found so far that will handle this is Clover [by Atlassian]. The features look great. Unfortunately, the product requires having a centralized build server. Which is a little too much to ask for hobby projects. [Speculation warning] To use this, it would require a continuous integration/build server setup.  Also, the cost maybe little high [$1200 for a 1 machine server] to make an off the cuff suggestion to management.I realize the return for a product may exceed the cost of developing a huge commercial product, however it is a little difficult to determine that [without the evaluation trial (if requires an build server and you don’t have one that’s a lot to ask for an evaluation)]. They do have an desktop version, but from the screenshots and the description, I wonder how well that would work [being a web client]. Also, I doubt that it includes C++/GUnit support.

 

Fixing Classpath Issues with JasperReports, J2EE, and Maven

If you are having issues with bundling JasperReports, a J2EE server, and Maven, you are not alone. There are many bumps in the road for getting JasperReports integrated with Tomcat/J2EE Container and building with Maven. Hopefully, this blog entry will make things slightly easier.

Firstly, there is the confusion of where the JasperReports dependency lies in the maven2 repository. There is a very suggestive entry for: {groupid: jasperreports, artifactid: jasperreports} however, that entry only hosts versions 0.5.0 to 3.5.3. The entry: {groupid: net.sf.jasperreports, artifactid: jasperreports} contains the versions 3.6 to the latest version. In addition, the last entry contains an artifactid of jasperreports-fonts, handy if special reporting fonts are requested from your reports.

Secondly, if you are getting ClassNotFoundException or NoClassDefFoundError for the class: net/sf/jasperreports/engine/JRException your issue with still with the POM configuration for JasperReports. The ClassNotFoundException is a runtime error, so despite a successful build, you will still see this. Despite the  jasperreports-x.x.x.jar file being in your web-inf folder, you may still see this error.

You must modify the scope of the component in the pom.xml. Change the scope from compile [which is needed to build], to “provided.” (http://maven.apache.org/pom.html#Dependencies) Provided not only uses the dependency to build, but update’s the container’s classpath to use this as a runtime dependency.

Continue reading “Fixing Classpath Issues with JasperReports, J2EE, and Maven”