Part 1 of 2: How To Make Your Life Easier As A Tech Worker (IT/Software Engineering/System Administrator): Automation

After some time of being a student, and a tech worker [Software Engineer, IT Consultant, or  System Administrator], you should start to find ways to make your job easier.  These are some of the things I found helpful:

  • Create templates for daily status emails. If your employer/ client requires a daily status, this makes the task of writing up the daily status easier. Outlook has the ability to keep track of email templates.
    • If the daily status is being sent to multiple people it may be helpful to create a distribution list. A distribution list is less likely to lose contacts between statuses.
    • One template a coworker of mine uses is this template.
  • Create filters for emails. This can help reduce the amount of email that requires attention. For example if a group communicates with a shared distribution list then create a rule or a filter that can organize that into a separate folder. This can also clear out “Build succeeded” emails [if your company automatically subscribes you to those]
    How To Create Rules In Outlook.
    How to Create GMail Filters.
  • Use SSH keys: SSH Keys can save a little bit of time by creating password-less logins. Also with SSH keys, scripts are able to run commands on other SSH enabled boxes. This means that much of the repetitive system administrative tasks can be automated. Gentoo Guide to SSH [This include generic non-gentoo instructions.
  • SSH Commands: SSH Commands are shell-less sessions in which a user logs into a SSH box with a key to kick off only one specific command. For example, say you had to reset an environment, and it required quite a few commands. With an SSH command, you would combine all of those commands and then kick off the command as an SSH command with its own key. From there on, you could kick off those commands by connecting to the SSH box with the key.  More information.
  • Backups- It can’t hurt to backup your data. I’ve never found a single person that was reliable at doing this frequently. Setting up rsync, SSH-keys and making it a cron job is a good first step. I’ve heard good things about luckyBackup, but I haven’t had a chance to try it. 
  • Using Salt/Puppet to automate system administrative tasks on multiple machines. This would be a bit more helpful for those who regularly roll out new software to multiple machines, or need to perform other system admin tasks.
  • IfThisThenThat (Review) is an online service that connects other online services to perform the intended result. For example, if a Google search reveals specific keywords, you can have that update a twitter account.

SSH Mastery, A Book Review

“SSH Mastery: OpenSSH, PuTTY, Tunnels, and Keys” by Michael Lucas is one of those technical books that you wouldn’t keep on your bookshelf. It’s one of the books that will have its bindings bent, and many pages bookmarked sitting near the keyboard. Well, that is until most of the information is second nature to you. This book is one of the rare books that lives up to its title, is short and concise, very valuable, and is rather inexpensive.

The author could approach the material as a dull professor, or regurgitate the man page. However, the author addresses an audience of IT professionals as a professional himself. The book introduces the reader to the SSH toolset, goes over the clients (for Mac, Linux, and Windows), key configuration, various SSH configuration bits, a little over X forwarding, the configuration options, and how to do tunneling (enough to emulate a VPN connection).

Besides the writing style, and targeting the audience, I think that the strong points of this book are based on the subtle details. For example, instead of marketing SSH as a mysterious and obscure protocol that is “secure,” it’s described as a wrapping protocol. Many online tutorials and top X SSH tip blog-posts market SSH as the former. Another being, the author takes a side on the option to use the SSH v1 vs. v2 option. He strongly discourages v1, due to its flaws. Lastly, he answered a question or two that I did not realize I had about SSH. The questions were: How can you prevent man-in-the-middle? How can you improve performance? The answers being, key exchanges and multiplexing.

The concerns that I have for this book are rather minor. I would have liked the author to provide some data demonstrating the performance of the protocol. Example of performance could come from a comparison of speeds of file transfers [and different ciphers used], or gathering some evidence on how much cost is involved in using a terminal through SSH. The other criticism that I have is that the book did not go into detail on X forwarding. The reason given: It varies quite a bit from platform to platform and is subject to change. It would have been nice to get a review on the different X clients and protocols available.

How to get rid of Screen Scrapers from your Website

While driving on a long trip this weekend, I had a bit of time to think. One topic that came to my mind was screen scraping, with a focus on APIs. It hit me: screen scraping is more of a problem with the content producer than it is with the “unauthorized scraping” application.

Screen scraping is the process of taking information that is rendered on the client, and then transforming the information in another process. Typically, the information that is obtained is later processed for filtering, saving, or making a calculation on the information. Everyone has performed some [legitimate form] of screen scraping. When you print a web page, the content is reformatted to be printed. Many of the unauthorized formats of screen scraping have been collecting information on current gambling games [poker, etc], redirecting capchas, and collecting airline fare/availability information.

The scrapee’s [the organization that the scraper is targeting] argument against the process is typically a claim that the tool puts an unusual demand on their service. Typically this demand does not provide them with their usual predictable probability of profit that they are used to. Another argument is that the scraper provides an unfair advantage to other users on the service. In most cases, the scrapee fights against this in legal or technical manners. A third argument is that the content is being misappropriated, or some value is being gained by the scraper and defrauded from the scrapee.

The problem I have with the fighting back against scrapers, is that it never solves the problem that the scrapers try to fix. Let’s take a few examples to go over my point: the KVS tool, TV schedules, and poker bots. The KVS tool uses [frequently updated] plugins to scrape airline sites to get accurate pricing and seat availability details. The tool is really good for people that want to get a fair bit of information on what fares are available and when. It does not provide any information that was not provided by anyone else. It just made many more queries than most people can do manually. Airlines fight against this because they make a lot of money on uninformed users. Their business model is to guarantee that their passengers are not buying up cheap seats. When an airline claims that they have a “lowest price guarantee” that typically means that they show the discount tickets for as long as possible, until they’re gone.

Another case where web scraping has caused another issue is with TV schedules. With the MythTV craze a few years ago, many open source users were using MythTV to record programs via their TV card. It’s a great technology, however the schedule is not provided in the cable TV feed, at least in an unencrypted manner. Users had to resort to scrapping television sites for publicly available “copyrighted” schedules.

The Poker-bots are a little bit of an ethical issue. This is something that differs from the real world rules of the game. When playing poker outside of the internet, players do not have access to real-time statistic tools. Online poker providers aggressively fight against the bots. It makes sense; bots can perform the calculations a lot faster than humans can.

Service providers try to block scrapers in a few different ways. The end of the Wikipedia article lists more; this is a shortened version. Web sites try to deny/misinform scrapers in a few manners: profile the web request traffic (clients that have difficulty with cookies, and do not load JavaScript/images are big warning signs), block the requesting provider, provide “invisible false data” (honeypot-like paths on the content), etc. Application-based services [Pokerbots] are more focused on trying to look for processes that may influence the running executable, securing the internal message handling, and sometimes record the session (also typically done on MMORPGs)

In the three cases, my point is not to argue why the service is justified in attempting to block them, my point is that the service providers are ignoring an untapped secondary market. Those service providers have refused to address the needs of this market – or maybe just haven’t seen the market as viable, and are merely ignoring it.

If people wish to make poker bots, create a service that allows just the bots to compete against each other. The developers of these bots are [generally] interested in the technology, not so much the part about ripping-off non-bot users.

For airlines, do not try to hide your data. Open up API keys for individual users. If an individual user is trying to abuse the data to resell it, to create a Hipmunk/Kayak clone, revoke the key. Even if the individual user’s service request don’t fit the profile; there are ways of catching this behavior. Mapmakers have solved this problem a long time ago by creating trap streets. Scrapers are typically used as a last resort, they’re used to do something that the current process is made very difficult to do.

Warning more ranting: with airline sites, it’s difficult to get a very good impression on the cost differences of flying to different markets [like flying from Greensboro rather than Charlotte] or even changing tickets, so purchasing from an airline is difficult without the aid of this kind of tool. Most customers want to book a single round trip ticket, but some may have a complex itinerary that will have them leaving Charlotte stopping over in Texas, then to San Francisco, and then returning to Texas and flying back to my original destination. That could be accomplished by purchasing separate round trip tickets, but the rules of the tickets allow such combinations to exist on a single literary. Why not allow your users to take advantage of these rules [without the aid of a costly customer service representative]?

People who use scrapers do not represent the majority of the service’s customers. In the case of the television schedules example, they do not profit off the information, and the content that they wished to retrieve wasn’t even motivated by profit. Luckily, an organization stepped in and provided this information at a reasonable [$25/yr] cost. The organization is SchedulesDirect.

The silver lining to the battle on scrapers can get interesting. The PokerClients have prompted scraper developers to come up with clever solutions. The “Coding the Wheel” blog has an interesting article about this and how they inject DLLs into running applications, use OCR, and abuse Windows Message Handles [again of another process]. Web scraping introduces interesting topics that deal with machine learning [to create profiles], and identifying usage patterns.

In conclusion, solve the issue that the screen scrapers attempt to solve, and if you have a situation like poker, prevent the behavior you wish to deny.

The Linguistics of Webservices

It is frequently recommended that, readable/”good code,” should closely resemble speech. For example name a variable representing a collection “library” and a method “remove.” From those two items one can have the library remove a book. For example:

 
library.remove(book);

Services, in an odd situation, are meant to hide the client from the actual networking background, but yet be treated as the neighboring code within the application. Without the proxies and similar method calls, one should just write their own protocol and avoid using webservice. Therefore, I thought it would be a little interesting to do a thought experiment on how one can reference web services.

REST Services ask two basic questions “Can you _____ for me?” and “What do you know about ______?” The subject “you” in this case is a reference to the webservice. “Me” is the service client. Asking if the service can do something for you is referring to REST calls with the POST, DELETE, or PUT action code. Asking for something from the REST service is hinting at the GET [and sometimes post] HTTP action code will be used.  With REST running over HTTP all information has to be in Base64 encoding, meaning that there is only one encoding send to the receiving webservice.

SOAP is a little more complex. Scratch that it is a lot more complex. When SOAP was created, the deciding power to be went a little overboard for 90%* of the use cases. SOAP includes information about how the webservice is composed, structure of complex data types, error handling, and the calls themselves. Therefore, there should be a new way to refer to them. Web services can be asked the following questions:

  1. What can you do?
  2. What is that weird data type that is being returned by this method?
  3. What exceptions can be expected from this method?
  4. Can you run _____ [method]?
  5. Can we keep a secret? If so, then please run this privately _______. [WS-Security]
  6. Can you return/receive the results from [method] in another language? [WS-DIME]

Given any extra extensions to the WS-* specification, more questions may be asked.

* Statistics here are not an accurate measure, more of a wild guess. Nevertheless, it sounds right, that what counts, right?

APIs: A Strategy Guide

When considering the importance of APIs, I took a look at the book APIs: A Strategy Guide by Daniel Jacobson and Dan Woods. It didn’t meet my hopes, but it did go on an interesting journey. I was going to write some of my thoughts on APIs, but this book covers them better than I could. When I saw the subject of the book, and the publisher [Oreilly Meida], I had the impression that the book was going to be developer centric and would cover many of the popular APIs [Twitter, Facebook, Weather.com, Reddit, Last.fm/Audioscrobler, Amazon, etc.]. I was sorely wrong. This book is more for selling the idea of creating an API, marketing it to future developers, and discussing the technical problems introduced by an in place API.

What I had hoped: I had hoped that the book would be more of a Rosetta stone of APIs. How to critique an API, how to find weaknesses [it does hint at this, but it wasn’t a serious look into it], optimize usage, and how to potentially adapt software into an API. I can’t qualify myself as an API designer; however I do have some authority on the subject. My experience includes adapting a massively large legacy system into a license controlled API. I can’t go into too many of the details of this as that it was an assignment for work.

While I’m still off subject I should also add to my rant. APIs are notoriously annoying to learn. Every provider has a new way of accessing data, and passing around data IDs [an identifier specifying a user, their credentials to make API calls, etc.]. Providers may include good documentation, some don’t. Some providers make their API incredibly difficult to understand, some don’t.  Some APIs have synchronous and asynchronous calls. These technical details were not mentioned, compared, or criticized in the book. I was left wishing that they were. As a developer those details interest me more than how to sell the idea to my boss.

Getting back on the subject, I wish that the book discussed more about the communication channels in which the APIs may be transferred over. The book highly promoted REST, and mentioned SOAP. That’s great, but what about XMPP, in-proc messages, binary web services [Hessian], etc. I finished the book wondering, what about separating the API layer from the actual implementation. It also did not address the issue of multi-language environments. Should you support non-Java based clients, etc. I realize that the answer to that should be “it must be language independent.”But it brings up an important question on how you, the producer, want the API to be used. That is an important consideration if your service/product is mean to be scalable, or if the tasks are intended to take a long time. [Hint: Use a message queuing service].

According to the write-ups on the authors, they have the credentials to talk about APIs from a business sense. But when it comes to actually implementing, I wish they had a subject expert[s] go into more detail. The technical detail and ridiculous amount of depth is what I come to expect from an O’Reilly published book.

Overall, this book is interesting in an architectural sense. However in a developer perspective, it’ll give you things to argue about in the planning stage, but won’t help out later during implementation. Want to give your developer users an API for your product? Give your manager this book, and soon you’ll be asked to write an API.

“Wanted Java Developer” could you be a little more ambiguous?

I have a bone to pick with the industry I associate with. The tech industry struggles to clearly define the expectations of whatever is desired. This is pretty much a universal issue with the industry. There are always hidden requirements. Job listings are no different.

The bone I have to pick is with the titles/job requirements. One of the worst offenders of this is for a Java Software Engineer. Based on the availability of frameworks, and meta-frameworks, asking for a Java engineer is quite ambiguous. This could ask for a graphics developer, API designer, web developer, Computer vision expert [CV in java is possible with native libraries, trust me!], core libraries developer, or even a micro-JVM developer. The amount of variability with the language makes a generic “Java developer” title frustrating.

As a potential candidate, it is incredibly frustrating to see the “Java Developer” title. Some titles ask for experience in Spring, Hibernate, Solr, Lucene, JAI, J2EE, etc. It’s incredibly frustrating to go into an interview, where the listing asked for all of these and then only be grilled on the minute [rarely used] inner workings of Spring RMI. Just the Spring framework alone asks for a lot. Of all of the potential avenues of spring that one could master are: Message queuing, Web services, Roo, Security, Integration, Web flow, MVC, BlazeDS, Batch, Social, and mobile. That’s not even accounting for the frameworks that you can substitute between the pathways for Spring. I’ve worked with a late Spring 2.x and early 3.0, I was not even aware of the new BlazeDS, Batch, social, or even mobile options for the framework. Things change quite quickly.

Besides ranting, what is the purpose of this article? I believe that the Java title should be a bit more specific. If you want a Java developer that knows a lot of frameworks, still label it as a Java developer position. However, don’t expect him or her to know all of the inner workings: that is just silly. If your business involves multimedia display, request a developer that knows the Java 2D and 3D graphics, and maybe the JAI libraries. There is a lot there for practically any task you want to throw at Java, given you know which library to use.

Scientific or mathematical tasks, ask for a Scientific Java Developer. What libraries should they know? Colt, EJML, JAMA, etc.

Writing a Java API? Ask for a Java API Designer. Expect for them to list their favorite APIs, what works, what doesn’t, and why.

Web applications? Ask for a Java Web Developer. Maybe they should have experience with a SOAP framework, MVC framework, Play, GWT, basic Web skills, and maybe even a non-SOAP based webservice library [REST, Hessian/Binary]. Please don’t use the enterprise title unless you need someone who knows EJB, and/or ESB frameworks.

Moreover as an employer, be more specific the first interview/inquiry on what the job is asking for. If an job application asks for the common ORM Hibernate, how much should one know about it before applying for the position? Should they be able to know how to wire up beans to their DB analog? Should they know how write their own dialect for a new data source? With JUnit, should the applicant be expected to transition your current development methodology to TDD? Should they know how to extend the JUnit framework?

At this point, you should be getting the picture. Different tasks demand different skills. Specify what you’re looking for. Find people that can learn new skills and that are interested in the same problems you are. People who have an active interest can learn what is needed to get up to speed. If you go to a butcher and ask for meat, you’re always going to get what you asked for, but not what you were craving.

Addendum: This can apply to an Erlang, C++, Python, and most other developers. The role of Java currently has such a great demand, and such an ambiguous title, making this article a little easier to communicate.

“10 Ten Reasons why You’re Programming Wrong” — How to bore/annoy your audience

After watching the video “The Web Will Die When OOPs dies” and reading a few hip articles, on “how to improve your code,” gave me an idea. If someone is writing to give advice on improvements they should provide real world examples to why their suggestions/fixes to a language/framework are necessary.

Instead of articles of “Top 10 ways to improve your code;” I’d rather see an article about how something was improved. Take small segments from your own application or an open source project and write up a small report on how your suggestion[s] improved it. If you’re writing an article for developers, we’re interested in the proof and technical content. If your suggestions didn’t work out, we’re still interested in the story about the journey, or the technical observations that you found. Making vague suggestions and recommendations (without evidence) bore a technical audience. When you make vague suggestions to someone, you’re making an statement that you know more about their work or problem than they do. Your audience should be intelligent, if not then still write for an intelligent audience. Vague statements don’t address interesting problems, or new approaches.

So going back to the video I mentioned earlier. I like his tactless approach. He’s taken a good deal of experience from web development and he is presenting his argument. He doesn’t attempt to ask for others for opinion, nor does he attempt to tell his audience how to do their job. He just merely presents issues that he found, and his way of trying to solve those. The last thing that I loved about his talk is that he never once tried to push the latest hip language or framework. He mentioned a few frameworks, and then made a critism.

My first attempt at open source: PageRecommender

For personal/private [non-work related] projects, I tend to shy away from creating/working on open source projects. Typically open source projects tend to resemble the same work I do during the day, working and dealing with others. That’s no fun when you just want to build something you need. However, I am trying something new. I’m releasing one of the personal projects into the world of open source. It is the component that is used on this website for making recommendations for projects. [Example: see the bottom of the Financial Strategy Simulator page] A project page that contains more technical information about this project can be found under /a/Projects.

To find this source pull it [using git] from: https://github.com/monksy/PageRecommender

What is this project about? This project is designed to analyze Apache request logs, and attempt to piece together sessions, and then to create an Amazon/Newegg-like statistical recommendation. The desired output is XML representing a parent/child relationship of a page and the next connecting page.  The output comes from standard out. The component is designed to be used as a quiet utility.

What this project isn’t: This project isn’t a completely generic solution that’ll fit your site. It is designed within the context of my current website, and the format of the standard Apache log files. Want to have it look for pages that don’t fall under the /p/{Name} syntax? Change up the Apache log file format? Well this project won’t work for you without modification. Also, there is no such warranty provided by this code. It’s open source, it’s free as in speech but not as in beer.

What could be improved: I realize that this project isn’t perfect. I could have designed it to be slightly easier to read. It could be documented much, much better. However, this is an internal project. Want to improve it? Github should allow you to make such changes. [I’m rather new to Github, so don’t hold me to that statement] The XStream dependency could be removed. But for right now it works.

What do I recommend?

Grab the source! Add more tests, and send me input on how you think it could be improved.

Review: “Test-Driven Development” by Kent Beck [The creator of JUnit]

It is always interesting to read a book about a technical topic from its creator. They tend to better identify the motivating factors and history that went into creating the product than other authors could possibly try. Kent Beck is the creator of JUnit, and happens to be the author of this book. The book is designed for people who are new to unit testing and Test-Driven Development. For anyone in the current software development industry, this is very few people.

The book is divided into three separate sections. The first being a walk-through with JUnit (by example), xUnit, and then a “best practices guide.” The JUnit section, the majority of the content of this book, is focused on developing a currency class. While there were some interesting design decisions, this part irritated me the most. The first section produced blatant errors, just to create a test that would fail. I would have preferred if these errors had been described rather than presented to the reader. The currency example was a good example, however adding another example would have been better. The xUnit section of the book focused on creating a test environment for Python. I am not a Python developer, but I have written a very small amount of code. I was not a fan of this section. However, I was pleased with the best practices section. The only issue I had with the last section is that I do not recall that the best practices section went into why the recommendations are better than what they replace. If my memory serves me correctly, it did not.

This is a great resource for developers who have never heard of TDD. It may even be a great book for students that have just completed half of a class in Java. However, it is not such a great book for people who have worked with Test Driven Development.  For those who have already worked with TDD, a Manning Press’s “In Action” series may be more suitable. Although their resource on JUnit looks to be a bit dated. Manning Press’s dated JUnit book includes information on integrations with other frameworks, which are slightly harder to test against [J2EE, XML, Servlets, EJB, DBs….].

In summary, I was probably looking for another book similar to “Pragmatic Unit Testing in Java with JUnit” (Hunt and Davis). Their approach was more focused on how to try to find trouble spots, and improve existing code.

Given that this was written in 2002, the following is not exactly fair criticism, but it should be mentioned by anyone that is claiming that TDD is beneficial. I would have liked to see more support [research/studies] on why Test Driven Development is beneficial. Every TDD-related article/publication/book that I have read always attempts to convince the reader that it’s necessarily with marketing-speak. The claim is that it is good for “high stress environments,” or it reduces defects. However, I have never seen evidence supporting the claim mentioned. I, personally, have found that TDD to be a significant improvement in development. However, I have never been able to quantify how much of an improvement it has made. What is the amount of time saved by TDD? Has it made today’s software more reliable than before? Has it affected the developer job market? [Increased/Decreased] Has it made the users of the products using TDD happier? Most importantly, has test-driven development reduced the stress of software development?

Well That Was Silly Of Me, Issues with Sed….

Refreshing my memory on sed caused me to run into two issues tonight. Firstly…. the -n parameter only shows the patterns that you wish to show [after it is used]. Secondly, the order of deleting and printing lines  matters. It turns out that it matters a lot.

Lets say you have a file named contents. It contains:

Gooogle
GooooogleBot
Gooooogle Pictures
Google Plus
Reddit
Yahoo

Let’s assume that you wanted to just show all lines that contained “Gooogle” [and its similar brothers] with sed. You would write a line that contains this:

 
sed -e '/Goo[o]\+gle/p' content

Right? Nope. It’ll show all of the items, despite that you used the print command to display items that matched that pattern. To fix this, put in the -n option before -e.

That’s great… But that returns: Gooogle, GooooogleBot, and Gooooogle Pictures. In this example, we don’t like GoogleBot. So lets remove it. You may now write something like: 

 
sed -n -e '/Goo[o]\+gle/p' -e '/Goo[o]\+gleBot/d' content

It seems like a logical extension. Right? The next regular expression should pass over the printed lines left and make an evaluation. Nope, it doesn’t. It’ll display the same results prior to the second expression. What’s going on? Its not a bad expression. Its not a bad command. It’s due to the placement of the prints, deletes, and where you ask that the pattern space be shown. This is some odd quirk, that I haven’t found an explanation for [yet]. But what it turns out to be the correct way of doing it is to rearrange everything where the deletes are first, and then the prints occur [Also, to refuse to print the pattern space after the deletes (weird I know).

So the correct form is:

sed -e '/Goo[o]\+gleBot/d' -n -e '/Goo[o]\+gle/p' content

Bizarre? Yes, very much so, but it works.