A New Blog Series

This is the start of a new series where I’ll be documenting my (previous) adventures in ‘vibe coding.’ These experiments happened a while ago (over six months), but my goal is to share what worked, what didn’t, and how we can all use AI agents more effectively.

Here’s what I’ll cover in the coming posts:

  1. Introduction/Goals
  2. Tools/Models
  3. Results
  4. What Succeeded?
  5. What Failed?
  6. Reflection/What I Would Do Differently

Let’s get started.

Movie Nights

The problem for this is that it seems that there isn’t a very good method for negotiating with friends over when/where/and what movies to watch for hosting a “Movie Night Party.” After hosting a movie night with friends, we were all there.. however we didn’t have a good clue as to what we’d like to watch. Given more time, it would have been more productive to figure that out beforehand. Traditional event planning tools felt like overkill and fairly invasive for something we wanted to keep self-hosted and private. (For example, Partiful founders are Palantir alums)

Essentially, it’s a blend of event hosting and polling features, designed for easy movie night coordination.

At a high level, the system was built with these core components:

Database → Movie App → Notifier (Email)

The system catered to two types of users:

  1. Host: Responsible for creating the event, inviting guests, and managing RSVPs.
  2. Guest: Receives an invitation, can view the event details, vote for a movie, suggest new options, and indicate their availability.

I wanted to avoid the friction of traditional usernames and passwords. Instead, users would log in with “magic links” a much simpler experience and didn’t burdeon the invitees to having an offical account. Asking friends to create accounts felt like too much overhead for a casual movie night.

Technologies Used

Choose Elixir for this. I chose it for a few reasons:

  1. Elixir is a well-regarded language for building web applications, and I found it very approachable.
  2. It has excellent support for databases and notification services
  3. And, of course, it’s functional! (A personal preference.)

So far, so good, right?

How it went

Early Experiments with Aider

I started with Aider, a straightforward CLI tool for interacting with LLMs. Since I didn’t want to subscribe to any services initially, I explored running models locally. However, that approach didn’t last long.

The local models struggled for a couple of key reasons:

  1. They were too small, in parameter size, to handle code effectively.
  2. They were slow, and I couldn’t easily provide feedback to improve their output.

The prompt based model of Aider didn’t end up working well. The models would produce code that met what you asked for, however they never took account for the context surrounding where the changes were made. Additionally, Aider has limited tool calling support, so it didn’t iterate on the code it changed.

I was dealing with small 30-70b models split between a GPU and System RAM. **Of course, the more capable models ( 300-600 billion parameters) typically require over a terabyte of memory: far beyond what my desktop system can handle, even though it’s fairly powerful.

Moving to MiniMax on Aider

I decided to try a MiniMax subscription on Aider for a few reasons:

  1. It received positive reviews on Reddit.
  2. The $80/year price was very reasonable.
  3. I wasn’t planning on storing sensitive data, as I was cautious about where my prompts and data might end up.

Firstly, on Aider MiniMax was much better at generating larger bits of code, and it was usually more successful. Secondly, I still had issues where the generation would just cut out. The code that was generated was better, but the issue then became getting the generated code to be compilable.

MiniMax was better, but not great. It was capable of generating code, but it wasn’t as good as account for context and structure.

Next, I experimented with OpenCode and oh-my-opencode (which later became oh-my-openagent). These frameworks support tool calling and agents, so I used Sisyphus within oh-my-openagent to continue generating code. I still maintained a healthy skepticism about the generated code, so this remained a manual, iterative process to build functionality incrementally and reliably.

Continuing on OpenCode And Sysphius

Sisyphus proved to be a great tool for creating agent loops, but it struggled with more complex problem-solving. For example, when I asked it to address linting issues, it gave up and simply disabled the linting rule instead of fixing the underlying problems.

Again, something frustrating to learn about.

Timeouts with tool calls

Tool calls are inherently risky on a development machine. An LLM could, in theory, execute commands like “rm -rf” and wipe your system. Because it took me too long to review and approve (or reject) those tool call requests, the agent harness would often time out or attempt alternative actions, leading to further complications.

Combining multiple features in one commit

Maintaining discipline and focusing on single features proved surprisingly difficult. Anything you prompt can come to existance, even bugs you just discover. The perceived power of the LLMs made it tempting to address unrelated issues, in order to “fix that one little thing,” which ultimately led to more code drift than actual progress.

TMDB Integration

Integrating TMDB support for movie search and lookup within the client and UI was a clear success. It worked seamlessly, likely because the LLM had been trained on the TMDB API documentation. The bigger challenge came with implementing an auto-completing text field. The LLM struggled with the specific version of Elixir I was using, and how different versions respond to even reasonable key changes.

Results

Ultimately, I grew increasingly frustrated with this project. Each new feature seemed to introduce regressions (“AI drift”). Eventually, I had to abandon it.

The result was a functional website, but with features that didn’t work as intended. The codebase is questionable at best, and any attempt to add new features risked reintroducing flawed assumptions from the LLM.

Reflection (Well 9 months later)

Looking back, I believe a spec-driven development approach, with a robust library of specifications, would have significantly improved my chances of success. I also wish I’d started with more powerful models and the better context management techniques I’ve learned since. Perhaps I’ll revisit this project someday.

Specifically, spec-driven development would have prevented issues with authentication and user interaction patterns. I also discovered the value of planning tools like Prometheus *after* this project, something I normally use today.

Furthermore, maintaining an “Agents.md” file to track tool usage would have mitigated much of the churn. A code context tool (similar to Graphify or CocoIndex) would also have been beneficial.

Additionally, what I learned is that learning AI Assisted Devleopment is a journey. It’s not something you can learn instantaneously.

Lessons Learned and Future Approach

I wouldn’t attempt to salvage the existing project; I’d start from scratch.

Here’s what I would do differently next time:

  1. Plan with the LLM First: I’d begin by using an LLM to thoroughly plan the application’s architecture.
  2. Generate, Don’t Delegate: I’d generate the initial codebase myself, rather than relying on tool calls that could be using outdated versions.
  3. Spec-Driven Development: I’d adopt a spec-driven development approach to guide feature creation.
  4. Automated Spec Execution: I’d leverage a tool like Code-On-Incus to automate spec execution, avoiding the tool permission request issues I encountered.
  5. Rollback, Don’t Repair: When things go wrong, I’d revert to a previous working state instead of trying to fix the generated code. Focus on refining the prompt and context.
  6. Incremental Milestones: I’d prioritize small, achievable milestones, like getting a basic web app running, before tackling larger features. I’d delay complex elements like custom authentication.