Two years in, we tore the heart out of SeenWant and rebuilt it.
Specifically, we rebuilt the swipe engine — the system that decides which film, show, book, or game gets shown to you next. The old engine worked fine. It was the kind of "fine" that quietly destroys a product. Users were swiping. They weren't finishing lists. They were watching their Want list grow without ever pulling from it. The signal was good but the loop was broken.
This post is what we learned, what we built, and why we think most algorithmic recommendation feeds are quietly bad in ways their owners can't see.
The original engine
Our v1 swipe engine was straightforward. It pulled candidates from four sources:
- Genre matches — films/shows in genres the user had previously Wanted or Seen.
- Collaborative filtering — "users who liked what you liked also liked..."
- Trending — globally popular titles, weighted by recency.
- Cold-start fillers — for new users, a curated set of "obvious" titles to bootstrap signal.
The swipe deck was a weighted blend of these sources. Tunable. Reasonable. Industry-standard.
It worked the way industry-standard recommendation worked: showing users things that were "good enough" without ever being surprising.
What we noticed
Six months into the v1 engine being live, we started getting weird signals from analytics:
- Want-to-Seen conversion was low. Users were Wanting things but not Seeing them. The Want list was a graveyard.
- Time-to-first-Seen for new users was high. Even after onboarding, it took new users two weeks on average to mark anything as actually watched.
- Repeat-swipe sessions were short. Users would swipe for two minutes, get bored, leave.
- Power users were less engaged with recommendations than casual users. The most active users were ignoring our recommendations entirely and adding things from external sources (Letterboxd lists, friend recs).
Each of these signals had explanations. The compound message was: the engine was failing at the actual job.
What the actual job is
We had an internal debate about this for months. The job we'd shipped to was "show users things they'll enjoy." The job we should have shipped to was something different: "help users decide what to consume next."
These sound the same. They aren't.
"Show users things they'll enjoy" optimizes for swipe-right rate — how often the next card gets a Want. This is what every recommendation engine in entertainment is optimized for. It's the wrong target.
"Help users decide what to consume next" optimizes for something harder: the rate at which a Want becomes a Seen. The rate at which a casual evening becomes a watched film. The conversion from interest to action.
Once we shifted to that target, almost everything in the engine had to change.
The five things we got wrong
Looking back at v1, the systematic errors:
1. We optimized for "more candidates"
The v1 engine assumed that the more options, the better. So we'd surface 30 candidates per session.
Wrong. After about 12 candidates, fatigue set in. By candidate 20, swipe quality dropped. By 30, users were just clearing the deck to be done.
v2 fix: cap each session at 12 candidates. Make each one earn its slot.
2. We treated all four media types the same
A book recommendation works differently than a film recommendation. A book is a 10-hour commitment. A film is a 2-hour commitment. A game is potentially a 50-hour commitment.
The v1 engine showed all four side-by-side with the same weighting. Users were Wanting books at the same rate as films, but only Seeing 5% of the books they Wanted.
v2 fix: type-specific weighting. Books require much stronger signal before we recommend one. Games even more so. Films are the fastest decision; we're more permissive.
3. We trusted the genre system too much
"Drama" is a genre. So is "horror." So is "thriller." So is "psychological thriller." The taxonomy is a mess, and films cross-tag liberally.
V1 used genre overlap as a primary signal. The result was that someone who liked one slow Korean film got recommended every "Drama" tag in the database, including American 90s family films.
v2 fix: we replaced genre matching with embeddings. Films are now placed in a learned similarity space based on actual user behavior, not metadata tags. Past Lives is now recommended next to Aftersun, not next to The Pursuit of Happyness.
4. We didn't model the room
Most of our users don't watch alone. They watch with partners, families, roommates. The engine treated every recommendation as if the user was the sole audience.
v2 fix: a "watching with" mode that lets users swipe with someone else's profile in mind. Recommendations adjust accordingly. We're still iterating on this; it's the v2 feature with the most room left.
5. We over-rewarded recency
Trending titles were too heavily weighted. A film that came out last month would beat a 2003 film of the same actual relevance, every time.
v2 fix: recency is a small bonus, not a multiplier. Films from any year can win.
The new architecture
The v2 engine has four main parts. Worth being specific about because the architectural decisions are the actual story:
Embedding space
Every title in our database has an embedding — a 384-dimensional vector that captures, very roughly, "what kind of film is this." We learned the space from user co-occurrence data: titles that are frequently Wanted by the same users cluster together in the space.
This is not new technology. What's new for our space is how the embeddings are trained — on Want-to-Seen transitions, not on Want signals alone. A Want without a follow-up Seen is a weak signal. A Want followed by a 5-star Seen is a strong one. The embeddings learn from the strong signals.
Per-user taste vector
Each user has a corresponding taste vector in the same 384-dim space. It's the running weighted average of their Seen titles. It updates after every Seen, not every Want.
This is the second key decision: we update on Seen, not Want. Most engines update on engagement. We update on completion. The difference is enormous.
Candidate selection
The candidate set for a session is the set of titles closest to the user's taste vector, filtered by:
- Things the user hasn't Wanted, Seen, or Skipped
- Things available on the user's connected streaming services (we know which they have)
- Things that match the user's settings (preferred languages, content filters)
We then re-rank by a small set of additional signals: friends' ratings (if any), recency boost, diversity injection (don't show 12 thrillers in a row).
Diversity injection
A pure embedding-similarity engine will collapse — it'll show you the same 30 films until the heat death of the universe.
We force diversity by reserving 2-3 slots in every 12-card session for off-vector recommendations. These are titles the engine isn't sure about. Some land badly. Some land brilliantly. The point is to avoid the algorithmic collapse that makes most recommendation feeds feel sterile after a few weeks.
What we measure now
Three metrics replace the old one (swipe-right rate):
- Want-to-Seen conversion (W2S). Of the films a user Wants, what % do they actually Seen within 30 days? Target: 35%. Current: 31%. Up from 14% pre-rebuild.
- Time-to-decision in active sessions. From "I'm bored" to "I'm watching this one," how long? Target: under 3 minutes. Current: 2:48. Down from 4:12.
- Recommendation reuse rate. When we recommend a film and a user Wants it, how likely are they to come back to find it again later? Target: 70% of Wants get re-engaged within 60 days. Current: 64%.
These metrics are harder to move than swipe-right rate. They are also the right things to measure.
What's next
A few experiments running:
- Mood-conditioned recommendations. Instead of "what do you like?" → "what do you want to feel right now?" The taste vector becomes one input; mood becomes another.
- Cold-start improvement. New users still take 5-10 swipes to see meaningful personalization. We want it down to 3.
- Friend-graph integration. When a friend Sees a film and rates it 5 stars, that should immediately surface in the swipe deck the next session. It does, but the signal-strength is conservative; we're tuning it up.
The lesson
The biggest thing we learned wasn't technical. It was a reframing.
Recommendation engines are mostly bad because they optimize for the wrong thing — engagement, swipe-rate, time-on-platform. These metrics are easy to measure and easy to game. The harder, more honest metric is did the user end up with a better experience. That's what we're trying to build for now.
If you've used SeenWant before our rebuild, you'll feel the difference. If you haven't — give it a swipe and let us know.
Curious about the architecture in more detail? We're publishing a follow-up on the embedding training pipeline. Subscribe to the RSS feed to catch it.


