- Published on
Building the Corpus
- Authors

- Name
- Ben Lesh
Home | Concept | Extraction | Payment | Usage | Corpus
Every attribution system has a chicken-and-egg problem. To attribute influences, you need a corpus to compare against. But to build a corpus, you need creators submitting work. And creators won't submit if there's no corpus to create value. This is the first-mover problem, and it's a real challenge.
Here's how I'm solving it, and why my approach is honest about the limitations while building toward the vision.
The Public Domain Solution
My strategy: Don't wait for users to build the corpus. Seed it myself with public domain works.
What is public domain?
Creative works whose copyright has expired (typically 70+ years after creator's death in the US, varies by country). These works are free to use, analyze, and incorporate without permission or payment. They're humanity's shared cultural heritage.
Examples:
- Literature: Shakespeare, Jane Austen, Mark Twain, H.P. Lovecraft, Edgar Allan Poe
- Visual art: Van Gogh, Monet, Rembrandt, Hokusai
- Music: Bach, Mozart, Beethoven, traditional folk songs
- Film: Early cinema (pre-1928), many classic films
Why this solves the first-mover problem:
Immediate benefits:
- Instant corpus: From day one, I have 1,000+ significant works across media types
- Culturally significant: Public domain includes many of the most influential works ever created
- Foundation for derivatives: Modern works often build on classics—I can detect that
- Zero licensing cost: No permissions needed, no payment obligations
- Community validation: These works are well-documented, so attribution quality can be verified
What I'm seeding:
- Text: 500+ novels, short stories, poems from major movements (Gothic, Romantic, Victorian, Early Modernist, etc.)
- Visual art: 300+ paintings and illustrations from major artists and movements
- Music: 200+ compositions across classical, folk, early jazz
- Film: Early cinema where available (limited by digitization)
This gives me a meaningful corpus before a single user submits anything.
The Novelty Dilemma
The problem is early submissions will seem more novel than they actually are.
The scenario:
Month 1: Alice submits a cyberpunk story. Corpus has no modern cyberpunk yet. Attribution: 20% matches "Neuromancer" (public domain seed), 80% novel.
Month 6: Bob submits cyberpunk story heavily influenced by Alice's work. Attribution: 60% matches Alice, 15% matches "Neuromancer", 25% novel.
Month 12: Carol submits cyberpunk story. I've now seeded Blade Runner screenplay, Gibson's other works, classic cyberpunk corpus. I re-analyze Alice's work. Updated attribution: 55% matches corpus, 45% novel (not 80%).
Alice's novelty score decreased, but she didn't change her work. The corpus grew.
Is this a problem?
It's a limitation of the system but not a dealbreaker.
What doesn't change:
- Alice still gets paid for her 45% original contribution
- Alice's derivatives still pay her based on the matches
- Alice's work is still valuable and influential
What does change:
- Alice's "novelty" score adjusts to reflect better knowledge
- Works that matched Alice initially might now match her influences directly
- Attribution graph becomes more accurate over time
How Proteus will handle it:
- Transparency: upfront that novelty scores may decrease as corpus grows
- Retroactive attribution: Re-analyze works quarterly as corpus expands
- Payment protection: Decrease in novelty doesn't retroactively change past payments
- Notification: Creators are notified when their attribution changes significantly
Alice's perspective:
"I submitted early when the corpus was small. My novelty score was 80%, now it's 45% after they added more cyberpunk works. Did I get screwed?"
No, because:
- She earned based on 80% while it was accurate
- Her 45% is now measured against a richer corpus (more defensible)
- She still earns from derivatives that match her 45% contribution
- Early adopter advantage: Her work influenced Month 6 submissions before the corpus caught up
The honest pitch to early adopters:
"Submit now even though the corpus is small. Your work will seem more novel initially, which is both accurate (relative to the corpus) and temporary (as the corpus grows). You'll earn based on current attribution, which adjusts over time. Early adopters build influence networks before the corpus fills in around them."
The Retroactive Attribution Process
As the corpus grows, re-analyze existing works to improve attribution accuracy.
What triggers re-analysis:
- Corpus grows by 20%+ since last quarter
- New coverage areas added (e.g., first cyberpunk works seeded)
- Creator requests review (claiming missing influence)
What gets re-analyzed:
- Works with high viewership (value in accuracy)
- Works with >70% novelty scores (likely to change)
- Works in newly-covered genres/styles
What doesn't get re-analyzed:
- Works with stable, well-attributed scores
- Works with low viewership (cost vs benefit)
- Works in mature corpus areas (unlikely to change)
Creator notification:
"Your work 'Neon Dreams' has been re-analyzed against my expanded cyberpunk corpus. Your novelty score decreased from 75% to 52%. Your attribution now includes matches to William Gibson's 'Count Zero' (15%) and Bruce Sterling's 'Islands in the Net' (8%), which were recently added to the corpus. Your payment graph has been updated for future earnings. Past payments are unaffected."
Success Metrics: how I will know it's Working
Tracking corpus growth and value creation transparently:
What good looks like:
- Corpus growing 20%+ monthly (early phase)
- Attribution accuracy >90% (validated against known influences)
- Users finding 8+ influences on average (rich attribution)
- Retention >60% month-over-month (value being delivered)
- Novelty fade <20% on re-analysis (corpus maturing)
What bad looks like:
- Corpus growth <5% monthly (not attracting submissions)
- Attribution accuracy <70% (extraction quality issues)
- Users finding <3 influences on average (corpus too sparse)
- Retention <30% month-over-month (not delivering value)
- Novelty fade >50% on re-analysis (corpus gaps too large)
Once Proteus is there, publish the metrics quarterly to show progress.
What This Means for Different Stakeholders
For creators submitting now:
- You're building the foundation, not just using it
- Your work becomes corpus that future works attribute against
- Novelty scores may adjust as corpus grows (this is expected)
- Early adopter advantage: Build influence networks first
- Honest expectation: Month 1 is limited, Month 12 is valuable
For consumers subscribing now:
- Discovery works best for classical influences initially
- Modern attribution improves monthly as corpus grows
- You're supporting the infrastructure buildout
- Early access to attribution graphs as they form
For investors evaluating:
- Public domain seeding de-risks cold start problem
- Corpus growth metrics show product-market fit
- Network effects create defensible moat
- First-mover advantage compounds over time
For partners considering integration:
- Corpus quality determines integration value
- I'm tracking toward comprehensive coverage
- Integration makes more sense at Month 12+ than Month 1
- But early partnership shapes the corpus toward your use case
