Building the Corpus

Every attribution system has a chicken-and-egg problem. To attribute influences, you need a corpus to compare against. But to build a corpus, you need creators submitting work. And creators won't submit if there's no corpus to create value. This is the first-mover problem, and it's a real challenge.

Here's how I'm solving it, and why my approach is honest about the limitations while building toward the vision.

The Public Domain Solution

My strategy: Don't wait for users to build the corpus. Seed it myself with public domain works.

What is public domain?

Creative works whose copyright has expired (typically 70+ years after creator's death in the US, varies by country). These works are free to use, analyze, and incorporate without permission or payment. They're humanity's shared cultural heritage.

Examples:

Literature: Shakespeare, Jane Austen, Mark Twain, H.P. Lovecraft, Edgar Allan Poe
Visual art: Van Gogh, Monet, Rembrandt, Hokusai
Music: Bach, Mozart, Beethoven, traditional folk songs
Film: Early cinema (pre-1928), many classic films

Why this solves the first-mover problem:

Immediate benefits:

Instant corpus: From day one, I have 1,000+ significant works across media types
Culturally significant: Public domain includes many of the most influential works ever created
Foundation for derivatives: Modern works often build on classics—I can detect that
Zero licensing cost: No permissions needed, no payment obligations
Community validation: These works are well-documented, so attribution quality can be verified

What I'm seeding:

Text: 500+ novels, short stories, poems from major movements (Gothic, Romantic, Victorian, Early Modernist, etc.)
Visual art: 300+ paintings and illustrations from major artists and movements
Music: 200+ compositions across classical, folk, early jazz
Film: Early cinema where available (limited by digitization)

This gives me a meaningful corpus before a single user submits anything.

The Novelty Dilemma

The problem is early submissions will seem more novel than they actually are.

The scenario:

Month 1: Alice submits a cyberpunk story. Corpus has no modern cyberpunk yet. Attribution: 20% matches "Neuromancer" (public domain seed), 80% novel.
Month 6: Bob submits cyberpunk story heavily influenced by Alice's work. Attribution: 60% matches Alice, 15% matches "Neuromancer", 25% novel.
Month 12: Carol submits cyberpunk story. I've now seeded Blade Runner screenplay, Gibson's other works, classic cyberpunk corpus. I re-analyze Alice's work. Updated attribution: 55% matches corpus, 45% novel (not 80%).

Alice's novelty score decreased, but she didn't change her work. The corpus grew.

Is this a problem?

It's a limitation of the system but not a dealbreaker.

What doesn't change:

Alice still gets paid for her 45% original contribution
Alice's derivatives still pay her based on the matches
Alice's work is still valuable and influential

What does change:

Alice's "novelty" score adjusts to reflect better knowledge
Works that matched Alice initially might now match her influences directly
Attribution graph becomes more accurate over time

How Proteus will handle it:

Transparency: upfront that novelty scores may decrease as corpus grows
Retroactive attribution: Re-analyze works quarterly as corpus expands
Payment protection: Decrease in novelty doesn't retroactively change past payments
Notification: Creators are notified when their attribution changes significantly

Alice's perspective:

"I submitted early when the corpus was small. My novelty score was 80%, now it's 45% after they added more cyberpunk works. Did I get screwed?"

No, because:

She earned based on 80% while it was accurate
Her 45% is now measured against a richer corpus (more defensible)
She still earns from derivatives that match her 45% contribution
Early adopter advantage: Her work influenced Month 6 submissions before the corpus caught up

The honest pitch to early adopters:

"Submit now even though the corpus is small. Your work will seem more novel initially, which is both accurate (relative to the corpus) and temporary (as the corpus grows). You'll earn based on current attribution, which adjusts over time. Early adopters build influence networks before the corpus fills in around them."

The Retroactive Attribution Process

As the corpus grows, re-analyze existing works to improve attribution accuracy.

What triggers re-analysis:

Corpus grows by 20%+ since last quarter
New coverage areas added (e.g., first cyberpunk works seeded)
Creator requests review (claiming missing influence)

What gets re-analyzed:

Works with high viewership (value in accuracy)
Works with >70% novelty scores (likely to change)
Works in newly-covered genres/styles

What doesn't get re-analyzed:

Works with stable, well-attributed scores
Works with low viewership (cost vs benefit)
Works in mature corpus areas (unlikely to change)

Creator notification:

"Your work 'Neon Dreams' has been re-analyzed against my expanded cyberpunk corpus. Your novelty score decreased from 75% to 52%. Your attribution now includes matches to William Gibson's 'Count Zero' (15%) and Bruce Sterling's 'Islands in the Net' (8%), which were recently added to the corpus. Your payment graph has been updated for future earnings. Past payments are unaffected."

Success Metrics: how I will know it's Working

Tracking corpus growth and value creation transparently:

What good looks like:

Corpus growing 20%+ monthly (early phase)
Attribution accuracy >90% (validated against known influences)
Users finding 8+ influences on average (rich attribution)
Retention >60% month-over-month (value being delivered)
Novelty fade <20% on re-analysis (corpus maturing)

What bad looks like:

Corpus growth <5% monthly (not attracting submissions)
Attribution accuracy <70% (extraction quality issues)
Users finding <3 influences on average (corpus too sparse)
Retention <30% month-over-month (not delivering value)
Novelty fade >50% on re-analysis (corpus gaps too large)

Once Proteus is there, publish the metrics quarterly to show progress.

What This Means for Different Stakeholders

For creators submitting now:

You're building the foundation, not just using it
Your work becomes corpus that future works attribute against
Novelty scores may adjust as corpus grows (this is expected)
Early adopter advantage: Build influence networks first
Honest expectation: Month 1 is limited, Month 12 is valuable

For consumers subscribing now:

Discovery works best for classical influences initially
Modern attribution improves monthly as corpus grows
You're supporting the infrastructure buildout
Early access to attribution graphs as they form

For investors evaluating:

Public domain seeding de-risks cold start problem
Corpus growth metrics show product-market fit
Network effects create defensible moat
First-mover advantage compounds over time

For partners considering integration:

Corpus quality determines integration value
I'm tracking toward comprehensive coverage
Integration makes more sense at Month 12+ than Month 1
But early partnership shapes the corpus toward your use case