Column
Primitive Accumulation for the LLM Age
By Oz Gultekin
The scraping of the public web was not a bug. It was the opening move of a classic enclosure. A column on how the training corpus became private property.
In 1773, the English Parliament passed the Inclosure Act. The Act converted common land — land worked collectively by peasants for grazing, foraging, and subsistence — into private property for landlords. The peasants were not consulted. The peasants were told they had been made freer, because now they could sell their labour in a city.
Marx called this primitive accumulation. The original theft. The precondition for everything capitalism did next.
The scraping of the open Web between roughly 2016 and 2023 is primitive accumulation for the LLM age. The theft has already happened. The enclosure is already complete. The question on the table in 2026 is only who gets to write the legislation that retroactively legalizes it.
What was enclosed
Every public-facing webpage on the Internet in the late 2010s was, in practice, a commons. A writer published a blog post. A researcher posted a preprint. A photographer uploaded a portfolio. A musician posted a demo. A grandmother uploaded a recipe. A union local posted meeting minutes.
None of it was free in the sense of being worthless. All of it was free in the sense that the cost of looking at it was zero and the social licence for looking at it was “please link back.” That was the commons. Informal, unenforced, universal.
Between roughly 2016 and 2022, the commons was scraped in bulk by roughly a dozen entities. Common Crawl scraped it in public. LAION scraped it in public. The frontier labs scraped it privately and lied about it later. The grandmother’s recipe is now a training example in a model trained by a company worth more than the GDP of Portugal. The grandmother got an AdSense impression for her trouble, or more likely nothing.
This is the enclosure. The peasant commons became private property through a one-sided transaction the peasant was not party to.
The opt-out is theatre
The current state of the art in data governance is the opt-out. A website operator can add a robots.txt directive telling a specific AI scraper not to crawl the site. A website operator can file a takedown notice. A website operator can sign up for a licensing scheme operated by the same labs that did the scraping in the first place.
All of this is theatre. Consider the order of operations.
- The scraper took everything before the opt-out existed. The model is already trained.
- The robots.txt directive only applies going forward, and only to the scrapers that announce themselves. The ones that use residential proxies and rotate user agents do not announce themselves.
- Takedown notices target individual URLs. You cannot take down a training example because the training example is now a set of weights, not a file.
- The licensing schemes are opt-in, run by the labs, priced by the labs, and available only to rightsholders with the legal staff to negotiate them. Your local newspaper, your grandmother, the indie musician, the union local — none of them have the legal staff. So none of them opt in.
The opt-out is a system designed to produce no outs. It is the same trick a twenty-first-century landlord pulls with an arbitration clause in a rental agreement. Sure, you can opt out. You just have to notarize a form and mail it in the first thirty days and hire a lawyer to enforce it. In practice, nobody does.
The Canadian legal gap
Canada has a Copyright Act. The Copyright Act has a set of statutory exceptions. One of them, added in 2012, is called “fair dealing for the purpose of research.” It is the exception the labs will invoke, in court, when somebody finally sues them in Canada.
The argument will be that training a model is “research.” The argument will be that the output of a model does not reproduce the training inputs, so no copyright is infringed. The argument will cite a line of American cases on “transformative use” that has nothing to do with Canadian law and will be wheeled in anyway.
The Copyright Act was last substantively amended in 2012. The drafters did not anticipate a regime in which every word, image, and song on the Internet is ingested en masse and converted into a commercial product by a firm that is not a researcher in any sense of the word Canadian law would recognize.
A Canadian Parliament that took primitive accumulation seriously would do four things.
- Define training as a restricted act. Explicitly name the ingestion of copyrighted work into a training corpus as an act covered by copyright. Not fair dealing. Not research. A use the copyright holder must license.
- Create a collective licence. For the same reason radio stations do not negotiate with every songwriter individually, training data should be licensed collectively through a body like SOCAN. A blanket licence, a public tariff, and a distribution formula. This is the opposite of the lab-run opt-in scheme. The lab pays; the public body distributes.
- Require provenance disclosure. Any model offered commercially in Canada must publish the list of datasets it was trained on. No black-box corpora. No “proprietary mix.” A public register.
- Back-date the licence. The training has already happened. The licence is therefore retroactive, priced as a percentage of revenue from any product using a model trained before the Act comes into force. This is the only way to claw back the surplus from the scrape that has already happened.
None of this will happen in the current Parliament. The lobbying has already started, and the lobbying is well-funded.
”But language is just patterns”
The industry’s philosophical defence of mass scraping is that the model learns patterns, not copies. A model does not reproduce the New York Times article. A model learns, from the New York Times article among millions of others, to produce Times-like prose on demand.
This is true at the token level. It is irrelevant at the economic level. If I hire an apprentice, I am allowed to have that apprentice read the New York Times, learn from it, and then write prose in a similar style. I am not allowed to bottle the apprentice’s brain, replicate it a million times, and sell the replicas.
The model is not an apprentice. The model is the bottling operation. The commodity is the bottled pattern, not the pattern itself. Commodifying the pattern is the novel economic act, and it is the one the law has not caught up with.
Every industry adjacent to this one understands the distinction perfectly. A radio station pays SOCAN because it broadcasts music, not because the music itself is proprietary. A hospital pays Elsevier because it accesses research, not because research itself is proprietary. The principle is that commercial use of a corpus triggers payment to the corpus, regardless of whether the use “copies” in a 1976 sense.
The LLM is a commercial use of a corpus. The corpus should be paid. This is the entire argument. It is not complicated. The industry pretends it is complicated because the answer, if we admit it is not complicated, is expensive.
The argument
The scraping was enclosure. The enclosure was the opening move. The move is complete.
What remains is not a prevention question. It is a restitution question. The public commons was converted into private property, and the private property is now producing private profit, and no share of that profit flows back to the commons.
A serious anti-capitalist politics of AI starts here, not with the model, not with the product, not with the job loss. It starts with the question of who owns the training corpus, and the answer is: the public did, the public still morally does, and the law should say so in black ink.
Every other argument in AI policy is downstream of that one.