Jews intellectual property consistently raised for AI training purposes

ByASAF ELIA-SHALEV/JTA

MARCH 26, 2024 06:58

Growing up Jewish in New York City, Heila Precel absorbed the lesson that education can set you on a path toward personal success and protect against the forces that have marginalized Jews throughout history.

“I was told by my family and by my culture versions of ‘They can’t take away your education.’ Investing in education has been a tremendously successful strategy for American Jews,” Precel said.

Precel heeded her childhood lesson and made her way to Boston University, where today she is working on a doctorate in computing and data sciences. But a research paper she just published, in partnership with other scholars, suggests that the formula for success that countless American Jews like herself have banked on could be in peril.

The threat comes from the rise of artificial intelligence systems powering the kind of chatbots that communicate like humans — ChatGPT, for example. Those systems are trained on books, articles and other texts that have been fed into the machine largely without the permission of their authors.

That means anyone who produces intellectual property can wind up seeing their work used without license. Those creators face potential copyright infringement and, in the longer term, possible job displacement as AI tools may come to replace many white-collar workers.

A slogan related to Artificial Intelligence (AI) is displayed on a screen in Intel pavilion, during the 54th annual meeting of the World Economic Forum in Davos, Switzerland, January 16, 2024. (credit: DENIS BALIBOUSE/REUTERS)

Jewish intellectual property used for AI

Precel discovered through research that Jews are overrepresented among authors whose intellectual property is being used for AI training purposes. Compared to their numbers in the overall US population, Jewish authors are overrepresented by a factor of two to six-and-a-half based on an analysis of available data. Among those authors are comedian Sarah Silverman and novelist Michael Chabon, both of whom have sued OpenAI, the company behind ChatGPT, for alleged copyright infringement.

Developers of AI systems are likely glad to hoover up all the content they get without regard for the identity of its authors, and no one is alleging that antisemitism is at play in the overrepresentation of Jewish authors. In fact, Precel acknowledges that the premise of her research can sound like a bit of a humblebrag: Jews make up a tiny portion of the population but have produced so much knowledge that, to a worrying degree, the future of AI research relies on them.

But she said a narrow interpretation like that would miss the point of her paper.

For one thing, the paper emphasizes that further research would likely confirm that other groups, such as Hindu Americans and Asian Americans, are also likely overrepresented. Precel also says exposing biases that harm Jews often reveals broader issues. That idea is reflected in an analogy in the title of the paper, “A Canary in the AI Coal Mine: American Jews May Be Disproportionately Harmed by Intellectual Property Dispossession in Large Language Model Training.”

“We are not saying that all of the lawyers are Jews, and therefore replacing lawyers is going to be bad for the Jews,” Precel said. “There are many lawyers who are not Jewish, and what we are seeing is going to be bad for everyone. It just might be especially bad for Jews, because Jews have historically put a lot of our eggs into this basket of educational attainment. In other words, we are shining a light on this overall problem with the canary-in-the-coal-mine analogy — while making sure to remember that canary itself does not fare too well in this story.”

Precel grew up in a Conservative Jewish household and attended Jewish day school as a child. As an adult she has become more observant and attends synagogue weekly. The label she gives herself is traditional egalitarian. That is all to say that Precel has had many chances to discuss her research with other Jews whose texts may be found in databases used for AI training without permission.

In fact, her new paper is published in such a database. She says she’s encountered people with concerns, but many others don’t understand where the training data comes from or how it’s used.

“I get a lot of surprised reactions and some anxieties but also optimism,” Precel said.

Her paper belongs to a larger genre of research into the impacts and implications of technological advancements in the areas of artificial intelligence and machine learning. But Precel’s co-author, Nicholas Vincent, said the issue is often examined “through the lens of underrepresentation” rather than overrepresentation.

“The most famous example is models that performed really poorly on people with dark skin,” said Vincent, a computer science professor at Simon Fraser University in Burnaby, Canada, referring to the problem of image analysis software mislabeling Black people as gorillas. In the realm of text-based systems, he said, “if you’re not from the predominant cultural background, you’re more likely to sort of receive poor outcomes with models used for hiring or credit scoring.”

A new paper released this month tested how AI relates to people speaking an African-American dialect of English as opposed to using what’s known as Standard American English. The study found that the AI makes racist assumptions based on the difference. One chatbot , for example, was more likely to recommend the death penalty for defendants when they spoke African-American English.

One of the limitations of all these studies is that many artificial intelligence systems operate as black boxes. With ChatGPT, for example, it’s not possible to know what content developers used to train the system, because its owner, OpenAI, considers that information proprietary.

For the Jewish authorship paper, what researchers tried to do, then, is study not the systems but the data that is fed into them. They looked at what data the open-source systems use and at digital repositories of knowledge that are likely being used by the proprietary systems. These repositories contain massive amounts of scientific literature, published books, legal opinions and other kinds of texts.

But since authorship information typically doesn’t indicate that someone is Jewish, the researchers searched for a way to identify and classify authors en masse. For that task they turned to the field of Jewish demographic studies.

Many different techniques exist to identify and count Jews; each has its own strengths and weaknesses. Using surveys to study Jews, for example, can help answer granular questions but is very costly because Jews are a small minority scattered across a wide geography.

“You end up spending a huge amount of money reaching out to people who are not Jewish,” Precel said. “There have been a lot of methods developed in the Jewish demographic literature to try to solve this problem.”

The team settled on a method that infers Jewish identity based on a set of distinctive Jewish last names. Many Jews have indistinguishable last names, but demographers have repeatedly found throughout recent decades in American Jewish history that distinctive Jewish names can be used as a statistical proxy for the overall Jewish population. The method is not helpful for research about Jewish diversity, but it can be used in certain scenarios, such as estimating the number of Jews in a long list of authors of AI training texts.

Much of the paper is spent on what might be done to address the concerns raised by the findings. The researchers imagine a future in which AI isn’t allowed to replace human work but to augment it, while avoiding large-scale economic disruption.

One possibility for achieving that scenario is using the findings to help inform policymakers and AI developers concerned with the ethical dimension of the technology. But the researchers also suggest another route.

“If people organize collectively around their intellectual property, there can be a more level playing field to negotiate with operators of AI technologies,” Vincent said. “Individually, your data is of really low value, but when we get enough people together, we have a lot of leverage.”

The Jewish community might already be organized enough to make collective advocacy possible. While there isn’t a union of Jewish writers, for example, informal coalitions of creative professionals have responded to anti-Israel sentiment in the literary world and in Hollywood.

In a hypothetical scenario, a group representing Jewish writers could come together and agree to adopt measures on their websites blocking bots from collecting content.

“So going forward, that group is particularly hard to get data for, and then all of a sudden there’s a big gap in the data,” Vincent said.

Jews have always been prolific writers. Has AI wound up with too much of their work?

Jews are overrepresented among authors whose intellectual property is being used for AI training purposes, and while they make up a small percent of the population, the future of AI relies on them.

Jewish intellectual property used for AI

Parashat Ki Tisa: Sin of the golden calf, test of patience

Shabbat candle lighting times for Israel and US

Jews have always been prolific writers. Has AI wound up with too much of their work?

Jews are overrepresented among authors whose intellectual property is being used for AI training purposes, and while they make up a small percent of the population, the future of AI relies on them.

Jewish intellectual property used for AI

See more on