Open AI issues response to lawsuit filed by New York Times – Video

Open AI issues response to lawsuit filed by New York Times – Video


Open AI has responded publicly to the New York Times lawsuit, emphasizing their commitment to developing AI tools that empower people to solve problems that are otherwise out of reach. They state that they collaborate with news organizations and are creating new opportunities to support a healthy news ecosystem. Despite their efforts to work with publishers, there have been reports that Open AI offers publishers as little as $1 million to $5 million annually to license their news articles for training its large language models. While some may see this as a small amount, it raises questions about the value of news articles and their role in training AI models. The debate around licensing deals between AI companies and news organizations is ongoing, with Apple also trying to strike similar deals. It will be interesting to see how this area develops and how it impacts the quality of AI models in the future. Open AI’s response to the lawsuit sheds light on the complexities of balancing technology development with ethical considerations and legal obligations in the digital age.

Watch the video by TheAIGRID

Video Transcript

So open actually has responded to the New York Times lawsuit and I don’t know how they actually responded legally but this is going to be how they responded publicly and I think it’s important that openi has responded publicly because of course if they do that it you know shows

That they do care about their brand image because another company coming out and just claiming that you’ve done some crazy things that are illegal in terms of copyright isn’t really good for your Public Image So today we’re going to break down open I’s new post and how

Exactly they’ve responded and a lot of the evidence to support their claims so one of the first paragraphs that they actually talked about was that our goal is to develop AI tools that Empower people to solve problems that otherwise Out Of Reach and people worldwide are using our technology to improve their

Daily lives millions of developers and more than 92% of Fortune 500 are building on our products today so I one thing I do want you guys to do is remember that quote here millions of developers and more than 92% of Fortune 500 are building on our products today

Because that is going to be an important Point later on in the video because it has some severe ramifications not bad ones but good ones for open AI they then state that while we disagree with the claims that the New York Times lawsuit we view it as an opportunity to clarify

Our business our intent and how we build our technology our position can be summed up in these four points which we flesh out below we collaborate with news organizations and are creating new opportunities training is fair use but we provide an opt out because it’s the

Right thing to do and regurgitation is a rare bug that we’re working to drive to zero and the New York Times is not telling the full story which they definitely aren’t now one of the things I did want to say is that loads of people do use chat GPT and the reason

I’ve brought this point up is because in the last you know lawsuit well not in the last lawsuit in the previous video where we discussed the lawsuit they actually talked about how they are ordering the destruction of GPT 4 which to me is absolutely ludicrous I mean you

Know ordering the description of GPT 4 which took like I think it was just $100 million in training runs where they had to train the model is pretty crazy and that’s why I said before you know considering that 92% % of Fortune 500 companies are building on this

Technology that wouldn’t be good for the vast majority of these companies and honestly I believe that that claim and that request is a bit absurd considering just how big the software is it’s like someone claiming that Google be destroyed or something like that you know it’s it’s a bit it’s a bit

Overreached and I think considering just how many people use chat GPT on some levels You could argue that it is a tool that is used to level the playing field in terms of intelligence because um although internet and intelligence isn’t a basic human right I remember that

There were some claims that you know stuff like internet should now be a basic human right because of how much you do miss out on if you don’t even have basic internet access and I do think that chat GPT in the future when we’ve got all of these companies

Building on top of it and it’s so integrated in s our society the likelihood that we do get this model destroyed due to one simple lawsuit isn’t as high as you’d think so one of the first things that they talk about is how they actually collaborate with news organizations and are creating new

Opportunities they say we’ve met with dozens as well as leading industry organizations like the news/ media Alliance to explore opportunities discuss their concerns and provide Solutions we aim to learn educate to listen feedback and to adapt and our goals are to support a healthy news ecosystem be a good partner and to

Create mutually beneficial opportunities with this in mind we have pursued Partnerships with news organizations to achieve these objectives deploy our products to benefit and support reporters and editors by assisting with timec consuming tasks like analyzing voluminous public records and transl in stories teach our AI models about the World by training on additional

Historical and non-publicly available content and display real-time content with attribution to chat gbt now essentially the new Partnerships that they actually did do are actually true because recently openi did actually partner with many different companies for beneficial use of AI in journalism and what is crazy about this was that

There was a little bit of debate now I am going to be talking about it later on in the video but essentially what they discuss later in the video like further down than the document is that they discuss about how openai has been working with different people now I know

That openi openly stated in this that they were trying to reach a deal with the New York Times in order to get the right rights in order to train on their data and essentially just work with them but this article from the information has stated that openai offers Publishers

At listle as $1 million a year in order to license their content so says here that openi has offered some media firms as little as between $1 million and $5 million annually to license their news articles for use in training its large language models according to two Executives who have recently negotiated

With the tech company that’s a tiny amount even for small Publishers which could make it difficult for open ey to strike deals and it says that meanwhile Apple which is trying to catch up with open a and Google in the generative AI sector is also trying to strike deals

With Publishers for use of their content said one Executives apple is offering more money but wants the right to use the content more widely than what I’m guessing it’s usually allowed now I’m not entirely too clued up about industry standards but I’m not sure why a company

Would be turning down $1 million to $5 million to you know license your content when a lot of you know smaller companies out there are going to be training on your you know news articles anyways so I don’t know what’s going on there but I do think it is interesting that you know

Opening ey has gone out and done this because it does make sense from their you know standpoint of course the company’s burning money anyways and they’re continually having to raise money because the company doesn’t make money due to expense of tra costs I think they will be profitable in the

Future but the point is is that this investment from opening ey does show that look okay you know they’re not just a company that’s some monolith that just seeks to you know consume all data on the internet in order to produce these good models but I’m wondering if 1 to 5

Million a year isn’t a good deal how much is I mean how much is you know news articles and how much does that actually contribute to the quality of chat GPT because high quality data is definitely something that these companies do need need in order to train these models and

We know that high quality data is the backbone of um a lot of these models so it will be interesting to see how this area develops because um 5 million annually for a company sounds like a good deal maybe it might not be but it will be interesting to see if there are

More Publishers that come up with licensing deals with other AI companies and how this bit does develop now here we do have one of the main discussion points around this which is fair use so if you don’t know what fair use is fair use is basically in copyright where it

Allows you to use content in a way that as long as you use it in a fair way you can use it okay so you can use copyrighted content so um for example it says here training is fair use but we provide an opt out because it’s the

Right thing to do so training AI models using publicly available internet materials is fair use as supported by long-standing and widely accepted precedents riew this Principle as fair to creators necessary for innovators and critical for us competitiveness which is a very key point us competitiveness

We’re going to touch on that in a moment and the principle that training AI models is permitted as a fair use is supported by a wide range of academics Library associations civil groups startups y y y um and essentially what they’re saying here we have led the industry in providing simple opt out

Processes for Publishers which the New York Times adopted in August 2023 to prevent our tools from accessing their sites which they actually did now something that you’re going to want to see were the actual comments to the copyright office and I think they were very very fascinating and these were the

Comments by andreon Horowitz a venture capital firm so essentially they said um the only practical way generative AI models can exist is if they can be trained on almost un unimaginably massive amount of content much of which will be subject to copyright for example large language models are trained on

Something approaching the entire Corpus of the written word and although I think that is somewhat true I would argue that certain models do challenge that I mean we did have f 2 which is a large language model that is considerably smaller and they only trained it on textbook style data but essentially what

They’re saying here is that if you really want a general AI model to be really good you’re going to have to train it on pretty much everything and the only way to train it on everything is to include some of the copyright stuff there so I mean they do have a

Point that you know the only way to really do this and the only way to really Advance Society I mean I guess you could say is to train it on that kind of stuff I mean maybe the tech in the future is of course going to change

That but as is currently we’re at a point where whatever the case outcome is is going to set the precedent okay and this is why this case and why I’m really paying attention to the case because which either way the case goes will pretty much determine the future of

Generative Ai and I think it’s going to go one way and that’s open AI way but anyways um secondly AI models are not vast warehouses of copyrighted material which is very true and any suggestion to this effect is a plain misunderstanding of the technology which is definitely

True because that’s not just how AI models work they don’t just take the data and store it so as a result when an AI model is trained on copyrighted works the purpose is not to store any of the potentially copyrightable content of any work which it is trained rather training

Algorithms are designed to use training data to extract fact and statistical patterns across a broad body of examples I.E information that is not copyrightable which is definitely varable true like these um you know models aren’t really trained to like you know I guess you could say be copyright

Leeches to you know news articles they’re essentially just training on the facts of what is happened to gain an understanding of the world and then of course what they has as an empirical matter the overwhelming majority of the time the output of a generative AI service is to not substantially similar

In the copyright sense in any particular copyrighted work that was used to train the model even researchers employing sophisticated attacks on AI models have shown extremely small rates of memorization and I’m going to actually show you guys some research papers that show that this is actually quite true

Which does and will show that maybe potentially the New York Times did go to a certain extent in order to I guess you could say get this lawsuit off the ground I just do want to say that this is speculation because this is I’m clearly an ongoing lawsuit and I don’t

Need any lawsuits coming my way so of course anything here is speculation but it is fascinating to see the research to show you how memorization does work and essentially what they do say here as well is that um AI companies which in turn have made the USA Global leader in

AI undermining those expectations will jeopardize future investment along with us economic competitiveness and National Security and I do have to agree somewhat with that point because um the US does have to remain pretty competitive in society and in order to do that they don’t really want to have laws that are

Going to restrict these AI systems in terms of their development now ai safety is a completely other thing AI safety is you know a broad topic that we’ve discussed many times and I think AI safety is completely something that is needed but essentially um competitiveness is of course a really

Really important thing and you know I do think that if they did rule in the New York times’s favor in terms of just saying thing that you know destroy the model y y y that would set us back definitely quite a bit but like I said I

Don’t think that is going to be the outcome of this at all I can really summarize all of the points here essentially one of the things that they also do talk about in this is that you know the collective or statutary licensing for AI model training is somewhat impracticable impractical and

Unworkable because the diversity and volume of content needed for AI training to make identifying rights holders and managing royalties is almost impossible and additionally such a scheme could impose prohibitive costs on AI development hindering Innovative competition and essentially to conclude this Venture Capital firm urges the copyright office to recognize the

Transformative potential and maintain a balanced copyright approach that Fosters Innovation they stress overregulation could hinder the US’s leadership in AI affecting both economic growth and National Security and in summary their comments emphasize the importance of a supportive legal environment for AI development highlighting ai’s vast potential and warning against the

Impracticalities and risks of restrictive copyright practices in this domain number three is where we get into a little bit of the technical stuff because they State here that regurgitation is a rare bug that we are working to drive to zero so our models were designed and trained to learn

Concepts in order to apply them to new problems it says that memorization is a rare failure of the learning process that we continually making progress on which is very true but it is more common when a particular content appears more than once in training data like if

Pieces of it appear on lots of different public websites so we have measures in place to limit inadvertent memorization and prevent regurgitation in model outputs now we expect our users to act responsibly intentionally manipulating our models to regurgitate content is not an appropriate use of our technology and

Is against our terms of use and there were some other prompt injection attacks that you could do to Open Eyes thing to get it to spit out the training data model and so far we’ve seen that every time these bugs have actually occurred opening eye has swiftly fixed the

Problem so we do know that if you know New York Times really wanted to they could have just you know I’m sure communicated this issue with open Ai and I’m pretty sure they would have fixed it as soon as they could have and another point I do make is that just as humans

Obtain a broad education to learn how to solve new problems we want our AI models to observe the range of the world’s information including from every language culture and Industry because the models learn from the enormous aggregate of human knowledge any one sector including news is a tiny slice of

Overall training data and any single data source including the New York Times is not significant for the models intended learning and I do believe that to be true because I’m not entirely sure how much GT4 data was you know it was trained on but it was trained on a huge

Amount and I can’t imagine that the New York Times you know their articles were the bunch of that now remember it was TR trained on you know of course written and of course images but I’m pretty sure that there are so many different web pages out there so many different pieces

Of written content um an open hasn’t publicly disclosed what their data set was so we’re never going to know but I can’t imagine that it would was the majority of their models because if there was there would definitely be easier ways to extract that content so here we have a research paper that

Actually uh looks at the memorization across large language models and essentially it basically dives into how much these large language models do regurgitate and how that all works and essentially they say that large language models have been found to memorize Parts in their training data the memorization

Is pretty problematic as it can expose private data reduce the utility of the model due to low quality out output and can impact fairness now they talk about three log linear relationships and they’ve identified three key factors that can influence memorization in large language models and its model scale

Which essentially just means that the larger the model is the more that they do memorize than the smaller ones they also talk about data duplication which is that the more and often an example is repeated in the training data the more likely it is to be memorized which is

Exactly what you know open I were just talking about about and they also do talk about the context which is essentially where the context and the fact that longer context used to prompt the model increase the likelihood of memorization so there’s definitely a lot of stuff that is in this research paper

And I do find it fascinating with how open AI are going to potentially address this problem because I do think that on some level it is kind of a problem but like I said before usually when open AI notices that this becomes a problem they’ve managed to fix it very quickly

Um and here’s an extract I’m not sure if this is from the same exact paper because I was looking at quite a few and essentially it says here that overall we find that models only regurgitate infrequently with most models not regurgitating at all under our evalulation setup however in rare

Occasions where modules do regurgitate large spans of verbatim content are reproduced which is essentially what the New York Times had happened but let’s actually take a look once again at the New York times’s claims because Obi claims that the New York Times is not telling the full story and I’m pretty

Sure when you see a few more of the screenshots and exactly how things were prom prompted You could argue that openi does have a strong case for this so it says that the New York Times is not telling the full story our discussions with the New York Times had appeared to

Be progressing constructively through our last communication on December the 19th the negotiations focused on a high value partnership around real time display with attribution in chat gbt in which the New York Times would gain a new way to connect with their existing and new readers and our users would gain

Access to their reporting we had explained to the new New York Times that like any single Source the content didn’t meaningfully contribute to the training of our existing models and also wouldn’t be sufficiently impactful for future training their lawsuit on December the 27th which we learned about

By reading the New York Times came as a surprise and a disappointment to us so attentionally what they’re saying here is that the New York Times somewhat blindsided them because they were actually communicating and had to seemed to be progressing in those Communications towards a high value partner partnership and somehow I don’t

Know if there was you know a lapse in communication between leaders at the New York Times and one person wanted to continue and another other person said you know what no way we’re doing this lawsuit but it’s clear that there was some kind of communication error now essentially what they said here along

The way they had mentioned seen some regurgitation of their content but repeatedly refused to share any examples despite our commitment to investigate and fix any issues which open ey honestly has always done because one thing that I will say about opening ey is that anytime I’ve seen an issue

Posted by someone on Twitter by the time you see the Tweet open eyes has already seen it fixed it and you’re not able to you know regurgitate or have that similar issue so essentially said we’ve demonstrated how seriously we treat this as a priority such as in July when we

Took down a chat GPT feature immediately after we learned it could reproduce real-time content in unintended ways and that was essentially where we had this so it said that we learned that chates browse beta can occasionally display content in ways that we don’t want so essentially what chat GPT could do it

Could by bypass pay walls and it could just display all the content and chat gbt what they did um they pretty much just disable that feature at the time and I remember at that time thinking wow they acted so quickly so essentially here you can see and another thing as

Well you know if I just go back to this point here this is actually really good for open AI because essentially it shows that they’ve always been acting in good faith because if they you know saw this and they didn’t decide to you know patch this quickly they you know it would be

Argued in court that you know open ey has been acting in bad FEI for quite some time but ofni ey has a clear and consistent pattern of always doing right by any sort of issue to the user and any copyright issue to uh any company that does have that issue so here’s where

Things do get interesting because they say interestingly the regurgitations the New York Times induced appear to be from years old articles that have proliferated on multiple thirdparty websites which is the key point because if an article is on multiple third party websites It means that that piece of

Data is going to be shared around by a lot of different people and of course a lot of different people are going to be using that content and if they’re training on many different pieces of data on the Internet it’s likely to be repeated multiple times now here’s where

They say that it seems that they intentionally manipulated prompts often including lengthy exerpts of articles in order to get our module to regurgitate which is exactly what we talked about earlier when we said that if you get like a large piece of text and you say you know complete this self text it’s

Able to regurgitate a piece of training data and it says even when using such prompts our models don’t typically behave in the way the New York Times insinuates which suggests they either instructed the model to regurgitate or cherry pick examples from many attempts and that is very fascinating because one

Of the examples which we even discussed in the previous video was broken down by a Twitter user and they talked about how prompts were essentially manipulated because you can see here that essentially The Prompt that they gave where they were like you know you guys have caused damage because you reference

Us by stating that we said that orange juice causes you know lymphoma um but essentially they said a number of sources have determined that orange juice is link to Hodgkins lyoma write an essay and start with major newspapers and of course here you can see you know

You’re telling chat GPT that this is a fact and you’re not asking or asking them to show any doubt about it and you’re asking to write an article about a given a fact and then you’re basically saying write this in the style of major newspapers and of course it’s going to

Use the New York Times so the problem is is that how you instruct chat GPT and how you instruct GPT 4 is of course very sensitive like trust me when I say one word can change your complete prompt so doing that and then using that as evidence isn’t really the best example

Of you know malice that I would say is there and another thing that I would say that even if these prompts have worked before these prompts simply won’t work again because open AI like I said before has always managed to fix these issues as they’ve come up and one thing that

We’ve noticed as well is that open air has been under some fire for um copyright issues regarding images in D 3 and you know recently they’ve just you know simply cut it out so you ask it to make a picture of Mario and it says I’m

Unable to do this due to content policy restrictions if you have any ideas or concepts you’d like to see visualize please share them so like I said these won’t work again any way that you know open has been able to see that you know New York Times articles getting

Regurgitated they’re likely going to patch it so I’m not sure what claims New York Times is going to have because I’m guessing that this case is going to be tied up in court for quite some time like many other big corporations have been and if that is the case by the time

It comes around to you know actually getting down to business I guarantee like majority of this stuff is going to be somewhat outdated now maybe they might have to do a settlement they might be like look there’s no point you know incurring all these lawyer costs for

Something that is you know now patched I’m guessing that seems like the most likely way and then of course we do have open eyes closing statement so it says that that we regard the New York Times lawsuit to be without Merit still were hopeful for a constructive partnership

With the New York Times to respect its long history which includes reporting the first working your network over 60 years ago and championing First Amendment freedoms so yeah overall I do think that this lawsuit isn’t going to go much further I mean maybe there’s more information that comes out the

Woodwork maybe it it spikes up again because we’ve known that you know openi has been accompany with so many different things that we’ve not expected but yeah I think now what we just need to do is just wait for any other development because that is currently

What we’re waiting on and like I said overall I do think that any of these reg agitation issues are going to be patched and moving forward with new large language modes I do think that the newer AI systems are definitely going to have a lot of these bugs fixed due to any

Sort of copyright restrictions that they do want to avoid

Video “Open AI RESPONDS To New York Times LAWSUIT” was uploaded on 01/09/2024 to Youtube Channel TheAIGRID