Skip to content Skip to footer

Advice on using LLMs wisely. Ten of my LinkedIn posts on LLMs | by Lak Lakshmanan | Jan, 2024

Ten of my LinkedIn posts on LLMs

1. Non-determinism in LLMs

The best LLM use cases are where you use LLM as a tool rather than expose it directly. As Richard Seroter says, how many chatbots do you need?

However, this use case of replacing static product pages by personalized product summaries is like many other LLM use cases in that it faces unique risks due to non-determinism. Imagine that a customer sues you a year from now, saying that they bought the product because your product summary claimed (wrongly) that the product was flameproof and their house burned down. The only way to protect yourself would be to have a record of every generated summary and the storage costs will quickly add up …

One way to avoid this problem (and what I suggest) is to generate a set of templates using LLMs and use an ML model to choose which template to serve. This also has the benefit of allowing human oversight of your generated text, so you are not at the mercy of prompt engineering. (This is, of course, just a way to use LLMs to efficiently create different websites for different customer segments — the more things change, the more they rhyme with existing ideas).

Many use cases of LLMs are like this: you’ll have to reduce the non-deterministic behavior and associated risk through careful architecture.

2. Copyright issues with LLMs

The New York Times is suing OpenAI and Microsoft over their use of the Times’ articles. This goes well beyond previous lawsuits, claiming that:

1. OpenAI used millions of articles, and weighted them higher thus implicitly acknowledging the importance of the Times’ content.

2. Wirecutter reviews reproduced verbatim, but with the affiliate links stripped out. This creates a competitive product.

3. GenAI mimics the Times’ expressive style leading to trademark dilution.

4. Value of the tech is trillions of dollars for Microsoft and billions of dollars for OpenAI based on the increase in their market caps.

5. Producing close summaries is not transformative given that the original work was created at considerable expense.

The lawsuit also goes after the corporate structure of Open AI, the nature of the close collaborations with Open AI that Microsoft relied on to build Azure’s computing platform and selection of datasets.

The whole filing is 69 pages, very readable, and has lots of examples. I strongly recommend reading the full PDF that’s linked from the article.

I am not a lawyer, so I’m not going to weigh in on the merits of the lawsuit. But if the NYTimes wins, I’d expect that:

1. The cost of LLM APIs will go up as LLM providers will have to pay their sources. This lawsuit hits on training and quality of the base service not just when NYTimes articles are reproduced during inference. So, costs will go up across the board.

2. Open source LLMs will not be able to use Common Crawl (where the NYTimes is the 4th most common source). Their dataset quality will degrade, and it will be harder for them to match the commercial offerings.

3. This protects business models associated with producing unique and high quality content.

4. SEO will further privilege being the top 1 or 2 highest authority on a topic. It will be hard for others to get organic traffic. Expect customer acquisition costs through ads to go up.

3. Don’t use a LLM directly; Use a bot creation framework

A mishap at a Chevy dealership

demonstrates why you should never implement the chatbot on your website directly on top of an LLM API or with a custom GPT — you will struggle to tame the beast. There will also be all kinds of adversarial attacks that you will spend a lot of programmer dollars guarding against.

What should you do? Use a higher level bot-creation framework such as Google Dialogflow or Amazon Lex. Both these have a language model built in, and will respond to only a limited number of intents. Thus saving you from an expensive lesson.

4. Gemini demonstrates Google’s confidence in their research team

What a lot of people seem to be missing is the ice-cold confidence Google leadership had in their research team.

Put yourself in the shoes of Google executives a year ago. You’ve lost first-mover advantage to startups that have gone to market with tech you deemed too risky. And you need to respond.

Would you bet on your research team being able to build a *single* model that would outperform OpenAI, Midjourney, etc? Or would you spread your bets and build multiple models? [Gemini is a single model that has beat the best text model on text, the best image model on images, the best video model on video, and the best speech model on speech.]

Now, imagine that you have two world class labs: Google Brain and Deep Mind. Would you combine them and tell 1000 people to work on a single product? Or would you hedge the bet by having them work on two different approaches in the hope one is successful? [Google combined the two teams calling it Google Deep Mind under the leadership of Demis, the head of Deep Mind, and Jeff Dean, the head of Brain, became chief scientist.]

You have an internally developed custom machine learning chip (the TPU). Meanwhile, everyone else is building models on general purpose chips (GPUs). Do you double down on your internal chip, or hedge your bets? [Gemini was trained and is being served fromTPUs.]

On each of these decisions, Google chose to go all-in.

5. Who’s actually investing in Gen AI?

Omdia estimates of H100 shipments:

A good way to cut past marketing hype in tech is to look at who’s actually investing in new capacity. So, the Omdia estimates of H100 shipments is a good indicator of who’s winning in Gen AI.

Meta and Microsoft bought 150k H100s apiece in 2023 while Google, Amazon, and Oracle bought 50k units each. (Google internal usage and Anthropic are on TPUs, so their Gen AI spend is higher than the 50k would indicate.)

1. Apple is conspicuous by its absence.
2. Very curious what Meta is up to. Look for a big announcement there?
3. Oracle is neck-and-neck with AWS.

Chip speed improvements these days don’t come from packing more transistors on a chip (physics limitation). Instead, they come from optimizing for specific ML model types.

So, H100 gets 30x inference speedups over A100 (the previous generation) on transformer workloads by (1) dynamically switching between 8bit and 16bit representation for different layers of a transformer architecture (2) increasing the networking speed between GPUs allowing for model parallelism (necessary for LLMs), not just data parallelism (sufficient for image workloads). You wouldn’t spend $30,000 per chip unless your ML models had this specific set of specific need.

Similarly, the A100 got its improvement over the V100 by using a specially designed 10-bit precision floating point type that balances speed and accuracy on image and text embedding workloads.

So knowing what chips a company is buying lets you guess what AI workloads a company is investing in. (to a first approximation: the H100 also has hardware instructions for some genomics and optimization problems, so it’s not 100% clear-cut).

6. People like AI-generated content, until you tell them it is AI generated

Fascinating study from MIT:

1. If you have content, some AI-generated and some human-generated, people prefer the AI one! If you think AI-generated content is bland and mediocre, you (and I) are in the minority. This is similar to how the majority of people actually prefer the food in chain restaurants — bland works for more people.

2. If you label content as being AI-generated or human-generated, people prefer the human one. This is because they now score human-generated content higher while keeping scores for AI the same. There is some sort of virtue-signalling or species-favoritism going on.

Based on this, when artists ask for AI-generated art to be labeled or writers ask for AI-generated text to be clearly marked, is it just special pleading? Are artists and writers lobbying for preferred treatment?

Not LLM — but my first love in AI — methods in weather forecasting — are having their moment

Besides GraphCast, there are other global machine learning based weather forecasting models that are run in real time. Imme Ebert-Uphoff ‘s research group shows them side-by-side (with ECMWF and GFS numerical weather forecast as control) here:

Side-by-side verification in a setting such as the Storm Prediction Center Spring Experiment is essential before these forecasts get employed in decision making. Not sure what the equivalent would be for global forecasts, but such evaluation is needed. So happy to see that CIRA is providing the capability.

7. LLMs are plateau-ing

I was very unimpressed after OpenAI’s Dev day.

8. Economics of Gen AI software

There are two unique characteristics associated with Gen AI software —(1) the computational cost is high because it needs GPUs for training/inference (2) the data moat is low because smaller models finetuned on comparitively little data can equal the performance of larger models. Given this, the usual expectation that software has low marginal cost and provides huge economies of scale may no longer apply.

9. Help! My book is part of the training dataset of LLMs

Many of the LLMs on the market include a dataset called Books3 in their training corpus. The problem is that this corpus includes pirated copies of books. I used a tool created by the author of the Atlantic article

to check whether any of my books is in the corpus. And indeed, it seems one of the books is.

It was a humorous post, but captures the real dilemma since no one writes technical books (entire audience is a few thousands of copies) to make money.

10. A way to detect Hallucinated Facts in LLM-generated text

Because LLMs are autocomplete machines, they will pick the most likely next phrase given the preceding text. But what if there isn’t enough data on a topic? Then, the “most likely” next phrase is an average of many different articles in the general area, and so the resulting sentence is likely to be factually wrong. We say that the LLM has “hallucinated” a fact.

This update from Bard takes advantage of the relationship between frequency in the training dataset and hallucination to mark areas of the generated text that are likely to be factually incorrect.

Follow me on LinkedIn:

Source link