Deploy Tiny-Llama on AWS EC2. Learn how to deploy a real ML… | by Marcello Politi

Tiny-Llama logo (src: https://github.com/jzhang38/TinyLlama)

Learn how to deploy a real ML application using AWS and FastAPI

Introduction

I have always thought that even the best project in the world does not have much value if people cannot use it. That is why it is very important to learn how to deploy Machine Learning models. In this article we focus on deploying a small large language model, Tiny-Llama, on an AWS instance called EC2.

List of tools I’ve used for this project:

Deepnote: is a cloud-based notebook that’s great for collaborative data science projects, good for prototyping
FastAPI: a web framework for building APIs with Python
AWS EC2: is a web service that provides sizable compute capacity in the cloud
Nginx: is an HTTP and reverse proxy server. I use it to connect the FastAPI server to AWS
GitHub: GitHub is a hosting service for software projects
HuggingFace: is a platform to host and collaborate on unlimited models, datasets, and applications.

About Tiny Llama

TinyLlama-1.1B is a project aiming to pretrain a 1.1B Llama on 3 trillion tokens. It uses the same architecture as Llama2 .

Today’s large language models have impressive capabilities but are extremely expensive in terms of hardware. In many areas we have limited hardware: think smartphones or satellites. So there is a lot of research on creating smaller models so they can be deployed on edge.

Here is a list of “small” models that are catching on: