Skip to content Skip to footer

Deploy Tiny-Llama on AWS EC2. Learn how to deploy a real ML… | by Marcello Politi | Jan, 2024

Tiny-Llama logo (src:

Learn how to deploy a real ML application using AWS and FastAPI


I have always thought that even the best project in the world does not have much value if people cannot use it. That is why it is very important to learn how to deploy Machine Learning models. In this article we focus on deploying a small large language model, Tiny-Llama, on an AWS instance called EC2.

List of tools I’ve used for this project:

  • Deepnote: is a cloud-based notebook that’s great for collaborative data science projects, good for prototyping
  • FastAPI: a web framework for building APIs with Python
  • AWS EC2: is a web service that provides sizable compute capacity in the cloud
  • Nginx: is an HTTP and reverse proxy server. I use it to connect the FastAPI server to AWS
  • GitHub: GitHub is a hosting service for software projects
  • HuggingFace: is a platform to host and collaborate on unlimited models, datasets, and applications.

About Tiny Llama

TinyLlama-1.1B is a project aiming to pretrain a 1.1B Llama on 3 trillion tokens. It uses the same architecture as Llama2 .

Today’s large language models have impressive capabilities but are extremely expensive in terms of hardware. In many areas we have limited hardware: think smartphones or satellites. So there is a lot of research on creating smaller models so they can be deployed on edge.

Here is a list of “small” models that are catching on:

  • Mobile VLM (Multimodal)
  • Phi-2
  • Obsidian (Multimodal)

Source link