Open source and non-commercial license: It is open source. Researchers and programmers can freely use and edit Llama 2.
Llama 2 is superior in every aspect compared to the old version. These characteristics make it a powerful tool for many applications, such as chatbots, virtual assistants, and natural language understanding.
To start building an application, you must set up a development environment to isolate your project from existing projects on your machine.
First, start by creating a virtual environment using the Pipenv library as follows:
pipenv shell
Next, install the necessary libraries to build the chatbot.
pipenv install streamlit replicate
Streamlit: It is an open source web app framework that rapidly renders machine learning and data science applications.
Replicate: It is a cloud platform, providing access to large open source machine learning models for you to deploy your projects.
To get the Replicate token key, you must first register an account on Replicate using a GitHub account.
Once you have accessed the dashboard, navigate to the Explore button and search for Llama 2 chat to see the model llama-2–70b-chat .
Click model llama-2–70b-chat to view the Llama 2 API endpoint. Click the API button on the navigation bar of the llama-2–70b-chat model . On the right side of the page, click the Python button . This will give you access to API tokens for Python applications .
Copy REPLICATE_API_TOKEN and store it safely for future use.
First, create a Python file named llama_chatbot.py and an env (.env) file . You will write code in llama_chatbot.py and contain the secret key and API token in the .env file .
On the file llama_chatbot.py , import the library as follows:
import streamlit as st import os import replicate
Next, set the model's global variable llama-2–70b-chat .
# Biến toàn cục REPLICATE_API_TOKEN = os.environ.get('REPLICATE_API_TOKEN', default='') # Xác định các endpoint model làm biến độc lập LLaMA2_7B_ENDPOINT = os.environ.get('MODEL_ENDPOINT7B', default='') LLaMA2_13B_ENDPOINT = os.environ.get('MODEL_ENDPOINT13B', default='') LLaMA2_70B_ENDPOINT = os.environ.get('MODEL_ENDPOINT70B', default='')
On the .env file , add the Replicate token and endpoint models in the following format:
REPLICATE_API_TOKEN='Paste_Your_Replicate_Token' MODEL_ENDPOINT7B='a16z-infra/llama7b-v2-chat:4f0a4744c7295c024a1de15e1a63c880d3da035fa1f49bfd344fe076074c8eea' MODEL_ENDPOINT13B='a16z-infra/llama13b-v2-chat:df7690f1994d94e96ad9d568eac121aecf50684a0b0963b25a41cc40061269e5' MODEL_ENDPOINT70B='replicate/llama70b-v2-chat:e951f18578850b652510200860fc4ea62b3b16fac280f83ff32282f87bbd2e48'
Paste the Replicate token and save the .env file.
Create a prompt to start the Llama 2 model depending on the task you want to perform. In this case, the example wants the model to act as an assistant.
# Đặt Pre-propmt PRE_PROMPT = "You are a helpful assistant. You do not respond as " "'User' or pretend to be 'User'." " You only respond once as Assistant."
Set up the page configuration for the chatbot as follows:
# Đặt cấu hình trang ban đầu st.set_page_config( page_title="LLaMA2Chat", page_icon=":volleyball:", layout="wide" )
Write a function that initializes and sets the session state variable.
# Hằng số LLaMA2_MODELS = { 'LLaMA2-7B': LLaMA2_7B_ENDPOINT, 'LLaMA2-13B': LLaMA2_13B_ENDPOINT, 'LLaMA2-70B': LLaMA2_70B_ENDPOINT, } # Biến trạng thái phiên DEFAULT_TEMPERATURE = 0.1 DEFAULT_TOP_P = 0.9 DEFAULT_MAX_SEQ_LEN = 512 DEFAULT_PRE_PROMPT = PRE_PROMPT def setup_session_state(): st.session_state.setdefault('chat_dialogue', []) selected_model = st.sidebar.selectbox( 'Choose a LLaMA2 model:', list(LLaMA2_MODELS.keys()), key='model') st.session_state.setdefault( 'llm', LLaMA2_MODELS.get(selected_model, LLaMA2_70B_ENDPOINT)) st.session_state.setdefault('temperature', DEFAULT_TEMPERATURE) st.session_state.setdefault('top_p', DEFAULT_TOP_P) st.session_state.setdefault('max_seq_len', DEFAULT_MAX_SEQ_LEN) st.session_state.setdefault('pre_prompt', DEFAULT_PRE_PROMPT)
This function sets basic variables like chat_dialogue , pre_prompt , llm , top_p , max_seq_len , and temperature in the session state. It also handles the selection of the Llama 2 model based on the user's selection.
Write a function to display sidebar content of Streamlit app.
def render_sidebar(): st.sidebar.header("LLaMA2 Chatbot") st.session_state['temperature'] = st.sidebar.slider('Temperature:', min_value=0.01, max_value=5.0, value=DEFAULT_TEMPERATURE, step=0.01) st.session_state['top_p'] = st.sidebar.slider('Top P:', min_value=0.01, max_value=1.0, value=DEFAULT_TOP_P, step=0.01) st.session_state['max_seq_len'] = st.sidebar.slider('Max Sequence Length:', min_value=64, max_value=4096, value=DEFAULT_MAX_SEQ_LEN, step=8) new_prompt = st.sidebar.text_area( 'Prompt before the chat starts. Edit here if desired:', DEFAULT_PRE_PROMPT,height=60) if new_prompt != DEFAULT_PRE_PROMPT and new_prompt != "" and new_prompt is not None: st.session_state['pre_prompt'] = new_prompt + "n" else: st.session_state['pre_prompt'] = DEFAULT_PRE_PROMPT
This function displays the header and settings of the Llama 2 chatbot for adjustment.
Write a function to render chat history in the main content area of Streamlit app.
def render_chat_history(): response_container = st.container() for message in st.session_state.chat_dialogue: with st.chat_message(message["role"]): st.markdown(message["content"])
This function iterates through the chat_dialogue saved in the session state, displaying each message with the corresponding role (user or assistant).
Handle user input using the function below.
def handle_user_input(): user_input = st.chat_input( "Type your question here to talk to LLaMA2" ) if user_input: st.session_state.chat_dialogue.append( {"role": "user", "content": user_input} ) with st.chat_message("user"): st.markdown(user_input)
This function presents users with an input field where they can enter messages and questions. Notifications are added to chat_dialogue in the state session with the user role when the user sends notifications.
Write a function that generates responses from the Llama 2 model and displays them in the chat area.
def generate_assistant_response(): message_placeholder = st.empty() full_response = "" string_dialogue = st.session_state['pre_prompt'] for dict_message in st.session_state.chat_dialogue: speaker = "User" if dict_message["role"] == "user" else "Assistant" string_dialogue += f"{speaker}: {dict_message['content']}n" output = debounce_replicate_run( st.session_state['llm'], string_dialogue + "Assistant: ", st.session_state['max_seq_len'], st.session_state['temperature'], st.session_state['top_p'], REPLICATE_API_TOKEN ) for item in output: full_response += item message_placeholder.markdown(full_response + "▌") message_placeholder.markdown(full_response) st.session_state.chat_dialogue.append({"role": "assistant", "content": full_response})
This function creates a chat history thread, including both user and assistant messages, before calling the debounce_replicate_run function to get the assistant's response. It continuously modifies feedback in the UI to deliver a real-time experience.
This main function is responsible for rendering the entire Streamlit app.
def render_app(): setup_session_state() render_sidebar() render_chat_history() handle_user_input() generate_assistant_response()
It calls all the predefined functions to set session state, render sidebar, chat history, handle user input, and generate responses from the virtual assistant in a logical order.
Write a function that calls render_app and starts the application when deploying the script.
def main(): render_app() if __name__ == "__main__": main()
Now the application is ready to be deployed.
Create a utils.py file in the project folder and add the function below:
import replicate import time # Initialize debounce variables last_call_time = 0 debounce_interval = 2 # Set the debounce interval (in seconds) def debounce_replicate_run(llm, prompt, max_len, temperature, top_p, API_TOKEN): global last_call_time print("last call time: ", last_call_time) current_time = time.time() elapsed_time = current_time - last_call_time if elapsed_time < debounce_interval: print("Debouncing") return "Hello! Your requests are too fast. Please wait a few" " seconds before sending another request." last_call_time = time.time() output = replicate.run(llm, input={"prompt": prompt + "Assistant: ", "max_length": max_len, "temperature": temperature, "top_p": top_p, "repetition_penalty": 1}, api_token=API_TOKEN) return output
This function implements a debugging mechanism to prevent frequent and excessive API queries from user input.
Next, import the debug response function into the llama_chatbot.py file as follows:
from utils import debounce_replicate_run
Now run this application:
streamlit run llama_chatbot.py
Expected results:
The result shows a conversation between the model and a human.
Some real-life examples of Llama 2 applications include:
Above is how to build a chatbot using Streamlit and Llama 2 . Good luck!