How to run Qwen 3.5 locally on a single GPU

Qwen 3.5 is Alibabas latest Qwen model line, built upon the powerful performance of previous Qwen models in inference, programming, and multimodal tasks.

Qwen 3.5 is Alibaba's latest Qwen model line, built upon the powerful performance of previous Qwen models in inference, programming, and multimodal tasks.

 

Independent benchmark tests show that the Qwen 3.5-397B-A17B model scores highly in widely used benchmarks such as LiveCodeBench and AIME26, often outperforming leading models like GPT-5.2 and Claude Opus 4.5 in most evaluated categories, and delivering significantly higher throughput compared to previous Qwen generations.

Hardware and software requirements for Qwen 3.5

Before running Qwen 3.5 locally, you need to ensure your setup meets both hardware and software requirements for smooth inference. This guide will use an NVIDIA H200 GPU with 141GB of VRAM, combined with 240GB of system RAM, providing sufficient memory to efficiently run the MXFP4_MOE version of Qwen 3.5 with MoE offloading enabled.

In general, for best performance, your total VRAM + RAM capacity should approximately equal the size of the quantization model you are downloading. Otherwise, llama.cpp might transfer to an SSD , but the inference process will be slower.

Regarding software, you need to install the latest NVIDIA GPU driver, along with a recent version of the CUDA Toolkit, to ensure full compatibility with llama.cpp and CUDA-accelerated inference.

How to run Qwen 3.5 locally

Now that you've met the prerequisites, let's look at a step-by-step guide on how to use Qwen 3.5 locally:

1. Set up the local environment.

To run Qwen 3.5 locally, you need access to a computer with a powerful GPU . Since most laptops and desktops don't have enough VRAM or memory to handle models of this size, we'll be using a cloud-based GPU virtual machine.

 

This guide uses Hyperbolic to run the model privately. You can also use other providers such as RunPod, Vast.ai, or any GPU virtual machine platform you prefer. This article chooses Hyperbolic because it currently offers some of the most cost-effective GPU versions available.

Start by launching a new instance with a single H200 GPU.

After the machine boots up, you will see the public IP address and the SSH command needed to connect from your local terminal.

 

Before connecting, make sure you have set up local SSH and added your public SSH key when creating the virtual machine .

Once the instance is ready, connect to it via SSH with port forwarding. This is important because we want to access the local llama.cpp inference server through port 8080:

ssh -L 8080:localhost:8080 root@129.212.191.53

The first time you connect, type yes to confirm, then authenticate with your SSH key.

After logging in, verify that the GPU is correctly recognized:

nvidia-smi 

You will see NVIDIA H200 listed in the results.

Finally, install the necessary Linux packages to download, compile, and run llama.cpp:

sudo apt update sudo apt install pciutils build-essential cmake curl libcurl4-openssl-dev -y

Once complete, your environment is ready to install llama.cpp and run Qwen 3.5 locally.

2. Install llama.cpp with CUDA support.

llama.cpp is an open-source C and C++ inference engine that allows you to run large language models locally with minimal setup, supporting both CPU and GPU acceleration.

 

First, copy the llama.cpp archive:

git clone https://github.com/ggml-org/llama.cpp

Next, we configure the build to support CUDA with CMake. We enable CUDA using -DGGML_CUDA=ON and set the CUDA architecture to 90a since we're using NVIDIA H200 (Hopper class). This helps the build generate GPU code optimized for Hopper features.

cmake llama.cpp -B llama.cpp/build  -DGGML_CUDA=ON  -DCMAKE_BUILD_TYPE=Release  -DCMAKE_CUDA_ARCHITECTURES="90a"

Now compile the server binary file. llama-server is a built-in REST server that allows you to expose llama.cpp as an API endpoint:

cmake --build llama.cpp/build --config Release -j --clean-first --target llama-server

Finally, copy the compiled binary files to the home directory for easy execution:

cp llama.cpp/build/bin/llama-* llama.cpp

3. Download the Qwen 3.5 model.

Now that you've installed llama.cpp, the next step is to download the actual Qwen 3.5 model weights in GGUF format. These files are quite large, so using the Hugging Face CLI is the most reliable way to load them directly onto your GPU.

Python needs to be installed first because Hugging Face's download tools and validation utilities are distributed as Python packages. Although llama.cpp itself is written in C++, Python makes managing model downloads and delivery much easier.

Start by installing pip:

sudo apt install python3-pip

Next, install the Hugging Face Hub client along with performance-enhancing tools. hf_transfer and hf-xet significantly speed up downloads, which is crucial when downloading hundreds of gigabytes of model files.

pip -q install -U huggingface_hub hf-xet pip -q install -U hf_transfer

Now, let's download the Qwen 3.5 model from Hugging Face. In this tutorial, we'll only download the MXFP4_MOE variant, which is optimized for efficient MoE inference:

 

hf download unsloth/Qwen3.5-397B-A17B-GGUF  --local-dir models/Qwen3.5  --include "*MXFP4_MOE*"

Once the download is complete, the model files will be stored in models/Qwen 3.5, ready to be loaded into llama.cpp for local inference.

4. Launch the Qwen 3.5 model on a single GPU.

Now, we can launch Qwen 3.5 using llama-server. This provides us with an OpenAI-compatible endpoint API that can be called from local tools and applications.

Optimize the server for a single-GPU setup by doing three main things. First, enable the --fit option so that llama.cpp automatically balances the model between the GPU's VRAM and system RAM, instead of reporting an error when the model doesn't fit in all the VRAM .

Secondly, we use a larger context window with --ctx-size 16384 so the server can handle longer prompts. Thirdly, we enable the --jinja option and pass --chat-template-kwargs to control chat formatting and disable thought mode for faster and more direct responses.

Run the server with the command:

./llama.cpp/llama-server  --model models/Qwen3.5/MXFP4_MOE/Qwen3.5-397B-A17B-MXFP4_MOE-00001-of-00006.gguf  --alias "Qwen3.5"  --host 0.0.0.0  --port 8080  --fit on  --jinja  --ctx-size 16384  --temp 0.7  --top-p 0.8  --top-k 20  --min-p 0.00  --chat-template-kwargs "{"enable_thinking": false}"

While the model is loading, you'll see it using both GPU VRAM and system memory, which is normal for a large MoE model.

Once the loading process is complete, the server will be accessible at:

  • 0.0.0.0:8080 on the virtual machine
  • http://127.0.0.1:8080 on your local machine after SSH port forwarding

Let the server continue running. On your local computer, open a new terminal window and reconnect using SSH port forwarding:

ssh -L 8080:localhost:8080 root@129.212.191.53

Next, test the server by listing the available models:

curl -s http://127.0.0.1:8080/v1/models

If you see Qwen 3.5 in the response, your server is running correctly and you are ready to call it from the OpenAI SDK and your local applications.

5. Test the Qwen 3.5 model using the OpenAI SDK

Now that the Qwen 3.5 inference server is running, the next step is to verify that it works correctly with real-world client applications. One of the biggest advantages of llama.cpp is that llama-server provides an API compatible with OpenAI, meaning you can use the official OpenAI SDK without changing your code structure.

 

First, install the Python OpenAI package on your local computer (or inside a virtual machine if you prefer):

pip install openai 

Now, let's run a simple test script. This script connects to your local forwarded endpoint at http://127.0.0.1:8080/v1 instead of OpenAI's cloud server.

python3 - <<'PY' from openai import OpenAI client = OpenAI( base_url="http://127.0.0.1:8080/v1", api_key="sk-no-key-required" ) response = client.chat.completions.create( model="Qwen3.5", messages=[ {"role": "user", "content": "Write one sentence about AI agents."} ] ) print(response.choices[0].message.content) PY

There are a few important details to understand here:

  • base_url points to your local Qwen 3.5 server, not the OpenAI API.
  • The API key is still required by the SDK, but llama.cpp doesn't require authentication, so any placeholder value will work.
  • The model name="Qwen 3.5" matches the alias set when the server starts up.

If everything is configured correctly, you will receive quick and clear feedback from the model.

This confirms that:

  • The Qwen 3.5 model has been successfully loaded.
  • The llama.cpp server is running correctly.
  • Your SSH port forwarding is working.
  • The endpoint is fully compatible with OpenAI-style applications.

At this point, you can integrate Qwen 3.5 into any local tool, agent workflow, or application that already supports the OpenAI API format.

6. Develop a text-based user interface (TUI) for stock trading using Llama.cpp's WebUI.

Llama.cpp includes a built-in WebUI, similar to ChatGPT , which you can use to chat directly with the model in your browser. This is useful for quick testing, iterative operations, and code generation without having to write any client scripts beforehand.

Because SSH port forwarding has been set up, you can open the WebUI on your local computer and it will function as if the server were running on your laptop.

By default, the WebUI is available at:

http://127.0.0.1:8080

If this page loads, it confirms two things: your SSH tunnel is working correctly, and the Qwen 3.5 server is locally accessible while still running privately on a GPU virtual machine.

After you enter the WebUI, paste this prompt. The goal is for the model to generate both Python code and a brief user guide.

Xây dựng một ứng dụng giao diện người dùng văn bản (TUI) đơn giản bằng Python "Stock Screener Trainer" chạy bằng `python app.py` sử dụng thư viện rich (không phải giao diện web). Ứng dụng này cho phép tôi nhập danh sách mã cổ phiếu, chọn chế độ (tăng trưởng/giá trị/cổ tức) và mức độ rủi ro (thấp/trung bình/cao), lấy các chỉ số cơ bản công khai cho mỗi mã cổ phiếu từ một nguồn miễn phí, hiển thị trạng thái load trực tiếp, sau đó tạo một bảng đẹp và phần "Top 5 theo quy tắc chấm điểm của tôi" với lời cảnh báo rõ ràng "chỉ mang tính chất giáo dục, không phải lời khuyên tài chính", và lưu toàn bộ kết quả vào file `results.csv`.

Within seconds, Qwen 3.5 will generate an `app.py` file, usually a brief explanation of how to run it.

 

Now switch to your local terminal (laptop). Install the necessary libraries for the created application:

pip install rich yfinance

This operation installs:

  • Rich for text-based user interface (TUI) layouts, tables, prompts, and progress indicators.
  • Use yfinance to get free, publicly available stock indices.

Create a file named app.py, paste the code generated by the model, and run it:

python3 app.py

After running the script, you will see the text-based user interface (TUI) launch correctly in your terminal. The application will prompt you to enter the stock ticker symbol you want to analyze, along with your preferred screening mode and risk level.

For example, the author of the article conducted experiments with three popular stocks.

After a short loading period, the tool will return a complete table of stock market indicators, highlighting the results based on scoring rules and saving everything to the results.csv file.

This is a great example of how Qwen 3.5 can create a fully functional application in a single go, using only a 4-bit quantized model endpoint and a simple prompt.

You've just finished reading the article "How to run Qwen 3.5 locally on a single GPU" edited by the TipsMake team. We hope this article has provided you with many useful tech tips and tricks. You can search for similar articles on tips and guides. Thank you for reading and for following us regularly.

Related posts
  • Comparing the performance of Qwen 3.5 and Gemma 4

    google deepmind has just launched gemma 4, its latest open weight family with sizes e2b, e4b, 26b, a4b, and 31b. alibaba's qwen 3.5 is already one of the most robust open families on the market, including sizes 2b, 4b, 9b, 27b, 35b-a3b, 122b-a10b, and 397b-a17b.
Other Technology articles
Category

System

Windows XP

Windows Server 2012

Windows 8

Windows 7

Windows 10

Wifi tips

Virus Removal - Spyware

Speed ​​up the computer

Server

Security solution

Mail Server

LAN - WAN

Ghost - Install Win

Fix computer error

Configure Router Switch

Computer wallpaper

Computer security

Mac OS X

Mac OS System software

Mac OS Security

Mac OS Office application

Mac OS Email Management

Mac OS Data - File

Mac hardware

Hardware

USB - Flash Drive

Speaker headset

Printer

PC hardware

Network equipment

Laptop hardware

Computer components

Advice Computer

Game

PC game

Online game

Mobile Game

Pokemon GO

information

Technology story

Technology comments

Quiz technology

New technology

British talent technology

Attack the network

Artificial intelligence

Technology

Smart watches

Raspberry Pi

Linux

Camera

Basic knowledge

Banking services

SEO tips

Science

Strange story

Space Science

Scientific invention

Science Story

Science photo

Science and technology

Medicine

Health Care

Fun science

Environment

Discover science

Discover nature

Archeology

Life

Travel Experience

Tips

Raise up child

Make up

Life skills

Home Care

Entertainment

DIY Handmade

Cuisine

Christmas

Application

Web Email

Website - Blog

Web browser

Support Download - Upload

Software conversion

Social Network

Simulator software

Online payment

Office information

Music Software

Map and Positioning

Installation - Uninstall

Graphic design

Free - Discount

Email reader

Edit video

Edit photo

Compress and Decompress

Chat, Text, Call

Archive - Share

Electric

Water heater

Washing machine

Television

Machine tool

Fridge

Fans

Air conditioning

Program

Unix and Linux

SQL Server

SQL

Python

Programming C

PHP

NodeJS

MongoDB

jQuery

JavaScript

HTTP

HTML

Git

Database

Data structure and algorithm

CSS and CSS3

C ++

C #

AngularJS

Mobile

Wallpapers and Ringtones

Tricks application

Take and process photos

Storage - Sync

Security and Virus Removal

Personalized

Online Social Network

Map

Manage and edit Video

Data

Chat - Call - Text

Browser and Add-on

Basic setup