Chuyển tới nội dung chính

Deploy model GPT-OSS by using vLLM v0.10.0

Introduction

GPT-OSS is the latest open-weight model series of OpenAI, designed for powerful reasoning, agentic tasks, and versatile developer use cases. Required:

  • openai/gpt-oss-20b: for lower latency, and local or specialized use cases
    • The smaller model
    • Only requires about 16GB of VRAM.
  • openai/gpt-oss-120b: recommend for production, general-purpose, high-reasoning use cases
    • Our larger full-sized model
    • Best with ≥60GB VRAM
    • Can fit on a single H100 or multi-GPU setups.

Step 1: Deploy a container with vLLM v0.10.1 template

  1. Click Create a new container button
  2. In the template selection, choose the latest vLLM template (v0.10.1).
  3. For GPU selection, only 1xH100 GPU is required to serve the model.
  4. Keep all the other settings on their defaults if you want to serve openai/gpt-oss-20b. Change the model in the command section if you want to serve openai/gpt-oss-120b.
  5. Click Create Container to create your container.

Wait for your container to initialize. This process usually takes around 15 minutes to download the gpt-oss-20b model and up to 2 hours for the gpt-oss-120b model. You can monitor the progress in the Container Logs.

If the logs stop at a line similar to ( Red box in the screenshot)

Using model weights format [*.safetensors]

This means the model is still downloading or initializing, and the endpoint is not yet ready to receive requests.

The model is considered fully loaded and ready to serve only when you see all checkpoint shards completed, like this ( Yellow box in the screenshot)

Loading safetensors checkpoint shards: 100% Completed [3/3]

This indicates that all model files have been successfully loaded.

Step 2: Sending a Run request

After your container is running and the model is downloaded, you can send a run request to test the setup.

  1. Check the available model list:
curl -X 'GET' \  

'{your endpoint}/v1/models' \ -H 'accept: application/json'.fptcloud.com/v1/models' \

-H 'accept: application/json'
  1. Test the model by asking a few simple questions.
curl -X 'POST' \  

'{your endpoint}/v1/chat/completions' \88sdgk-8000.serverless.fptcloud.com/v1/chat/completions' \

-H 'accept: application/json' \

-H 'Content-Type: application/json' \

-d '{

"messages": [

{

"content": "Tell me what is GPT-OSS?",

"role": "user",

"name": "admin"

}

],

"model": " openai/gpt-oss-120b"

}'


Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET https://ai-docs.fptcloud.com/fpt-gpu-cloud/gpu-container/use-cases/deploy-model-gpt-oss-by-using-vllm-v0.10.0.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.