团队里数据科学家交付的模型越来越多,但工程侧的部署效率却成了瓶颈。每个新模型都伴随着一套临时的 Flask API、一个手写的、充满“最佳实践”却无人维护的 Dockerfile,以及一套在 Wiki 上都找不到的部署脚本。结果就是,模型迭代快,但上线周期长,且每个线上服务的技术栈和运维方式都存在细微但致命的差异。这种混乱直接导致了不稳定的服务和高昂的维护成本。
我们需要一个标准化的“AI服务单元”。这个单元必须是自包含的、可复制的,并且对数据科学家屏蔽底层基础设施的复杂性。我们的目标是,让模型从开发完成到准生产环境部署,应该是一个提交代码、触发CI/A/B测试的自动化流程,而不是一场跨部门的协调会议。
初步构想是围绕一个“黄金镜像”展开。通过预先构建一个包含所有运行时依赖、安全基线和监控代理的虚拟机镜像,我们可以从根本上消除环境不一致的问题。服务本身则由两部分组成:一个负责处理业务逻辑、认证、限流的API网关,和一个高性能、支持分布式计算的模型服务层。这个组合构成了我们的“AI服务单元”模板。
经过几轮技术选型,我们确定了最终的技术栈:
- 基础镜像构建:
Packer。这是不二之选。它能以代码化的方式定义虚拟机镜像的构建过程,支持多云平台,并且与我们的CI/CD流程(如Jenkins, GitLab CI)能很好地集成。 - 模型服务层:
BentoML+Ray。BentoML提供了强大的模型打包和服务抽象,其“Runner”概念非常灵活。对于需要并行处理或模型间复杂依赖的场景,BentoML与Ray的集成提供了开箱即用的分布式计算能力,这正是我们处理复杂特征工程和模型集成所需要的。 - API网关/BFF:
NestJS。单纯暴露BentoML的API端点在真实生产环境中是远远不够的。我们需要一个健壮的框架来处理用户认证、API版本管理、请求校验、数据转换等非ML核心但至关重要的业务逻辑。NestJS基于TypeScript,架构清晰(受Angular启发),生态完善,非常适合构建这一层可靠的“控制平面”。
这套组合拳的核心在于:Packer 负责环境标准化,BentoML on Ray 负责核心的、可伸缩的AI计算,而 NestJS 则作为面向外部的、稳定的API门面。现在,开始动手把它们捏合在一起。
第一阶段:使用 Packer 打造不可变的基础设施
一切的基础是那个“黄金镜像”。我们的目标是创建一个包含了CUDA驱动、特定Python环境、Ray集群依赖以及Node.js运行时的AMI(Amazon Machine Image)。在真实项目中,这个镜像的构建过程应该由CI/CD流水线触发。
这里的核心是 packer.pkr.hcl 文件。我们避免在镜像中直接烘焙进模型代码或业务逻辑代码,这些应该在实例启动时通过代码拉取或挂载数据卷的方式动态加载。镜像只负责提供一个稳定、可靠、安全、一致的运行时环境。
// file: base-ml-service.pkr.hcl
packer {
required_plugins {
amazon = {
version = ">= 1.2.1"
source = "github.com/hashicorp/amazon"
}
}
}
variable "aws_access_key" {
type = string
sensitive = true
}
variable "aws_secret_key" {
type = string
sensitive = true
}
variable "aws_region" {
type = string
default = "us-east-1"
}
variable "instance_type" {
type = string
default = "g4dn.xlarge" // A GPU instance for ML workloads
}
source "amazon-ebs" "ubuntu-ml" {
access_key = var.aws_access_key
secret_key = var.aws_secret_key
region = var.aws_region
instance_type = var.instance_type
source_ami_filter {
filters = {
name = "ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"
root-device-type = "ebs"
virtualization-type = "hvm"
}
most_recent = true
owners = ["099720109477"] # Canonical's owner ID
}
ssh_username = "ubuntu"
ami_name = "bento-ray-nestjs-base-{{timestamp}}"
tags = {
OS_Version = "Ubuntu 20.04"
Base_For = "Standard ML Service Unit"
}
}
build {
name = "ml-service-unit-build"
sources = ["source.amazon-ebs.ubuntu-ml"]
provisioner "shell" {
inline = [
"echo 'Waiting for cloud-init to finish...'",
"cloud-init status --wait",
"sudo apt-get update -y",
"sudo apt-get install -y wget software-properties-common"
]
}
// Provisioner 1: Install NVIDIA Drivers for GPU
provisioner "shell" {
script = "./scripts/install_nvidia_drivers.sh"
}
// Provisioner 2: Install Miniconda and create a Python environment
// A common mistake is to install python packages globally.
// Using a dedicated conda environment ensures clean dependency management.
provisioner "shell" {
script = "./scripts/setup_python_env.sh"
}
// Provisioner 3: Install Node.js and PM2
provisioner "shell" {
inline = [
"curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -",
"sudo apt-get install -y nodejs",
"sudo npm install pm2 -g" // PM2 for process management
]
}
// Provisioner 4: System hardening and cleanup
provisioner "shell" {
inline = [
"sudo apt-get autoremove -y",
"sudo apt-get clean",
"sudo rm -rf /tmp/*"
]
}
}
配套的脚本 setup_python_env.sh 至关重要,它负责创建隔离的Python环境并预装核心依赖:
#!/bin/bash
set -e # Exit immediately if a command exits with a non-zero status.
# Install Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-Linux-x86_64.sh -O /tmp/miniconda.sh
bash /tmp/miniconda.sh -b -p /opt/conda
rm /tmp/miniconda.sh
export PATH="/opt/conda/bin:$PATH"
# Create a dedicated environment
conda create --name ml_env python=3.9 -y
source /opt/conda/bin/activate ml_env
# Install core ML libraries
# Note: We install specific versions to ensure reproducibility.
pip install "ray[default]==2.6.3"
pip install "bentoml==1.1.10"
pip install "torch==2.0.1" "torchvision==0.15.2" --index-url https://download.pytorch.org/whl/cu118
# A common trap is forgetting to make the conda env accessible by all users
# or not setting up the PATH correctly in service files.
# We ensure the activation script is available system-wide.
echo "source /opt/conda/bin/activate ml_env" | sudo tee /etc/profile.d/conda.sh
这个阶段完成后,我们就拥有了一个可重复构建的、干净的运行时环境。任何基于此镜像启动的实例,都具备了运行我们整个服务单元所需的一切基础软件。
第二阶段:BentoML 服务与 Ray 分布式执行
接下来,我们定义一个稍微复杂的BentoML服务。假设我们的场景是一个图像处理流程,它包括:1)一个特征提取模型;2)一个基于特征进行分类的模型。其中,特征提取是一个可以并行化的耗时操作。这里就是Ray的用武之地。
我们将使用BentoML的Runner抽象来封装模型,并指定其中一个Runner使用RayRunner来利用Ray的分布式能力。
# file: service.py
import typing as t
import bentoml
import numpy as np
from PIL.Image import Image
# For demonstration, we use dummy models.
# In a real project, these would be loaded from a model store.
class FeatureExtractor:
def __init__(self):
# Heavy initialization would go here
pass
def extract(self, img_batch: t.List[Image]) -> np.ndarray:
# Simulate a heavy, parallelizable task
print(f"FeatureExtractor running on batch of size: {len(img_batch)}")
return np.random.rand(len(img_batch), 512)
class Classifier:
def __init__(self):
pass
def predict(self, features: np.ndarray) -> t.List[str]:
# Simulate classification
print(f"Classifier predicting for features of shape: {features.shape}")
return ["cat" if feat[0] > 0.5 else "dog" for feat in features]
# 1. Define Runners for our models
# This runner will be executed in a distributed fashion by Ray
feature_extractor_runner = bentoml.Runner(
name="feature_extractor",
runnable_class=FeatureExtractor,
method_name="extract",
# Here is the key part: specifying Ray options for the runner
# This creates a pool of Ray actors for parallel execution.
ray_options=bentoml.ray.RayOptions(
num_replicas=4, # Number of parallel Ray actors
num_cpus=1, # CPUs per actor
)
)
classifier_runner = bentoml.Runner(
name="classifier",
runnable_class=Classifier
)
# 2. Define the BentoML Service
svc = bentoml.Service(
name="image_classification_service",
runners=[feature_extractor_runner, classifier_runner]
)
# 3. Define the service API endpoint
@svc.api(input=bentoml.io.Image(), output=bentoml.io.JSON())
async def classify(image: Image) -> t.Dict[str, str]:
"""
The core logic of our service.
It orchestrates calls to different runners.
"""
# Convert single image to a batch for runner processing
image_batch = [image]
# The call to `feature_extractor_runner.async_run` will be
# scheduled on the Ray actor pool by BentoML automatically.
features = await feature_extractor_runner.async_run(image_batch)
# The result is then passed to the next stage.
results = await classifier_runner.predict.async_run(features)
return {"prediction": results[0]}
bentofile.yaml文件则负责声明服务的元数据和Python依赖,BentoML会用它来构建一个可移植的”Bento”。
# file: bentofile.yaml
service: "service:svc"
labels:
owner: ml-platform-team
stage: development
include:
- "*.py"
python:
packages:
- numpy
- Pillow
# These dependencies were already pre-installed in the Packer image,
# but declaring them here ensures the Bento is self-contained and
# can be run elsewhere if needed. It's a best practice.
- torch
- bentoml
- ray[default]
通过bentoml build命令,我们可以将上述代码和模型(如果有的话)打包成一个版本化的Bento。这个Bento就是我们要部署的ML资产。
第三阶段:NestJS 作为稳固的控制平面
现在,轮到NestJS了。它将作为我们服务单元的统一入口,处理所有外部请求。我们将创建一个MlProxyModule,它负责接收请求,进行必要的处理(如认证、日志记录),然后将请求转发给在本地运行的BentoML服务。
这种架构的好处是显而易见的:
- 关注点分离: NestJS处理所有通用的Web服务逻辑,BentoML专注于高性能的模型推理。
- 强类型与可维护性: TypeScript和NestJS的模块化结构让大型项目更易于维护。
- 生态系统: 我们可以轻松集成
class-validator进行复杂的请求校验,使用Passport.js进行认证,等等。
// file: src/ml-proxy/ml-proxy.controller.ts
import {
Controller,
Post,
UploadedFile,
UseInterceptors,
HttpException,
HttpStatus,
Logger,
} from '@nestjs/common';
import { FileInterceptor } from '@nestjs/platform-express';
import { MlProxyService } from './ml-proxy.service';
@Controller('v1/classify')
export class MlProxyController {
private readonly logger = new Logger(MlProxyController.name);
constructor(private readonly mlProxyService: MlProxyService) {}
@Post('image')
@UseInterceptors(FileInterceptor('image')) // 'image' is the field name in multipart/form-data
async classifyImage(@UploadedFile() file: Express.Multer.File) {
if (!file) {
throw new HttpException('Image file is required', HttpStatus.BAD_REQUEST);
}
// In a real project, you would add more validation here:
// - File size limits
// - MIME type checking (e.g., only allow image/jpeg, image/png)
this.logger.log(`Received image for classification: ${file.originalname}`);
try {
// Delegate the actual call to the service layer
return await this.mlProxyService.forwardToBento(file.buffer);
} catch (error) {
this.logger.error(`Error forwarding request to BentoML: ${error.message}`, error.stack);
// Here, we must decide what error to expose to the client.
// Avoid leaking internal implementation details.
throw new HttpException(
'Failed to process the image due to an internal error.',
HttpStatus.INTERNAL_SERVER_ERROR,
);
}
}
}
MlProxyService则使用NestJS内置的HttpModule(基于axios)来与BentoML服务通信。
// file: src/ml-proxy/ml-proxy.service.ts
import { Injectable, Logger } from '@nestjs/common';
import { HttpService } from '@nestjs/axios';
import { firstValueFrom } from 'rxjs';
import { ConfigService } from '@nestjs/config';
import * as FormData from 'form-data';
@Injectable()
export class MlProxyService {
private readonly logger = new Logger(MlProxyService.name);
private readonly bentoServiceUrl: string;
constructor(
private readonly httpService: HttpService,
private readonly configService: ConfigService, // For managing config like URLs
) {
// A common error is hardcoding URLs. Use a config service.
this.bentoServiceUrl = this.configService.get<string>(
'BENTOML_SERVICE_URL',
'http://127.0.0.1:3000' // Default for local dev
);
}
async forwardToBento(imageBuffer: Buffer): Promise<any> {
const endpoint = `${this.bentoServiceUrl}/classify`;
const form = new FormData();
form.append('image', imageBuffer, { filename: 'upload.jpg', contentType: 'image/jpeg' });
this.logger.log(`Forwarding request to BentoML endpoint: ${endpoint}`);
// The key here is to correctly handle multipart/form-data proxying.
// We need to pass the headers from the form-data library to axios.
try {
const response = await firstValueFrom(
this.httpService.post(endpoint, form, {
headers: form.getHeaders(),
}),
);
return response.data;
} catch (error) {
if (error.response) {
this.logger.error(`BentoML service responded with error: ${error.response.status}`, error.response.data);
} else {
this.logger.error(`Failed to connect to BentoML service: ${error.message}`);
}
// Re-throw to be caught by the controller's error handler
throw new Error('Upstream BentoML service failed');
}
}
}
我们将MlProxyModule、HttpModule和ConfigModule组合在一起,构成一个完整的模块。
// file: src/ml-proxy/ml-proxy.module.ts
import { Module } from '@nestjs/common';
import { HttpModule } from '@nestjs/axios';
import { ConfigModule } from '@nestjs/config';
import { MlProxyController } from './ml-proxy.controller';
import { MlProxyService } from './ml-proxy.service';
@Module({
imports: [
HttpModule,
ConfigModule, // Make sure ConfigModule is imported globally in AppModule
],
controllers: [MlProxyController],
providers: [MlProxyService],
})
export class MlProxyModule {}
第四阶段:组装与启动
现在,我们有了Packer镜像、BentoML服务代码和NestJS应用代码。最后一步是在虚拟机实例启动时,将这一切组装并运行起来。这通常通过cloud-init或实例的user-data脚本来完成。
这个启动脚本是整个自动化流程的粘合剂。
#!/bin/bash
set -ex
# --- Configuration ---
export APP_HOME="/srv/app"
export GIT_REPO_URL="<your-git-repo-for-the-service-code>"
export BENTO_TAG="image_classification_service:latest" # This would be dynamic in CI/CD
# Activate the conda environment pre-built by Packer
source /etc/profile.d/conda.sh
# --- 1. Fetch Application Code ---
# In a real setup, use a specific commit hash, not main.
git clone ${GIT_REPO_URL} ${APP_HOME}
cd ${APP_HOME}
# --- 2. Setup NestJS Application ---
cd nestjs-gateway
npm install
npm run build
cd ..
# --- 3. Setup BentoML Service ---
# The bento has already been built by a CI pipeline and stored
# in a Bento Store (like S3). Here we would download and import it.
# For simplicity, let's assume the bentoml project is in the repo.
cd bentoml-service
# bentoml pull ${BENTO_TAG} --bento-store s3://my-bento-store
# For this example, we build it on the fly, which is not ideal for production.
bentoml build
cd ..
# --- 4. Start Services with PM2 ---
# PM2 allows us to manage and monitor these different processes.
# Start Ray cluster head node
pm2 start "ray start --head --port=6379 --disable-usage-stats" --name ray-head
# Wait for Ray head to be ready. A robust solution would poll `ray status`.
sleep 5
# Start BentoML server, pointing to the Ray head.
# The `BENTOML_CONFIG` environment variable is the proper way to configure BentoML.
export BENTOML_CONFIG=./config.production.yaml
pm2 start "bentoml serve ${BENTO_TAG} --production" --name bentoml-server
# Start NestJS Gateway
# BENTOML_SERVICE_URL is read by our NestJS ConfigService
export BENTOML_SERVICE_URL="http://127.0.0.1:3000"
export PORT=8080 # Port for the NestJS app
pm2 start "node dist/main.js" --name nestjs-gateway --cwd ./nestjs-gateway
# Save the PM2 process list to resurrect on reboot
pm2 save
config.production.yaml文件用于配置BentoML,特别是其与Ray的连接。
# file: config.production.yaml
runners:
feature_extractor:
ray_options:
address: "ray://127.0.0.1:10001" # Connect to the Ray cluster via client
下面的Mermaid图清晰地展示了我们构建的服务单元内部的请求流:
graph TD
subgraph "Client"
C[User/Client Application]
end
subgraph "EC2 Instance (Packer AMI)"
subgraph "NestJS Gateway (Port 8080)"
A[Controller] --> |Input Validation, Auth| B(Service)
B --> |HTTP POST /classify| D{BentoML API Server}
end
subgraph "BentoML Service (Port 3000)"
D --> |async_run| E[feature_extractor_runner]
D --> |async_run| F[classifier_runner]
end
subgraph "Ray Cluster (Local)"
E -- schedules on --> G1[Ray Actor 1]
E -- schedules on --> G2[Ray Actor 2]
E -- schedules on --> G3[Ray Actor 3]
G1 & G2 & G3 -- parallel execution --> E
end
E --> F
end
C -- HTTPS Request --> A
F -- Final Prediction --> D
D -- JSON Response --> B
B -- JSON Response --> A
A -- HTTPS Response --> C
这个架构的最终成果是一个高度标准化的部署单元。数据科学家只需要关心他们的service.py和bentofile.yaml。工程团队则维护Packer配置和NestJS网关模板。当需要部署新服务时,CI/CD流水线只需拉取相应的代码,基于我们的黄金AMI启动新实例,并运行启动脚本即可。这就形成了我们期望的 MLOps “高速公路”。
当前这个实现将所有组件(NestJS、BentoML、Ray Head/Worker)都部署在同一个(或一组相同的)实例上。这种“一体化”部署模式简化了初始设置和网络配置,但牺牲了组件独立扩展的能力。例如,如果API网关的CPU负载很高,而模型推理需要大量GPU,我们将被迫垂直扩展整个实例,造成资源浪费。
一个自然的演进方向是将此架构迁移到Kubernetes。NestJS网关可以作为一个常规的Deployment,BentoML服务可以通过其Yatai或自定义Operator进行部署,而Ray集群则由KubeRay Operator进行管理。这样,每个组件都可以根据自身的资源需求独立伸缩。
另一个局限性是组件间的通信。目前NestJS和BentoML之间采用HTTP通信,简单但性能并非最优。对于需要极低延迟的场景,将通信协议升级为gRPC会是必要的优化步骤。这需要在NestJS中引入gRPC客户端,并在BentoML服务中暴露gRPC端点,BentoML对此有原生支持。
最后,当前的Packer镜像是一个“大而全”的设计。随着支持的模型框架增多(TensorFlow, JAX等),镜像体积会不断膨胀。未来的优化路径可以是采用多阶段Packer构建,先创建一个包含通用依赖(如CUDA、Node.js)的基础镜像,然后在此之上为不同的框架栈(PyTorch、TF)构建专用的、更轻量的上层镜像。