构建基于 Packer 镜像的标准化 AI 服务单元集成 NestJS 与 BentoML on Ray

MLOps

文章字数: 3.8k

阅读时长: 18 分

团队里数据科学家交付的模型越来越多，但工程侧的部署效率却成了瓶颈。每个新模型都伴随着一套临时的 Flask API、一个手写的、充满“最佳实践”却无人维护的 Dockerfile，以及一套在 Wiki 上都找不到的部署脚本。结果就是，模型迭代快，但上线周期长，且每个线上服务的技术栈和运维方式都存在细微但致命的差异。这种混乱直接导致了不稳定的服务和高昂的维护成本。

我们需要一个标准化的“AI服务单元”。这个单元必须是自包含的、可复制的，并且对数据科学家屏蔽底层基础设施的复杂性。我们的目标是，让模型从开发完成到准生产环境部署，应该是一个提交代码、触发CI/A/B测试的自动化流程，而不是一场跨部门的协调会议。

初步构想是围绕一个“黄金镜像”展开。通过预先构建一个包含所有运行时依赖、安全基线和监控代理的虚拟机镜像，我们可以从根本上消除环境不一致的问题。服务本身则由两部分组成：一个负责处理业务逻辑、认证、限流的API网关，和一个高性能、支持分布式计算的模型服务层。这个组合构成了我们的“AI服务单元”模板。

经过几轮技术选型，我们确定了最终的技术栈：

基础镜像构建: Packer。这是不二之选。它能以代码化的方式定义虚拟机镜像的构建过程，支持多云平台，并且与我们的CI/CD流程（如Jenkins, GitLab CI）能很好地集成。
模型服务层: BentoML + Ray。BentoML提供了强大的模型打包和服务抽象，其“Runner”概念非常灵活。对于需要并行处理或模型间复杂依赖的场景，BentoML与Ray的集成提供了开箱即用的分布式计算能力，这正是我们处理复杂特征工程和模型集成所需要的。
API网关/BFF: NestJS。单纯暴露BentoML的API端点在真实生产环境中是远远不够的。我们需要一个健壮的框架来处理用户认证、API版本管理、请求校验、数据转换等非ML核心但至关重要的业务逻辑。NestJS基于TypeScript，架构清晰（受Angular启发），生态完善，非常适合构建这一层可靠的“控制平面”。

这套组合拳的核心在于：Packer 负责环境标准化，BentoML on Ray 负责核心的、可伸缩的AI计算，而 NestJS 则作为面向外部的、稳定的API门面。现在，开始动手把它们捏合在一起。

第一阶段：使用 Packer 打造不可变的基础设施

一切的基础是那个“黄金镜像”。我们的目标是创建一个包含了CUDA驱动、特定Python环境、Ray集群依赖以及Node.js运行时的AMI（Amazon Machine Image）。在真实项目中，这个镜像的构建过程应该由CI/CD流水线触发。

这里的核心是 packer.pkr.hcl 文件。我们避免在镜像中直接烘焙进模型代码或业务逻辑代码，这些应该在实例启动时通过代码拉取或挂载数据卷的方式动态加载。镜像只负责提供一个稳定、可靠、安全、一致的运行时环境。

// file: base-ml-service.pkr.hcl
packer {
  required_plugins {
    amazon = {
      version = ">= 1.2.1"
      source  = "github.com/hashicorp/amazon"
    }
  }
}

variable "aws_access_key" {
  type      = string
  sensitive = true
}

variable "aws_secret_key" {
  type      = string
  sensitive = true
}

variable "aws_region" {
  type    = string
  default = "us-east-1"
}

variable "instance_type" {
  type    = string
  default = "g4dn.xlarge" // A GPU instance for ML workloads
}

source "amazon-ebs" "ubuntu-ml" {
  access_key    = var.aws_access_key
  secret_key    = var.aws_secret_key
  region        = var.aws_region
  instance_type = var.instance_type
  source_ami_filter {
    filters = {
      name                = "ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"
      root-device-type    = "ebs"
      virtualization-type = "hvm"
    }
    most_recent = true
    owners      = ["099720109477"] # Canonical's owner ID
  }
  ssh_username = "ubuntu"
  ami_name     = "bento-ray-nestjs-base-{{timestamp}}"
  tags = {
    OS_Version = "Ubuntu 20.04"
    Base_For   = "Standard ML Service Unit"
  }
}

build {
  name    = "ml-service-unit-build"
  sources = ["source.amazon-ebs.ubuntu-ml"]

  provisioner "shell" {
    inline = [
      "echo 'Waiting for cloud-init to finish...'",
      "cloud-init status --wait",
      "sudo apt-get update -y",
      "sudo apt-get install -y wget software-properties-common"
    ]
  }
  
  // Provisioner 1: Install NVIDIA Drivers for GPU
  provisioner "shell" {
    script = "./scripts/install_nvidia_drivers.sh"
  }

  // Provisioner 2: Install Miniconda and create a Python environment
  // A common mistake is to install python packages globally.
  // Using a dedicated conda environment ensures clean dependency management.
  provisioner "shell" {
    script = "./scripts/setup_python_env.sh"
  }
  
  // Provisioner 3: Install Node.js and PM2
  provisioner "shell" {
    inline = [
      "curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -",
      "sudo apt-get install -y nodejs",
      "sudo npm install pm2 -g" // PM2 for process management
    ]
  }

  // Provisioner 4: System hardening and cleanup
  provisioner "shell" {
    inline = [
      "sudo apt-get autoremove -y",
      "sudo apt-get clean",
      "sudo rm -rf /tmp/*"
    ]
  }
}

配套的脚本 setup_python_env.sh 至关重要，它负责创建隔离的Python环境并预装核心依赖：

#!/bin/bash
set -e # Exit immediately if a command exits with a non-zero status.

# Install Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-Linux-x86_64.sh -O /tmp/miniconda.sh
bash /tmp/miniconda.sh -b -p /opt/conda
rm /tmp/miniconda.sh
export PATH="/opt/conda/bin:$PATH"

# Create a dedicated environment
conda create --name ml_env python=3.9 -y
source /opt/conda/bin/activate ml_env

# Install core ML libraries
# Note: We install specific versions to ensure reproducibility.
pip install "ray[default]==2.6.3"
pip install "bentoml==1.1.10"
pip install "torch==2.0.1" "torchvision==0.15.2" --index-url https://download.pytorch.org/whl/cu118

# A common trap is forgetting to make the conda env accessible by all users
# or not setting up the PATH correctly in service files.
# We ensure the activation script is available system-wide.
echo "source /opt/conda/bin/activate ml_env" | sudo tee /etc/profile.d/conda.sh

这个阶段完成后，我们就拥有了一个可重复构建的、干净的运行时环境。任何基于此镜像启动的实例，都具备了运行我们整个服务单元所需的一切基础软件。

第二阶段：BentoML 服务与 Ray 分布式执行

接下来，我们定义一个稍微复杂的BentoML服务。假设我们的场景是一个图像处理流程，它包括：1）一个特征提取模型；2）一个基于特征进行分类的模型。其中，特征提取是一个可以并行化的耗时操作。这里就是Ray的用武之地。

我们将使用BentoML的Runner抽象来封装模型，并指定其中一个Runner使用RayRunner来利用Ray的分布式能力。

# file: service.py
import typing as t
import bentoml
import numpy as np
from PIL.Image import Image

# For demonstration, we use dummy models.
# In a real project, these would be loaded from a model store.
class FeatureExtractor:
    def __init__(self):
        # Heavy initialization would go here
        pass

    def extract(self, img_batch: t.List[Image]) -> np.ndarray:
        # Simulate a heavy, parallelizable task
        print(f"FeatureExtractor running on batch of size: {len(img_batch)}")
        return np.random.rand(len(img_batch), 512)

class Classifier:
    def __init__(self):
        pass

    def predict(self, features: np.ndarray) -> t.List[str]:
        # Simulate classification
        print(f"Classifier predicting for features of shape: {features.shape}")
        return ["cat" if feat[0] > 0.5 else "dog" for feat in features]

# 1. Define Runners for our models
# This runner will be executed in a distributed fashion by Ray
feature_extractor_runner = bentoml.Runner(
    name="feature_extractor",
    runnable_class=FeatureExtractor,
    method_name="extract",
    # Here is the key part: specifying Ray options for the runner
    # This creates a pool of Ray actors for parallel execution.
    ray_options=bentoml.ray.RayOptions(
      num_replicas=4, # Number of parallel Ray actors
      num_cpus=1,     # CPUs per actor
    )
)

classifier_runner = bentoml.Runner(
    name="classifier",
    runnable_class=Classifier
)

# 2. Define the BentoML Service
svc = bentoml.Service(
    name="image_classification_service",
    runners=[feature_extractor_runner, classifier_runner]
)

# 3. Define the service API endpoint
@svc.api(input=bentoml.io.Image(), output=bentoml.io.JSON())
async def classify(image: Image) -> t.Dict[str, str]:
    """
    The core logic of our service.
    It orchestrates calls to different runners.
    """
    # Convert single image to a batch for runner processing
    image_batch = [image]
    
    # The call to `feature_extractor_runner.async_run` will be
    # scheduled on the Ray actor pool by BentoML automatically.
    features = await feature_extractor_runner.async_run(image_batch)
    
    # The result is then passed to the next stage.
    results = await classifier_runner.predict.async_run(features)
    
    return {"prediction": results[0]}

bentofile.yaml文件则负责声明服务的元数据和Python依赖，BentoML会用它来构建一个可移植的”Bento”。

# file: bentofile.yaml
service: "service:svc"
labels:
  owner: ml-platform-team
  stage: development
include:
  - "*.py"
python:
  packages:
    - numpy
    - Pillow
    # These dependencies were already pre-installed in the Packer image,
    # but declaring them here ensures the Bento is self-contained and
    # can be run elsewhere if needed. It's a best practice.
    - torch
    - bentoml
    - ray[default]

通过bentoml build命令，我们可以将上述代码和模型（如果有的话）打包成一个版本化的Bento。这个Bento就是我们要部署的ML资产。

第三阶段：NestJS 作为稳固的控制平面

现在，轮到NestJS了。它将作为我们服务单元的统一入口，处理所有外部请求。我们将创建一个MlProxyModule，它负责接收请求，进行必要的处理（如认证、日志记录），然后将请求转发给在本地运行的BentoML服务。

这种架构的好处是显而易见的：

关注点分离: NestJS处理所有通用的Web服务逻辑，BentoML专注于高性能的模型推理。
强类型与可维护性: TypeScript和NestJS的模块化结构让大型项目更易于维护。
生态系统: 我们可以轻松集成class-validator进行复杂的请求校验，使用Passport.js进行认证，等等。

// file: src/ml-proxy/ml-proxy.controller.ts
import {
  Controller,
  Post,
  UploadedFile,
  UseInterceptors,
  HttpException,
  HttpStatus,
  Logger,
} from '@nestjs/common';
import { FileInterceptor } from '@nestjs/platform-express';
import { MlProxyService } from './ml-proxy.service';

@Controller('v1/classify')
export class MlProxyController {
  private readonly logger = new Logger(MlProxyController.name);

  constructor(private readonly mlProxyService: MlProxyService) {}

  @Post('image')
  @UseInterceptors(FileInterceptor('image')) // 'image' is the field name in multipart/form-data
  async classifyImage(@UploadedFile() file: Express.Multer.File) {
    if (!file) {
      throw new HttpException('Image file is required', HttpStatus.BAD_REQUEST);
    }
    
    // In a real project, you would add more validation here:
    // - File size limits
    // - MIME type checking (e.g., only allow image/jpeg, image/png)
    
    this.logger.log(`Received image for classification: ${file.originalname}`);

    try {
      // Delegate the actual call to the service layer
      return await this.mlProxyService.forwardToBento(file.buffer);
    } catch (error) {
      this.logger.error(`Error forwarding request to BentoML: ${error.message}`, error.stack);
      // Here, we must decide what error to expose to the client.
      // Avoid leaking internal implementation details.
      throw new HttpException(
        'Failed to process the image due to an internal error.',
        HttpStatus.INTERNAL_SERVER_ERROR,
      );
    }
  }
}

MlProxyService则使用NestJS内置的HttpModule（基于axios）来与BentoML服务通信。

// file: src/ml-proxy/ml-proxy.service.ts
import { Injectable, Logger } from '@nestjs/common';
import { HttpService } from '@nestjs/axios';
import { firstValueFrom } from 'rxjs';
import { ConfigService } from '@nestjs/config';
import * as FormData from 'form-data';

@Injectable()
export class MlProxyService {
  private readonly logger = new Logger(MlProxyService.name);
  private readonly bentoServiceUrl: string;

  constructor(
    private readonly httpService: HttpService,
    private readonly configService: ConfigService, // For managing config like URLs
  ) {
    // A common error is hardcoding URLs. Use a config service.
    this.bentoServiceUrl = this.configService.get<string>(
      'BENTOML_SERVICE_URL',
      'http://127.0.0.1:3000' // Default for local dev
    );
  }

  async forwardToBento(imageBuffer: Buffer): Promise<any> {
    const endpoint = `${this.bentoServiceUrl}/classify`;
    
    const form = new FormData();
    form.append('image', imageBuffer, { filename: 'upload.jpg', contentType: 'image/jpeg' });

    this.logger.log(`Forwarding request to BentoML endpoint: ${endpoint}`);

    // The key here is to correctly handle multipart/form-data proxying.
    // We need to pass the headers from the form-data library to axios.
    try {
      const response = await firstValueFrom(
        this.httpService.post(endpoint, form, {
          headers: form.getHeaders(),
        }),
      );
      return response.data;
    } catch (error) {
      if (error.response) {
        this.logger.error(`BentoML service responded with error: ${error.response.status}`, error.response.data);
      } else {
        this.logger.error(`Failed to connect to BentoML service: ${error.message}`);
      }
      // Re-throw to be caught by the controller's error handler
      throw new Error('Upstream BentoML service failed');
    }
  }
}

我们将MlProxyModule、HttpModule和ConfigModule组合在一起，构成一个完整的模块。

// file: src/ml-proxy/ml-proxy.module.ts
import { Module } from '@nestjs/common';
import { HttpModule } from '@nestjs/axios';
import { ConfigModule } from '@nestjs/config';
import { MlProxyController } from './ml-proxy.controller';
import { MlProxyService } from './ml-proxy.service';

@Module({
  imports: [
    HttpModule,
    ConfigModule, // Make sure ConfigModule is imported globally in AppModule
  ],
  controllers: [MlProxyController],
  providers: [MlProxyService],
})
export class MlProxyModule {}

第四阶段：组装与启动

现在，我们有了Packer镜像、BentoML服务代码和NestJS应用代码。最后一步是在虚拟机实例启动时，将这一切组装并运行起来。这通常通过cloud-init或实例的user-data脚本来完成。

这个启动脚本是整个自动化流程的粘合剂。

#!/bin/bash
set -ex

# --- Configuration ---
export APP_HOME="/srv/app"
export GIT_REPO_URL="<your-git-repo-for-the-service-code>"
export BENTO_TAG="image_classification_service:latest" # This would be dynamic in CI/CD

# Activate the conda environment pre-built by Packer
source /etc/profile.d/conda.sh

# --- 1. Fetch Application Code ---
# In a real setup, use a specific commit hash, not main.
git clone ${GIT_REPO_URL} ${APP_HOME}
cd ${APP_HOME}

# --- 2. Setup NestJS Application ---
cd nestjs-gateway
npm install
npm run build
cd ..

# --- 3. Setup BentoML Service ---
# The bento has already been built by a CI pipeline and stored
# in a Bento Store (like S3). Here we would download and import it.
# For simplicity, let's assume the bentoml project is in the repo.
cd bentoml-service
# bentoml pull ${BENTO_TAG} --bento-store s3://my-bento-store
# For this example, we build it on the fly, which is not ideal for production.
bentoml build
cd ..

# --- 4. Start Services with PM2 ---
# PM2 allows us to manage and monitor these different processes.

# Start Ray cluster head node
pm2 start "ray start --head --port=6379 --disable-usage-stats" --name ray-head

# Wait for Ray head to be ready. A robust solution would poll `ray status`.
sleep 5

# Start BentoML server, pointing to the Ray head.
# The `BENTOML_CONFIG` environment variable is the proper way to configure BentoML.
export BENTOML_CONFIG=./config.production.yaml
pm2 start "bentoml serve ${BENTO_TAG} --production" --name bentoml-server

# Start NestJS Gateway
# BENTOML_SERVICE_URL is read by our NestJS ConfigService
export BENTOML_SERVICE_URL="http://127.0.0.1:3000"
export PORT=8080 # Port for the NestJS app
pm2 start "node dist/main.js" --name nestjs-gateway --cwd ./nestjs-gateway

# Save the PM2 process list to resurrect on reboot
pm2 save

config.production.yaml文件用于配置BentoML，特别是其与Ray的连接。

# file: config.production.yaml
runners:
  feature_extractor:
    ray_options:
      address: "ray://127.0.0.1:10001" # Connect to the Ray cluster via client

下面的Mermaid图清晰地展示了我们构建的服务单元内部的请求流：

graph TD
    subgraph "Client"
        C[User/Client Application]
    end

    subgraph "EC2 Instance (Packer AMI)"
        subgraph "NestJS Gateway (Port 8080)"
            A[Controller] --> |Input Validation, Auth| B(Service)
            B --> |HTTP POST /classify| D{BentoML API Server}
        end

        subgraph "BentoML Service (Port 3000)"
            D --> |async_run| E[feature_extractor_runner]
            D --> |async_run| F[classifier_runner]
        end

        subgraph "Ray Cluster (Local)"
            E -- schedules on --> G1[Ray Actor 1]
            E -- schedules on --> G2[Ray Actor 2]
            E -- schedules on --> G3[Ray Actor 3]
            G1 & G2 & G3 -- parallel execution --> E
        end
        
        E --> F
    end

    C -- HTTPS Request --> A
    F -- Final Prediction --> D
    D -- JSON Response --> B
    B -- JSON Response --> A
    A -- HTTPS Response --> C

这个架构的最终成果是一个高度标准化的部署单元。数据科学家只需要关心他们的service.py和bentofile.yaml。工程团队则维护Packer配置和NestJS网关模板。当需要部署新服务时，CI/CD流水线只需拉取相应的代码，基于我们的黄金AMI启动新实例，并运行启动脚本即可。这就形成了我们期望的 MLOps “高速公路”。

当前这个实现将所有组件（NestJS、BentoML、Ray Head/Worker）都部署在同一个（或一组相同的）实例上。这种“一体化”部署模式简化了初始设置和网络配置，但牺牲了组件独立扩展的能力。例如，如果API网关的CPU负载很高，而模型推理需要大量GPU，我们将被迫垂直扩展整个实例，造成资源浪费。

一个自然的演进方向是将此架构迁移到Kubernetes。NestJS网关可以作为一个常规的Deployment，BentoML服务可以通过其Yatai或自定义Operator进行部署，而Ray集群则由KubeRay Operator进行管理。这样，每个组件都可以根据自身的资源需求独立伸缩。

另一个局限性是组件间的通信。目前NestJS和BentoML之间采用HTTP通信，简单但性能并非最优。对于需要极低延迟的场景，将通信协议升级为gRPC会是必要的优化步骤。这需要在NestJS中引入gRPC客户端，并在BentoML服务中暴露gRPC端点，BentoML对此有原生支持。

最后，当前的Packer镜像是一个“大而全”的设计。随着支持的模型框架增多（TensorFlow, JAX等），镜像体积会不断膨胀。未来的优化路径可以是采用多阶段Packer构建，先创建一个包含通用依赖（如CUDA、Node.js）的基础镜像，然后在此之上为不同的框架栈（PyTorch、TF）构建专用的、更轻量的上层镜像。

Packer BentoML NestJS Ray

构建移动端混合推理架构 Swift TFLite与Laravel TensorFlow服务的协同设计

2023-10-27 架构设计

Laravel Swift TensorFlow PHP 微服务

构建支持多模态召回的插件化推荐引擎Kit

2023-10-27 架构与设计

Pinecone 推荐系统 Kit 插件化架构向量数据库