qwen2.5-vl 7b 使用OCR 返回JSON信息 包含位置信息
为了防止幻觉:不使用提示词。
入参加上 response_format 来输出规定格式 及相对应的内容。
response_format={“type”: “json_object”}
qwen2.5-vl 7b 使用OCR 返回JSON信息 包含位置信息
为了防止幻觉:不使用提示词。
入参加上 response_format 来输出规定格式 及相对应的内容。
response_format={“type”: “json_object”}
$(“.chb:checked”).map(function(){return $(this).val()}).get();
$(".chb:checked").map(
function(){
return $(this).val()
}
).get();
https://github.com/moeru-ai/airi

模型驱动的灵魂容器,什么都能做一点的桌宠:让 Neuro-sama 这样的虚拟伴侣也成为我们世界中的一份子吧!
[加入 Discord] [试试看] [English] [日本語]
深受 Neuro-sama 启发
Warning
注意: 我们没有发行任何与本项目关联的加密货币或代币,请注意判断资讯并谨慎行事。
Note
我们有一个专门的组织 @proj-airi 用于所有从 Project AIRI 诞生的子项目,快来看看吧!
RAG(检索增强生成)、记忆系统、嵌入式数据库、图标、Live2D 实用工具等等!
借助现代大型语言模型的力量,像是 ChatGPT 和著名的 Claude 所能带来的,想要 LLM(大语言模型)和我们角色扮演、聊天已经超简单了,每个人都能上手。而像 Character.ai(又称 c.ai) 和 JanitorAI 这样的平台,以及本地应用如 SillyTavern(又称酒馆),已经是基于聊天或文字冒险游戏体验的相当不错的解决方案。
但是,如何赋予它们玩游戏的能力呢?让它们能看到你正在编写的代码?不仅能一边聊天一边玩游戏,也可以看视频,还能做很多其他事情?
你可能已经知道 Neuro-sama,她目前是最好的能够玩游戏、聊天并与你和参与者(在VTuber社区中)互动的 AI VTuber / 伴侣,有些人也称这种存在为”数字人”。可惜的是,她并不开源,当她从直播中下线后,你就无法与她互动了。
因此,这个项目 AIRI,在这里提供了另一种可能性:让你轻松拥有自己的数字生命、赛博生命,随时随地。
与其他 AI 和 LLM 驱动的 VTuber 开源项目不同,アイリ VTuber 从开始开发的第一天开始就支持多种 Web 技术,涵盖诸如 WebGPU、WebAudio、Web Workers、WebAssembly、WebSocket 等已经广泛应用或仍在大量实验的 API。
这意味着 アイリ VTuber 能够在现代浏览器和设备上运行,甚至能够在移动设备上运行(已经完成了 PWA 支持),这为我们(贡献者们)带来了更多的可能性,让我们得以更进一步构建和扩展 アイリ VTuber 的外部功能,而与此同时也不失配置的灵活性——可以有选择地在不同设备上启用会需要 TCP 连接或其他非 Web 技术的功能,例如连接到 Discord 的语音频道一起开黑,或是和朋友们一起玩 Minecraft(我的世界)、Factorio(异星工厂)。
Note
アイリ VTuber 仍处于早期开发阶段,我们欢迎优秀的开发者加入我们,一起将它变为现实。
即使不熟悉 Vue.js、TypeScript 和所需的其他开发工具也没关系,我们也欢迎艺术家、设计师、运营策划的加入,你甚至可以成为第一个用 アイリ VTuber 直播的博主。
如果你使用的是 React、 Svelte,甚至 Solid 也没关系,您可以自己创建一个子目录,添加您希望在 アイリ VTuber 中看到的功能,或者想实验的功能。
我们非常期待以下领域的朋友加入:
如果你已经感兴趣了,为什么不来这里和大家打个招呼呢?Would like to join part of us to build AIRI?

有关开发此项目的具体教程,参见 CONTRIBUTING.md
pnpm i pnpm dev
unspeech: 用于代理 /audio/transcriptions 和 /audio/speech 的代理服务器实现,类似 LiteLLM 但面向任何 ASR 和 TTShfup: 帮助部署、打包到 HuggingFace Spaces 的工具集@proj-airi/drizzle-duckdb-wasm: DuckDB WASM 的 Drizzle ORM driver 驱动@proj-airi/duckdb-wasm: 易于使用的 @duckdb/duckdb-wasm 封装@proj-airi/lobe-icons: 为 lobe-icons 漂亮的 AI & LLM 图标制作的 Iconify JSON 封装,支持 Tailwind 和 UnoCSSautorio: Factorio 自动化库tstl-plugin-reload-factorio-mod: 开发时支持热重载 Factorio 模组demodel: 轻松加速各种推理引擎和模型下载器拉/下载模型或数据集的速度inventory: 中心化模型目录和默认服务来源配置的公开 API 服务@proj-airi/elevenlabs: ElevenLabs API 的类型定义https://viewscreen.githubusercontent.com/markdown/mermaid?docs_host=https%3A%2F%2Fdocs.github.com&color_mode=light#d44cc388-5696-4f6d-8a48-845520d580d2Loading
xsai:实现了相当数量的包来与 LLM 和模型交互,像 Vercel AI SDK 但是更小https://github.com/cnadler86/mp_esp_dl_models
This is a MicroPython binding for ESP-DL (Deep Learning) models that enables face detection, face recognition, human detection, cat detection, and image classification on ESP32 devices.
I spent a lot of time and effort to make this. If you find this project useful, please consider donating to support my work.
FaceDetector: Detects faces in images and provides bounding boxes and facial featuresFaceRecognizer: Recognizes enrolled faces and manages a face databaseHumanDetector: Detects people in images and provides bounding boxesCatDetector: Detects cats in images and provides bounding boxesImageNet: Classifies images into predefined categoriesCocoDetector: Detects objects in images using COCO dataset categoriesYou can find precompiled images in two ways:
git clone –recursive https://github.com/cnadler86/mp_esp_dl_models.git git clone https://github.com/cnadler86/micropython-camera-API.git git clone https://github.com/cnadler86/mp_jpeg.git
a) Using mpconfigvariant files (recommended): The models can be enabled in the board’s mpconfigvariant files (e.g., mpconfigvariant_FLASH_16M.cmake). The following flags are available:
b) Using command line flags: You can enable models directly through the idf.py command using -D flags:idf.py -D MP_DL_FACE_RECOGNITION_ENABLED=1 -D MP_DL_CAT_DETECTOR_ENABLED=1 [other flags…]
Basic build command:cd mp_esp_dl_models/boards/ idf.py -D MICROPY_DIR=<micropython-dir> -D MICROPY_BOARD=<BOARD_NAME> -D MICROPY_BOARD_VARIANT=<BOARD_VARIANT> -B build-<your-build-name> build cd build-<your-build-name> python ~/micropython/ports/esp32/makeimg.py sdkconfig bootloader/bootloader.bin partition_table/partition-table.bin micropython.bin firmware.bin micropython.uf2
All models support various input pixel formats including RGB888 (default), RGB565, and others supported by ESP-DL. You can use mp_jpeg to decode camera images to the correct format.
The pixel format can be set through the constructor’s pixel_format parameter. This value matches the ESP-DL image format definitions.
espdl.RGB888 (default)espdl.RGB565espdl.GRAYSCALEThe FaceDetector module detects faces in images and can optionally provide facial feature points.
FaceDetector(width=320, height=240, pixel_format=espdl.RGB888, features=True)
Parameters:
width (int, optional): Input image width. Default: 320height (int, optional): Input image height. Default: 240pixel_format (int, optional): Input image pixel format. Default: espdl.RGB888features (bool, optional): Whether to return facial feature points. Default: Trueframebuffer: image data (required)score: Detection confidence (float)box: Bounding box coordinates [x1, y1, x2, y2]features: Facial feature points [(x,y) coordinates for: left eye, right eye, nose, left mouth, right mouth] if enabled, None otherwiseThe FaceRecognizer module manages a database of faces and can recognize previously enrolled faces.
FaceRecognizer(width=320, height=240, pixel_format=espdl.RGB888, features=True, db_path=”face.db”, model=None)
Parameters:
width (int, optional): Input image width. Default: 320height (int, optional): Input image height. Default: 240pixel_format (int, optional): Input image pixel format. Default: espdl.RGB888features (bool, optional): Whether to return facial feature points. Default: Truedb_path (str, optional): Path to the face database file. Default: “face.db”model (str, optional): Feature extraction model to use (“MBF” or “MFN”). Default: None (uses default model)framebuffer: image data (required)score: Detection confidencebox: Bounding box coordinates [x1, y1, x2, y2]features: Facial feature points (if enabled)person: Recognition result containing:
id: Face IDsimilarity: Match confidence (0-1)name: Person name (if provided during enrollment)framebuffer: image datavalidate (bool, optional): Check if face is already enrolled. Default: Falsename (str, optional): Name to associate with the face. Default: Noneid (int): ID of the face to deleteThe HumanDetector module detects people in images. The CatDetector does it for cats. Both modules provide bounding boxes for detected objects.
HumanDetector(width=320, height=240, pixel_format=espdl.RGB888) #For cats use CatDetector
Parameters:
width (int, optional): Input image width. Default: 320height (int, optional): Input image height. Default: 240pixel_format (int, optional): Input image pixel format. Default: espdl.RGB888framebuffer: image datascore: Detection confidencebox: Bounding box coordinates [x1, y1, x2, y2]The ImageNet module classifies images into predefined categories.
ImageNet(width=320, height=240, pixel_format=espdl.RGB888)
Parameters:
width (int, optional): Input image width. Default: 320height (int, optional): Input image height. Default: 240pixel_format (int, optional): Input image pixel format. Default: espdl.RGB888framebuffer: image data[class1, score1, class2, score2, ...]The COCO detect module detects objects in images using the COCO dataset.
COCODetector(width=320, height=240, pixel_format=espdl.RGB888, model=CONFIG_DEFAULT_COCO_DETECT_MODEL)
Parameters:
width (int, optional): Input image width. Default: 320height (int, optional): Input image height. Default: 240pixel_format (int, optional): Input image pixel format. Default: espdl.RGB888model (int, optional): COCO detection model to use. Default: CONFIG_DEFAULT_COCO_DETECT_MODELframebuffer: image datascore: Detection confidencebox: Bounding box coordinates [x1, y1, x2, y2]category: Detected object class idfrom espdl import FaceDetector import camera from jpeg import Decoder # Initialize components cam = camera.Camera() decoder = Decoder(pixel_format=”RGB888″) face_detector = FaceDetector() # Capture and process image img = cam.capture() framebuffer = decoder.decode(img) # Convert to RGB888 results = face_detector.run(framebuffer) if results: for face in results: print(f”Face detected with confidence: {face[‘score’]}”) print(f”Bounding box: {face[‘box’]}”) if face[‘features’]: print(f”Facial features: {face[‘features’]}”)
from espdl import FaceRecognizer import camera from jpeg import Decoder # Initialize components cam = camera.Camera() decoder = Decoder(pixel_format=”RGB888″) recognizer = FaceRecognizer(db_path=”/faces.db”) # Enroll a face img = cam.capture() framebuffer = decoder.decode(img) face_id = recognizer.enroll(framebuffer, name=”John”) print(f”Enrolled face with ID: {face_id}”) # Later, recognize faces img = cam.capture() framebuffer = decoder.decode(img) results = recognizer.run(framebuffer) if results: for face in results: if face[‘person’]: print(f”Recognized {face[‘person’][‘name’]} (ID: {face[‘person’][‘id’]})”) print(f”Similarity: {face[‘person’][‘similarity’]}”)
The following table shows the frames per second (fps) for different image sizes and models. The results are based on a test with a 2MP camera and a ESP32S3.
| Frame Size | FaceDetector | HumanDetector |
|---|---|---|
| QQVGA | 14.5 | 6.6 |
| R128x128 | 21 | 6.6 |
| QCIF | 19.7 | 6.5 |
| HQVGA | 18 | 6.3 |
| R240X240 | 16.7 | 6.1 |
| QVGA | 15.2 | 6.6 |
| CIF | 13 | 5.5 |
| HVGA | 11.9 | 5.3 |
| VGA | 8.2 | 4.4 |
| SVGA | 6.2 | 3.8 |
| XGA | 4.1 | 2.8 |
| HD | 3.6 | 2.6 |
validate=True during enrollment to avoid duplicates以下方案在 2025-07 月新鲜验证,已在「4×A55 2 GB RAM」级别的 Arm64 板卡(全志 T733/A733)跑通,可直接照抄。
bash
复制
# Armbian(Debian 12 bookworm)为例
sudo apt update && sudo apt install -y \
git cmake build-essential gcc g++ \
wget ffmpeg alsa-utils
bash
复制
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
# 编译 Arm64 可执行文件 + 库
make -j$(nproc)
编译结束应出现 mainstreambench 三个可执行文件;在 A733 上整个过程 ≈ 3 min。
bash
复制
# 建议 q5_0 量化版,文件 ~140 MB,内存峰值 < 600 MB
bash ./models/download-ggml-model.sh small-q5_0
# 模型会落位 ./models/ggml-small-q5_0.bin
bash
复制
# 先转成 16 kHz 单声道
ffmpeg -i meeting.mp3 -ar 16000 -ac 1 -c:a pcm_s16le meeting.wav
# 运行识别
./main \
-m models/ggml-small-q5_0.bin \
-f meeting.wav \
-l zh --output-txt
# 结果:meeting.wav.txt,简体中文,A733 实测 1 h 音频 ≈ 25 min 完成。
bash
复制
# 依赖 ALSA
./stream \
-m models/ggml-small-q5_0.bin \
-t 4 # 4 线程,可改 2 线程更省电 \
--step 500 # 每 0.5 s 出字 \
-l zh --print-colors
tiny-q5_0 模型 → 延迟降到 0.4 s,CPU 占用 35 %,但错字率略升 2 %。swapon;编译阶段峰值 1.8 GB。make clean && make CMAKE_FLAGS="-DCMAKE_C_FLAGS='-march=armv8-a'"。结论
A733 可“零依赖”离线部署 Whisper.cpp,文件转写 1× 实时、流式延迟 < 1 s,无需 GPU/NPU,2 W 级功耗即可长期运行,满足会议记录、本地字幕等场景。若需更高精度,可把模型换成 medium-q5_0 (300 MB),速度仍保持 0.5× 实时左右。祝部署顺利!