作者归档：admin

生成18mmx100mm 条码打印机的条码信息

brother 打印机使用18mm 打印条码

使用：JsBarcode.all.min.js 生成条码

12mm 条码 80mm长度：12 2 fontsize3
字体大小 4 ;字体到底部 1mm

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
    <script type="text/javascript" src="https://code.jquery.com/jquery-2.1.3.min.js"></script>
    <script src="JsBarcode.all.min.js"></script>
</head>
<body>
    
</body>
<script>
function createPrintPage(ht_code) {
  // 创建一个 iframe 用于打印
  const iframe = document.createElement('iframe');
  iframe.style.display = 'none';
  document.body.appendChild(iframe);

  // 设置 iframe 的内容
  const iframeContent = iframe.contentWindow.document;
  iframeContent.open();
  iframeContent.write(`
  <html>
    <head>
      <style>
        @media print {
          @page {
            size: 100mm 18mm;
            margin: 0;
          }
          body {
            margin: 0;
            padding: 0;
            width: 100mm;
            height: 18mm;
          }
          .barcode-container {
            width: 100%;
            height: 100%;
            display: flex;
            flex-direction: column;
            justify-content: center;
            align-items: center;
          }
          .barcode {
            width: 80mm;
            height: 12mm;
          }
          .text {
            position: absolute;
            font-size: 4mm;
            text-align: center;
            width: 100%;
            bottom: 1mm;
            color: black;
          }
        }
      </style>
    </head>
    <body>
      <div class="barcode-container">
        <canvas id="barcode" class="barcode"></canvas>
        <div class="text" id="barcode-text">Sample Barcode</div>
      </div>
    </body>
    </html>
  `);
  iframeContent.close();
  // 使用 JsBarcode 生成条形码
  const barcodeData = ht_code; // 条形码数据
  const barcodeElement = iframeContent.getElementById('barcode');
  const barcodeTextElement = iframeContent.getElementById('barcode-text');
  // 生成条形码
  JsBarcode(barcodeElement, barcodeData, {
    format: "CODE128", // 使用 CODE128 格式
    fontSize: 3,
    margin: 0,
    height: 12,
    width: 2,
    displayValue: false // 不显示条形码下方的文本
  });
  // 设置条形码文本
  barcodeTextElement.textContent = barcodeData;
  // 打印
  iframe.contentWindow.print();
}

// 调用函数生成并打印页面
createPrintPage("HTA101WDCX0320250024");
</script>
</html>

AI SFT LORA 数据微调

发表评论

A:数据获取：文献资料可以使用KIMI或者其他AI 对文档PDF 进行处理成输入输出，若有思维链，可以成为SFT格式。

B:数据采集：对互联网公开数据进行采集，后处理成自己要的格式，也可以使用AI处理数据。

对A、B两种数据对AI进行LORA微调（SFT），即可完成项目需求。

实例：

中医：知识框架以学习教材+病例数据进行微调。

首先中医的AI微调范例：https://github.com/Zlasejd/HuangDI

中医的数据 SFT LORA ：SylvanL/Traditional-Chinese-Medicine-Dataset-SFT · Datasets at HF Mirror

开源SFT数据集

发表评论

R1的SFT数据 Congliu/Chinese-DeepSeek-R1-Distill-data-110k

https://hf-mirror.com/datasets/Congliu/Chinese-DeepSeek-R1-Distill-data-110k

GitHub – chaoswork/sft_datasets: 开源SFT数据集整理,随时补充

医学o1 sft: https://hf-mirror.com/datasets/FreedomIntelligence/medical-o1-reasoning-SFT

数据集	数目	Lang	Task	Gen	类型	来源	链接
belle_cn	1079517	CN	TS/MT	SI	通用指令，数学推理，对话	text-davunci-003	下载
firefly	1649398	CN	MT	COL	23种nlp任务	收集中文数据集，人工书写指令模板	下载
GAOKAO	2785	CN	MT	COL	高考中的多选，填空等问题	人工标注的数据集的收集	下载
COIG	298428	CN	MT	COL	考试，翻译，价值观指令数据集搜集，基于知识图谱的反事实对话	自动化工具+人工验证	下载
pCLUE	1200705	CN	MT		73个Prompt,分类，推理，关键词识别，阅读理解等9个NLP任务		下载
CSL	396209	CN	MT		40万中文论文元数据，26个Prompt		下载
CNewSum	304307	CN	TS		字节与UCSB发布的中文摘要数据集		下载
Coco-cn		CN	TS		图文多模态		下载
news_commentary	69200	EN/CN	TS		中英文翻译数据		下载
Chain of Thought	74771	EN/CN	MT	HG	CoT相关任务	人在现有数据集上标注CoT	下载
HC3	37175	EN/CN	TS	MIX	对话评估	gpt-3.5 或人工	下载
instinwild	52191	EN/CN	MT	SI	生成，开放域问答，头脑风暴	text-davunci-003	下载
Alpaca_GPT4	52002	EN/CN	MT	SI	通用指令	GPT-4 生成的Alpaca数据	下载
MOSS	1583595	EN/CN	SI				下载
LLMZoo		ML					下载
Guanaco	534610	ML	MT	SI	多种nlp任务	text-davinci-003	下载
Natural Instructions	5040134	ML	MT	COL	多种nlp任务	人工标注的数据集的收集	下载
xP3	78883588	ML	MT	COL	多种nlp任务	人工标注的数据集的收集	下载
alpaca	52002	EN	MT	SI	通用指令	text-davinci-003	下载
GPT4all	806199	EN	MT	COL	代码，故事，对话	GPT-3.5-turbo 蒸馏	下载
GPTeacher	29013	EN	MT	SI	通用，角色扮演，工具指令	GPT-4 & toolformer	下载
prosocial dialog	165681	EN	TS	MIX	对话	GPT-3改写问题，人工回复	下载
finance_en	68912	EN	TS	COL	金融领域问答	GPT3.5	下载
instruct	888969	EN	MT	COL	GPT4All，Alpaca和开源数据集的增强	使用AllenAI提供的nlp增强工具	下载
Code Alpaca	20022	EN	SI	SI	代码生成，编辑，优化	text-davinci-003	下载
webGPT	18994	EN	TS	MIX	信息检索问答	fine-tuned GPT-3 + 人工评估	下载
dolly 2.0	15015	EN	TS	HG	公开、封闭式问答、信息抽取、摘要生成、开放式构思、分类以及创意写作七类任务	人工标注	下载
baize	653699	EN	MT	COL	Alpaca和多种问答任务	人工标注的数据集的收集	下载
hh-rlhf	284517	EN	TS	MIX	对话	RLHF models	下载
OIG(part)	49237	EN	MT	COL	多种nlp任务	人工标注的数据集的收集和数据增强	下载
camel	760620	EN	MT	SI	物理生物化学编程，数学，社会等领域的角色扮演对话人工标注的数据集的收集	gpt-3.5-turbo 生成	下载
FLAN-Muffin	1764800	EN	MT	COL	60种nlp任务	人工标注的数据集的收集	下载
GPT4Tools	71446	EN	MT	SI	a collection of tool-related instructions	gpt-3.5-turbo	下载
ShareChat	1663241	EN	MT	MIX	general instruct	收集ShareGPT	下载
Auto CoT		EN					下载
ultrachat	28247446	EN					下载
StackLLaMA	todo	EN

MLX SFT 格式转换

发表评论

去hf-mirror.org下载 SFT格式数据。json

转换方法：


import json

class sft_data():
    def __init__(self,file_name,save_name):
        f=open(file_name,"r",encoding="utf-8")
        self.data=json.loads(f.read())
        f.close()
        new_file=open(save_name,"w",encoding="utf-8")
        for i in self.data:
            one_text=self.mlx_train_text(i)
            new_file.write(json.dumps(one_text, ensure_ascii=False)+"\n")
        new_file.close()
    def mlx_train_text(self,one_dic):
        # Question=one_dic['Question']
        # Complex_CoT=one_dic['Complex_CoT']
        # Response=one_dic['Response']
        otherx=list(one_dic.values())[0:3]
        Question=otherx[0]
        Complex_CoT=otherx[1]
        Response=otherx[2]
        text="Please reason step by step:\n\nQuestion:"+Question+"\n\nLet's solve this step by step:\n"+Complex_CoT+"\n\nFinal Answer:"+Response
        return {"text":text}
    
sft_data("medical_o1_sft_Chinese.json","mlx_sft2.jsonl")

import json

class sft_data():
    def __init__(self,file_name,save_name):
        f=open(file_name,"r",encoding="utf-8")
        self.data=json.loads(f.read())
        f.close()
        new_file=open(save_name,"w")
        for i in self.data:
            one_text=self.mlx_train_text(i)
            new_file.write(json.dumps(one_text)+"\n")
        new_file.close()
    def mlx_train_text(self,one_dic):
        Question=one_dic['Question']
        Response=one_dic['Response']
        Complex_CoT=one_dic['Complex_CoT']
        text="Please reason step by step:\n\nQuestion:"+Question+"\n\nLet's solve this step by step:\n"+Complex_CoT+"\n\nFinal Answer:"+Response
        return {"text":text}
    
sft_data("medical_o1_sft_Chinese.json","mlx_sft.jsonl")

MLX 的SFT数据格式

Please reason step by step:
#空行
Question:Question
#空行
Let's solve this step by step:
Complex_CoT
#空行
Final Answer:Response

以JOSNL 的格式

{"text":上面的sft信息}
{"text":上面的sft信息}
{"text":上面的sft信息}
{"text":上面的sft信息}

MLX 的CoT 训练 LORA SFT 微调

发表评论

GIT上有训练例子：https://github.com/jbarnes850/deepseek-r1-finetune

训练文件：https://hf-mirror.com/datasets/FreedomIntelligence/medical-o1-reasoning-SFT

原始文件格式：
{ "Question": "患者的具体医疗问题描述", "Complex_CoT": "详细的逐步推理过程", "Response": "最终答案" }

处理后的格式：

Please reason step by step:

Question: {样本的Question字段}

Let's solve this step by step:
{样本的Complex_CoT字段}

Final Answer: {样本的Response字段}

下面例子：

{
  "Question": "A 45-year-old patient presents with sudden onset chest pain, shortness of breath, and anxiety. The pain is described as sharp and worsens with deep breathing. What is the most likely diagnosis and what immediate tests should be ordered?",
  "Complex_CoT": "The patient's symptoms suggest possible acute coronary syndrome, pulmonary embolism, or pneumothorax. Given the sharp chest pain worsened by deep breathing, pulmonary embolism is a strong consideration. Immediate tests should include ECG, troponin, D-dimer, and chest X-ray.",
  "Response": "The most likely diagnosis is pulmonary embolism. Immediate tests should include ECG, troponin, D-dimer, and chest X-ray."
}

#处理后
Please reason step by step:

Question: A 45-year-old patient presents with sudden onset chest pain, shortness of breath, and anxiety. The pain is described as sharp and worsens with deep breathing. What is the most likely diagnosis and what immediate tests should be ordered?

Let's solve this step by step:
The patient's symptoms suggest possible acute coronary syndrome, pulmonary embolism, or pneumothorax. Given the sharp chest pain worsened by deep breathing, pulmonary embolism is a strong consideration. Immediate tests should include ECG, troponin, D-dimer, and chest X-ray.

Final Answer: The most likely diagnosis is pulmonary embolism. Immediate tests should include ECG, troponin, D-dimer, and chest X-ray.

处理后的数据是一个 Hugging Face Dataset 对象，其内部结构如下

如果要导出则是TEXT的LORA 的JSONL

例如

{ "text": "Please reason step by step:\n\nQuestion: {Question}\n\nLet's solve this step by step:\n{Complex_CoT}\n\nFinal Answer: {Response}" }

一行一行的TEXT文本

相关信息 https://el.psy.congroo.com/wp-admin/post.php?post=983 MLX数据格式

关于将上面的SFT信息转为JSONL的代码，未测试。

def prepare_dataset(tokenizer):
    """Prepare the medical reasoning dataset and export to JSONL"""
    # Load raw dataset
    dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en")
    
    # Split dataset (5% for training, 1% for testing)
    dataset = dataset["train"].train_test_split(
        train_size=0.05, 
        test_size=0.01, 
        seed=42
    )

    # Define formatting function
    def format_instruction(sample):
        return f"""Please reason step by step:

Question: {sample['Question']}

Let's solve this step by step:
{sample['Complex_CoT']}

Final Answer: {sample['Response']}"""

    # Create formatted text datasets
    text_train = dataset["train"].map(
        lambda x: {"text": format_instruction(x)},
        remove_columns=dataset["train"].column_names,
        num_proc=os.cpu_count()
    )
    
    text_test = dataset["test"].map(
        lambda x: {"text": format_instruction(x)},
        remove_columns=dataset["test"].column_names,
        num_proc=os.cpu_count()
    )

    # Export to JSONL (关键新增代码)
    text_train.to_json(
        "medical_train.jsonl",
        orient="records",
        lines=True,
        force_ascii=False  # 保留非ASCII字符（如中文）
    )
    
    text_test.to_json(
        "medical_test.jsonl",
        orient="records",
        lines=True,
        force_ascii=False
    )

    # Tokenization (保留原有流程)
    train_dataset = text_train.map(
        lambda x: tokenizer(
            x["text"],
            truncation=True,
            padding="max_length",
            max_length=1024,
            return_tensors=None,
        ),
        remove_columns=["text"],
        num_proc=os.cpu_count()
    )

    print("\nJSONL 文件已生成：")
    print(f"- medical_train.jsonl ({len(text_train)} 个样本)")
    print(f"- medical_test.jsonl ({len(text_test)} 个样本)")
    
    return train_dataset

MLX CLI训练命令使用SFT 加入监督函数

mlx-cli train \
    --stage sft \                  # 指定微调阶段为SFT（监督微调）
    --do_train \                   # 表示进行训练
    --model_name_or_path /path/to/pretrained/model \  # 预训练模型的路径
    --dataset your_dataset_name \  # SFT数据集的名称或路径
    --finetuning_type lora \       # 使用LoRA微调方法
    --output_dir ./output \        # 输出目录
    --learning_rate 5e-5 \         # 学习率
    --num_train_epochs 3 \         # 训练轮数
    --per_device_train_batch_size 8 \  # 每个设备的训练批次大小
    --loss_function cross_entropy  # 使用交叉熵损失函数

~~在这个命令中，--loss_function 参数用于指定监督函数，确保训练过程是有监督的~~

之前的MLX的LORA快速微调

~~直接使用SFT数据也可以实现LORA微调但是没有监督函数。~~

mlx_lm.lora --model ../../qwen2.5-0.5B --train --data ./data

EL PSY CONGROO

这一切都是SteinsGate的选择

作者归档：admin

生成18mmx100mm 条码打印机的条码信息

AI SFT LORA 数据微调

中医：知识框架以学习教材+病例数据进行微调。

开源SFT数据集

MLX SFT 格式转换

转换方法：

MLX 的SFT数据格式

以JOSNL 的格式

MLX 的CoT 训练 LORA SFT 微调

MLX CLI训练命令使用SFT 加入监督函数

之前的MLX的LORA快速微调

中医：知识框架以学习教材+病例数据进行微调。

转换方法：

MLX 的SFT数据格式

以JOSNL 的格式

MLX CLI训练命令 使用SFT 加入监督函数

之前的MLX的LORA快速微调

MLX CLI训练命令使用SFT 加入监督函数