作者归档:admin

生成18mmx100mm 条码打印机的条码信息

brother 打印机 使用18mm 打印条码

使用:JsBarcode.all.min.js 生成条码

12mm 条码 80mm长度:12 2 fontsize3
字体 大小 4 ;字体到底部 1mm

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
    <script type="text/javascript" src="https://code.jquery.com/jquery-2.1.3.min.js"></script>
    <script src="JsBarcode.all.min.js"></script>
</head>
<body>
    
</body>
<script>
function createPrintPage(ht_code) {
  // 创建一个 iframe 用于打印
  const iframe = document.createElement('iframe');
  iframe.style.display = 'none';
  document.body.appendChild(iframe);

  // 设置 iframe 的内容
  const iframeContent = iframe.contentWindow.document;
  iframeContent.open();
  iframeContent.write(`
  <html>
    <head>
      <style>
        @media print {
          @page {
            size: 100mm 18mm;
            margin: 0;
          }
          body {
            margin: 0;
            padding: 0;
            width: 100mm;
            height: 18mm;
          }
          .barcode-container {
            width: 100%;
            height: 100%;
            display: flex;
            flex-direction: column;
            justify-content: center;
            align-items: center;
          }
          .barcode {
            width: 80mm;
            height: 12mm;
          }
          .text {
            position: absolute;
            font-size: 4mm;
            text-align: center;
            width: 100%;
            bottom: 1mm;
            color: black;
          }
        }
      </style>
    </head>
    <body>
      <div class="barcode-container">
        <canvas id="barcode" class="barcode"></canvas>
        <div class="text" id="barcode-text">Sample Barcode</div>
      </div>
    </body>
    </html>
  `);
  iframeContent.close();
  // 使用 JsBarcode 生成条形码
  const barcodeData = ht_code; // 条形码数据
  const barcodeElement = iframeContent.getElementById('barcode');
  const barcodeTextElement = iframeContent.getElementById('barcode-text');
  // 生成条形码
  JsBarcode(barcodeElement, barcodeData, {
    format: "CODE128", // 使用 CODE128 格式
    fontSize: 3,
    margin: 0,
    height: 12,
    width: 2,
    displayValue: false // 不显示条形码下方的文本
  });
  // 设置条形码文本
  barcodeTextElement.textContent = barcodeData;
  // 打印
  iframe.contentWindow.print();
}

// 调用函数生成并打印页面
createPrintPage("HTA101WDCX0320250024");
</script>
</html>

AI SFT LORA 数据微调

A:数据获取:文献资料 可以使用KIMI或者其他AI 对文档PDF 进行处理 成输入输出,若有思维链,可以成为SFT格式。

B:数据采集:对互联网公开数据进行采集,后处理成自己要的格式,也可以使用AI处理数据。

对A、B两种数据对AI进行LORA微调(SFT),即可完成项目需求。

实例:

中医:知识框架以学习教材+病例数据进行微调。

首先 中医的AI微调范例:https://github.com/Zlasejd/HuangDI

中医的数据 SFT LORA :SylvanL/Traditional-Chinese-Medicine-Dataset-SFT · Datasets at HF Mirror

开源SFT数据集

R1的SFT数据 Congliu/Chinese-DeepSeek-R1-Distill-data-110k

https://hf-mirror.com/datasets/Congliu/Chinese-DeepSeek-R1-Distill-data-110k

GitHub – chaoswork/sft_datasets: 开源SFT数据集整理,随时补充

医学o1 sft: https://hf-mirror.com/datasets/FreedomIntelligence/medical-o1-reasoning-SFT

数据集数目LangTaskGen类型来源链接
belle_cn1079517CNTS/MTSI通用指令,数学推理,对话text-davunci-003下载
firefly1649398CNMTCOL23种nlp任务收集中文数据集,人工书写指令模板下载
GAOKAO2785CNMTCOL高考中的多选,填空等问题人工标注的数据集的收集下载
COIG298428CNMTCOL考试,翻译,价值观指令数据集搜集,基于知识图谱的反事实对话自动化工具+人工验证下载
pCLUE1200705CNMT73个Prompt,分类,推理,关键词识别,阅读理解等9个NLP任务下载
CSL396209CNMT40万中文论文元数据,26个Prompt下载
CNewSum304307CNTS字节与UCSB发布的中文摘要数据集下载
Coco-cnCNTS图文多模态下载
news_commentary69200EN/CNTS中英文翻译数据下载
Chain of Thought74771EN/CNMTHGCoT相关任务人在现有数据集上标注CoT下载
HC337175EN/CNTSMIX对话评估gpt-3.5 或 人工下载
instinwild52191EN/CNMTSI生成,开放域问答,头脑风暴text-davunci-003下载
Alpaca_GPT452002EN/CNMTSI通用指令GPT-4 生成的Alpaca数据下载
MOSS1583595EN/CNSI下载
LLMZooML下载
Guanaco534610MLMTSI多种nlp任务text-davinci-003下载
Natural Instructions5040134MLMTCOL多种nlp任务人工标注的数据集的收集下载
xP378883588MLMTCOL多种nlp任务人工标注的数据集的收集下载
alpaca52002ENMTSI通用指令text-davinci-003下载
GPT4all806199ENMTCOL代码,故事,对话GPT-3.5-turbo 蒸馏下载
GPTeacher29013ENMTSI通用,角色扮演,工具指令GPT-4 & toolformer下载
prosocial dialog165681ENTSMIX对话GPT-3改写问题,人工回复下载
finance_en68912ENTSCOL金融领域问答GPT3.5下载
instruct888969ENMTCOLGPT4All,Alpaca和开源数据集的增强使用AllenAI提供的nlp增强工具下载
Code Alpaca20022ENSISI代码生成,编辑,优化text-davinci-003下载
webGPT18994ENTSMIX信息检索问答fine-tuned GPT-3 + 人工评估下载
dolly 2.015015ENTSHG公开、封闭式问答、信息抽取、摘要生成、开放式构思、分类以及创意写作七类任务人工标注下载
baize653699ENMTCOLAlpaca和多种问答任务人工标注的数据集的收集下载
hh-rlhf284517ENTSMIX对话RLHF models下载
OIG(part)49237ENMTCOL多种nlp任务人工标注的数据集的收集和数据增强下载
camel760620ENMTSI物理生物化学编程,数学,社会等领域的角色扮演对话人工标注的数据集的收集gpt-3.5-turbo 生成下载
FLAN-Muffin1764800ENMTCOL60种nlp任务人工标注的数据集的收集下载
GPT4Tools71446ENMTSIa collection of tool-related instructionsgpt-3.5-turbo下载
ShareChat1663241ENMTMIXgeneral instruct收集ShareGPT下载
Auto CoTEN下载
ultrachat28247446EN下载
StackLLaMAtodoEN

MLX SFT 格式转换

去hf-mirror.org下载 SFT格式数据。json

转换方法:


import json

class sft_data():
    def __init__(self,file_name,save_name):
        f=open(file_name,"r",encoding="utf-8")
        self.data=json.loads(f.read())
        f.close()
        new_file=open(save_name,"w",encoding="utf-8")
        for i in self.data:
            one_text=self.mlx_train_text(i)
            new_file.write(json.dumps(one_text, ensure_ascii=False)+"\n")
        new_file.close()
    def mlx_train_text(self,one_dic):
        # Question=one_dic['Question']
        # Complex_CoT=one_dic['Complex_CoT']
        # Response=one_dic['Response']
        otherx=list(one_dic.values())[0:3]
        Question=otherx[0]
        Complex_CoT=otherx[1]
        Response=otherx[2]
        text="Please reason step by step:\n\nQuestion:"+Question+"\n\nLet's solve this step by step:\n"+Complex_CoT+"\n\nFinal Answer:"+Response
        return {"text":text}
    
sft_data("medical_o1_sft_Chinese.json","mlx_sft2.jsonl")

import json

class sft_data():
    def __init__(self,file_name,save_name):
        f=open(file_name,"r",encoding="utf-8")
        self.data=json.loads(f.read())
        f.close()
        new_file=open(save_name,"w")
        for i in self.data:
            one_text=self.mlx_train_text(i)
            new_file.write(json.dumps(one_text)+"\n")
        new_file.close()
    def mlx_train_text(self,one_dic):
        Question=one_dic['Question']
        Response=one_dic['Response']
        Complex_CoT=one_dic['Complex_CoT']
        text="Please reason step by step:\n\nQuestion:"+Question+"\n\nLet's solve this step by step:\n"+Complex_CoT+"\n\nFinal Answer:"+Response
        return {"text":text}
    
sft_data("medical_o1_sft_Chinese.json","mlx_sft.jsonl")

MLX 的SFT数据格式

Please reason step by step:
#空行
Question:Question
#空行
Let's solve this step by step:
Complex_CoT
#空行
Final Answer:Response

以JOSNL 的格式

{"text":上面的sft信息}
{"text":上面的sft信息}
{"text":上面的sft信息}
{"text":上面的sft信息}

MLX 的CoT 训练 LORA SFT 微调

GIT上有训练例子:https://github.com/jbarnes850/deepseek-r1-finetune

训练文件:https://hf-mirror.com/datasets/FreedomIntelligence/medical-o1-reasoning-SFT

原始文件格式:
{ "Question": "患者的具体医疗问题描述", "Complex_CoT": "详细的逐步推理过程", "Response": "最终答案" }

处理后的格式:

Please reason step by step:

Question: {样本的Question字段}

Let's solve this step by step:
{样本的Complex_CoT字段}

Final Answer: {样本的Response字段}

下面例子:

{
  "Question": "A 45-year-old patient presents with sudden onset chest pain, shortness of breath, and anxiety. The pain is described as sharp and worsens with deep breathing. What is the most likely diagnosis and what immediate tests should be ordered?",
  "Complex_CoT": "The patient's symptoms suggest possible acute coronary syndrome, pulmonary embolism, or pneumothorax. Given the sharp chest pain worsened by deep breathing, pulmonary embolism is a strong consideration. Immediate tests should include ECG, troponin, D-dimer, and chest X-ray.",
  "Response": "The most likely diagnosis is pulmonary embolism. Immediate tests should include ECG, troponin, D-dimer, and chest X-ray."
}

#处理后
Please reason step by step:

Question: A 45-year-old patient presents with sudden onset chest pain, shortness of breath, and anxiety. The pain is described as sharp and worsens with deep breathing. What is the most likely diagnosis and what immediate tests should be ordered?

Let's solve this step by step:
The patient's symptoms suggest possible acute coronary syndrome, pulmonary embolism, or pneumothorax. Given the sharp chest pain worsened by deep breathing, pulmonary embolism is a strong consideration. Immediate tests should include ECG, troponin, D-dimer, and chest X-ray.

Final Answer: The most likely diagnosis is pulmonary embolism. Immediate tests should include ECG, troponin, D-dimer, and chest X-ray.

处理后的数据是一个 Hugging Face Dataset 对象,其内部结构如下

如果要导出 则是TEXT的LORA 的JSONL

例如

{
"text": "Please reason step by step:\n\nQuestion: {Question}\n\nLet's solve this step by step:\n{Complex_CoT}\n\nFinal Answer: {Response}"
}

一行一行的TEXT文本

相关信息 https://el.psy.congroo.com/wp-admin/post.php?post=983 MLX数据格式

关于将上面的SFT信息转为JSONL的代码 ,未测试。

def prepare_dataset(tokenizer):
    """Prepare the medical reasoning dataset and export to JSONL"""
    # Load raw dataset
    dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en")
    
    # Split dataset (5% for training, 1% for testing)
    dataset = dataset["train"].train_test_split(
        train_size=0.05, 
        test_size=0.01, 
        seed=42
    )

    # Define formatting function
    def format_instruction(sample):
        return f"""Please reason step by step:

Question: {sample['Question']}

Let's solve this step by step:
{sample['Complex_CoT']}

Final Answer: {sample['Response']}"""

    # Create formatted text datasets
    text_train = dataset["train"].map(
        lambda x: {"text": format_instruction(x)},
        remove_columns=dataset["train"].column_names,
        num_proc=os.cpu_count()
    )
    
    text_test = dataset["test"].map(
        lambda x: {"text": format_instruction(x)},
        remove_columns=dataset["test"].column_names,
        num_proc=os.cpu_count()
    )

    # Export to JSONL (关键新增代码)
    text_train.to_json(
        "medical_train.jsonl",
        orient="records",
        lines=True,
        force_ascii=False  # 保留非ASCII字符(如中文)
    )
    
    text_test.to_json(
        "medical_test.jsonl",
        orient="records",
        lines=True,
        force_ascii=False
    )

    # Tokenization (保留原有流程)
    train_dataset = text_train.map(
        lambda x: tokenizer(
            x["text"],
            truncation=True,
            padding="max_length",
            max_length=1024,
            return_tensors=None,
        ),
        remove_columns=["text"],
        num_proc=os.cpu_count()
    )

    print("\nJSONL 文件已生成:")
    print(f"- medical_train.jsonl ({len(text_train)} 个样本)")
    print(f"- medical_test.jsonl ({len(text_test)} 个样本)")
    
    return train_dataset

MLX CLI训练命令 使用SFT 加入监督函数

mlx-cli train \
    --stage sft \                  # 指定微调阶段为SFT(监督微调)
    --do_train \                   # 表示进行训练
    --model_name_or_path /path/to/pretrained/model \  # 预训练模型的路径
    --dataset your_dataset_name \  # SFT数据集的名称或路径
    --finetuning_type lora \       # 使用LoRA微调方法
    --output_dir ./output \        # 输出目录
    --learning_rate 5e-5 \         # 学习率
    --num_train_epochs 3 \         # 训练轮数
    --per_device_train_batch_size 8 \  # 每个设备的训练批次大小
    --loss_function cross_entropy  # 使用交叉熵损失函数

在这个命令中,--loss_function 参数用于指定监督函数,确保训练过程是有监督的

之前的MLX的LORA快速微调

直接使用SFT数据也可以实现LORA微调 但是没有监督函数。

mlx_lm.lora --model ../../qwen2.5-0.5B --train --data ./data