Llamaindex BM25 实验

还是得实践和查看官方文档,https://developers.llamaindex.ai/python/examples/retrievers/bm25_retriever/ 之前学习被AI坑了好久全是AI幻觉,下面的是2026-01-07 可以运行代码。(Name: llama-index Version: 0.14.10;Name: llama-index-retrievers-bm25 Version: 0.6.5)
主要坑点BM25的存储 retriever.persist(“./bm25_retriever”)和载入retriever = bm25.BM25Retriever.from_persist_dir(“./bm25_retriever”)
AI老说这个是已经失效了,让我用pickle,但是BM25 实现算法又不能全部pickle下来,后来想到直接用node去直接生成BM25,但是BM25应对中文需要jieba分词,分词效率每秒100万字,且需要启动时间,所以如果500片2000千字的文章转换成的节点至少要2秒左右才能生成BM25 的retriever对象!AI害人不浅,全是幻觉。

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.retrievers import bm25
# from llama_index.core.retrievers import BM25Retriever
import jieba
import time
import joblib
documents=SimpleDirectoryReader("data").load_data()

print(documents)

splitter=SentenceSplitter(chunk_size=512,chunk_overlap=30)

new_nodesx=splitter.get_nodes_from_documents(documents)

print(new_nodesx)

documents=SimpleDirectoryReader(input_files=["add.docx",]).load_data()

print(documents)

documents[0].metadata['MAC']='adgfa-192ga'
documents[0].metadata['document_id']=documents[0].id_

print(documents)
new_nodes=splitter.get_nodes_from_documents(documents)
print(new_nodes)

class ChineseBM25Retriever(bm25.BM25Retriever):
    def _tokenize(self,text):
        return [w for w in jieba.cut(text) if w.strip()]

c=time.time()
retriever=ChineseBM25Retriever(nodes=new_nodes,similarity_top_k=10)
print(time.time()-c,retriever)

# c=time.time()
# retriever=ChineseBM25Retriever(nodes=new_nodesx,similarity_top_k=10)
# print(time.time()-c,retriever)
retriever.persist("./bm25_retriever")


retrieved_nodes = retriever.retrieve(
    "What is link?"
)
for node in retrieved_nodes:
    print(node)


del retriever
retriever = bm25.BM25Retriever.from_persist_dir("./bm25_retriever")

print("Reload BM25 from disk")
retrieved_nodes = retriever.retrieve(
    "What is link?"
)
for node in retrieved_nodes:
    print(node)

执行结果:

(base) EgoistdeMacBook-Pro:rag maysrp$ python doc.py
/opt/anaconda3/lib/python3.13/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
2026-01-07 22:47:06,574 - INFO - NumExpr defaulting to 12 threads.
[Document(id_='de9ebe13-a86b-4331-a6aa-8046de5263d1', embedding=None, metadata={'file_name': 'Suzhou.docx', 'file_path': '/Users/maysrp/rag/data/Suzhou.docx', 'file_type': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'file_size': 10159, 'creation_date': '2026-01-03', 'last_modified_date': '2026-01-03'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='Suzhou is Link’s best city in her live.', path=None, url=None, mimetype=None), image_resource=None, audio_resource=None, video_resource=None, text_template='{metadata_str}\n\n{content}'), Document(id_='92e769b9-e293-48e2-a5ac-743a753b167a', embedding=None, metadata={'file_name': 'game.docx', 'file_path': '/Users/maysrp/rag/data/game.docx', 'file_type': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'file_size': 10138, 'creation_date': '2026-01-03', 'last_modified_date': '2026-01-03'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='Link love play Switch game!', path=None, url=None, mimetype=None), image_resource=None, audio_resource=None, video_resource=None, text_template='{metadata_str}\n\n{content}'), Document(id_='7cc63b7a-19fb-4d27-b58d-266101e2c49f', embedding=None, metadata={'file_name': 'sutdent.docx', 'file_path': '/Users/maysrp/rag/data/sutdent.docx', 'file_type': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'file_size': 10142, 'creation_date': '2026-01-03', 'last_modified_date': '2026-01-03'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='Link is high school sutdent!', path=None, url=None, mimetype=None), image_resource=None, audio_resource=None, video_resource=None, text_template='{metadata_str}\n\n{content}'), Document(id_='9d3ee148-060d-4752-9a31-504714eb43ae', embedding=None, metadata={'file_path': '/Users/maysrp/rag/data/test.txt', 'file_name': 'test.txt', 'file_type': 'text/plain', 'file_size': 25, 'creation_date': '2025-12-27', 'last_modified_date': '2025-10-08'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='i am a look!\nThanks alot\n', path=None, url=None, mimetype=None), image_resource=None, audio_resource=None, video_resource=None, text_template='{metadata_str}\n\n{content}')]
[TextNode(id_='98437de2-9ad7-451d-aa2f-c72c778866f7', embedding=None, metadata={'file_name': 'Suzhou.docx', 'file_path': '/Users/maysrp/rag/data/Suzhou.docx', 'file_type': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'file_size': 10159, 'creation_date': '2026-01-03', 'last_modified_date': '2026-01-03'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='de9ebe13-a86b-4331-a6aa-8046de5263d1', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_name': 'Suzhou.docx', 'file_path': '/Users/maysrp/rag/data/Suzhou.docx', 'file_type': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'file_size': 10159, 'creation_date': '2026-01-03', 'last_modified_date': '2026-01-03'}, hash='22704cae161d866fe704741b55317fb4f2baf40fd456bfee0c030bd4a9f37e81')}, metadata_template='{key}: {value}', metadata_separator='\n', text='Suzhou is Link’s best city in her live.', mimetype='text/plain', start_char_idx=0, end_char_idx=39, metadata_seperator='\n', text_template='{metadata_str}\n\n{content}'), TextNode(id_='aebe5d8c-e8e0-4e91-8641-2287c045a062', embedding=None, metadata={'file_name': 'game.docx', 'file_path': '/Users/maysrp/rag/data/game.docx', 'file_type': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'file_size': 10138, 'creation_date': '2026-01-03', 'last_modified_date': '2026-01-03'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='92e769b9-e293-48e2-a5ac-743a753b167a', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_name': 'game.docx', 'file_path': '/Users/maysrp/rag/data/game.docx', 'file_type': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'file_size': 10138, 'creation_date': '2026-01-03', 'last_modified_date': '2026-01-03'}, hash='a4ef473a6499b1e9a6d082cc6e859661785a05b99aeaf00c34acfb969565e11f')}, metadata_template='{key}: {value}', metadata_separator='\n', text='Link love play Switch game!', mimetype='text/plain', start_char_idx=0, end_char_idx=27, metadata_seperator='\n', text_template='{metadata_str}\n\n{content}'), TextNode(id_='dcf48ae6-1d63-4ccf-b21f-ba12852a59df', embedding=None, metadata={'file_name': 'sutdent.docx', 'file_path': '/Users/maysrp/rag/data/sutdent.docx', 'file_type': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'file_size': 10142, 'creation_date': '2026-01-03', 'last_modified_date': '2026-01-03'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='7cc63b7a-19fb-4d27-b58d-266101e2c49f', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_name': 'sutdent.docx', 'file_path': '/Users/maysrp/rag/data/sutdent.docx', 'file_type': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'file_size': 10142, 'creation_date': '2026-01-03', 'last_modified_date': '2026-01-03'}, hash='dce4f61a4f5f77866c8f3baa92c20997bad74b8ea4b00b6bdbcdca6b5fd240ce')}, metadata_template='{key}: {value}', metadata_separator='\n', text='Link is high school sutdent!', mimetype='text/plain', start_char_idx=0, end_char_idx=28, metadata_seperator='\n', text_template='{metadata_str}\n\n{content}'), TextNode(id_='affc51cf-137b-4306-a047-be08aa95ac62', embedding=None, metadata={'file_path': '/Users/maysrp/rag/data/test.txt', 'file_name': 'test.txt', 'file_type': 'text/plain', 'file_size': 25, 'creation_date': '2025-12-27', 'last_modified_date': '2025-10-08'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='9d3ee148-060d-4752-9a31-504714eb43ae', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': '/Users/maysrp/rag/data/test.txt', 'file_name': 'test.txt', 'file_type': 'text/plain', 'file_size': 25, 'creation_date': '2025-12-27', 'last_modified_date': '2025-10-08'}, hash='34192438b5e50e0927a1f41c8e40814c5c6e2078c7ab4ec91c732c9938944f50')}, metadata_template='{key}: {value}', metadata_separator='\n', text='i am a look!\nThanks alot', mimetype='text/plain', start_char_idx=0, end_char_idx=24, metadata_seperator='\n', text_template='{metadata_str}\n\n{content}')]
[Document(id_='4221065b-b64d-4c8b-8807-94be61940f8e', embedding=None, metadata={'file_name': 'add.docx', 'file_path': 'add.docx', 'file_type': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'file_size': 10121, 'creation_date': '2026-01-03', 'last_modified_date': '2026-01-03'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='Link like  NIkon camera .he has a camre that is name is z8', path=None, url=None, mimetype=None), image_resource=None, audio_resource=None, video_resource=None, text_template='{metadata_str}\n\n{content}')]
[Document(id_='4221065b-b64d-4c8b-8807-94be61940f8e', embedding=None, metadata={'file_name': 'add.docx', 'file_path': 'add.docx', 'file_type': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'file_size': 10121, 'creation_date': '2026-01-03', 'last_modified_date': '2026-01-03', 'MAC': 'adgfa-192ga', 'document_id': '4221065b-b64d-4c8b-8807-94be61940f8e'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='Link like  NIkon camera .he has a camre that is name is z8', path=None, url=None, mimetype=None), image_resource=None, audio_resource=None, video_resource=None, text_template='{metadata_str}\n\n{content}')]
[TextNode(id_='e225ff5c-48db-4e3e-a741-90194180af44', embedding=None, metadata={'file_name': 'add.docx', 'file_path': 'add.docx', 'file_type': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'file_size': 10121, 'creation_date': '2026-01-03', 'last_modified_date': '2026-01-03', 'MAC': 'adgfa-192ga', 'document_id': '4221065b-b64d-4c8b-8807-94be61940f8e'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='4221065b-b64d-4c8b-8807-94be61940f8e', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_name': 'add.docx', 'file_path': 'add.docx', 'file_type': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'file_size': 10121, 'creation_date': '2026-01-03', 'last_modified_date': '2026-01-03', 'MAC': 'adgfa-192ga', 'document_id': '4221065b-b64d-4c8b-8807-94be61940f8e'}, hash='d907ec8b73619251ad897fc95f680465711bbec8cda67a3494f0d5e6e0e27686')}, metadata_template='{key}: {value}', metadata_separator='\n', text='Link like  NIkon camera .he has a camre that is name is z8', mimetype='text/plain', start_char_idx=0, end_char_idx=58, metadata_seperator='\n', text_template='{metadata_str}\n\n{content}')]
2026-01-07 22:47:07,035 - DEBUG - Building index from IDs objects
2026-01-07 22:47:07,289 - WARNING - As bm25s.BM25 requires k less than or equal to number of nodes added. Overriding the value of similarity_top_k to number of nodes added.
0.267697811126709 <__main__.ChineseBM25Retriever object at 0x14b823620>
Finding newlines for mmindex: 100%|████████| 2.11k/2.11k [00:00<00:00, 19.1MB/s]
Node ID: e225ff5c-48db-4e3e-a741-90194180af44
Text: Link like  NIkon camera .he has a camre that is name is z8
Score:  0.115

Reload BM25 from disk
Node ID: e225ff5c-48db-4e3e-a741-90194180af44
Text: Link like  NIkon camera .he has a camre that is name is z8
Score:  0.115

发表回复