Getting entities from pre-saved DocBin

I have around 700k documents that I want to process in spacy and save into a DocBin for later use.

I wrote a code to do a keywords search using phrasematcher and it worked great. I'm now trying to build a knowledge graph out of the DocBin I have and I can't seem to be able to access the entities to use them in the graph logic. I read somewhere that DocBins don't keep that information (?) but when I print DocBin.tokens I get some values and not just an empty output.

This might be a very stupid question but I'm quite lost and the documentation does not seem to be detailed enough for this.

import spacy
from spacy.tokens import DocBin
from spacy.vocab import Vocab
nlp = spacy.load('fr_dep_news_trf')
DocBinPath = r'C:\[Redacted]\FRdocBin.nlp'
loadedDocBin = DocBin(Vocab()).from_disk(DocBinPath)
DocList=list(loadedDocBin.get_docs(nlp.vocab))
for doc in DocList People = list(set([ent.text for ent in doc.ents if ent.label_=='PERSON'])) 

This doesn't produce any errors but doc.ents is empty.

This is the code for saving the Docbin:

FRdoc_bin = DocBin (store_user_data=True,attrs=['ENT_TYPE','LEMMA','LIKE_EMAIL','LIKE_URL','LIKE_NUM','ORTH','POS','HEAD','DEP'])
doc = frNLP(text)
FRdoc_bin.add(doc)
FRdoc_bin.to_disk(CreatedModelPath+r'\FRdocBin'+'.nlp')

2 Answers

If you want to use custom attrs, you need both ENT_IOB and ENT_TYPE for entities.

Are you sure that you need custom attrs in the first place? Have you customized the values for LIKE_URL or other lexical attrs? If not, the default attrs for DocBin should be fine.

1

Edit: I figured out the issue from the spacy discussion, it's quite simply that the fr model I was using doesn't support NER. Switched fr_core_news_lg and it worked :)

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.

You Might Also Like