I’m building a Django web application to store documents and their associated metadata.
The bulk of the metadata will be stored in the underlying MySQL database, with the OCR’d document text indexed in Elasticsearch to enable full-text search. I’ve incorporated django-elasticsearch-dsl to connect and synchronize my data models, as I’m also indexing (and thus, double-storing) a few other fields found in my models. I had considered using Haystack, but it lacks support for the latest Elasticsearch versions.
When a document is uploaded via the applications’s admin interface, a post_save signal automatically triggers a Celery asynchronous background task to perform the OCR and will ultimately index the extracted text into Elasticsearch.
Seeing as how I don’t have a full-text field defined in my model (and hope to avoid doing so as I don’t want to store or search against CLOB’s in the database), I’m seeking the best practice for updating my Elasticsearch documents from my tasks.py file. There doesn’t seem to be a way to do so using django-elasticseach-dsl (but maybe I’m wrong?) and so I’m wondering if I should either:
Try to interface with Elasticsearch via REST using the sister django-elasticsearch-dsl-drf package.
More loosely integrate my application with Elasticsearch by using the more vanilla elasticsearch-dsl-py package (based on elasticsearch-py). I‘d lose some “luxury” with this approach as I’d have to write a bit more integration code, at least if I want to wire up my models with signals.
Is there a best practice? Or another approach I haven’t considered?
Update 1: In trying to implement the answer from @Nielk, I'm able to persist the OCR'd text (result = "test" in tasks.py below) into ElasticSearch, but it's also persisting in the MySQL database. I'm still confused about how to essentially configure Submission.rawtext as a passthru to ElasticSearch.
models.py:
class Submission(models.Model):
rawtext = models.TextField(null=True, blank=True)
...
def type_to_string(self):
return ""
documents.py:
@registry.register_document
class SubmissionDocument(Document)
rawtext = fields.TextField(attr="type_to_string")
def prepare_rawtext(self, instance):
# self.rawtext = None
# instance.rawtext = "test"
return instance.rawtext
...
tasks.py (called on Submission model post_save signal):
@shared_task
def process_ocr(my_uuid)
result = "test" # will ultimately be OCR'd text
instance = Submission.objects.get(my_uuid=my_uuid)
instance.rawtext = result
instance.save()
Update 2 (Working Solution):
models.py class Submission(models.Model):
@property
def rawtext(self):
if getattr(self, '_rawtext_local_change', False):
return self._rawtext
if not self.pk:
return None
from .documents import SubmissionDocument
try:
return SubmissionDocument.get(id=self.pk)._rawtext
except:
return None
@rawtext.setter
def rawtext(self, value):
self._rawtext_local_change = True
self._rawtext = value
documents.py
@registry.register_document
class SubmissionDocument(Document):
rawtext = fields.TextField()
def prepare_rawtext(self, instance):
return instance.rawtext
tasks.py
@shared_task
def process_ocr(my_uuid)
result = "test" # will ultimately be OCR'd text
# note that you must do a save on property fields, can't do an update
instance = Submission.objects.get(my_uuid=my_uuid)
instance.rawtext = result
instance.save()
You can add extra fields in the document definition linked to your model (see the field 'type_to_field' in the documentation https://django-elasticsearch-dsl.readthedocs.io/en/latest/fields.html#using-different-attributes-for-model-fields , and combine this with a 'prepare_xxx' method to initialize to an empty string if the instance is created, and to its current value in case of an update) Would that solve your problem ?
Edit 1 - Here's what I meant:
models.py
class Submission(models.Model):
@property
def rawtext(self):
if getattr(self, '_rawtext_local_change ', False):
return self._rawtext
if not self.pk:
return None
from .documents import SubmissionDocument
return SubmissionDocument.get(meta__id=self.pk).rawtext
@property.setter
def rawtext(self, value):
self._rawtext_local_change = True
self._rawtext = value
Edit 2 - fixed code typo