pythondata-acquisition

How to separate data acquisition, processing, and visualization properly in Python?


I am working on a project, where I want to perform data acquisition, data processing and GUI visualization (using pyqt with pyqtgraph) all in Python. Each of the parts is in principle implemented, but the different parts are not well separated, which makes it difficult to benchmark and improve performance. So the question is:

Is there a good way to handle large amounts of data between different parts of a software?

I think of something like the following scenario:

When I say "large amounts of data", I mean that I get arrays with approximately 2 million data points (16bit) per second that need to be processed and possibly also stored.

Is there any framework for Python that I can use to handle this large amount of data properly? Maybe in form of a data-server that I can connect to.


Solution

  • How much data?

    In other words, are you acquiring so much data that you cannot keep all of it in memory while you need it?

    For example, there are some measurements that generate so much data, the only way to process them is after-the-fact:

    1. Acquire the data to storage (usually RAID0)
    2. Post-process the data
    3. Analyze the results
    4. Select and archive subsets

    Small Data

    If your computer system is able to keep pace with the generation of data, you can use a separate Python queue between each stage.

    Big Data

    If your measurements are creating more data than your system can consume, then you should start by defining a few tiers (maybe just two) of how important your data is:

    One analogy might be a video stream...

    • lossless -- gold-masters for archival
    • lossy -- YouTube, Netflix, Hulu might drop a few frames, but your experience doesn't significantly suffer

    From your description, the Acquisition and Processing must be lossless, while the GUI/visualization can be lossy.

    For lossless data, you should use queues. For lossy data, you can use deques.

    Design

    Regardless of your data container, here are three different ways to connect your stages:

    1. Producer-Consumer: P-C mimics a FIFO -- one actor generates data and another consumes it. You can build a chain of producers/consumers to accomplish your goal.
    2. Observer: While P-C is typically one-to-one, the observer pattern can also be one-to-many. If you need multiple actors to react when one source changes, the observer pattern can give you that capability.
    3. Mediator: Mediators are usually many-to-many. If each actor can cause the others to react, then all of them can coordinate through the mediator.

    It seems like you just need a 1-1 relationship between each stage, so a producer-consumer design looks like it will suit your application.