Setting up an architecture for analyses flows of time stream data from multiple sources [closed]

What would be the best practice in terms of AWS for the following:

  • Many IOT medical devices gather data at around 256kBps
  • The data is a time series data (a matrix of [Channels X Samples], there can be millions of samples and dozens of channels)
  • Data is saved into files in S3 and each session is logged in a database with some metadata. So far we are using RDS for this.
  • Each dataset is around 5GB
  • We have access to the datasets and would like to run some analysis flow:
    • Access the data file
    • Analysis step:
      • Execute code (version managed) that accepts the data file and produces a result (another file or a JSON)
      • Register the analysis step in some database (which?) and register the result (if a file is produced, register its location)
    • Perform N more analysis steps in a similar manner. Analysis steps can depend on each other, but can also be parallel.
    • The result of the N’th step is the end result of the analysis flow.

The idea is to provide an easy way to run code on data in AWS without actually downloading the files and keep a log of what analysis was performed on the data.

Any ideas which services and databases to use? How to pass the data around?
What would be an easy to use interface to the data scientist who works with Python for example?

I have the following idea in mind:

  • Analysis steps are managed code repos in CodeCommit (can be containers)
  • Data scientists define flows (in JSON format)
  • When a data scientist gives the order his flow is executed
  • The flow is registered as an entry in a database
  • A flow manager distributes the flows between execution agents
  • An agent is a mechanism that gets the flow, pulls the data and containers, and executes the flow
  • Each agent registers each step in the flow into a database

Examples of analysis steps:

  1. Filtering
  2. Labelling of artifacts in the data (timestamps)
  3. Calculation of statistical parameters

Answer

It sounds like you want to use Elastic MapReduce to do the analysis – it’s a big data managed service. You should be able to use EMR Notebooks for analysis. Getting the data in would probably be best with something like Kinesis. There’s also a whole load of specific IoT services, but those are not my area of expertise.

This is quite a large, wide open question – effectively you’re asking ‘how do I build a big data analytics platform’, which is a complicated one! I’d suggest you read up on the services listed above and see if they meet your needs, or have your company reach out to AWS for professional services. It doesn’t have to cost a fortune!

Attribution
Source : Link , Question Author : sstbrg , Answer Author : shearn89

Leave a Comment