Data Mill - A K8s-based infrastructure for analytics

Data What?

The availability of real-time data from sensors and processes enables businesses for continuously monitoring their assets, as well as identifying patterns in recurring failures and consequently understand premonitory signals to effectively prevent them. This adds a complexity which demands specific technologies and platforms: to achieve large-scale ingestion, persistent storage and computation of diverse data-sources under a number of constraints and use cases.

So what's the problem?

Despite its enormous potential, Data Science is a hype, let's face it. Moved by their ambitions, lots of companies started working on this topic but only a few were really successfull. The main barrier is the gap between the expectations of the stakeholders and the actual value delivered by models, as well as the lack of information over incoming data, in terms of both data quality and the processes producing them. Besides, analytics projects require a very interdisciplinar team, encompassing system administrators, engineers, scientists, as well as domain experts. To succeed, this requires a significant investment and a clear strategy.

Typically, projects are developed over a so-called lambda architecture, which combines a streaming layer to a batch one.

This raises the amount of complexity. Whereas continuous-integration and deployment (CICD) can automate and speed up to a great extent (using unit and integration tests, as well as frequent releases) the software development cycle, generally data scientists tend to work in a different workflow, and are often operating aside the rest of the team with consequent information gaps and unexpected behaviors upon changes on the data they use and the models they produced.

Waste of resources? The norm!

We have a solution:

Embracing DataOps practices

Scalable Data Processing

We did set up a modular architecture based on Kubernetes to let you have advantage of the best open source technologies for analytics. The architecture follows DataOps practices and provides everything for ingestion, persistent storage, storage, distributed processing, data exploration, business intelligence dashboards, machine learning, model benchmarking and project management, as well as ML model serving and of course infrastructure monitoring. So just boot it up and start developing your application!

Innovative algorithms