How you ingest your data depends on the type of data and its purpose. For example, is it high complexity, high velocity or low complexity, low velocity? Are you depositing it into a staging layer to transform it before moving it, or are you ingesting it from a bunch of data vendors, trusting that it’s correct, and streaming it directly into your systems? (In which case, I admire your devil-may-care attitude.)
Here are questions you can ask to inform your data ingestion strategy:
- What quality parameters must we meet?
- What’s the complexity?
- What’s the velocity?
- What are the consumer needs?
- Are there standards regimes to follow? (for example, SOC2, PCI)
- Are there regulatory or compliance needs? (for example, HIPAA, GDPR, CCPA)
- Can this be automated?
The higher your need for data quality—here, or at any layer or location through which the data will pass—the greater your need for data ingestion observability. That is, the greater visibility you need into the quality of the data being ingested.
As we covered in the introduction, errors tend to cascade, and “garbage in” can quickly become “garbage everywhere.” Small efforts to clean up quality here will have a cumulative effect and save entire days or weeks of work.
As Fithrah Fauzan, Data Engineering Team Lead at Shipper, puts it, “It only takes one typo to change the schema and break the file. This used to happen for us on a weekly basis. Without IBM® Databand® , we often didn’t know that we had a problem until two or three days later—where after we’d have to backfill the data.”
What’s more, when you can observe your ingested data, you can set rules to automatically clean it, and ensure it stays clean. So, if a data vendor changes their API, you catch the schema change. And if there are human errors in your database that have caused some exceedingly wonky values, you notice—and pause—and address them.
When you can observe your data ingestion process, you can more reliably:
- Aggregate the data: Gather it all in one place
- Merge: Combine like data sets
- Divide: Divide unlike data sets
- Summarize: Produce meta data to describe the data set
- Validate the data: Verify that the data is high quality (is as expected)
- (Maybe) Standardize: Align schemas
- Cleanse: Remove incorrect data
- Deduplicate: Eliminate unintentional duplicates
From there, you can start to build the framework that works for you. Highly complex data demands highly complex observability—if your monitoring tool isn’t highly customizable, and can’t automatically detect anomalous values, automation doesn’t do you much good. You’ll always be fixing the system, tweaking rules and refining things by hand.
Highly complex data also makes you a good candidate for a hybrid ingestion approach, where you can designate ingested data into either a batch or streamed workflow. If you have any amount of high-velocity data—that’s supposed to be very up-to-date—you’re a candidate for a streaming ingestion architecture, or of course, hybrid.
And from there, you’ll want to understand what your specific user needs are.
Here are a few data ingestion tools you may find useful for this:
- Ingestion tool: Apache Storm, Apache Flume, Gobblin, Apache NiFi, Apache Logstash
- Storage: Snowflake, Azure, AWS, Google Cloud Platform
- Streaming platform: Apache Kafka
- Observability platform: Databand