Introducing Dr. Sven: open source data health checker
I just created a new project on GitHub that I want to share with you all: https://github.com/MeStrak/dr-sven.
What follows is a brief description of Dr. Sven adapted from the project readme, and a demo of what it can do.
What is Dr. Sven?
Dr Sven is a very simple data health checker which performs a checkup on your data to give an indication of whether it's in good shape or not.
The doctor comes in the form of an AWS Lambda function written in Python. It scans a table in an AWS Athena database based on a set of simple data rules defined by you to check that the data is in good shape. When I say simple I mean it - right now the only rule that exists is a check that every day within a given time range contains at least the specified number of records. Specific dates or days can be ignored to avoid spam.
Dr. Sven practises KISS medicine
No the doc doesn't break the hypocratic oath, it just prescribes to the ideology that simple is better.
Motivation for creating Dr. Sven
Dr. Sven is designed with a data lake type situation in mind where data is transferred from one data source to another repository. Most data ingestion monitoring tools I've seen focus on whether individual processes have run without error at the expected time. In some complex scenarios they also try to check that elements of the source data match the finally processed data which can be very difficult to do.
Imagine this situation ...
- all dashboards are green π
- logs contain no errors π
- all appears well with the world π
... then a pesky human looks at the data and says
- "but there are no records from Sunday, that's not right" π°
or
- "there's something weird - it looks like there's a hole in this data from 1 month ago" π±.
This project aims to reduce dependence on that human by providing a naive yet effective way to detect data issues. Specify some data rules defining what you expect to see, then let our kind doctor check if the processed data meets the criteria.
Demo!
Let's take a look at a simple example of using Dr. Sven using the AWS Athena getting started dataset. We'll create a checkup for the following:
- Check every weekend contains at least a few records
- Check that every week day contains lots of records
- Specify a few dates to ignore for each of those rules, because we already know that there are some exceptions to the rule
Prerequisites if you want to play along
- Installation instructions are in the project readme, and it's better to leave them there for now as I improve them. Here I'll show what Dr. Sven can do once installed.
- We'll be using the dataset in the AWS Athena getting started guide: https://docs.aws.amazon.com/athena/latest/ug/getting-started.html.
Configuring Dr. Sven
Each dataset you want to check must have a specific configuration file. A configuration file can contain multiple rules as long as they are for the same dataset.
First let's look at the configuration file (this example is contained in the project on GitHub), then I'll explain the parts which aren't immediately obvious.
Datasource.query
- This is the basis of the dataset which will be inspected
- It must return data with 2 columns: date, and count, where count is the number of records on that day
- The query text must contain '{start_date}' and '{end_date}' which will be replaced by Dr. Sven based on the date range you specify.
- Date must be in the formay YYYY-mm-dd
Rules
- This is the list of the rules which will be executed on the data
- Currently only
[[rules.min_records]]
is implemented, which checks that the data contains the required number of records - Ignore dates is a list of dates which should be excluded from the rule, usually because there is a good reason why the data doesn't on that day meet the requirements of the rule
- Ignore days is a list of days, using the name of the day which should be excluded. For example if every Sunday there is system maintenance so there will be no records, add 'Sunday' to this list. You can also use the friendly names 'Weekends' instead of ['Saturday', 'Sunday'] and 'Weekdays' instead of ['Monday', 'Tuesday', ..., 'Friday']
- In this configuration we exclude 'Weekdays' to define a rule only for the weekend, and 'Weekends' to define a rule only applicable Monday - Friday
[general]
title = "Example config for Dr. Sven using AWS Athena getting started dataset"
output_location = "lovely-bucket"
output_region = "eu-west-1"
[datasource]
query = "SELECT date, COUNT(*) as count FROM cloudfront_logs WHERE date BETWEEN date '{start_date}' AND date '{end_date}' GROUP BY date"
database = "mydatabase"
start_date = 2014-07-01
end_date = 2014-08-08
[rules]
[[rules.min_records]]
name = "Weekends must have a few records"
ignore_dates = [2014-07-12, 2014-07-13]
ignore_days = ["Weekdays"]
min_records = 10
explanation = "The logs should always have some activity at weekends, just not as much as in the week"
[[rules.min_records]]
name = "Weekdays must have many records"
ignore_dates = [2014-07-06, 2014-07-07, 2014-07-06, 2014-07-25]
ignore_days = ["Weekends"]
min_records = 100
explanation = "There will always be a lot of log activity during the week unless there is some downtime"
Running Dr. Sven
Now we launch our dr-sven lambda with this cloudwatch event (replacing bucket name with the bucket you put the config file in specified).
{
"s3": {
"bucket": {
"name": "my-bucket"
},
"object": {
"key": "aws-athena-example.toml"
}
}
}
Checking the results
Dr. Sven will output two files with the timestamp, checkup ID and the
- Summary: .md file containing a summary of the results for each rule (number of rows checked, passed, failed and ignored)
- Results: .csv file with a line per date for each failed rule
The dataset that we tested actually only contains data for two dates. We configured Dr. Sven to check data over from 2014-07-01 to 2014-08-08, so we're expecting a lot of failures.
Summary.md
Contains a summary of all rules that were checked, importantly telling you how many rules failed which you can then check in results.csv.
Weekends must have a few records
The logs should always have some activity at weekends, just not as much as in the week
Total dates checked: 39
Ignored: 31
Passed: 1
Failed: 7
Weekdays must have many records
There will always be a lot of log activity during the week unless there is some downtime
Total dates checked: 39
Ignored: 12
Passed: 1
Failed: 26
Results.csv
The results .csv output lists every issue found, along with the name of the rule taken from the config file. I've truncated the output below because it was too long - the real output found 32 issues, and correctly excluded ignored dates and days.
count | day_of_week | symptom | failed_rule | |
---|---|---|---|---|
06/07/2014 | 0 | Sunday | Expected at least 10 records but found 0 | Weekends must have a few records |
19/07/2014 | 0 | Saturday | Expected at least 10 records but found 0 | Weekends must have a few records |
20/07/2014 | 0 | Sunday | Expected at least 10 records but found 0 | Weekends must have a few records |
(β¦) | ||||
04/08/2014 | 0 | Monday | Expected at least 100 records but found 0 | Weekdays must have many records |
06/08/2014 | 0 | Wednesday | Expected at least 100 records but found 0 | Weekdays must have many records |
07/08/2014 | 0 | Thursday | Expected at least 100 records but found 0 | Weekdays must have many records |
08/08/2014 | 0 | Friday | Expected at least 100 records but found 0 | Weekdays must have many records |
Wrap-up
So there you have it. Like I said Dr. Sven is a really simple way to check that your data obeys some basic rules. I don't know if it's useful to others but I'm very eager to hear your feedback!