Introducing Dr. Sven: open source data health checker

Introducing Dr. Sven: open source data health checker

7 0

I just created a new project on GitHub that I want to share with you all: https://github.com/MeStrak/dr-sven.

What follows is a brief description of Dr. Sven adapted from the project readme, and a demo of what it can do.

What is Dr. Sven?

Dr Sven is a very simple data health checker which performs a checkup on your data to give an indication of whether it's in good shape or not.

The doctor comes in the form of an AWS Lambda function written in Python. It scans a table in an AWS Athena database based on a set of simple data rules defined by you to check that the data is in good shape. When I say simple I mean it - right now the only rule that exists is a check that every day within a given time range contains at least the specified number of records. Specific dates or days can be ignored to avoid spam.

Dr. Sven practises KISS medicine

No the doc doesn't break the hypocratic oath, it just prescribes to the ideology that simple is better.

Motivation for creating Dr. Sven

Dr. Sven is designed with a data lake type situation in mind where data is transferred from one data source to another repository. Most data ingestion monitoring tools I've seen focus on whether individual processes have run without error at the expected time. In some complex scenarios they also try to check that elements of the source data match the finally processed data which can be very difficult to do.

Imagine this situation ...

  • all dashboards are green πŸ˜„
  • logs contain no errors 😁
  • all appears well with the world πŸš€

... then a pesky human looks at the data and says

  • "but there are no records from Sunday, that's not right" 😰

or

  • "there's something weird - it looks like there's a hole in this data from 1 month ago" 😱.

This project aims to reduce dependence on that human by providing a naive yet effective way to detect data issues. Specify some data rules defining what you expect to see, then let our kind doctor check if the processed data meets the criteria.

Demo!

Let's take a look at a simple example of using Dr. Sven using the AWS Athena getting started dataset. We'll create a checkup for the following:

  • Check every weekend contains at least a few records
  • Check that every week day contains lots of records
  • Specify a few dates to ignore for each of those rules, because we already know that there are some exceptions to the rule

Prerequisites if you want to play along

Configuring Dr. Sven

Each dataset you want to check must have a specific configuration file. A configuration file can contain multiple rules as long as they are for the same dataset.

First let's look at the configuration file (this example is contained in the project on GitHub), then I'll explain the parts which aren't immediately obvious.

Datasource.query

  • This is the basis of the dataset which will be inspected
    • It must return data with 2 columns: date, and count, where count is the number of records on that day
    • The query text must contain '{start_date}' and '{end_date}' which will be replaced by Dr. Sven based on the date range you specify.
  • Date must be in the formay YYYY-mm-dd

Rules

  • This is the list of the rules which will be executed on the data
  • Currently only [[rules.min_records]] is implemented, which checks that the data contains the required number of records
  • Ignore dates is a list of dates which should be excluded from the rule, usually because there is a good reason why the data doesn't on that day meet the requirements of the rule
  • Ignore days is a list of days, using the name of the day which should be excluded. For example if every Sunday there is system maintenance so there will be no records, add 'Sunday' to this list. You can also use the friendly names 'Weekends' instead of ['Saturday', 'Sunday'] and 'Weekdays' instead of ['Monday', 'Tuesday', ..., 'Friday']
  • In this configuration we exclude 'Weekdays' to define a rule only for the weekend, and 'Weekends' to define a rule only applicable Monday - Friday
[general]
title = "Example config for Dr. Sven using AWS Athena getting started dataset"
output_location = "lovely-bucket"
output_region = "eu-west-1"

[datasource]
query = "SELECT date, COUNT(*) as count FROM cloudfront_logs WHERE date BETWEEN date '{start_date}' AND date '{end_date}' GROUP BY date"
database = "mydatabase"
start_date = 2014-07-01
end_date = 2014-08-08

[rules]
  [[rules.min_records]]
  name = "Weekends must have a few records"
  ignore_dates = [2014-07-12, 2014-07-13]
  ignore_days = ["Weekdays"]
  min_records = 10
  explanation = "The logs should always have some activity at weekends, just not as much as in the week"

  [[rules.min_records]]
  name = "Weekdays must have many records"
  ignore_dates = [2014-07-06, 2014-07-07, 2014-07-06, 2014-07-25]
  ignore_days = ["Weekends"]
  min_records = 100
  explanation = "There will always be a lot of log activity during the week unless there is some downtime"
Enter fullscreen mode Exit fullscreen mode

Running Dr. Sven

Now we launch our dr-sven lambda with this cloudwatch event (replacing bucket name with the bucket you put the config file in specified).

 {
  "s3": {
    "bucket": {
      "name": "my-bucket"
    },
    "object": {
      "key": "aws-athena-example.toml"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Checking the results

Dr. Sven will output two files with the timestamp, checkup ID and the

  • Summary: .md file containing a summary of the results for each rule (number of rows checked, passed, failed and ignored)
  • Results: .csv file with a line per date for each failed rule

The dataset that we tested actually only contains data for two dates. We configured Dr. Sven to check data over from 2014-07-01 to 2014-08-08, so we're expecting a lot of failures.

Summary.md

Contains a summary of all rules that were checked, importantly telling you how many rules failed which you can then check in results.csv.

Weekends must have a few records

The logs should always have some activity at weekends, just not as much as in the week
Total dates checked: 39
Ignored: 31
Passed: 1
Failed: 7

Weekdays must have many records

There will always be a lot of log activity during the week unless there is some downtime
Total dates checked: 39
Ignored: 12
Passed: 1
Failed: 26

Results.csv

The results .csv output lists every issue found, along with the name of the rule taken from the config file. I've truncated the output below because it was too long - the real output found 32 issues, and correctly excluded ignored dates and days.

count day_of_week symptom failed_rule
06/07/2014 0 Sunday Expected at least 10 records but found 0 Weekends must have a few records
19/07/2014 0 Saturday Expected at least 10 records but found 0 Weekends must have a few records
20/07/2014 0 Sunday Expected at least 10 records but found 0 Weekends must have a few records
(…)
04/08/2014 0 Monday Expected at least 100 records but found 0 Weekdays must have many records
06/08/2014 0 Wednesday Expected at least 100 records but found 0 Weekdays must have many records
07/08/2014 0 Thursday Expected at least 100 records but found 0 Weekdays must have many records
08/08/2014 0 Friday Expected at least 100 records but found 0 Weekdays must have many records

Wrap-up

So there you have it. Like I said Dr. Sven is a really simple way to check that your data obeys some basic rules. I don't know if it's useful to others but I'm very eager to hear your feedback!