Jon Tavernier

Analytical Data Engineering

Tips for populating and working with an analytical data store.

<p>One of the main goals for Analytical Data Engineering is to populate an analytical data store, such as Google BigQuery, with a copy of source system data. For example, your company may use 17 different third-party software services. You will want to copy some and maybe all of that data into an analytical data store so the business can:</p> <ol><li><p>Ask and answer questions of the data not supported by the built-in reporting capabilities of those tools.</p></li><li><p>Ask and answer questions requiring data to be combined across those tools.</p></li></ol> <h2>Extracting Data</h2> <ol><li><p>For incremental extracts, vendors must provide you with the ability to extract data that has been recently created, updated, or deleted.</p></li><li><p>For full extracts, vendors should provide you the ability to move data at at least 10 MB per second.</p></li></ol> <h2>Processing Single Entities</h2> <p>You need only three pieces of information to process your incremental extracts:</p> <ol><li><p>The data key(s).</p></li><li><p>A timestamp indicating when you extracted the data.</p></li><li><p>The schema.</p></li></ol> <p>Processing the extracts is <b>(Existing Data + Newly Created Data + Recently Updated Data) - Deleted Data</b>. Pseudo code to keep the latest data for each entity is shown below.</p> <p><code>select * from ( select entity_key, elt_effective_timestamp, * from existing_data union all select entity_key, elt_effective_timestamp, * from staging_data ) ud where not exists ( select 1 from deleted_data dd where dd.entity_key = ud.entity_key ) qualify row_number() over( partition by ud.entity_key order by ud.effective_date desc ) = 1;</code></p> <h2>Processing Sets of Data</h2> <p>You need only three pieces of information to process your incremental extracts:</p> <ol><li><p>The set key(s).</p></li><li><p>A timestamp indicating when you extracted the set, which must be the same for all rows in the set.</p></li><li><p>The schema.</p></li></ol> <p>Processing the extracts is <b>Existing Set + Newly Extracted Sets</b>. Pseudo code to keep the latest data for each set is shown below.</p> <p><code>select * from ( select set_key, elt_effective_timestamp, * from existing_data union all select set_key, elt_effective_timestamp, * from staging_data ) ud qualify rank() over( partition by ud.set_key order by ud.effective_date desc ) = 1;</code></p> <h2>Validating Your Copy</h2> <p>You have a few options here, such as:</p> <ol><li><p>Manual inspection of data in the vendor's user interface.</p></li><li><p>Sampling data.</p></li><li><p>High-level checksums.</p></li></ol> <p>I prefer using high-level checksums where possible. Here's how it works.</p> <p>Let's say you copy the orders table into your analytical data store. How do you know your copy of the data accurately reflects that of the source system?</p> <p>You can aggregate data in both systems then compare the results. For example, you could run this query in both systems to perform a high-level check of whether you have the same number of orders on each date:</p> <p><code>select date(o.created_at) as date_id, count(*) as order_count from orders o group by date_id order by date_id desc;</code></p> <p>You could further enhance this concept by adding in the ability to calculate a checksum in both systems and compare that as well. Unfortunately, many vendor systems lack the capability to perform such aggregation within their system.</p>