Analytical Data Engineering

Tips for populating and working with an analytical data store.

One of the main goals for Analytical Data Engineering is to populate an analytical data store, such as Google BigQuery, with a copy of source system data. For example, your company may use 17 different third-party software services. You will want to copy some and maybe all of that data into an analytical data store so the business can: <ol><li>Ask and answer questions of the data not supported by the built-in reporting capabilities of those tools.</li><li>Ask and answer questions requiring data to be combined across those tools.</li></ol> <h2>Extracting Data</h2> <ol><li>For incremental extracts, vendors must provide you with the ability to extract data that has been recently created, updated, or deleted.</li><li>For full extracts, vendors should provide you the ability to move data at at least 10 MB per second.</li></ol> <h2>Processing Single Entities</h2> You need only three pieces of information to process your incremental extracts: <ol><li>The data key(s).</li><li>A timestamp indicating when you extracted the data.</li><li>The schema.</li></ol> Processing the extracts is (Existing Data + Newly Created Data + Recently Updated Data) - Deleted Data. Pseudo code to keep the latest data for each entity is shown below. <code>select * from ( select entity_key, elt_effective_timestamp, * from existing_data union all select entity_key, elt_effective_timestamp, * from staging_data ) ud where not exists ( select 1 from deleted_data dd where dd.entity_key = ud.entity_key ) qualify row_number() over( partition by ud.entity_key order by ud.effective_date desc ) = 1;</code> <h2>Processing Sets of Data</h2> You need only three pieces of information to process your incremental extracts: <ol><li>The set key(s).</li><li>A timestamp indicating when you extracted the set, which must be the same for all rows in the set.</li><li>The schema.</li></ol> Processing the extracts is Existing Set + Newly Extracted Sets. Pseudo code to keep the latest data for each set is shown below. <code>select * from ( select set_key, elt_effective_timestamp, * from existing_data union all select set_key, elt_effective_timestamp, * from staging_data ) ud qualify rank() over( partition by ud.set_key order by ud.effective_date desc ) = 1;</code> <h2>Validating Your Copy</h2> You have a few options here, such as: <ol><li>Manual inspection of data in the vendor's user interface.</li><li>Sampling data.</li><li>High-level checksums.</li></ol> I prefer using high-level checksums where possible. Here's how it works. Let's say you copy the orders table into your analytical data store. How do you know your copy of the data accurately reflects that of the source system? You can aggregate data in both systems then compare the results. For example, you could run this query in both systems to perform a high-level check of whether you have the same number of orders on each date: <code>select date(o.created_at) as date_id, count(*) as order_count from orders o group by date_id order by date_id desc;</code> You could further enhance this concept by adding in the ability to calculate a checksum in both systems and compare that as well. Unfortunately, many vendor systems lack the capability to perform such aggregation within their system.