Data Vault Automation at de Bijenkorf PRESENTED BY ROB WINTERS ANDREI SCORUS Presentation agenda ◦ Project objec*ves ◦ Architectural overview ◦ The data warehouse data model ◦ Automa*on in the data warehouse ◦ Successes and failures ◦ Conclusions About the presenters Rob Winters Andrei Scorus Head of Data Technology, the Bijenkorf BI Consultant, Incentro Project role: Project role: ◦ Project Lead ◦ Main ETL Developer ◦ Systems architect and administrator ◦ ETL Developer ◦ Data modeler ◦ Modeling support ◦ Source system expert ◦ Developer (ETL, predic;ve models, reports) ◦ Stakeholder manager ◦ Joined project September 2014 ◦ Joined project November 2014 Project objectives Technical Requirements ◦ Informa*on requirements ◦ Have one place as the source for all reports ◦ Security and privacy ◦ Informa*on management ◦ Integrate with produc*on • One environment to quickly generate customer insights • Then feed those insights back to produc;on ◦ Non-‐func*onal requirements ◦ System quality ◦ Extensibility ◦ Scalability ◦ Maintainability ◦ Security ◦ Flexibility ◦ Low Cost • Then measure the impact of those changes in near real ;m Source system landscape Source Type Number of Sources Examples Load Frequency Data Structure 2x/hour Par;al 3NF Oracle DB 2 Virgo ERP MySQL 3 Product DB, Web 10x/hour Orders, DWH 3NF (Web Orders), Improperly normalized Event bus 1 Web/email events 1x/minute Tab delimited with JSON fields Webhook 1 Transac;onal Emails 1x/minute JSON REST APIs 5+ GA, DotMailer 1x/hour-‐1x/day JSON SOAP APIs 5+ AdWords, Pricing 1x/day XML Architectural overview Tools AWS ◦ S3 ◦ Kinesis ◦ Elas;cache ◦ Elas;c Beanstalk ◦ EC2 ◦ DynamoDB Open Source ◦ Snowplow Event Tracker ◦ Rundeck Scheduler ◦ Jenkins Con;nuous Integra;on ◦ Pentaho PDI Other ◦ ◦ ◦ ◦ HP Ver;ca Tableau Github RStudio Server DWH internal architecture • Tradi;onal three ;er DWH • ODS generated automa;cally from staging • Ops mart reflects data in original source form • Helps offload queries from source systems • Business marts materialized exclusively from vault Bijenkorf Data Vault overview Aligned to Data Vault 2.0 Model contains • Hash keys • Sales transac;ons • Hashes used for CDC • Customer and corporate loca;ons • Parallel loading • Maximum u;liza;on of available resources • Data unchanged in to the vault • Customers Excluded from the vault ◦ Event streams ◦ Server logs ◦ Unstructured data • Products Some sta*s*cs • Payment methods 18 hubs • 34 loading scripts • E-‐mail • Phone • Product grouping • Campaigns • deBijenkorf card • Social media 27 links • 43 loading scripts 39 satellites • 43 loading scripts 13 reference tables • 1 script per table Data volumes • ~1 TB base volume • 10-‐12 GB daily • ~250 source tables Deep dive: Transactions in DV • Transac;ons Deep dive: Customers in DV • Same as link on customer Challenges encountered during data modeling Challenge Issue Details Resolu*on Source issues • Source systems and original data • Business keys rebuilt in staging unavailable for most informa;on prior to vault loading • Data ohen transformed 2-‐4 ;mes before access was available • Business keys (ex. SKU) typically replaced with sequences Modeling returns • Retail returns can appear in ERP • Original model showed sale in 1-‐3 ways across mul;ple tables state on line item satellite with inconsistent keys • Revised model recorded • Online returns appear as a state “nega;ve sale” transac;ons and change on original transac;on used a new link to connect to and may/may not appear in ERP original sale when possible Fragmented knowledge • Informa;on about the systems was being held by mul;ple people • Documenta;on was out-‐of-‐date • Talking to as many people as possible and tes;ng hypotheses on the data Targeted benefits of DWH automation Objec*ve Achievements Speed of development • Integra;on of new sources or data from exis;ng sources takes 1-‐2 steps • Adding a new vault dependency takes one step Simplicity • Five jobs handle all ETL processes across DWH Traceability • Every record/source file is traced in the database and every row automa;cally iden;fied by source file in ODS Code simplifica*on • Replaced most common key defini;ons with dynamic variable replacement File management • Every source file automa;cally archived to Amazon S3 in appropriate loca;ons sorted by source, table, and date • En;re source systems, periods, etc can be replayed in minutes Source loading automation o Design of loader focused on process abstrac;on, traceability, and minimiza;on of “moving parts” o Final process consisted of two base jobs working in tandem: one for genera;ng incremental extracts from source systems, one for loading flat files from all sources to staging tables o Replica;on was desired but rejected due to limited access to source systems Workflow of source integra*on Source tables duplicated in staging with addi;on of loadTs and sourceFile columns Metadata for source file added Loader automa;cally generates ODS, begins tracking source files for duplica;on and data quality Example: Add addi*onal table from exis*ng source Query generator automa;cally executes full duplica;on on first execu;on and incrementals aherward CREATE TABLE stg_oms.customer ( customerId int , customerName varchar(500) , customerAddress varchar(5000) , loadTs ;mestamp NOT NULL , sourceFile varchar(255) NOT NULL ) ORDER BY customerId PARTITION BY date(loadTs) ; INSERT INTO meta.source_to_stg_mapping (targetSchema, targetTable, sourceSystem, fileNamePapern, delimiter, nullField) VALUES ('stg_oms','customer','OMS','OMS_CUSTOMER','TAB','NULL') ; Vault loading automation o Loader is fully metadata driven with focus on horizontal scalability and management simplicity o To support speed of development and performance, variable-‐driven SQL templates used throughout All Staging Tables Checked for Changes • New sources automa;ca lly added • Last change epoch based on load stamps, advanced each ;me all dependenc ies execute successfull y • Dependenc ies declared at List of ;me of job Depende crea;on nt Vault • Load Loads priori;za; Iden;fied on possible but not u;lized Loads Planned in Hub, Link, Sat Order • Jobs parallelized across tables but serialized per job • Dynamic job queueing ensures appropriat e execu;on order Loads Executed • Variables automa;ca lly iden;fied and replaced • Each load records performan ce sta;s;cs and error messages Design goals for mart loading automation Requirement Solu;on Benefit Simple, standardized models Metadata-‐driven Pentaho PDI Easy development using parameters and variables Easily Extensible Plugin framework Rapid integra;on of new func;onality Rapid new job development Recycle standardized jobs and transforma;ons Limited moving parts, easy modifica;on Low administra;on overhead Leverage built in logging and tracking Easily integrated mart loading repor;ng with other ETL reports Data Information mart automation flow Retrieve commands Get dependencies • Each dimension and fact is processed independently • Based on defined transforma;on, get all related vault tables: links, satellites or hubs • From the related tables, build a list of unique keys that have changed since the last update of the fact or dimension Retrieve changed • Store the data in the database un;l further processing data Execute transforma*ons Maintentance • Mul;ple Pentaho transforma;ons can be processed per command using the data captured in previous steps • Logging happens throughout the whole process • Cleanup aher all commands have been processed • DV drives recommenda;on engine and customer recommenda;ons (updated nightly) • Data pipeline supports near real ;me upda;ng of customer recommenda;ons based on web ac;vity Business Intelligence • Provided first unified data model of customer ac;vity • 80% reduc;on in unique customer keys • Allowed for segmenta;on of customers based on combina;on of in-‐store and online ac;vity Personaliza;on Customer Analysis Primary uses of Bijenkorf DWH • DV-‐based marts replace joining dozens of tables across mul;ple sources with single facts/ dimensions • IT-‐driven repor;ng being replaced with self-‐service BI Biggest drivers of success AWS Infrastructure Automa;on PDI framework Cost: En;re infrastructure for less than one server in the data center Speed: Enormous ;me savings aher ini;al investment Ease of use: Adding new commands takes at most 45 minutes Toolset: Most services available off the shelf, minimizing administra;on Simplicity: Able to run and monitor 40k+ queries per day with minimal effort Agile: Building the framework took 1 day Freedom: No dependency on IT for development support Auditability: Enforced tracking and archiving without developer involvement Low profile: Average memory usage of 250MB Scalability: Systems automa;cally scaled to match DWH demands Biggest mistakes along the way Reliance on documenta;on and requirements over expert users • Ini;al integra;on design was based on provided documenta;on/models which was rarely accurate • Current users of sources should have been engaged earlier to explain undocumented caveats Late u;liza;on of templates and variables • Variables were u;lized late in development, slowing progress significantly and crea;ng consistency issues • Good ini;al design of templates will significantly reduce development ;me in mid/long run Aggressive overextension of resources • We apempted to design and populate the en;re data vault prior to focusing on customer deliverables like reports (in addi;on to other projects) • We have shihed focus to con;nuous release of new informa;on rather than wai;ng for completeness Primary takeaways ◦ Sources are like cars: the older they are, the more idiosyncrasies. Be cau;ous with design automa;on! ◦ Automa;on can enormously simplify/accelerate data warehousing. Don’t be afraid to roll your own ◦ Balance stateful versus stateless and monolithic versus fragmented architecture design ◦ Cloud based architecture based on column store DBs is extremely scalable, cheap, and highly performant ◦ A successful vault can create a new problem: gewng IT to think about business processes rather than system keys! Rob Winters Andrei Scorus WintersRD@gmail.com andrei.scorus@incentro.com
© Copyright 2025