Three big migration projects brought Picnic’s self-service Business Intelligence to new heights — speeding up the development cycle with lightweight tech at scale
Welcome to the fourth part in a series of five blog posts about Data Engineering at Picnic. The series has covered a lot of ground, introducing the state of the art of Picnic’s Lakeless Data Warehouse in the first blog post. In the second, we took a close look at the Data Engineer’s role in the future of groceries, and in the third, we shared some war stories of how we built the Data Warehouse (DWH) in the early days of Picnic.
Now it’s time to take a deep dive into our Business Intelligence scalability challenges over the years. This article reveals how we handled a number of common challenges that businesses face when they need to scale analytics from a few dozen users to hundreds.
At the beginning of 2018, the Data Engineering team was in a challenging situation. We had hit a few scalability roadblocks that nearly halted any new development for weeks. To tackle them, we radically changed our tech stack with three large-scale migration projects:
- Adopted Snowflake as a DWH platform. Moved away from AWS Redshift, Redshift Spectrum, and Athena.
- Adopted Python as the language for Extract Load Transform (ELT). Moved away from Hitachi Pentaho Data Integration (PDI) community open-source edition.
- Adopted Argo for job scheduling and orchestration. Migrated away from Airflow, Cron jobs, and K8s scheduling.
It was an ambitious plan, borderline impossible to achieve, but we were confident that we had the skills and motivation to see it through by living the words “The best way out is through.” Now, three years later, all three projects have turned out to be a great success. They were carefully selected to fulfill our vision for the Data Engineering discipline within Picnic. But it was a wild ride. It felt very much like we were replacing the engine and the cockpit of a plane in mid-air. As a team, we are proud of the extraordinary achievement of pulling it off.
The three-headed monster nemesis of the Data Engineer 🐲
Before we jump to the solutions, it is essential to understand what triggered the change. Here are some familiar challenges that every business faces at one point or another in its analytics journey:
- Under-performant DWH: slow queries, can’t fit all the data, and not resilient to heavy user queries.
- Sluggish ETL/ELT jobs: sequential runs, challenging to peer review, and failures under heavy load due to poor streaming support.
- Complex job orchestration tooling: scheduling tools unable to simultaneously meet dependency management, on-demand triggering, and Kubernetes compatibility deployment requirements.
We have always strived to push our technology stack to the limits and at that time it was evident that it had a hard time keeping up with the pace of the business. To turn this around, it is not sufficient to treat the symptoms — a sea change to rethink our data platform was needed to address the root causes. Let’s take a look at the Picnic-specific drivers.
Scaling from one to two countries: a challenge like no other
Picnic started operations in the Netherlands in April 2015 and quickly expanded to Germany in late 2017. The good news for the DWH was that we had considered this in the data model of the Data Vault (DV) and the Dimensional Kimball from the very beginning of Picnic. Still, there was a lot to adapt when it came to market-specific requirements. For example, we had to adjust to a new way of implementing our delivery areas. In the Netherlands, a postcode is very granular and comprises on average of eight addresses, while in Germany, it contains thousands. From easy database joins on postcodes, we had to move to queries based on coordinates of latitude and longitude if we wanted to do any geospatial analysis.
Besides that, the whole deployment setup needed a makeover. As a company, we decided that all countries would share the codebase while the country-specific deployments were entirely isolated. This included the DWH. We started building two separate DWHs side by side with the same structure. At first, managing all this data infrastructure was a full-time DBA job: deciding when to scale out to more nodes, rebuilding indices, monitoring queries, and lots of waiting in between.
Cross-country analysis (cross-cluster Redshift) was a must. Some of the systems, like ad campaigns, were global and it proved challenging to combine data from both clusters. After exploring bridges between two Redshift clusters, we ended up with the only sensible decision: we needed to duplicate all the data that was truly global and hence important for both markets.
The business grew rapidly; data grew even faster
Over five years, we grew from zero orders per day to tens of thousands, from zero customers to millions. Data volume grew even faster due to the introduction of Snowplow for collecting event-based analytics data from our apps and back-end services. At first, we loaded it in Redshift. At some point, after a few months of a steady scale-out of our nodes, we couldn’t expand any further on the tier of EC2 compute instances that we were using. The option was to either upgrade the tier, which was very expensive or switch to storage intense instances, which compromised the analytical performance. At the time, we ran as a pilot the newly released Redshift Spectrum. The technology turned out to require more maintenance effort that we could spare.
Increased business complexity means increased complexity of analytical queries
The analysis of the existing data became more complex by the day. In the first years of Picnic, to keep good data governance and security, we only exposed DWH data to Business Analysts via Tableau. We had beautiful data sources in the Dimensional Kimball schema and were relentlessly working on putting out more aggregated data sources to keep the performance of the analyses manageable. This also kept it secure, due to Tableau’s access control configuration. Simultaneously, there was a straightforward data API for analysis that allowed us to keep the inner workings of DWH transparent.
However, this was not enough to keep up with the rapidly expanding business, which became constrained and dependent on an already over-stretched Data Engineering team. To solve this we set up an initial pilot with Athena to access some of the larger data sources from the in-app analytics events via SQL. It served a handful of users and seemed to be working well at first with manageable costs. However, it became apparent that this would turn into a maintenance nightmare as more people needed access. There was high demand for raw data and minimal tooling around data protection for controlled access to sensitive attributes. It required us to enforce security during the data extraction by rerouting personal data. This is far more challenging than masking data on reading as it makes the requirements phase heavier and introduces additional dependencies on the producer.
Query load resulted in Redshift cluster failures
On top of everything else, we started experiencing regular Redshift cluster restarts. In the worst period, this was at least once a week. We introduced a downtime of up to 15 minutes a day when the DWH was not available. It was a big issue: no production data capture processes could run, and we had to start them manually one by one.
The restarts were caused by heavy blending operations in Tableau that fired monstrous queries to the database. They could hit at any moment from any of the 50+ Tableau creators. We worked hard together with the AWS Redshift and the Tableau teams to get this resolved, but in the end, any solution seemed temporary.
Around the same time, Another drawback was that PDI upgrades were nearly impossible — we attempted them a few times and ended up breaking more things than we were fixing, so we stayed with the same major version. Not being able to benefit from the development of a tool was a significant downside for us. For example, we started heavily using metadata injections, although there were better out-of-the-box alternatives in the newer versions. We were also heavy users of JSON sources, and due to the slow built-in PDI component, we needed to do customizations to the ‘Fast JSON’ component. All in all, we didn’t see PDI scaling at Picnic. At first, we were pretty content with just having Cron jobs and Kubernetes schedules. However, as our use case became more demanding, we started looking at Airflow. We set up a proof-of-concept (PoC) in Docker on AWS and found it surprisingly tricky to get the UI responsive and running smoothly. We liked the idea of having an interface that gave us an overview of all jobs and visualization of the dependencies. We also found it difficult to manually trigger Directed Acyclic Graphs (DAGs), which was often required in our case, even after regular deployments. In the end, after the PoC in 2017, Airflow didn’t seem to be the best orchestration tool for Picnic.Job orchestration