
Building High-Performant Future-Proof Data Pipelines
Our client, a leading bank in Europe, was exploring solutions to transition their internal data workloads running on top of Oracle database to a modern, cloud-based data architecture. This new architecture establishes a foundation for high-performance data pipelines capable of handling various data loads while providing mechanisms to carefully manage sensitive and critical business data.
​
Amongst others, we led the complete design and development of a modern data architecture, established standards that facilitate the onboarding of new data sources, enabled easy testing of both code and data quality, enhanced maintainability by breaking down the previous monolithic architecture into microservices, and automated the entire release process to the corresponding environments.
​​​
Technologies used: AWS Glue, AWS S3, AWS RDS (Postgres), Python, Pyspark, SQL, Github Actions
Challenge
The existing solution designed around Oracledb operated as a single monolithic system. When executed, this system initiated a data load through a specific module, followed after by the execution of a large set of stored procedures running on a specific linear order. This solution has reached a 30-minute lower boundary for execution time, making it difficult to improve any better without taking serious refactoring initiatives.
​
The entire business logic contained within the stored procedures was not properly tested. This lack of testing complicated any refactoring initiatives, potentially leading to unintended side effects, as well as making it challenging to edit or append new functionalities to the system.
The data did not undergo quality checks before landing to the production environment. This opened the door for poor-quality data reaching consumers, which affected reports accuracy. Additionally, taking changes from concept to production involved multiple manual steps and human intervention. This directly resulted in an unnecessarily prolonged process.​​​
Solution
For this initiative, we built the new data architecture on AWS. On the infrastructure side, we selected AWS RDS (PostgreSQL) for the database layer; We chose Spark data processing; We utilized AWS Glue to run the PySpark jobs on a scheduled basis, and AWS S3 for storing the Glue job artifacts; Additionally, we used ECS to run the service responsible for migrating data to the cloud;
On the development side, we used Python as the primary programming language; We converted the entire set of SQL stored procedures and functions to PySpark; For project codebase managemet we used poetry and ruff; For unit and integration testing, we used pytest; We set up a CI/CD mechanism to automatically release changes to the appropriate environments.
Result
We had the data architecture fully operational on AWS, which not only fulfilled its intended purpose effectively but also minimized or eliminated the administrative tasks required to handle database operational issues, orchestration and release management. This significantly improved the efficiency in the time spent.
​
We enforced best practices, such as fully automated change management and delivery. This reduced the human input when releasing changes to production, improving release times from hours to minutes. Additionally, it ensured that no change reached production without passing through linters, tests, and data quality checks, which covered >80% of the codebase.
​
The foundational data architecture we built allows easy extensibility to support future initiatives. This architecture is designed to adapt to both current and planned data loads from the client. Finally, we delivered a fully documented piece of work that serves as a guide for additional contributions and facilitates the onboarding of new resources to the project.