Vitrai Gábor - Data Engineer

About Me

Professional Summary

Enthusiastic Data Engineer with a strong Computer Science background and 4+ years of experience building data pipelines and data solutions. Experienced in leveraging AWS Cloud and Apache Kafka for data-heavy projects. Specialized in ETL development, data pipeline design, and data warehousing using both streaming and batch processing. Proven reliability as an On-Call Engineer with strong collaboration skills and the ability to work independently.

Budapest, Hungary

gabor.vitrai

Hungarian (Native) • English (C1) • Italian (A2)

Skills

ETL Development Data Pipeline Design AWS Cloud Data Warehousing Data Migration Big Data Processing Python/Java/Scala SQL (PostgreSQL) Apache Kafka OpenSearch CI/CD (GitLab) Docker Kubernetes Prometheus & Grafana Apache Spark Streaming Architecture PagerDuty

Work Experience & Education

Work Experience

2022. Sept. - Present

Data Engineer

SpazioDati Srl., Trento, Italy

Being part of the Data Acquisition team, I have implemented new and maintained existing data processing ETL pipelines and large scale projects built on AWS Cloud. Acquired, processed, migrated and stored, structured and unstructured data. Relied on AWS Cloud, Apache Kafka, Kubernetes, and Docker to implement dozens of pipelines. Monitored and maintained processes as an On-Call Engineer.

2022. May - 2022. Aug.

Data Engineer Intern

SpazioDati Srl., Trento, Italy

Designed, trained and implemented Machine Learning models for the task of Postal Address Extraction from Free Text.

2021. Jan. - 2021. May

Teacher Assistant in Software Technology

Eötvös Loránd University, Budapest, Hungary

Taught Coding Principles, Java, CI, Architectures and Project management for Bachelor Students. Took the roles of Scrum Master, Project Manager.

2020. Mar. - 2020. Jul.

Java Software Engineer

Eötvös Loránd University, Budapest, Hungary · Contract

Developed a map viewer web application using the Spring framework. Responsible for backend services development, while also contributing to the user-facing client and deployment of the finished application.

Education

2020. Sept. - 2022. July

Master of Science in Data Science

University of Trento/Eötvös Loránd University, Budapest (Hungary) / Trento (Italy)

Graduated with 110/110 Cum Laude
Thesis Statement: Multilingual Address Extraction From Websites.
EIT Digital Master School Double Degree Program

2017. Sept. - 2020. July

Bachelor of Science in Computer Science

Eötvös Loránd University, Budapest

Graduated with 5/5 Grade
Thesis Statement: Development of a Corporate Management System with the help of Spring Framework

Projects

Cloud Resume Challenge on AWS

Project Description:

A production-ready serverless resume website built on AWS, demonstrating cloud architecture best practices, infrastructure as code, and CI/CD workflows. This project showcases end-to-end cloud engineering skills from infrastructure provisioning to automated deployments.

Personal Project Goal:

My goal with this project was to create an online CV that utilizes AWS technologies that are required by the Cloud Resume Challenge (Dynamo DB, Lambda etc.). I wanted to have a place where I can collect my small projects and share my portfolio. From the technological side, my main goal was to use AWS CloudFront to distribute my page in a secure way, have CI/CD pipelines set up, and have evertyhing under Terraform's management.

Website Code Terraform IaC

Cloud Resume Challenge Infrastructure Architecture

Architecture Highlights

Infrastructure as Code (Terraform): All AWS resources provisioned and managed declaratively, including S3 buckets, CloudFront distribution, Route53 DNS records, ACM certificates, Lambda functions, and DynamoDB tables. State stored in versioned S3 backend with DynamoDB state locking to prevent concurrent modifications.
CloudFront CDN: Global content delivery with edge caching, Origin Access Control (OAC) for secure S3 access, custom SSL/TLS certificate, and geo-restrictions for European traffic.
GitHub Actions CI/CD: Automated testing and deployment pipeline with branch-based workflows - tests run on feature/develop branches, full deployment to AWS on main branch merges. Includes HTML validation, JavaScript linting, S3 sync, and CloudFront cache invalidation.
Serverless View Counter: Lambda Function URL with Python runtime, atomic DynamoDB updates, CORS protection, and sub-second response times for real-time visitor tracking.
Route53 DNS Management: Custom domain with A/AAAA records aliased to CloudFront, automated DNS validation for SSL certificates.
AWS Certificate Manager: Free SSL/TLS certificate with automatic renewal, deployed in us-east-1 for CloudFront compatibility.
S3 Static Hosting: Versioned bucket with encryption at rest, private access via CloudFront OAC, optimized cache headers for performance.
DynamoDB: Serverless NoSQL database with on-demand billing, atomic increment operations, and single-digit millisecond latency.
Infrastructure Monitoring: EventBridge rules detect CloudFront API changes (create, update, delete operations) and send real-time email alerts via SNS, providing security monitoring and change tracking for critical infrastructure components.

Separation of Concerns

Terraform manages infrastructure: Long-lived resources like S3 buckets, CloudFront, DNS, certificates, Lambda, and DynamoDB.
GitHub Actions manages content: Website files (HTML, CSS, JS, images) deployed automatically on code changes.

Terraform AWS CloudFront AWS Lambda DynamoDB S3 Route53 ACM EventBridge SNS GitHub Actions Python

RAG Generative AI Application using AWS Bedrock and OpenSearch

Project Description:

A serverless Retrieval-Augmented Generation (RAG) application built on AWS for demo purpuses. It is demonstrating AI/ML engineering with AWS Bedrock, Knowledge Bases, and modern cloud architecture. Features intelligent document Q&A and summarization using Claude Sonnet 4.5 and Haiku 4.5 models.

Personal Project Goal:

My goal with this project was to make a reproducable RAG Chatbot demo using only Terraform. I have found some funny "Mars Travel Documents" online which I wanted to use to create a RAG chatbot on AWS technologies. The main objective was to make sure that all the resources/components were managed by Terraform. To make sure this is possible, I have found this official AWS sample codebase that belongs to a Workshop. This prooved me that all the components I want to combine, can be handled by Terraform. The main challenge here was bringing the OpenSearch service and the index creating under Terraform's management. Adding Bedrock Guardrails and designing the components using Terraform modules was an additional goal I set for myself. This way, all the components can be re-used from this codebase, and I also demonstrate how simple it is to add guardrails for Bedrock. (See examples at the bottom)

Terraform IaC

RAG Application Infrastructure Architecture

Please note that terraform manages exactly 50 components in total for this project, however for the simplicity of this graph, I have removed resources like IAM Policies, the Guardrail and some REST API resources. For the full graph, open the tamplate.yaml file from the linked repository.

Architecture Highlights

Infrastructure as Code (Terraform): All AWS resources provisioned and managed declaratively using modular Terraform design. Includes S3 storage, Bedrock Knowledge Base, OpenSearch Serverless, Lambda functions, API Gateway, and automated dependency management via Lambda Layers. State stored in versioned S3 backend with DynamoDB state locking.
AWS Bedrock Knowledge Base: Managed vector database using OpenSearch Serverless for semantic document search. Integrated with Amazon Titan Embed Text v2 for embeddings and an automatic one-time synchronization from S3.
Bedrock Guardrails: Implemented example Guardrails agains user inquiries about "Legal" or "Medical" related questions.
Claude AI Models via Inference Profiles: Leverages EU cross-region inference profiles for high availability. QA endpoint uses Claude Sonnet 4.5 reasoning and to answer complex questions. Summarize endpoint uses Claude Haiku 4.5 for fast document processing.
OpenSearch Serverless: Vector database with automatic scaling, encryption at rest and in transit, and network isolation via VPC endpoints.
API Gateway REST API: Secure REST API with API key authentication, usage plans with throttling (100 req/sec, 1000 burst), CORS configuration, and AWS_PROXY Lambda integration for request/response handling.
Lambda Functions (Python 3.12): Two serverless functions with configured timeout and memory allocation. QA function performs RAG queries with context retrieval. Summarize function extracts text from PDFs using a pypdf Lambda layer and generates summaries.
Lambda Layers for Dependencies: Clean separation of code and dependencies using automated Lambda Layer builds. Terraform triggers pip install from requirements.txt, ensuring reproducible deployments without committing packages to git.
IAM Security: Fine-grained permissions following least privilege principle. Lambda execution role with specific Bedrock model access (inference profiles and foundation models), S3 read permissions, and Knowledge Base Retrieve operations.
S3 Document Storage: Versioned S3 bucket with server-side encryption for document storage. Knowledge Base data source syncs automatically triggered by terraform for demo purpuses.
Modular Terraform Architecture: Infrastructure split into reusable modules (storage, knowledge-base, lambda-functions, api-gateway) promoting separation of concerns, testing, and maintainability.

Notes:

The purpuse of the demo is not to have it production ready, but to showcase my understanding of such a system. The goal was to have the entire system managed by Terraform, so it can be easily created and destroyed to avoid high costs. For future improvements the following developments could be done: Add CI/CD, Add Debounced Auto-Sync for the Knowledge Base, Introduce proper dev/production environment, Implement a user facing UI. The application is not kept up running due to the high costs of the AWS OpenSearch service, if you want to try it out, feel free to clone the repository, set up your AWS CLI profile, rename the project and execute the terraform apply command!

Example ALLOWED Guardrail Request & Response:

Request:

curl -X POST https://<api-id>.execute-api.eu-west-1.amazonaws.com/prod/qa \
  -H "x-api-key: <api-key>" \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the visa requirements for Mars?", ...}'

Response:

{
  "answer": "Based on the context provided, the visa requirements for Mars travelers include:
  1. **Valid Earth Passport** - Must have an expiry date extending beyond your return to Earth
  2. **Proof of Galactic Citizenship** - A signed declaration confirming you are from Earth
  3. **Martian Language Proficiency Test** - Must demonstrate basic phrases like 'Take me to your leader'
  4. **Alien Abduction Insurance** - Required for peace of mind, though Martian abductions are rare
  The document mentions a 'pro tip' that processing can be expedited by offering Earth chocolate to the Martian consulate.",
  "sources": [
    {
      "document": 1,
      "score": 0.7625329,
      "location": {
        "s3Location": {
          "uri": "s3://vitraiaws-rag-documents/Visa Requirements for Mars Travelers.pdf"
        },
        "type": "S3"
      }
    },
    {...}
  ],
  "question": "What are the visa requirements for Mars?",
  "documentsRetrieved": 5
}

Example DENIED Guardrail Request & Response:

Request:

curl -X POST https://<api-id>.execute-api.eu-west-1.amazonaws.com/prod/qa \
  -H "x-api-key: <api-key>" \
  -H "Content-Type: application/json" \
  -d '{"question": "What are my legal rights if the Mars immigration office denies my visa application?", ...}'

Response:

{
  "answer": "I cannot answer that question as it violates our content policy.",
  "sources": [
    {
      "document": 1,
      "score": 0.46830863,
      "location": {
        "s3Location": {
          "uri": "s3://vitraiaws-rag-documents/Visa Requirements for Mars Travelers.pdf"
        },
        "type": "S3"
      }
    },
    {...}
  ],
  "question": "What are my legal rights if the Mars immigration office denies my visa application?",
  "documentsRetrieved": 4
}

Terraform AWS Bedrock Claude AI OpenSearch Serverless AWS Lambda API Gateway S3 Python RAG Vector Databases Lambda Layers IAM

AWS Glue ETL Pipeline with Event-Driven Orchestration

Project Description:

An event-driven ETL pipeline built on AWS Glue demonstrating data engineering practices. Processes the Netflix Movies dataset (~9000 titles) with different types of transformations (date parsing, feature extraction, data quality improvements), converting CSV to Parquet format for analytics. Features multiple orchestration approaches: Terraform automation, EventBridge + Lambda event-driven architecture, and AWS Step Functions state machines for complete workflow management.

Personal Project Goal:

In this project, I wanted to create a demo completely managed by terraform in a way, that it uses the Glue Crawlers to populate an AWS Glue Data Catalog Database. The simple goal I set for myself was to have a Glue Job that executes some PySpark code, giving me a dataset that can be crawled by two crawlers, resulting in a source and a destination table which can be analysed using AWS Athena. Additionally, I wanted to add a Step Function State Machine to showcase an orchestrated data processing pipeline. By this point, the demo could be 100% run by a simple terraform apply command, alternatively could be run manually, trusting EventBridge to run the the components that were meant to run after the Glue Job. As a third option, the pipeline could be run or scheduled by the State Machine.

Terraform IaC

ETL Pipeline Infrastructure Architecture

AWS Application Composer visualization showing all infrastructure components and their relationships

Event-Driven Architecture: Understanding the Flow

This project demonstrates event-driven architecture through three orchestration approaches:

1. Terraform Provisioner Automation: On terraform apply, provisioners trigger the "source crawler", polls its status, then automatically start the Glue ETL job, demonstrating infrastructure-level event orchestration.

2. EventBridge + Lambda Pattern: When the Glue job completes, it emits a Glue Job State Change event to EventBridge. EventBridge matches the event pattern and invokes a Lambda function, which triggers the "destination crawler". This creates a reactive, loosely-coupled architecture where components respond to state changes.

3. Step Functions Orchestration: The state machine orchestrates the full workflow: Starts "source crawler" → polls until READY → starts Glue job (sync waits for completion) → starts "destination crawler" → polls until READY. Provides centralized control with visual tracking.

Architecture Highlights

Infrastructure as Code: All AWS resources provisioned and managed declaratively by Terraform. S3 backend with DynamoDB state locking.
Glue Data Catalog: Centralized metadata repository with two crawlers. Source crawler uses custom CSV classifier with OpenCSVSerde for proper quoted field parsing. Destination crawler automatically catalogs transformed Parquet files which can be reviewed in AWS Athena.
Glue ETL Job (PySpark): Spark transformations to the dataset: column renaming, type casting, date parsing, feature extraction, calculated fields, data cleaning. Outputs Parquet with CloudWatch logging.
EventBridge Event-Driven: Rule monitors job state changes, invokes Lambda on completion. Creates a reactive architecture.
Lambda Function: Triggered by EventBridge, starts "destination crawler" on job success. Handles "crawler already running" errors.
Step Functions: Orchestrates full workflow with polling loops and sync patterns. Built-in error handling and CloudWatch Logs integration.
S3 Organization: Single bucket with folders: source/, destination/, scripts/, logs/, temp/.
IAM Security: Separate roles with scoped permissions for Glue, Lambda, and Step Functions. No wildcard access.
Data Transformations: Processes Netflix dataset. Extracts structured data and outputs 16 analytical columns optimized for Athena.

Orchestration Approaches

For Demo/Development: Terraform provisioners for instant deployment.
For Production: EventBridge + Lambda for reactive automation.
Can be Scheduled: Step Functions for centralized control.

Step Functions execution showing workflow with status polling loops

Terraform AWS Glue PySpark AWS Lambda EventBridge Step Functions S3 CloudWatch Python ETL Parquet Athena IAM Event-Driven Architecture AWS Athena CloudWatch

Certifications

AWS Certified Cloud Practitioner

Amazon Web Services (2023)

HashiCorp Certified Terraform Associate

HashiCorp (2025)

Achievements

Designed and maintained dozens of ETL pipelines for data acquisition and processing

Architected and maintained large-scale ETL pipelines downloading data from official providers like Cloudera, Camerdata, HitHorizons and public web repositories. Implemented processing workflows using EC2 instances, applied proprietary matching tools for data enrichment, and orchestrated uploads to our AWS-based data lake.

Migrated legacy batching pipelines to streaming solutions

We maintained large-scale legacy pipelines, processing data using batching solutions. With my team, we have successfully migrated away from these legacy systems and replaced them with state-of-the-art streaming solutions, providing live updates and more a cost-effective infrastructure.

Implemented Big Data Platform Kafka sink connectors

Designed and implemented custom Kafka sink connectors for reading streaming Kafka data and storing it in a Postgres data lake. This solution enabled real-time data ingestion from multiple streaming sources, providing a scalable foundation for analytics while maintaining data consistency and reliability.