Vitrai Gábor
I am
About Me
Professional Summary
I am an enthusiastic Data Engineer with a strong background in Computer Science and more than 3 years of experience. I specialize in ETL development, data pipeline design, and data warehousing using both streaming and batch processing solutions. I am experienced in leveraging the AWS Cloud and Apache Kafka for data-heavy projects. I have proven to be reliable as an On-Call Engineer and demonstrated strong collaboration and independent work skills. I have personal interest in cloud solutions, specially in the AWS Cloud.
Skills
Work Experience & Education
Work Experience
Data Engineer
SpazioDati Srl., Trento, Italy
Implemented new and maintained existing data processing ETL pipelines and large scale projects. Acquired, processed, migrated and stored, structured and unstructured data. Relied on AWS Cloud, Apache Kafka, Kubernetes, and Docker to implement dozens of pipelines. Monitored and maintained processes as an On-Call Engineer.
Data Engineer Intern
SpazioDati Srl., Trento, Italy
Created Machine Learning models for the task of Postal Address Extraction from Free Text.
Teacher Assistant in Software Technology
Eötvös Loránd University, Budapest, Hungary
Taught Coding Principles, Java, CI, Architectures and Project management for Bachelor Students. Took the roles of Scrum Master, Project Manager.
Java Software Engineer
Eötvös Loránd University, Budapest, Hungary · Contract
Developed a map viewer web application using the Spring framework. Responsible for backend services development, while also contributing to the user-facing client and deployment of the finished application.
Education
Master of Science in Data Science
University of Trento/Eötvös Loránd University, Budapest (Hungary) / Trento (Italy)
Graduated with 110/110 Cum Laude
Thesis Statement: Multilingual Address Extraction From Websites.
EIT Digital Master School Double Degree Program
Bachelor of Science in Computer Science
Eötvös Loránd University, Budapest
Graduated with 5/5 Grade
Thesis Statement: Development of a Corporate Management System with the help of Spring Framework
Projects
Cloud Resume Challenge on AWS
Project Description:A production-ready serverless resume website built on AWS, demonstrating cloud architecture best practices, infrastructure as code, and modern CI/CD workflows. This project showcases end-to-end cloud engineering skills from infrastructure provisioning to automated deployments.
Personal Project Goal:My goal with this project was to get my hands dirty with terraform, and create an online CV that utilizes technologies that are required by the Cloud Resume Challenge. With this small project, I wanted to have a place where I can collect my small projects and share my portfolio.
Architecture Highlights
- Infrastructure as Code (Terraform): All AWS resources provisioned and managed declaratively, including S3 buckets, CloudFront distribution, Route53 DNS records, ACM certificates, Lambda functions, and DynamoDB tables. State stored in versioned S3 backend with DynamoDB state locking to prevent concurrent modifications.
- CloudFront CDN: Global content delivery with edge caching, Origin Access Control (OAC) for secure S3 access, custom SSL/TLS certificate, and geo-restrictions for European traffic.
- GitHub Actions CI/CD: Automated testing and deployment pipeline with branch-based workflows - tests run on feature/develop branches, full deployment to AWS on main branch merges. Includes HTML validation, JavaScript linting, S3 sync, and CloudFront cache invalidation.
- Serverless View Counter: Lambda Function URL with Python runtime, atomic DynamoDB updates, CORS protection, and sub-second response times for real-time visitor tracking.
- Route53 DNS Management: Custom domain with A/AAAA records aliased to CloudFront, automated DNS validation for SSL certificates.
- AWS Certificate Manager: Free SSL/TLS certificate with automatic renewal, deployed in us-east-1 for CloudFront compatibility.
- S3 Static Hosting: Versioned bucket with encryption at rest, private access via CloudFront OAC, optimized cache headers for performance.
- DynamoDB: Serverless NoSQL database with on-demand billing, atomic increment operations, and single-digit millisecond latency.
- Infrastructure Monitoring: EventBridge rules detect CloudFront API changes (create, update, delete operations) and send real-time email alerts via SNS, providing security monitoring and change tracking for critical infrastructure components.
Separation of Concerns
- Terraform manages infrastructure: Long-lived resources like S3 buckets, CloudFront, DNS, certificates, Lambda, and DynamoDB.
- GitHub Actions manages content: Website files (HTML, CSS, JS, images) deployed automatically on code changes.
RAG Generative AI Application using AWS Bedrock and OpenSearch
Project Description:A serverless Retrieval-Augmented Generation (RAG) application built on AWS for demo purpuses. It is demonstrating AI/ML engineering with AWS Bedrock, Knowledge Bases, and modern cloud architecture. Features intelligent document Q&A and summarization using Claude Sonnet 4.5 and Haiku 4.5 models.
Personal Project Goal:My goal with this project was to make a reproducable RAG Chatbot demo using only Terraform. I have found some funny "Mars Travel Documents" online which I wanted to use to create a RAG chatbot on AWS technologies. The main objective was to make sure that all the resources/components were managed by Terraform. To make sure this is possible, I have found this official AWS sample codebase that belongs to a Workshop. This prooved me that all the components I want to combine, can be handled by Terraform. The main challenge here was bringing the OpenSearch service and the index creating under Terraform's management. Adding Bedrock Guardrails and designing the components using Terraform moduls was an additional goal I set for myself. This way, all the components can be re-used from this codebase, and I also demonstrate how simple it is to add guardrails for Bedrock. (See examples at the bottom)
Architecture Highlights
- Infrastructure as Code (Terraform): All AWS resources provisioned and managed declaratively using modular Terraform design. Includes S3 storage, Bedrock Knowledge Base, OpenSearch Serverless, Lambda functions, API Gateway, and automated dependency management via Lambda Layers. State stored in versioned S3 backend with DynamoDB state locking.
- AWS Bedrock Knowledge Base: Managed vector database using OpenSearch Serverless for semantic document search. Integrated with Amazon Titan Embed Text v2 for embeddings and an automatic one-time synchronization from S3.
- Bedrock Guardrails: Implemented example Guardrails agains user inquiries about "Legal" or "Medical" related questions.
- Claude AI Models via Inference Profiles: Leverages EU cross-region inference profiles for high availability. QA endpoint uses Claude Sonnet 4.5 reasoning and to answer complex questions. Summarize endpoint uses Claude Haiku 4.5 for fast document processing.
- OpenSearch Serverless: Vector database with automatic scaling, encryption at rest and in transit, and network isolation via VPC endpoints.
- API Gateway REST API: Secure REST API with API key authentication, usage plans with throttling (100 req/sec, 1000 burst), CORS configuration, and AWS_PROXY Lambda integration for request/response handling.
- Lambda Functions (Python 3.12): Two serverless functions with configured timeout and memory allocation. QA function performs RAG queries with context retrieval. Summarize function extracts text from PDFs using a pypdf Lambda layer and generates summaries.
- Lambda Layers for Dependencies: Clean separation of code and dependencies using automated Lambda Layer builds. Terraform triggers pip install from requirements.txt, ensuring reproducible deployments without committing packages to git.
- IAM Security: Fine-grained permissions following least privilege principle. Lambda execution role with specific Bedrock model access (inference profiles and foundation models), S3 read permissions, and Knowledge Base Retrieve operations.
- S3 Document Storage: Versioned S3 bucket with server-side encryption for document storage. Knowledge Base data source syncs automatically triggered by terraform for demo purpuses.
- Modular Terraform Architecture: Infrastructure split into reusable modules (storage, knowledge-base, lambda-functions, api-gateway) promoting separation of concerns, testing, and maintainability.
Notes:
The purpuse of the demo is not to have it production ready, but to showcase my understanding of such a system.
The goal was to have the entire system managed by Terraform.
For future improvements the following developments could be done:
Add CI/CD, Add Debounced Auto-Sync for the Knowledge Base, Introduce proper dev/production environment, Implement a user facing UI.
The application is not kept up running due to the high costs of the AWS Open Search service,
if you want to try it out, feel free to clone the repository, set up your AWS CLI profile, rename the project and execute the terraform apply command!
Example ALLOWED Guardrail Request & Response:
Request:curl -X POST https://<api-id>.execute-api.eu-west-1.amazonaws.com/prod/qa \
-H "x-api-key: <api-key>" \
-H "Content-Type: application/json" \
-d '{"question": "What are the visa requirements for Mars?", ...}'
Response:
{
"answer": "Based on the context provided, the visa requirements for Mars travelers include:
1. **Valid Earth Passport** - Must have an expiry date extending beyond your return to Earth
2. **Proof of Galactic Citizenship** - A signed declaration confirming you are from Earth
3. **Martian Language Proficiency Test** - Must demonstrate basic phrases like 'Take me to your leader'
4. **Alien Abduction Insurance** - Required for peace of mind, though Martian abductions are rare
The document mentions a 'pro tip' that processing can be expedited by offering Earth chocolate to the Martian consulate.",
"sources": [
{
"document": 1,
"score": 0.7625329,
"location": {
"s3Location": {
"uri": "s3://vitraiaws-rag-documents/Visa Requirements for Mars Travelers.pdf"
},
"type": "S3"
}
},
{...}
],
"question": "What are the visa requirements for Mars?",
"documentsRetrieved": 5
}
Example DENIED Guardrail Request & Response:
Request:curl -X POST https://<api-id>.execute-api.eu-west-1.amazonaws.com/prod/qa \
-H "x-api-key: <api-key>" \
-H "Content-Type: application/json" \
-d '{"question": "What are my legal rights if the Mars immigration office denies my visa application?", ...}'
Response:
{
"answer": "I cannot answer that question as it violates our content policy.",
"sources": [
{
"document": 1,
"score": 0.46830863,
"location": {
"s3Location": {
"uri": "s3://vitraiaws-rag-documents/Visa Requirements for Mars Travelers.pdf"
},
"type": "S3"
}
},
{...}
],
"question": "What are my legal rights if the Mars immigration office denies my visa application?",
"documentsRetrieved": 4
}
AWS Glue ETL Pipeline with Event-Driven Orchestration
Project Description:An event-driven ETL pipeline built on AWS Glue demonstrating data engineering practices. Processes the Netflix Movies (~9000 titles) with transformations (date parsing, feature extraction, data quality improvements), converting CSV to Parquet format for analytics. Features multiple orchestration approaches: Terraform automation, EventBridge + Lambda event-driven architecture, and AWS Step Functions state machines for complete workflow management.
Personal Project Goal:
In this project, I wanted to create a demo completely managed by terraform in a way,
that it uses the Glue Crawlers to populate an AWS Glue Data Catalog Database.
The simple goal I set for myself was to have a Glue Job that executes some PySpark code,
giving me a dataset that can be crawled by two crawlers, resulting in a source and a destination table, which can be analysed by Athena.
Additionally, I wanted to add a Step Function State Machine to showcase similar approaches in data processing.
By this point, the the demo could be 100% run by a simple terraform apply command, could be run manually,
trusting the rest of the runs to EventBridge, and could be run or scheduled by a State Machine.
AWS Application Composer visualization showing all infrastructure components and their relationships
Event-Driven Architecture: Understanding the Flow
This project demonstrates event-driven architecture through three orchestration approaches:
1. Terraform Provisioner Automation: On terraform apply, provisioners trigger the "source crawler", polls its status, then automatically start the Glue ETL job, demonstrating infrastructure-level event orchestration.
2. EventBridge + Lambda Pattern: When the Glue job completes, it emits a Glue Job State Change event to EventBridge. EventBridge matches the event pattern and invokes a Lambda function, which triggers the "destination crawler". This creates a reactive, loosely-coupled architecture where components respond to state changes.
3. Step Functions Orchestration: The state machine orchestrates the full workflow: Starts "source crawler" → polls until READY → starts Glue job (sync waits for completion) → starts "destination crawler" → polls until READY. Provides centralized control with visual tracking.
Architecture Highlights
- Infrastructure as Code: All AWS resources provisioned and managed declaratively by Terraform. S3 backend with DynamoDB state locking.
- Glue Data Catalog: Centralized metadata repository with two crawlers. Source crawler uses custom CSV classifier with OpenCSVSerde for proper quoted field parsing. Destination crawler automatically catalogs transformed Parquet files which can be reviewed in AWS Athena.
- Glue ETL Job (PySpark): Spark transformations to the dataset: column renaming, type casting, date parsing, feature extraction, calculated fields, data cleaning. Outputs Parquet with CloudWatch logging.
- EventBridge Event-Driven: Rule monitors job state changes, invokes Lambda on completion. Creates a reactive architecture.
- Lambda Function: Triggered by EventBridge, starts "destination crawler" on job success. Handles "crawler already running" errors.
- Step Functions: Orchestrates full workflow with polling loops and sync patterns. Built-in error handling and CloudWatch Logs integration.
- S3 Organization: Single bucket with folders: source/, destination/, scripts/, logs/, temp/.
- IAM Security: Separate roles with scoped permissions for Glue, Lambda, and Step Functions. No wildcard access.
- Data Transformations: Processes Netflix dataset. Extracts structured data and outputs 16 analytical columns optimized for Athena.
Orchestration Approaches
- For Demo/Development: Terraform provisioners for instant deployment.
- For Production: EventBridge + Lambda for reactive automation.
- Can be Scheduled: Step Functions for centralized control.
Step Functions execution showing workflow with status polling loops
Certifications
AWS Certified Cloud Practitioner
Amazon Web Services (2023)
HashiCorp Certified Terraform Associate
HashiCorp (2025)
Achievements
Designed and maintained dozens of ETL pipelines for data acquisition and processing
Architected and maintained large-scale ETL pipelines downloading data from official providers like Cloudera, Camerdata, HitHorizons and public web repositories. Implemented processing workflows using EC2 instances, applied proprietary matching tools for data enrichment, and orchestrated uploads to our AWS-based data lake.
Migrated legacy batching pipelines to streaming solutions
We maintained large-scale legacy pipelines, processing data using batching solutions. With my team, we have successfully migrated away from these legacy systems and replaced them with state-of-the-art streaming solutions, providing live updates and more a cost-effective infrastructure.
Implemented Big Data Platform Kafka sink connectors
Designed and implemented custom Kafka sink connectors for reading streaming Kafka data and storing it in a Postgres data lake. This solution enabled real-time data ingestion from multiple streaming sources, providing a scalable foundation for analytics while maintaining data consistency and reliability.