Vitrai Gábor
I am
About Me
Professional Summary
I am an enthusiastic Data Engineer with a strong background in Computer Science and more than 3 years of experience. I specialize in ETL development, data pipeline design, and data warehousing using both streaming and batch processing solutions. Experienced in leveraging the AWS Cloud and Apache Kafka for data-heavy projects. I have proven to be reliable as an On-Call Engineer and demonstrated strong collaboration and independent work skills.
Skills
Work Experience & Education
Work Experience
Data Engineer
SpazioDati Srl., Trento, Italy
Implemented new and maintained existing data processing ETL pipelines and large scale projects. Acquired, processed, migrated and stored, structured and unstructured data. Relied on AWS Cloud, Apache Kafka, Kubernetes, and Docker to implement dozens of pipelines. Monitored and maintained processes as an On-Call Engineer.
Data Engineer Intern
SpazioDati Srl., Trento, Italy
Created Machine Learning models for the task of Postal Address Extraction from Free Text.
Teacher Assistant in Software Technology
Eötvös Loránd University, Budapest, Hungary
Taught Coding Principles, Java, CI, Architectures and Project management for Bachelor Students. Took the roles of Scrum Master, Project Manager.
Education
Master of Science in Data Science
University of Trento/Eötvös Loránd University, Budapest (Hungary) / Trento (Italy)
Graduated with 110/110 Cum Laude
Thesis Statement: Multilingual Address Extraction From Websites.
EIT Digital Master School Double Degree Program
Bachelor of Science in Computer Science
Eötvös Loránd University, Budapest
Graduated with 5/5 Grade
Thesis Statement: Development of a Corporate Management System with the help of Spring Framework
Projects
Cloud Resume Challenge
A production-ready serverless resume website built on AWS, demonstrating cloud architecture best practices, infrastructure as code, and modern CI/CD workflows. This project showcases end-to-end cloud engineering skills from infrastructure provisioning to automated deployments.
Architecture Highlights
- Infrastructure as Code (Terraform): All AWS resources provisioned and managed declaratively, including S3 buckets, CloudFront distribution, Route53 DNS records, ACM certificates, Lambda functions, and DynamoDB tables. State stored in versioned S3 backend with DynamoDB state locking to prevent concurrent modifications.
- CloudFront CDN: Global content delivery with edge caching, Origin Access Control (OAC) for secure S3 access, custom SSL/TLS certificate, and geo-restrictions for European traffic.
- GitHub Actions CI/CD: Automated testing and deployment pipeline with branch-based workflows - tests run on feature/develop branches, full deployment to AWS on main branch merges. Includes HTML validation, JavaScript linting, S3 sync, and CloudFront cache invalidation.
- Serverless View Counter: Lambda Function URL with Python runtime, atomic DynamoDB updates, CORS protection, and sub-second response times for real-time visitor tracking.
- Route53 DNS Management: Custom domain with A/AAAA records aliased to CloudFront, automated DNS validation for SSL certificates.
- AWS Certificate Manager: Free SSL/TLS certificate with automatic renewal, deployed in us-east-1 for CloudFront compatibility.
- S3 Static Hosting: Versioned bucket with encryption at rest, private access via CloudFront OAC, optimized cache headers for performance.
- DynamoDB: Serverless NoSQL database with on-demand billing, atomic increment operations, and single-digit millisecond latency.
- Infrastructure Monitoring: EventBridge rules detect CloudFront API changes (create, update, delete operations) and send real-time email alerts via SNS, providing security monitoring and change tracking for critical infrastructure components.
Separation of Concerns
- Terraform manages infrastructure: Long-lived resources like S3 buckets, CloudFront, DNS, certificates, Lambda, and DynamoDB.
- GitHub Actions manages content: Website files (HTML, CSS, JS, images) deployed automatically on code changes.
RAG Generative AI Application
A serverless Retrieval-Augmented Generation (RAG) application built on AWS for demo purpuses. It is demonstrating AI/ML engineering with AWS Bedrock, Knowledge Bases, and modern cloud architecture. Features intelligent document Q&A and summarization using Claude Sonnet 4.5 and Haiku 4.5 models.
Architecture Highlights
- Infrastructure as Code (Terraform): All AWS resources provisioned and managed declaratively using modular Terraform design. Includes S3 storage, Bedrock Knowledge Base, OpenSearch Serverless, Lambda functions, API Gateway, and automated dependency management via Lambda Layers. State stored in versioned S3 backend with DynamoDB state locking.
- AWS Bedrock Knowledge Base: Managed vector database using OpenSearch Serverless for semantic document search. Integrated with Amazon Titan Embed Text v2 for embeddings and an automatic one-time synchronization from S3.
- Claude AI Models via Inference Profiles: Leverages EU cross-region inference profiles for high availability. QA endpoint uses Claude Sonnet 4.5 reasoning and to answer complex questions. Summarize endpoint uses Claude Haiku 4.5 for fast document processing.
- API Gateway REST API: Secure REST API with API key authentication, usage plans with throttling (100 req/sec, 1000 burst), CORS configuration, and AWS_PROXY Lambda integration for request/response handling.
- Lambda Functions (Python 3.12): Two serverless functions with configured timeout and memory allocation. QA function performs RAG queries with context retrieval. Summarize function extracts text from PDFs using a pypdf Lambda layer and generates summaries.
- Lambda Layers for Dependencies: Clean separation of code and dependencies using automated Lambda Layer builds. Terraform triggers pip install from requirements.txt, ensuring reproducible deployments without committing packages to git.
- IAM Security: Fine-grained permissions following least privilege principle. Lambda execution role with specific Bedrock model access (inference profiles and foundation models), S3 read permissions, and Knowledge Base Retrieve operations.
- OpenSearch Serverless: Vector database with automatic scaling, encryption at rest and in transit, and network isolation via VPC endpoints.
- S3 Document Storage: Versioned S3 bucket with server-side encryption for document storage. Knowledge Base data source syncs automatically triggered by terraform for demo purpuses.
- Modular Terraform Architecture: Infrastructure split into reusable modules (storage, knowledge-base, lambda-functions, api-gateway) promoting separation of concerns, testing, and maintainability.
Notes:
The purpuse of the demo is not to have it production ready, but to showcase my understanding of such a system. The goal was to have the entire system managed by Terraform. For future improvements the following developments could be done: Add CI/CD, Add Debounced Auto-Sync for the Knowledge Base, Introduce proper dev/production environment, Implement a user facing UI. The application is not kept up running due to the high costs of the AWS Open Search service, if you want to try it out, feel free to clone the repository, set up your AWS CLI profile, rename the project and hit terraform apply!
Certifications
AWS Certified Cloud Practitioner
Amazon Web Services (2023)
HashiCorp Certified Terraform Associate
HashiCorp (2025)
Achievements
Migrated legacy batching pipelines to streaming solutions
We maintained large-scale legacy pipelines, processing data using batching solutions. With my team, we have successfully migrated away from these legacy systems and replaced them with state-of-the-art streaming solutions, providing live updates and more a cost-effective infrastructure.