AWS Glue Complete Tutorial: Serverless ETL Made Easy
Master AWS Glue from basics to advanced. Learn to build serverless ETL pipelines, use Glue Studio, crawlers, and catalog for data integration at any scale.
Introduction
AWS Glue has revolutionized how organizations handle data integration and ETL (Extract, Transform, Load) processes. As a fully managed, serverless data integration service, Glue eliminates the complexity of managing infrastructure while providing powerful capabilities for data discovery, cataloging, and transformation.
This comprehensive tutorial is based on popular AWS Glue educational content and walks you through everything from basic concepts to advanced implementation patterns. Whether you’re new to ETL or looking to migrate from traditional systems, this guide will give you the knowledge and practical examples you need.
Video Reference: This tutorial is inspired by comprehensive AWS Glue training content. For visual learning, check out this excellent video tutorial: AWS Glue Complete Tutorial
What is AWS Glue?
AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. It provides all the capabilities needed for data integration without managing any infrastructure.
Key Benefits
- Serverless: No infrastructure to manage or provision
- Auto-scaling: Scales automatically based on workload
- Cost-effective: Pay only for what you use
- Integrated: Works seamlessly with other AWS services
- Visual Interface: Glue Studio provides drag-and-drop ETL design
AWS Glue Architecture Components
1. Data Catalog
The central metadata repository that stores information about your data sources, transformations, and targets.
# Example: Programmatically accessing Data Catalog import boto3 glue_client = boto3.client('glue', region_name='us-east-1') # Get database information databases = glue_client.get_databases() print("Available databases:") for db in databases['DatabaseList']: print(f"- {db['Name']}: {db.get('Description', 'No description')}") 2. Crawlers
Automated tools that scan your data sources and populate the Data Catalog with metadata.
3. ETL Jobs
The actual data processing jobs that extract, transform, and load data.
4. Triggers
Schedule or event-based mechanisms to start ETL jobs and crawlers.
5. Connections
Secure connection information for accessing data sources.
Getting Started with AWS Glue
Prerequisites
Before you start, ensure you have:
- An AWS account
- Appropriate IAM permissions
- Data sources (S3, RDS, etc.) with sample data
Setting Up IAM Permissions
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "glue:*", "s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket", "iam:GetRole", "iam:PassRole", "ec2:DescribeVpcEndpoints", "ec2:DescribeRouteTables", "ec2:CreateNetworkInterface", "ec2:DeleteNetworkInterface", "ec2:DescribeNetworkInterfaces", "ec2:DescribeSecurityGroups", "ec2:DescribeSubnets", "ec2:DescribeVpcAttribute", "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents", "logs:DescribeLogStreams" ], "Resource": "*" } ] } Working with Glue Studio
Glue Studio provides a visual interface for creating ETL jobs without writing code.
Creating Your First Visual ETL Job
-
Access Glue Studio
- Go to AWS Console → Glue → Glue Studio
- Click “Create job” → “Visual ETL”
-
Add Data Sources
- Drag “S3” source node to canvas
- Configure S3 bucket and file format
- For CSV files: specify delimiter, header options
-
Add Transformations
- Use built-in transforms like:
- ApplyMapping: Change column names/types
- Filter: Remove unwanted rows
- Join: Combine datasets
- Aggregate: Group and summarize data
- Use built-in transforms like:
-
Configure Target
- Add S3 target node
- Specify output format (Parquet, ORC, etc.)
- Configure partitioning strategy
Example: Simple CSV to Parquet Conversion
# Job Configuration JobName: csv-to-parquet-conversion Role: AWSGlueServiceRole GlueVersion: 3.0 WorkerType: G.1X NumberOfWorkers: 2 # Source Configuration Source: Type: S3 Path: s3://my-bucket/input-data/ Format: CSV Options: header: true delimiter: "," # Target Configuration Target: Type: S3 Path: s3://my-bucket/output-data/ Format: Parquet PartitionKeys: ["year", "month"] Glue Crawlers: Automated Metadata Discovery
Crawlers automatically discover data structure and populate the Data Catalog.
Creating a Crawler
import boto3 glue_client = boto3.client('glue') # Create a crawler for S3 data crawler_config = { 'Name': 'customer-data-crawler', 'Role': 'AWSGlueServiceRole', 'DatabaseName': 'customer_analytics', 'Description': 'Crawler for customer CSV files', 'Targets': { 'S3Targets': [ { 'Path': 's3://my-data-lake/raw/customer-data/', 'Exclusions': ['*.tmp', '*.log'] } ] }, 'Schedule': { 'ScheduleExpression': 'cron(0 2 * * ? *)' # Daily at 2 AM }, 'SchemaChangePolicy': { 'UpdateBehavior': 'UPDATE_IN_DATABASE', 'DeleteBehavior': 'DEPRECATE_IN_DATABASE' }, 'RecrawlPolicy': { 'RecrawlBehavior': 'CRAWL_NEW_FOLDERS_ONLY' } } glue_client.create_crawler(**crawler_config) Running and Monitoring Crawlers
# Start crawler glue_client.start_crawler(Name='customer-data-crawler') # Check crawler status response = glue_client.get_crawler(Name='customer-data-crawler') print(f"Crawler state: {response['Crawler']['State']}") # Get crawler metrics metrics = glue_client.get_crawler_metrics() for metric in metrics['CrawlerMetricsList']: print(f"Crawler: {metric['CrawlerName']}") print(f" Tables created: {metric['TablesCreated']}") print(f" Tables updated: {metric['TablesUpdated']}") print(f" Last runtime: {metric['LastRuntime']}") Writing Glue ETL Jobs with Python
While Glue Studio provides visual job creation, you can also write jobs programmatically using Python and PySpark.
Basic Glue Job Structure
import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job ## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME']) # Initialize contexts sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args) # Your ETL logic here print("Starting ETL job...") # Read from catalog datasource0 = glueContext.create_dynamic_frame.from_catalog( database="customer_analytics", table_name="raw_customers", transformation_ctx="datasource0" ) print(f"Read {datasource0.count()} records from source") # Apply transformations # 1. Clean data cleaned = datasource0.dropDuplicates() # 2. Filter invalid records filtered = cleaned.filter( lambda row: row["email"] is not None and "@" in row["email"] ) # 3. Add derived columns transformed = filtered.map( lambda row: { **row, "customer_segment": "High" if row["total_spend"] > 1000 else "Medium" if row["total_spend"] > 100 else "Low", "processed_date": datetime.now().isoformat(), "data_quality_score": calculate_quality_score(row) } ) # Write to S3 output_path = "s3://my-data-lake/processed/customers/" glueContext.write_dynamic_frame.from_options( frame=transformed, connection_type="s3", connection_options={ "path": output_path, "partitionKeys": ["customer_segment", "processed_date"] }, format="parquet", transformation_ctx="output" ) print("ETL job completed successfully") job.commit() Advanced Transformations
def calculate_quality_score(row): """Calculate data quality score for each record""" score = 0 # Email validation if row.get("email") and "@" in row["email"]: score += 25 # Phone validation if row.get("phone") and len(str(row["phone"])) >= 10: score += 25 # Address completeness address_fields = ["street", "city", "state", "zipcode"] filled_fields = sum(1 for field in address_fields if row.get(field)) score += (filled_fields / len(address_fields)) * 25 # Purchase history if row.get("total_orders", 0) > 0: score += 25 return score # Complex join example def enrich_customer_data(glueContext): """Enrich customer data with multiple sources""" # Read customer data customers = glueContext.create_dynamic_frame.from_catalog( database="customer_analytics", table_name="customers", transformation_ctx="customers" ) # Read product data products = glueContext.create_dynamic_frame.from_catalog( database="product_catalog", table_name="products", transformation_ctx="products" ) # Read order data orders = glueContext.create_dynamic_frame.from_catalog( database="sales_data", table_name="orders", transformation_ctx="orders" ) # Join customers with orders customer_orders = customers.join( ["customer_id"], orders, ["customer_id"], "left" ) # Aggregate order metrics order_metrics = customer_orders.groupBy("customer_id").agg({ "order_total": "sum", "order_count": "count", "last_order_date": "max" }) # Join back with customer data enriched_customers = customers.join( ["customer_id"], order_metrics, ["customer_id"], "left" ) return enriched_customers Working with Different Data Sources
Amazon S3
# Reading from S3 s3_source = glueContext.create_dynamic_frame.from_options( connection_type="s3", connection_options={ "paths": ["s3://my-bucket/input-data/"], "recurse": True }, format="csv", format_options={ "withHeader": True, "separator": "," } ) Relational Databases (RDS, Redshift)
# Reading from RDS rds_source = glueContext.create_dynamic_frame.from_catalog( database="production_db", table_name="users", additional_options={ "hashexpression": "id", # For incremental loading "hashpartitions": "10" } ) # Writing to Redshift glueContext.write_dynamic_frame.from_jdbc_conf( frame=transformed_data, catalog_connection="redshift-connection", connection_options={ "dbtable": "processed_users", "database": "analytics" } ) Streaming Data (Kinesis)
# Reading from Kinesis stream kinesis_source = glueContext.create_data_frame.from_options( connection_type="kinesis", connection_options={ "streamName": "user-events", "initialPositionInStream": "TRIM_HORIZON", "inferSchema": "true" } ) # Process streaming data processed_stream = kinesis_source \ .groupBy("user_id", "event_type") \ .count() \ .writeStream \ .outputMode("update") \ .format("console") \ .start() Glue Interactive Sessions
Interactive sessions allow you to develop and test Glue jobs interactively using notebooks.
Setting Up Interactive Sessions
# Start an interactive session from awsglue.interactive import getOrCreateGlueSparkSession spark = getOrCreateGlueSparkSession() # Now you can work with Spark DataFrames interactively df = spark.read.csv("s3://my-bucket/data/") df.show() df.printSchema() # Test transformations cleaned_df = df.dropna() filtered_df = cleaned_df.filter(cleaned_df.amount > 0) result_df = filtered_df.groupBy("category").sum("amount") result_df.show() Monitoring and Troubleshooting
CloudWatch Integration
# Monitor job metrics import boto3 cloudwatch = boto3.client('cloudwatch') # Get Glue job metrics metrics = cloudwatch.get_metric_statistics( Namespace='Glue', MetricName='glue.driver.aggregate.elapsedTime', Dimensions=[ { 'Name': 'JobName', 'Value': 'my-etl-job' } ], StartTime=datetime.now() - timedelta(hours=24), EndTime=datetime.now(), Period=3600, Statistics=['Average', 'Maximum'] ) Job Bookmarks for Incremental Processing
# Enable job bookmarks for incremental processing job_arguments = { '--job-bookmark-option': 'job-bookmark-enable', '--enable-metrics': '' } # In your job script from awsglue.context import GlueContext glueContext = GlueContext(sc) # Read with bookmark datasource = glueContext.create_dynamic_frame.from_catalog( database="my_database", table_name="my_table", transformation_ctx="datasource", additional_options={ "jobBookmarkKeys": ["last_updated"], "jobBookmarkKeysSortOrder": "asc" } ) Performance Optimization
Choosing the Right Worker Type
| Worker Type | vCPU | Memory | Use Case |
|---|---|---|---|
| G.1X | 1 | 16 GB | Light transformations, small datasets |
| G.2X | 2 | 32 GB | Medium workloads, standard ETL |
| G.4X | 4 | 64 GB | Large datasets, complex transformations |
| G.8X | 8 | 128 GB | Very large datasets, memory-intensive jobs |
Partitioning Strategies
# Write with partitioning glueContext.write_dynamic_frame.from_options( frame=data, connection_type="s3", connection_options={ "path": "s3://my-bucket/processed-data/", "partitionKeys": ["year", "month", "day"] }, format="parquet" ) # Read with partition pruning partitioned_data = glueContext.create_dynamic_frame.from_options( connection_type="s3", connection_options={ "paths": ["s3://my-bucket/processed-data/"], "pushDownPredicate": "year='2024' AND month='10'" }, format="parquet" ) Cost Optimization
Auto Scaling
# Enable auto scaling job_config = { 'WorkerType': 'G.2X', 'NumberOfWorkers': 2, # Minimum workers 'MaxCapacity': 10, # Maximum workers 'Timeout': 2880 # 48 hours } Right-Sizing Resources
def optimize_job_config(data_size_gb, complexity): """Recommend optimal job configuration""" if data_size_gb < 1: return {'WorkerType': 'G.1X', 'NumberOfWorkers': 2} elif data_size_gb < 10: return {'WorkerType': 'G.2X', 'NumberOfWorkers': 3} elif data_size_gb < 100: return {'WorkerType': 'G.4X', 'NumberOfWorkers': 5} else: return {'WorkerType': 'G.8X', 'NumberOfWorkers': 10} Real-World Use Cases
1. Data Lake Formation
def build_data_lake_layer(glueContext, source_table, target_layer): """Build a data lake layer (raw, processed, curated)""" # Read from raw layer raw_data = glueContext.create_dynamic_frame.from_catalog( database="raw_data", table_name=source_table ) if target_layer == "processed": # Clean and standardize processed = raw_data \ .dropDuplicates() \ .resolveChoice(specs=[('column_name', 'cast:string')]) elif target_layer == "curated": # Join with reference data and aggregate reference_data = glueContext.create_dynamic_frame.from_catalog( database="reference", table_name="country_codes" ) curated = raw_data.join( ["country_code"], reference_data, ["code"], "left" ).groupBy("country_name").agg({"revenue": "sum"}) # Write to appropriate layer layer_path = f"s3://data-lake/{target_layer}/{source_table}/" glueContext.write_dynamic_frame.from_options( frame=locals()[target_layer], connection_type="s3", connection_options={"path": layer_path}, format="parquet" ) 2. CDC (Change Data Capture)
def process_cdc_data(glueContext, source_table): """Process change data capture from databases""" # Read CDC data (assuming it's in S3) cdc_data = glueContext.create_dynamic_frame.from_options( connection_type="s3", connection_options={ "paths": [f"s3://cdc-bucket/{source_table}/"], "recurse": True }, format="json" ) # Separate by operation type inserts = cdc_data.filter("operation = 'INSERT'") updates = cdc_data.filter("operation = 'UPDATE'") deletes = cdc_data.filter("operation = 'DELETE'") # Apply changes to target table # This is a simplified example - real CDC processing is more complex return { 'inserts': inserts.count(), 'updates': updates.count(), 'deletes': deletes.count() } Security Best Practices
Data Encryption
# Enable encryption for data at rest job_config = { '--encryption-type': 'sse-s3', '--enable-spark-ui': 'true', '--spark-event-logs-path': 's3://my-bucket/logs/' } VPC Configuration
# Run jobs in VPC for enhanced security connection_config = { 'Name': 'vpc-connection', 'ConnectionType': 'NETWORK', 'ConnectionProperties': { 'CONNECTOR_TYPE': 'NETWORK', 'CONNECTOR_URL': 'vpc-endpoint-url' } } Conclusion
AWS Glue represents a paradigm shift in ETL processing, offering serverless, scalable, and cost-effective data integration capabilities. From visual job creation in Glue Studio to complex PySpark transformations, Glue provides the tools needed for modern data engineering.
Key Takeaways
- Start Visual: Use Glue Studio for initial job creation and learning
- Scale Gradually: Begin with small datasets and scale up as needed
- Monitor Always: Implement comprehensive monitoring and alerting
- Optimize Costs: Use auto-scaling and right-size your resources
- Secure by Design: Implement encryption and VPC configurations
Next Steps
- Explore Glue Interactive Sessions for development
- Learn about Glue Workflows for complex orchestrations
- Consider Glue DataBrew for visual data preparation
- Integrate with Lake Formation for fine-grained access control
This tutorial has covered the fundamentals and advanced concepts of AWS Glue. For hands-on practice, start with small datasets and gradually work up to production workloads. The visual learning from the referenced video combined with these practical examples will give you a solid foundation in AWS Glue ETL processing.
Remember: ETL is both an art and a science. Start simple, measure performance, and iterate based on your specific use cases and data patterns.
Happy data engineering with AWS Glue! 🔧📊