Lab complete!
Now that you have completed this lab, make sure to update your Well-Architected review if you have implemented these changes in your workload.
Click here to access the Well-Architected Tool
We first need to create data sources containing the application logs, and the cost and usage reports. In this lab we provide sample files, it is recommended you use these files initially, then use your own files after you are familiar with the requirements and process.
We place both logs into S3, crawl them with Glue and then use Athena to confirm a database is created that we can use.
We will create a bucket and folders in S3, then copy the sample application log files, and cost and usage reports into the folders.
NOTE Please read the steps carefully, as the naming is critical and any mistakes will require you to rebuild the lab or make significant and repetitive changes.
Log into the AWS console via SSO:
Make sure you run everything in a single region
Go to the S3 console, Create a new S3 Bucket, it can have any name, but make it start with costefficiencylab to make it identifiable.
Create a folder in the new bucket with a name: applogfiles_workshop.
NOTE: You MUST name the folder applogfiles_workshop
Upload the application log file to the folder: Step1_access_log.gz
Create a folder named costusagefiles_workshop, inside the same bucket.
NOTE: You MUST name the folder costusagefiles_workshop, this will make pasting the code faster.
Copy the sample file to your bucket into the costusagefiles_workshop folder:
We will create a database with the uploaded application logs, with AWS Glue. For the application log files, we show you how to write a custom classifier, so you can handle any log file format from any application.
For our sample application logs, we have supplied Apache web server log files. The in-bulit AWS Glue classifier COMBINEDAPACHELOG will recognize these files, for example, it will read the timestamp as a single string. We will customize the interpreter to break this up into a date column, timestamp column, and timezone column. This will demonstrate how to write a customer classifier. The reference for classifiers is here: https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html
A sample log file line is:
10.0.1.80 - - [26/Nov/2019:00:00:07 +0000] "GET /health.html HTTP/1.1" 200 55 "-" "ELB-HealthChecker/2.0"
The original columns are:
- Client IP
- Ident
- Auth
- HTTP Timestamp*
- Request
- Response
- Bytes
- Referrer
- Agent
Using the custom classifier, we will make it build the following columns instead:
- Client IP
- Ident
- Auth
- Date*
- Time*
- Timezone*
- Request
- Response
- Bytes
- Referrer
- Agent
Go to the Glue console and click Classifiers:
Click Add classifier and create it with the following details:
Classifier name: WebLogs
Classifier type: Grok
Classification: Logs
Grok pattern:
%{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{DATE:logdate}\:%{TIME:logtime} %{INT:tz}\] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{Bytes:bytes=%{NUMBER}|-}) %{QS:referrer} %{QS:agent}
Custom patterns:
DATE %{MONTHDAY}/%{MONTH}/%{YEAR}
Click Create
A classifier tells Glue how to interpret the log file lines, and how to create columns. Each column is contained within %{}, and has the pattern, the separator ‘:', and the column name.
By using the custom classifier, we have separated the column timestamp into 3 columns of logdate, logtime and tz. You can compare the custom classifier we wrote with the COMBINEDAPACHELOG classifier:
Custom - %{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{DATE:logdate}\:%{TIME:logtime} %{INT:tz}\] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{Bytes:bytes=%{NUMBER}|-}) %{QS:referrer} %{QS:agent}
Builtin - %{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{Bytes:bytes=%{NUMBER}|-}) %{QS:referrer} %{QS:agent}
Next we will create a crawler to read the log files, and build a database. Click on Crawlers and click Add crawler:
Configure the crawler:
Crawler source type is Data stores, click Next:
Click the folder icon and expand your bucket created above, select the radio button next to the applogfiles_workshop. Do NOT select the actual file or bucket, select the folder. Click Select.
Click Next
Select No to not add another data store, click Next
Create an IAM role named AWSGlueServiceRole-CostWebLogs and click Next:
Frequency will be run on demand, click Next
Click Add database, you MUST name it webserverlogs, click Create.
Click Next:
Click Finish
Select and Run crawler, this will create a single database and table with our log files. We need to wait until the crawler has finished, this will take 1-2 minutes. Click refresh to check if it has finished.
We will confirm the database has built correctly. Click Databases on the left, and click on the database webserverlogs, you may need to click refresh:
Click Tables in webserverlogs, and click the table applogfiles_workshop
You can see the table is created, the Name, the Location, and the recordCount has a large number of records in it (the number may be different to the image below):
Scroll down and you can see the columns, and that they are all string. This will be a small hurdle for non-string columns like bytes if you want to perform a mathematical function on it. We will work around this with Athena in our example.
Go to the Athena service console, and select the webserverlogs database
Click the three dots next to the table applogfiles_workshop, and click Preview table:
View the results which will show 10 lines of your log. Note how there are separate columns logdate, logtime and tz that we created. The default classifier would have had a single column of text for the timestamp.
To measure efficiency we need to know the cost of the workload, so we will use the Cost and Usage Report. We will follow the process above to create
To use the files from this lab, follow the steps below:
Go into the Glue console, click Crawlers, and click Add crawler
Use the crawler name CostUsage and click Next
Select Data stores as the crawler source type, click Next
Click the folder icon, Select the S3 folder created above costefficiency and select the costusagefiles-workshop folder, make sure you dont select the bucket or file.
Click Select, then click Next
Select No do not another data store, click Next
Create an IAM role named AWSGlueServiceRole-Costusage, click Next
Set the frequency to run on demand, click Next
Cilck Add database, it MUST be named CostUsage, and click Create
click Next
Review and click Finish
Run the crawler CostUsage, then use Athena to check the database costusage was created and has records in the table costusagefiles_workshop, as per the Application logs database setup above.
You now have both the application and cost data sources setup, ready to create an efficiency metric.
Now that you have completed this lab, make sure to update your Well-Architected review if you have implemented these changes in your workload.
Click here to access the Well-Architected Tool