globaldatanetmenu

.Serverless querying and evaluating of Logs using Athena

May 23rd 2022-4 min read

Logging and monitoring are important components of IT in any company. Without logging it's harder to troubleshoot technical issues and nearly impossible to gain operational insights and intelligence.

When using AWS it is best practice to store the logs, regardless of whether they are application or governance logs (such as config, firewall or management logs), centrally and securely in an account used only for this purpose.

But how do you cost-effectively search for and evaluate problems in all those logs? That's where Amazon Athena comes in! Athena is an interactive query service that facilitates the analysis of data in Amazon S3 using standard SQL. Athena does not require a server, so there is no infrastructure to manage and you only pay for the queries that are executed.

During implementing Athena with centralised logs in AWS Organizations you may stumble across some challenges like:

  • Querying multiple accounts (needs partitions per account & day to be created)
  • New partitions must be added when creating date or time partitions - i.e. daily per account and region

To speed up query processing of highly partitioned tables and fully automate partition management you can use partition projection in Athena. With partition projection, partition values and locations are calculated from the configuration rather than read from a repository such as the AWS Glue Data Catalog. Because in-memory operations are often faster than remote operations, partition projection can reduce query runtime for highly partitioned tables. Depending on the specific characteristics of the query and the underlying data, partition projection can significantly reduce query runtime for queries that are limited to retrieving partition metadata.

Blog Content

Following I will show you how to create Athena tables for some of the most important AWS services.

VPC FlowLogs

VPC Flow Logs is a feature that enables you to capture information about the IP traffic going to and from network interfaces in your VPC. You can use these logs to examine network traffic patterns and identify threats and risks on your VPC network.

🚨 VPC FLowLogs are a non-Hive compatible scheme so you cannot add partitions using the MSCK REPAIR TABLE command.

💡 If you want to know how to query the VPC FlowLogs take a look at this documentation.

Here you can see how to create a Athena table for your central VPC FlowLogs:

CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs_pp (
  version int,
  account string,
  interfaceid string,
  sourceaddress string,
  destinationaddress string,
  sourceport int,
  destinationport int,
  protocol int,
  numpackets int,
  numbytes bigint,
  starttime int,
  endtime int,
  action string,
  logstatus string
)
PARTITIONED BY (
   `account` string,
   `region` string,
   `timestamp` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LOCATION
  's3://S3BucketName/AWSLogs/'
 
TBLPROPERTIES (
  'projection.account.values'='AccountIdsCommaseperated',
  'projection.account.type'='enum',
  'projection.enabled'='true',
  'projection.timestamp.format'='yyyy/MM/dd',
  'projection.timestamp.interval'='1',
  'projection.timestamp.interval.unit'='DAYS',
  'projection.timestamp.range'='2020/01/01,NOW',
  'projection.timestamp.type'='date',
  'projection.region.type'='enum',
  'projection.region.values'='RegionsCommaseperated',
  'storage.location.template'='s3://S3BucketName/AWSLogs/${account}/vpcflowlogs/${region}/${timestamp}',
  'skip.header.line.count'='1');

AWS CloudTrail

AWS CloudTrail is a governance, compliance, operations and risk monitoring service. CloudTrail provides an event history of your AWS account activity. This includes actions performed via the AWS Management Console, AWS SDKs, command line tools and other AWS services. The event history simplifies security analysis, resource change tracking and troubleshooting. More over, you can use CloudTrail to detect anomalous activity in you AWS accounts. These capabilities simplify operational analysis and troubleshooting.

💡 If you want to know how to query the AWS CloudTrail Logs take a look at this documentation.

Here you can see how to create a Athena table for your central CloudTrail Logs:

CREATE EXTERNAL TABLE cloudtrail_logs_pp(
    eventVersion STRING,
    userIdentity STRUCT<
        type: STRING,
        principalId: STRING,
        arn: STRING,
        accountId: STRING,
        invokedBy: STRING,
        accessKeyId: STRING,
        userName: STRING,
        sessionContext: STRUCT<
            attributes: STRUCT<
                mfaAuthenticated: STRING,
                creationDate: STRING>,
            sessionIssuer: STRUCT<
                type: STRING,
                principalId: STRING,
                arn: STRING,
                accountId: STRING,
                userName: STRING>>>,
    eventTime STRING,
    eventSource STRING,
    eventName STRING,
    awsRegion STRING,
    sourceIpAddress STRING,
    userAgent STRING,
    errorCode STRING,
    errorMessage STRING,
    requestParameters STRING,
    responseElements STRING,
    additionalEventData STRING,
    requestId STRING,
    eventId STRING,
    readOnly STRING,
    resources ARRAY<STRUCT<
        arn: STRING,
        accountId: STRING,
        type: STRING>>,
    eventType STRING,
    apiVersion STRING,
    recipientAccountId STRING,
    serviceEventDetails STRING,
    sharedEventID STRING,
    vpcEndpointId STRING
  )
PARTITIONED BY (
   `account` string,
   `region` string,
   `timestamp` string)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://S3BucketName/AWSLogs/OrgIdIfOrgTrailIsUsed'
TBLPROPERTIES (
  'projection.account.values'='AccountIdsCommaseperated',
  'projection.account.type'='enum',
  'projection.enabled'='true',
  'projection.timestamp.format'='yyyy/MM/dd',
  'projection.timestamp.interval'='1',
  'projection.timestamp.interval.unit'='DAYS',
  'projection.timestamp.range'='2020/01/01,NOW',
  'projection.timestamp.type'='date',
  'projection.region.type'='enum',
  'projection.region.values'='RegionsCommaseperated',
  'storage.location.template'='s3://S3BucketName/AWSLogs/OrgIdIfOrgTrailIsUsed/${account}/CloudTrail/${region}/${timestamp}')

AWS WAF

AWS WAF Logs help you to get detailed information about traffic that is analyzed by your web ACL. Logged information includes the time that AWS WAF received a web request from your AWS resource, detailed information about the request and details about the rules that the request matched. The Logs will help you to identify false-positives, detect anomalous activity and security issues to write your own Custom Rules manually or using automations.

💡 If you want to know how to query the AWS WAF Logs take a look at this documentation.

Here you can see how to create a Athena table for your central AWS WAF Logs:

CREATE EXTERNAL TABLE `waf_logs`(
  `timestamp` bigint,
  `formatversion` int,
  `webaclid` string,
  `terminatingruleid` string,
  `terminatingruletype` string,
  `action` string,
  `terminatingrulematchdetails` array<struct<
                                    conditiontype:string,
                                    location:string,
                                    matcheddata:array<string>>>,
  `httpsourcename` string,
  `httpsourceid` string,
  `rulegrouplist` array<struct<
                        rulegroupid:string,
                        terminatingrule:struct<
                           ruleid:string,
                           action:string,
                           rulematchdetails:string>,
                        nonterminatingmatchingrules:
                        array<struct<
                            ruleid:string,
                            action:string,
                            rulematchdetails:array<struct<
                                conditiontype:string,
                                location:string,
                                matcheddata:array<string>>
                                >>>,
                        excludedrules:string>>,
  `ratebasedrulelist` array<struct<
                          ratebasedruleid:string,
                          limitkey:string,
                          maxrateallowed:int>>,
  `nonterminatingmatchingrules` array<struct<
                                    ruleid:string,
                                    action:string>>,
  `requestheadersinserted` string,
  `responsecodesent` string,
  `httprequest` struct<clientip:string,
                      country:string,
                      headers:array<struct<
                                  name:string,
                                  value:string>>,
                      uri:string,
                      args:string,
                      httpversion:string,
                      httpmethod:string,
                      requestid:string>,
  `labels` array<struct<name:string>>)
PARTITIONED BY
(day STRING)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://S3BucketName/'
TBLPROPERTIES
(
 "projection.enabled" = "true",
 "projection.day.type" = "date",
 "projection.day.range" = "2021/01/01,NOW",
 "projection.day.format" = "yyyy/MM/dd",
 "projection.day.interval" = "1",
 "projection.day.interval.unit" = "DAYS",
 "storage.location.template" = "s3://S3BucketName/${day}"
)

Amazon S3 server access logs

S3 Access logging provides detailed records for the requests that are made to a bucket. You can use access log information in security, access audits and it will help you to understand your AWS bill. Moreover it can also help you to troubleshoot entitlement problems to your buckets.

💡 If you want to know how to query the AWS S3 Access Logs take a look at this documentation.

Here you can see how to create a Athena table for your central AWS S3 Access Logs:

CREATE EXTERNAL TABLE `s3_access_logs_db.mybucket_logs`(
  `bucketowner` STRING,
  `bucket_name` STRING,
  `requestdatetime` STRING,
  `remoteip` STRING,
  `requester` STRING,
  `requestid` STRING,
  `operation` STRING,
  `key` STRING,
  `request_uri` STRING,
  `httpstatus` STRING,
  `errorcode` STRING,
  `bytessent` BIGINT,
  `objectsize` BIGINT,
  `totaltime` STRING,
  `turnaroundtime` STRING,
  `referrer` STRING,
  `useragent` STRING,
  `versionid` STRING,
  `hostid` STRING,
  `sigv` STRING,
  `ciphersuite` STRING,
  `authtype` STRING,
  `endpoint` STRING,
  `tlsversion` STRING)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex'='([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://ACCESSLOGSBUCKETNAME/PREFIX/'

💡 Note


The approach I just applied to the examples of VPC FlowLogs, WAF, CloudTrail and S3 Access logs can certainly be applied to logs like Application Loadbalancer etc. For this you just have to adjust the corresponding fields and the partitioning.

globaldatanetCloud Development, Optimization & Automation

.Navigation

.Social

  • follow globaldatanet on instagram
  • follow globaldatanet on facebook
  • follow globaldatanet on twitter
  • follow globaldatanet on linkendin
  • follow globaldatanet on twitch
  •  listen to our serverless world podcast
  • follow globaldatanet's tech rss feed
  • follow globaldatanet at github
© 2022 by globaldatanet. All Right Reserved
Your privacy is important to us!

We use cookies on our website. Some of them are essential,while others help us to improve our online offer.
You can find more information in our Privacy policy