404 Error Detection and Notification on Real-Time data via AWS Kinesis

Rahulbhatia1998
10 min readAug 29, 2024

--

Kinesis Data Analytics is a service provided by AWS that allows you to process and analyze streaming data in real time. It enables you to build and run SQL queries on streaming data and generate actionable insights.

Kinesis Data Analytics supports both SQL and Apache Flink as the underlying processing engine for analyzing streaming data.

SQL Engine: The SQL engine is a fully managed service that allows you to process streaming data using SQL. The SQL engine is a good choice for simple streaming data processing tasks.

Apache Flink Engine: The Apache Flink engine is a more powerful engine that allows you to process streaming data. The Apache Flink engine is a good choice for complex streaming data processing tasks.

Kinesis Data Analytics simplifies building and managing Apache Flink workloads.

Coming down to the project,
When clients hit the web server and generate logs containing crucial information such as status codes, the logs are sent to a Kinesis Data Stream for processing. Kinesis Data Analytics is utilized to filter the logs in real time. Specifically, the system is designed to identify instances where a 404 status code occurs more than 10 times within a 1-minute window. To accomplish this, Flink SQL queries are applied to the incoming logs.Upon detecting the specified filter condition, an automatic email notification is triggered to a user. Amazon Simple Notification Service (SNS) is used for sending email notifications. However, since Kinesis Data Analytics cannot directly send emails, an additional Kinesis Data Stream is launched to store the filtered logs.To automate the email-sending process, an AWS Lambda function is created and configured to trigger when new filter data arrives in the dedicated Kinesis Data Stream. The Lambda function, integrated with the SNS service, generates and dispatches email notifications to a user.

Here are the steps on how to implement the project:

  1. Create a Kinesis Data Stream to store your incoming web server logs.
    Provide a name for your data stream.

Specify the number of shards for your data stream and choose a provisioned Capacity Mode.

Leave the rest as it is and click on the “Create data stream” button to create the stream.

2. Create another KDS for your filter stream output from the analytics tool. Follow the same steps as mentioned in step 1.

Name it as ‘KDS_output

3. To send email notifications using the AWS Simple Notification Service (SNS)

setup an SNS topic, provide a name for your topic.

Click on the “Create topic” button.

Now configure Email Subscription in SNS

While creating a Subscription on SNS,
you can
Enter the SNS topic ARN.
In the subscription creation dialog, select “Email” as the protocol and Enter the details of who should receive the Notifications(EMAIL)

Verify Email Address: For each email address provided, AWS SNS sends a confirmation email.

You can check your mail and verify it on the provided AWS link, to start receiving notifications from SNS.

you should have recevied an email like this.

Click on Confirm Subscription and you get a verified message.

It will show subscribed on the topic as well.

Set Up AWS Lambda Function:

➢Provide a name for your Lambda function.

➢Select the runtime environment.

I am going to be using “Python 3.10” as my runtime version.

➢Under “Permissions,” create a new execution role that grants necessary permissions to your Lambda function.

➢Click on the “Create function” button to create the Lambda function.

➢At the time of lambda function creation there we created a new IAM role. Now we have to edit that role so the lambda function can get the records from KDS and can use the SNS service.

➢Go to the Configuration -> Permissions tab then click on the role.

You can give Kinesis and SNS full access for now.

-> Kinesis

-> SNS

-> Cloud Watch

➢Finally save these permissions.

➢Set up the Lambda function to be triggered by the arrival of filtered log data in the dedicated Kinesis Data Stream.

➢Select the trigger type that matches your requirements. In this case, choose “Kinesis.”

➢Click on the “Add” button to add the trigger.

➢Write an AWS Lambda function that will send email notifications.

You can paste the following code , replacing your sns Topic here.

client = boto3.client(‘sns’)
def lambda_handler(event, context):
client.publish(TopicArn=”arn:aws:sns:ap-south-1:211125438166:rahul-email-notify”, Message=”Webpage is down, status 404", Subject = “Critical — Home Page Web App Down”)
print(“Lambda run and connect to SNS….”)

Deploy the code and test it once. You should be getting the email on your mail, if you have proper IAM permissions assigned.

Open EC2 console and setup the Web Server now.

You can setup the machine using Amazon Linux 2 AMI.
and during security group creation, enable HTTP traffic from the internet.

Launch the instance with the default properties.

Once launched, Run the following steps to setup Apache HTTP Web server.

Amazon linux 2 supports “Yum” package manager.

yum install httpd -y

Configure the Web server in this directory.

/var/www/html

create 2 files here “index.html” and “login.html”

Start the Web Server

systemctl start httpd
systemctl enable httpd

Give permission to all other users for reading and for execution. So the kinesis user can read the log data.

➢Install the kinesis agent

yum install aws-kinesis-agent -y

➢Configure the kinesis agent

vi /etc/aws-kinesis/agent.json

{
“cloudwatch.emitMetrics”: true,
“kinesis.endpoint”: “”,
“firehose.endpoint”: “”,

“flows”: [
{
“filePattern”: “/var/log/httpd/access.log*”,
“kinesisStream”: “KDS_source”,
“partitionKeyOption”: “RANDOM”,
“dataProcessingOptions”: [
{
“optionName”: “LOGTOJSON”,
“logFormat”: “COMMONAPACHELOG”,

}
]
}
]
}

For apache log formatting for kinesis you can check the log format here.
https://docs.aws.amazon.com/streams/latest/dev/writing-with-agents.html

Create an IAM role for EC2 and give full access to kinesis

Assign the new role to the EC2 instance.

Now you can start the Kinesis Agent.

service aws-kinesis-agent start

service aws-kinesis-agent status

➢Create a Kinesis Data Analytics application using Studio notebook now.

The name has been recently updated to Apache Managed Service for Flink.

Add Relevant Details for Flink Studio Notebook.

Provide a name for your application and select the runtime environment

Meanwhile you are creating this. Open a new tab and Go to AWS Glue.
-> Create a Glue database for storing metadata of sources and destinations of the Kinesis Data Analytics application.
➢Go to the AWS Glue database console

Click on Create database.

Configure this back in your Apache flink Notebook.

➢Add the source of the Kinesis Data Analytics application

Add the destination of Kinesis Data Analytics application

Leave the rest as it is, and click on continue, to create the Apache Flink Notebook.

Records in KDS are in JSON format, according to the configuration of the Kinesis agent.

●Flink will get data in real time, but it will be in a JSON format. But data must be in table format for Flink.

●In Kinesis Data Analytics with Apache Flink, you can process JSON data. But at the end, you have to convert it into a table format.

●When working with JSON data in Kinesis Data Analytics, you can consider the key-value pairs in the JSON as fields and their corresponding values.

●By defining a schema that matchesthe structure of the JSON data, you can create a table with columns representing the fields in the JSON.

●At the time ofcreating a table in Flink, you need to specify the connector you want to use to write data inthe table. The connector is responsible for writing the data in a format that Flink can understand.

Now, go to the Zeppelin notebook.

●Create a table in Flink SQL and specify the connector as Kinesis Data Streams.

●The “my_web_log” table will store all of the incoming logs.

●The %flink.sql statement allows you to run a Flink SQL query in a streaming context. The query will be executed continuously, and the results will be emitted as new data arrives. For this: you can use stream API

%flink.sql(type=update)

The query will be executed in real-time, and as soon as results arrive, they will be shown.

The “final_web_404_log” table will store the logs that meet the filter condition.

We can use the TUMBLE window function to aggregate logs within a 1-minute window.

Create a table in Flink SQL and specify the connector as Kinesis Data Streams. The “my_web_log” table will store all of the incoming logs.

CREATE TABLE my_web_log (
host VARCHAR(255),
ident VARCHAR(255),
authuser VARCHAR(255),
datetime VARCHAR(255),
request VARCHAR(255),
response VARCHAR(255),
ArrivalTime AS PROCTIME()
)
WITH (
'connector' = 'kinesis',
'stream' = 'KDS_source',
'aws.region' = 'ap-south-1',
'scan.stream.initpos' = 'LATEST',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601'
);

The query will continuously update the number of 404 errors in your weblog, grouped by 30-second intervals.

The count column will contain the number of 404 errors in each 30-second interval.

SELECT * FROM (
SELECT CAST(COUNT(*) AS INTEGER) AS webtotal
FROM my_web_log
WHERE response = '404'
)
GROUP BY TUMBLE(ArrivalTime, INTERVAL '30' SECOND);

The “final_web_404_log” table will store the logs that meet the condition.

CREATE TABLE final_web_404_log1 (
mytotal INTEGER
)
WITH (
'connector' = 'kinesis',
'stream' = 'KDS_destination',
'aws.region' = 'ap-south-1',
'scan.stream.initpos' = 'LATEST',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601'
);

It will select all rows from the my_web_log table where the response column equals 404. It will then group the results by the Arrival Time column in 30-second intervals and count the number of rows in each group. Finally, it will insert only the rows where the mytotal column is greater than or equal to 10 into the final_web_404_log1 table.

INSERT INTO final_web_404_log1
SELECT * FROM (
SELECT CAST(COUNT(*) AS INTEGER) AS mytotal
FROM my_web_log
WHERE response = 404
GROUP BY TUMBLE(ArrivalTime, INTERVAL '30' second)
)
WHERE mytotal >= 10;

To conclude, implementing 404 error detection and notification on real-time data using AWS Kinesis provides an efficient and scalable solution for monitoring web traffic. By leveraging AWS services like Kinesis Data Streams, Lambda, and SNS, organizations can quickly detect 404 errors as they occur, enabling immediate responses to issues that could impact user experience. This approach not only enhances operational efficiency but also ensures that critical errors are addressed in real-time, minimizing potential downtime and improving overall website reliability. Adopting such a solution is crucial for businesses that prioritize seamless user interactions and require robust error monitoring systems.

--

--

No responses yet