Over the past week we have been creating an infrastructure for long-term log data storage. This data could be any form of unstructured streaming text data, and in our case, we used Apache log data.
For this project, we have set up three machines: a web server and two machines running the Hortonworks stack (a namenode and a data node). In the Hadoop environment, we use:
- HBase to store raw log data.
- Hive for to store modeled and analytics data.
- Flume for log data ingestion.
- The core Hadoop ecosystem.
- Apache and Flume running on the web server
Flume listens to the logs generated by Apache. When Flume detects an entry, it sends a copy to the machines running HBase. The Flume agent on these machines receives the data and writes it to an HBase table and HDFS for long-term storage. When the data is sitting in HBase, it’s easy to write scripts in Python to extract all the information we need from the log entry.
Aggregate and Model Data
After extracting the data, we can aggregate and model it within our scripts. When the data is modeled to suit our use case, we can write it to Hive for structured storage and easy accessibility for our visualizations. We can perform all of this in Python by adding an open source package that can query HBase and Hive.
Next, it’s time to take a look at what we can do with standard Apache log data. We want to maximize the amount of information we can learn/identify based on the data we have.
Our raw data is received in the form of:
”Source_IP_Address – [Timestamp -Time_Zone_Identifier] ‘Request’ Status_Code Content_Size”
It would look something like this
“127.0.0.1 – [4/MAY/2015:10:12:22 -0500] ‘POST /index.html HTTP/1.0’ 200 1024”.
In our case, the unique identifier is the IP address. We can think of what in terms of a user. This data may not look like much. But, considering we get an entry anytime the page refreshes, we can get a good feel for what a user is doing.
Put Data into a Structured Form
The first thing we do to the data its break it apart into a structured form that we can insert into Hive. Using a script, we can first extract:
- The IP address
- Time zone
- Request type
- Page requested
- Status code
- Content size
Then, we can write a few quick functions to determine:
- If a request was made in the morning or evening.
- The day of the week the user hit on what.
- How frequently what returned.
- How long the user spent on each page.
- The location of the user.
- Exactly how the user requested the page.
Then, we can aggregate the data and build a profile of our user and—more importantly—of their digital footprint in our server.
Store the Data
Now that we have a script to model the data with, it’s time to store the data in a structured database. Currently, Hive supports data insertion via HiveQL. But, many environments might still be running the old version of Hive. Even if you have the new version, you might find it a bit slow to insert millions of rows of data via INSERT Query. For my data ingestion into Hive, I add a function to my script. The function takes all the data I’ve collected, inserts it into a list in the correct order to match my Hive table, and writes the data to a CSV file.
Load Data Statement
Next, we can use the LOAD DATA statement into Hive to insert millions of rows of data in trivial amounts of time. We create two tables in Hive. The first table contains a copy of every request with the aggregate data associated to the user appended on as well. The second table contains one entry for each unique IP address and all of the related aggregate data.
You can improve performance in several ways, such as partitioning on the IP address. Some people might also save a little storage space by generating a unique ID as a pseudo-primary key on each table, keeping only the aggregate data in Table 2 and performing a join whenever pulling what. This procedure will vary based on your specific requirements and environment. After the data is sitting in Hive, it’s easy to query the data via JDBC on many visualization platforms.
Use the Data
You can use this data to write machine learning algorithms. We can use this data to detect all sorts of things. Currently, we’re trying to detect abnormalities and possible threats on the fly with minimal delay.
In a business system where you have the same users logging in at the same time week after week, you can use the log data they generate to determine if a user account has been hijacked. Or, you can use the average time between web requests to detect and mitigate denial of service attacks.
Syntelli data scientists are passionate and are constantly seeking new and innovative solutions to employ in the field of Big Data & Analytics.
Data Science Consultant