Processing Large Data with PostgreSQL

In Section 3.2, a simple reporting system has been developed that informs the inter- ested parties about several aspects of the web platform. That might be enough for some businesses or organizations, but some might need a deeper grasp. To address this problem, a solution is needed that offers a common interface to execute analy- sis tasks on the data. There are a few numbers of such solutions for addressing this need; however, it is believed that PostgreSQL would cater to most of the require- ments. Here, PostgreSQL is used as our data processing engine for the following reasons:

● Open source that means no need to pay for the software license

● Basic table partitioning

● Common table expressions

● Window functions for aggregating data

● Unstructured data handling

● Leverages multiple CPUs to answer queries faster

● Well known with a large community

In this section, we will work on the same data as in Section 3.2. It is assumed that the data is parsed. Also, some scripting will still be written to load data into PostgreSQL though there is no need to revisit all the steps since they were covered.

3.3.1 Data Modeling

Previously, the data we get from request logs were described. Here, the goal is to create a model for that data in PostgreSQL. We will leverage PostgreSQL partitions to efficiently execute queries on given ranges.

k k Yet, before doing so, let us quickly visit PostgreSQL partitions. PostgreSQL

partitioning splits one large table into smaller physical pieces. When PostgreSQL receives a query to a partitioned table, it uses different execution plans to take full advantage of individual partitions. For example, when appropriate, PostgreSQL can scan the whole partition instead of random access. Partitions can increase the gains from indexing when a table grows large because it is easier to fit the partitioned index to memory than the whole index. Moreover, PostgreSQL partitions come in handy since their bulk operation like inserting or deleting a partition is possible, avoiding a massive lock on the table. Those operations are very common in large data sets. We may need to put new partitions for each day or hour or, on the flip side, may need to delete hourly or daily partitions for retention purposes. Last but not least, partitions can be moved to other media devices without affecting the rest:

CREATE TABLE dbdp.request_log (

ts timestamp WITH TIME ZONE NOT NULL,

host text,

ip text,

http_method text, unique_cookie text,

path text,

http_status_code int, content_length int, request_time int, http_referrer text, user_agent text ) PARTITION BY RANGE (ts);

Let us take a look at the data model. In the table, we have ts as shorthand for timestamp as the partition column and the rest of the columns parsed from the request log. All of the columns are nullable as some of them might be missed due to numerous reasons. Note that it might be a good idea to check the counts occa- sionally to avoid incomplete data.

3.3.2 Copying Data

When the data were processed with Unix tooling, most of the job were done on the web server itself. Nevertheless, it is ideal to use PostgreSQL for heavy lifting in this case. Thus, we have to transfer all parsed data to the PostgreSQL server and copy the data to PostgreSQL:

TODAY=$(date ’+%Y-%m-%d’)

YESTERDAY=$(date -d "yesterday" ’+%Y-%m-%d’)

TWO_WEEKS_AGO_PARTITION=$(date -d "2 weeks ago" ’+%Y_%m_%d’) YESTERDAY_PARTITION=$(date -d "yesterday" ’+%Y_%m_%d’) TABLENAME=’dbdp.request_log’

k k

#Only keep last 2 weeks of data

#Add yesterday’s partition

psql -h ${POSTGRESQL_SERVER} -U ${POSTGRESQL_USER} ${DATABASE} -c \

’DROP TABLE IF EXISTS

’"${TABLENAME}_partition_${TWO_WEEKS _AGO_PARTITION}"’;

CREATE TABLE IF NOT EXISTS

’"${TABLENAME}_partition_${YESTERDAY_PARTITION}"’

PARTITION OF dbdp.request_log

FOR VALUES FROM (’"’"${YESTERDAY}"’"’) TO (’"’"${TODAY}"’"’);’

while IFS= read -r host; do ssh ${SERVER_USER}@${host} \

zgrep request-log /data/logs/*.gz | cut -c 14- |

awk ’{s = ""; for (i=1; i<=NF; i++) \

{ gsub(/,/,"",$i); s= i>10 ? s" "$i : s$i",";} print s}’ | psql -h ${POSTGRESQL_SERVER} -U ${POSTGRESQL_USER} ${DATABASE} \

-c "SET datestyle = ’ISO,DMY’; \

COPY ${TABLENAME} FROM STDIN DELIMITER ’,’;"

done <hosts

In the first part of the script above, some of the variables are initialized to be used later. In the second part of the script, the partition is dropped from two weeks ago for retention purposes, and a new partition is added for yesterday’s data. In the loop, we go over each one of the hosts and copy the data from each host to the PostgreSQL server. Note that there is no need to perform the extra parsing that was done before for the user agent. Ideally, the user agent string is kept as it is since it might be valuable for different purposes. When connecting to PostgreSQL, we assumed either we have.pgpassfile or our IP is whitelisted by PostgreSQL server for authentication purposes. The authentication part will not be covered here as it is out of the scope of this book.

SET max_parallel_workers_per_gather = 4;

SELECT unique_cookie, count(unique_cookie) AS view_count FROM dbdp.request_log

WHERE ts >= current_date - 15 AND ts < current_date GROUP BY unique_cookie;

The data was loaded into PostgreSQL. Now, we can run our SQL queries over request log data. One important aspect is to limit queries to the partitions. Other- wise, PostgreSQL has to account for data for partitions that are not needed. This would be a common matter on queries over Big Data. Careful writing of queries is advised, and, in most of the cases, these are limited to ranges and partitions. We do not want to put an undesirable load on resources due to heavy queries.

In our example query, the view count per user are calculated by using a unique cookie. We also setmax_parallel_workers_per_gatherto enable faster query execution time by paralleling the work among worker processes. And that is just a start.

k k Detailed queries can be written to understand user behavior. We can potentially

connect a business intelligence tool for our table and bring other tables from the production databases to join with the log data. Nevertheless, a single PostgreSQL server might not be enough to do such complicated thing. Thus, a brief look at the multi-server setup might be considered.

3.3.3 Sharding in PostgreSQL

Although one PostgreSQL might be enough for many workloads, it might not be sufficient for jobs that require more data. We can still stick to PostgreSQL by adding a couple of nodes to distribute data over multiple servers. Luckily, PostgreSQL has a feature called foreign data wrappers that provides a mechanism to natively shard tables across multiple PostgreSQL servers. When we run a query on a node, the foreign data wrapper feature will transiently query other nodes and return the results as if they were coming from a table in the current database.

3.3.3.1 Setting up Foreign Data Wrapper

The foreign data wrapper comes with standard PostgreSQL distribution. The wrap- perpostgres_fdwmis an extension in the distribution and can be enabled through the following command. Note that it has to be run by database admin:

CREATE EXTENSION postgres_fdw;

Once the extension is enabled, then a server can be created. In the remote servers, we expect to have a database asdbdp, user asdbdp, and schema asdbdp setup. Two servers can be created as follows:

CREATE SERVER dbdp_server_one

FOREIGN DATA WRAPPER postgres_fdw

OPTIONS (host ’one.pg.mycompany.com’, dbname ’dbdp’);

CREATE SERVER dbdp_server_two

FOREIGN DATA WRAPPER postgres_fdw

OPTIONS (host ’two.pg.mycompany.com’, dbname ’dbdp’);

To enable connection from the destination server to remote servers, there is a need to map users from the destination server to the remote servers. The mapping can be done as follows:

CREATE USER MAPPING FOR CURRENT_USER SERVER dbdp_server_one

OPTIONS (user ’dbdp’, password ’*****’);

CREATE USER MAPPING FOR CURRENT_USER SERVER dbdp_server_two

OPTIONS (user ’dbdp’, password ’*****’);

k k An easy way to finish the mapping is by importing the desired schemas from

remote servers to the local server. The importing can be done as follows:

IMPORT FOREIGN SCHEMA "dbdp" FROM SERVER dbdp_server_one INTO dbdp_one;

IMPORT FOREIGN SCHEMA "dbdp" FROM SERVER dbdp_server_two INTO dbdp_two;

Now, we are done with pairing databases. One can pair all databases together so that each one of them can execute the same queries without any problem.

Nevertheless, a master database is selected that delegates the query execution to the other servers. It might be preferred to set up a stronger machine in terms of CPU/memory for the master server. Next, we can figure out how the sharding is done.

3.3.3.2 Sharding Data over Multiple Nodes

Partitioning is already used when implementing the copy operation for the request logs. As we have discussed earlier, it provides advantages over a traditional table with regard to large volumes of data. The next idea is to go even beyond partitioning on a single server and partition data over multiple servers. Distributed partitioning is called sharding since it involves scaling out horizontally. Sharding is required when the amount of data for the table is getting close to the capacity of a single server. Besides, we might need parallel processing on multiple servers when answering analytics or reporting queries. There is just one caveat: always filter queries by partitions. Otherwise, the queries will soon exhaust the database system:

CREATE FOREIGN TABLE dbdp.request_log_2020_01 PARTITION OF dbdp.request_log

FOR VALUES FROM (’2020-01-01’) TO (’2020-02-01’) SERVER dbdp_server_one;

CREATE FOREIGN TABLE dbdp.request_log_2020_02 PARTITION OF dbdp.request_log

FOR VALUES FROM (’2020-02-01’) TO (’2020-03-01’) SERVER dbdp_server_two;

Assume that we have created our respective tablesdbdp.request_log_2020_01 and dbdp.request_log_2020_02in remote servers with the same definition discussed earlier. Now, a link should be created between the remote servers and the local server by treating the table on a remote server as part of the partition with the above statements. PostgreSQL also supports sub-partitioning that means partitions can be made even smaller. Following our example, daily partitions on remote servers can be made as follows:

CREATE TABLE dbdp.request_log_2020_02_01 PARTITION OF dbdp.request_log_2020_02

FOR VALUES FROM (’2020-02-01’) TO (’2020-02-02’);

k k CREATE TABLE dbdp.request_log_2020_02_02

PARTITION OF dbdp.request_log_2020_02

FOR VALUES FROM (’2020-02-02’) TO (’2020-02-03’);

With all sharding and partition, we can have a distributed PostgreSQL cluster that can be scaled out horizontally. Nevertheless, it requires additional develop- ment and maintenance to support partitions and sharding. Plugins can be used for partition management, or partition creation can be automated through sched- uler. However, it would require additional effort. When it gets close to be on par with the complexity of a Big Data system, it might be a good idea to consider more scalable options.

Processing Large Data with Linux Commands