Big Data Applications
The database administrator is usually an expert on SQL performance and tuning. With the added responsibility to support a big data application, it is expected that the DBA will monitor and tune business intelligence (BI) queries that users submit. One big difference is that the BI queries are almost always ad hoc queries. The DBA encounters them in real time; hence, it is necessary to be proactive with performance tuning tactics.
One additional issue is that some of these ad hoc BI queries will be changed in minor (or major) ways by users, and may end up as regular reporting tools. The result is that the DBA must not only monitor queries and make appropriate database or application changes (such as partitioning or indexing), but must also coordinate with user departments to determine how their queries may be used in the future.
These concerns should be reviewed regularly as DBAs do capacity planning for the federated data stores that the queries access: the big data application(s), one or more data warehouses, various data marts, and perhaps even some operational data.
Capacity Planning and Application Breaking Points
The first stage where the DBA should get involved is during application and database design. Important items include the following:
◉ Complexity. Some big data applications consist of hundreds of tables or more. Are there common queries that will Join these tables regularly? As more data is accumulated do queries get more complex? For example, if the DBA decides to purge or archive older data to a separate table, are there user queries that must access all of the data, and hence will need to do an SQL Union operation of the current and purged data?
◉ Speed. Query elapsed time is usually not an issue in big data applications, at least for queries whose data are stored entirely in the big data environment. However, there are usually common data in the data warehouse that must also be Joined to these tables. This is because the warehouse already contains most dimensional data used for aggregation and subsetting. Queries that require a breakdown by geographic area, sales territory, store number, product type, or other dimension will require access to the data warehouse and big data application simultaneously, and this may lead the DBA to do some proactive performance tuning.
◉ Database design. Databases are usually designed to support one or more specific applications. As such, it is possible to embed business rules in the design rather than coding them in SQL or in an application program. Standard examples include parent-child relationships. Does an order require a valid customer? This verification can be coded in the database design as a foreign key relationship or a trigger. Must customer numbers be numeric? This data integrity check can be coded as a trigger or a column edit rule.
◉ Capacity growth. Big data applications start out big and tend to get bigger. There are several unexpected items that can affect the DBA’s storage and CPU calculations. One example is when the initial application specifications call for holding two years of data. After a few months, the business insists that seven years of data retention are necessary. Another example is when the DBA determines that several indexes are required for performance of critical SQL statements. These indexes take up storage space.
Big Data Application Testing
As the users begin to test their first queries, the DBA has the opportunity to gather more information, including capturing SQL statements for later analysis and some preliminary performance monitoring. These are the first indications of the expected stress to be placed on system resources when the application goes into production. In addition to estimates of production table sizes and an initial set of indexes, the DBA can document possible opportunities for future issues.
Some examples of these are the following:
◉ Backup and recovery. Sometimes disasters occur. While complete site-wide catastrophes are rare, more common issues such as invalid data loads or disk storage failures are possible. Big data applications are usually too big to recover by making regular full-sized backup copies of databases. It would take too long! More typically, the support staff implements either a replication backup scheme that posts copies big data changes to a backup site, or a second big data environment that mirrors the first. In either case, the desired result is to have a standby big data application ready in some form in case issues arise.
◉ Object management. Commonly, databases and tables store data in a physical sequence. For example, an order table may store orders in sequence by order number. As inserts, updates and deletes happen, it is possible for the strict sequencing to be broken. The DBA can then run a reorganization utility to restore the sequencing. This is not an option for big data objects, as they are simply too big to reorganize speedily. Therefore, the DBA must design objects, data loads and data change logic such that reorganizations are not necessary.
◉ Resource usage. System resources include CPUs, memory, disk storage and network bandwidth. In addition, transaction elapsed times and physical I/O operations are usually considered as resources. The DBA should use big data application testing as an opportunity to forecast what resources are required by the application and queries, with an eye towards developing options should one or types of resources be constrained in the future. For example, if it appears that there is insufficient CPU power available to process queries in a timely manner, the DBA can predict when new CPUs must be added. Alternatively, the DBA may be able to implement database changes such as additional indexes that allow queries to use less CPU power.
Big Data in Production
With the application in production the DBA must now look to the future. Varied user communities will now have access to big data, the number of queries will increase dramatically, and the value of the results will be used to make business decisions. Your user community will then begin to diversify into several different categories, each with its own needs.
Sophisticated users with BI tools. Most big data applications are expensive investments, and business management will already have several analyses or use cases to get some immediate payback on their investment. These users will commonly be given one or more business intelligence tools that will build and submit queries using a graphical interface. These queries will be relatively complex in that they will access many tables and will contain requests for aggregations (sums, averages, minima and maxima) and dimensional breakdowns (by customer type, by product type, by time period, by geographical area, etc.).
For the future, the DBA can expect a lot of performance tuning based on massive data access. This leads to strategies such as data partitioning, key clustering, and the like. Luckily, examples of the first queries should already be in the DBA’s hands from the testing phase mentioned earlier.
Simple users. This set of users is the second to arrive. They already know SQL and have a passing familiarity with enterprise data, and will begin by coding their own queries against the big data application. Looking ahead, these users will be making many SQL coding mistakes, with potential performance consequences. Implement SQL reviews for them and consider giving them some training in basic performance tools such as the DB2 Explain facility.
DataMart users. These users are already accessing relational data extracted from the data warehouse or operational systems and stored in local data marts on local platforms. They will eventually see the need to take their specially constructed data and Join it to the big data application. This kind of multi-site access is fraught with performance problems as some queries may cause huge volumes of data to be transferred across the network before it can be determined if the data is really needed.
Here, the DBA must meet with these users as soon as possible to determine what options will work for them. One potential solution is to have multiple copies of certain data available in two or more physical locations. For example, data warehouse dimension tables can be copied into the big data application, preliminary data joins executed, and the results copied to the data mart location. The data mart users can then work with the pre-joined tables locally without forcing massive network data movement.