-
How to Monitor MySQL with Percona’s Nagios Plugins
In this post, I’ll cover the new MySQL monitoring plugins we created for Nagios, and explain their features and intended purpose.
I want to add a little context. What problem were we trying to solve with these plugins? Why yet another set of MySQL monitoring plugins?
The typical problem with Nagios monitoring (and indeed with most other monitoring systems) is the alert spam. I’ve spoken to a lot of users with monitoring systems over the years, and everyone says that they get a lot of false alarms, and that they have email filters set up for alerts. Recently I spoke to one person who gets about thirty thousand alerts per week (!!!) in his email. I’m sure you can see that this makes the monitoring system much less effective. The usual cause of such noisy monitoring is when you alert on things that don’t really indicate problems.
To avoid this problem as much as possible, I suggest the following:
Alert only on things that reliably indicate a real problem. Do not alert on a threshold that merely indicates something is abnormal; abnormal things happen constantly. It’s useful to spend time investigating abnormalities, but in general, you should try to make sure that doesn’t creep into your fault-detection alerts.
Alert only on things that are actionable. If you get an alert and you can’t do anything about it, that’s a certain recipe for a /dev/null email filter.
Avoid ratios and thresholds as much as possible, because they are rarely actionable or reliable indicators of problems, and it’s difficult or impossible to find the right threshold for all situations. This suggestion is a consequence of the previous two, but it’s worth calling out separately in my opinion. I’ll highlight a couple of threshold-based alerts later in this blog post, and explain the need to be careful with them.
Although we’re not finished with the plugins, we have a usable starting point, and we’ll build on it as time passes. The plugins we created are based on our analysis of several hundred customer downtime incidents, and their causes and preventions. You can read more about that research in our white papers, here and here.
If you read those white papers, you’ll see a pretty extensive list of suggestions for what to monitor in MySQL. Those are only suggestions; any given deployment might want to monitor only some of them, as appropriate, and might need to include others as well. Here’s how you can use our new monitoring plugins to check for some of the conditions that caused downtime for our customers:
Failed or full LVM snapshots. LVM backup strategies can fail to delete snapshots after finishing, sometimes due to the backup script failing before completion. Additionally, LVM snapshots that don’t have enough copy-on-write space can fail before the backup completes. The pmp-check-lvm-snapshots plugin will alert when a snapshot is failed or overly full.
Too many deadlocks. Deadlocks (and lock wait timeouts, covered in another plugin) are frequent causes of serious errors in applications because the applications often don’t handle deadlocks correctly, and in many cases also don’t log the resulting errors. The best place to fix these problems is in the application, but the DBA really needs to be monitoring it as well, because an application error can creep in easily and be deployed, and you will not find it anywhere but in production. The pmp-check-mysql-deadlocks plugin can help you detect such problems, in conjunction with Percona Toolkit’s pt-deadlock-logger tool.
Open but deleted filehandles. It’s quite easy for a bad logrotate script (such as the one that ships by default with many server installations) or a system administration mistake to delete one of MySQL’s log or data files from disk. The database server will continue to function, however, because the file isn’t really deleted. The MySQL server keeps an open filehandle to the file, and it won’t be deleted until the server restarts, so the problem remains undetected while it becomes more and more severe. This can cause two serious types of problems: the disk can fill up with no apparent cause or solution, or you can restart the server and lose your data permanently. The pmp-check-mysql-deleted-files plugin will alert when files have been deleted but remain open via a filehandle.
Wrong filesystem privileges. It’s very easy for a file to creep into MySQL’s data directory with wrong permissions, without causing problems until a later date. For example, if a database isn’t owned by MySQL, you won’t have any problems until you try to create or drop a table, or the database itself. Likewise, if a system administrator leaves a temporary file in the database directory and then you try to drop the database, it will fail. These things tend to happen at inconvenient times. The pmp-check-mysql-file-privs plugin checks whether MySQL owns every file and directory within its data directory.
InnoDB is getting stuck, has long-running transactions, or other problems. The pmp-check-mysql-innodb plugin checks for several indicators of severe internal problems within InnoDB, such as contention that crosses a tipping point and cripples the server. Sometimes such conditions are like credit card debt: easy to start, impossible to dig out afterwards. Another very common cause of serious incidents is application misbehavior, such as opening a transaction but never committing it, which creates a lot of bloat or locking in the database server. This plugin can alert on that and a couple other types of common problems, too.
Missing PID file. A variety of problems can delete MySQL’s PID file, such as a buggy init script that attempts to start the server when it’s already running, or executing the server by hand instead of through the init script. Overly zealous cron jobs to delete old files in /tmp can do the same thing. If the PID file is missing, then many init scripts will behave very badly. Sample problems include starting two instances of MySQL on the same files in the data directory, and refusing to shut down MySQL so that it is killed forcibly (i.e. crashed) instead of gracefully shutting down for a system reboot. The pmp-check-mysql-pidfile plugin verifies that MySQL’s PID file exists.
Evidence of contention in the processlist. The pmp-check-mysql-processlist plugin checks SHOW FULL PROCESSLIST for patterns that indicate similar problems to those discussed within InnoDB, but it can find evidence of problems in the server as well as the storage engine. For example, a pile-up of connections all trying to open or close tables can indicate LOCK_open contention.
Replication is delayed or stopped. Our plugins are structured to monitor as few things as possible, but as many as needed. Replication is one thing that needs to be monitored in two dimensions, which shouldn’t be combined: delay (Seconds_behind_master or the delay indicated by the pt-heartbeat tool, if you choose), and actual failure. The pmp-check-mysql-replication-delay plugin checks for excessive delay, and pmp-check-mysql-replication-running checks that replication is functioning correctly.
Replicas have different data than their master. In addition to delay or failure, replication can fail silently and data can diverge or drift on a replica. The pt-table-checksum tool from Percona Toolkit can generate table checksums, and the pmp-check-pt-table-checksum plugin can alert when those checksums indicate that the replica is not logically identical to its master.
The server is allocating too much memory. A runaway server process, or poorly optimized queries, can consume too much memory and cause the server to swap until it is unresponsive and has to be power cycled. In addition, other processes can be runaways, too, such as a poorly written script or daemon; it is not unheard of for a monitoring agent or RAID utility software to do this. The pmp-check-unix-memory plugin checks for a single process consuming too much memory and/or for too little free memory in general. This plugin requires careful configuration because it uses a threshold for which there is no good one-size-fits-all value. This plugin, perhaps more than any other, should only be used judiciously.
The server has been restarted recently. It is surprising how often a server crashes, perhaps even cycling through crashes, and is not detected until sometime later when the problem has grown much more serious. You can use the pmp-check-mysql-status plugin to alert when the server’s Uptime variable is too small.
The server is approaching max_connections. Overflowing the max_connections setting is a bit like a lack of disk space: it is possible for it to be slowly approached over time, and then to begin causing problems when connections are rejected. Similarly to deadlocks and timeouts, applications often don’t handle or log this error appropriately. The pmp-check-mysql-status plugin can be configured to alert when Threads_connected approaches max_connections too closely, which can give the system administrator time to address the situation before it becomes a problem. This check requires careful configuration, or it too can become a noise generator. However, if your application environment makes it difficult to handle max_connections problems (the use of fixed-size connection pooling, for example, which can sometimes require application downtime to resolve), then it might be worth considering to let you deal with the problem when it’s convenient instead of when it’s urgent.
In addition to the checks in that list, there are several others that can be performed with existing plugins, so there was no need to write new ones. For example, the plugins that ship with Nagios are suitable for checking that the server is alive and responds to a query.
The pmp-check-mysql-status is also very flexible and can be used to trigger an alert on almost anything from SHOW VARIABLES and SHOW STATUS that you want. If you want to monitor ratios and rates, for example, you can certainly do so; it also supports an incremental mode of operation, so you can (for example) get the incremental increase in a variable such as Queries and alert on a queries-per-second threshold. (I don’t suggest doing this; I’m only listing an example of what you can do. The documentation has several other examples.)
In future releases we will include more plugins to monitor more types of problems, but the current plugins already cover some of the most common and severe conditions that we’ve observed to cause actual production downtime. Please download and try out the plugins, and help us improve them. If you are a customer and you need help configuring any of the plugins, we are happy to assist. In addition, please ask our sales staff to include a free monitoring setup in a new support contract. I believe they will be happy to negotiate this with you.
-
Introduction to Tungsten Enterprise 1.5 – MySQL high availability from a local cluster to multi-site clusters in the cloud
Webcast: Thursday 3/1/12 @ 10 am PT/ 1 pm ETWebcast: Tuesday 3/6/12 @ 15:00 CET/14:00 GMTLearn how the next version of Tungsten Enterprise will help you push the envelope in multi-master, multi-site MySQL database clusters. We will be introducing the following features:Setting up MySQL clusters with a single installation commandAutomatic master failover without data lossBuilding disaster recovery
-
Tmpfs with IO_Direct
bad idea..
tmpdir..
..on tmpfs
upgrade..
..to 5.5
default engine..
..innodb
create..
..temp tables statements
optimizing my.cnf...
..io_direct sounds good..
tmpfs..
..and io_direct
do not..
..play well together
..lucky...
..just slave..
..had backup
-
MySQL monitoring: storing, not caching
I've followed with interest on Baron's Why don’t our new Nagios plugins use caching? and Sheeri's Caching for Monitoring: Timing is Everything. I wish to present my take on this, from mycheckpoint's point of view.
So mycheckpoint works in a completely different way. On one hand, it doesn't bother with caching. On the other hand, it doesn't bother with re-reads of data.
There are no staleness issues, the data is consistent as it can get (you can never get a completely atomic read of everything in MySQL), and you can issue as many calculations as you want at the price of one take of monitoring. As in Sheere's example, you can run Threads_connected/max_connections*100, mix status variables, system variables, meta-variables (e.g. Seconds_behind_master), user-created variables (e.g. number of purchases in your online shop) etc.
mycheckpoint's concept is to store data. And store it in relational format. That is, INSERT it to a table.
A sample-run generates a row, which lists all status, server, OS, user, meta variables. It's a huge row, with hundreds of columns. Columns like threads_connected, max_connections, innodb_buffer_pool_size, seconds_behind_master, etc.
mycheckpoint hardly cares about these columns. It identifies them dynamically. Have you just upgraded to MySQL 5.5? Oh, there's a new bunch of server and status variables? No problem, mycheckpoint will notice it doesn't have the matching columns and will add them via ALTER TABLE. There you go, now we have a place to store them.
Running a formula like Threads_connected/max_connections*100 is as easy as issuing the following query:
SELECT Threads_connected/max_connections*100 FROM status_variables WHERE id = ...
Hmmm. This means I can run this formula on the most recent row I've just added. But wait, this also means I can run this formula on any row I've ever gathered.
With mycheckpoint you can generate graphs retroactively using new formulas. The data is there, vanilla style. Any formula which can be calculated via SQL is good to go with. Plus, you get the benefit of cross referencing in fun ways: cross reference to the timestamp at which the sample was taken (so, for example, ignore the spikes generated at this and that timeframe due to maintenance. Don't alert me on these), to system issues like load average or CPU usage (show me the average Seconds_behind_master when load average is over 8, or the average load average when slow query rate is over some threshold. You don't do that all the time, but when you need it, well, you can get all the insight you ever wanted.
Actually storing the monitored data in an easy to access format allows one to query, re-query, re-formulate. No worries about caching, you only sample once.
For completeness, all the above is relevant when the data is of numeric types. Other types are far more complicated to manage (the list of running queries is a common example).
-
Auto-generated date dimension tables
It seems that whenever I have a cross-continent flight, Mondriangets a new feature. This particular flight was from Florida back hometo California, and this particular feature is a time-dimensiongenerator.I was on the way home from an all-hands at Pentaho's Orlando,Florida headquarters, where new CEO Quentin Gallivan had outlined hisstrategy for the company. I also got to spend time with the many smartfolks from all over the world who work for Pentaho, among them Roland Bouman, formerly an evangelist for MySQL, now with Pentaho, but stillpassionately advocating for open source databases, open sourcebusiness intelligence, and above all, keeping it simple.Roland and I got talking about how to map Mondrian onto operationalschemas. Though not designed as star schemas, some operational schemasnevertheless have a structure that can support a cube, with a centralfact table surrounded by star or snowflake dimension tables. Often theone thing missing is a time dimension table. Since these timedimension tables look very much the same, how easy would it be forMondrian to generate them on the fly? Not that difficult, I thought,as the captain turned off the "fasten seatbelts" sign and I opened mylaptop. Here's what I came up with.Here's how you declare a regular time dimension table in Mondrian4:<PhysicalSchema> <Table name='time_by_day'/> <!-- Other tables... --></PhysicalSchema>Mondrian sees the table name 'time_by_day', checks that it exists, and finds the column definitions from the JDBC catalog. The table can then be used in various dimensions in the schema.An auto-generated time dimension is similar:<PhysicalSchema> <AutoGeneratedDateTable name='time_by_day_generated' startDate='2012-01-01' endDate='2014-01-31'/> <!-- Other tables... --></PhysicalSchema>The first time Mondrian reads the schema, it notices that the table is not present in the schema, and creates and populates it. Here is the DDL and data it produces.CREATE TABLE `time_by_day_generated` ( `time_id` Integer NOT NULL PRIMARY KEY, `yymmdd` Integer NOT NULL, `yyyymmdd` Integer NOT NULL, `the_date` Date NOT NULL, `the_day` VARCHAR(20) NOT NULL, `the_month` VARCHAR(20) NOT NULL, `the_year` Integer NOT NULL, `day_of_month` VARCHAR(20) NOT NULL, `week_of_year` Integer NOT NULL, `month_of_year` Integer NOT NULL, `quarter` VARCHAR(20) NOT NULL)JULIANYYMMDDYYYYMMDDDATEDAY_OF_WEEK_NAMEMONTH_NAMEYEARDAY_OF_MONTHWEEK_OF_YEARMONTHQUARTER2455928120101201201012012-01-01SundayJanuary2012111Q12455929120102201201022012-01-02MondayJanuary2012211Q12455930120103201201032012-01-03TuesdayJanuary2012311Q1The columns present are all of the time-dimension domains:DomainDefault column nameDefault data typeExampleDescriptionJULIAN time_id Integer 2454115Julian day number (0 = January 1, 4713 BC). Additional attribute 'epoch', if specified, changes the date at which the value is zero.YYMMDD yymmdd Integer 120219Decimal date with two-digit yearYYYYMMDD yyyymmdd Integer 20120219Decimal date with four-digit yearDATE the_date Date 2012-12-31Date literalDAY_OF_WEEK_NAMEthe_day String FridayName of day of weekMONTH_NAME the_month String DecemberName of monthYEAR the_year Integer 2012YearDAY_OF_MONTH day_of_monthString 31Day ordinal within monthWEEK_OF_YEAR week_of_yearInteger 53Week ordinal within yearMONTH month_of_yearInteger 12Month ordinal within yearQUARTER quarter String Q4Name of quarterSuppose you wish to choose specific column names, or have morecontrol over how values are generated. You can do that by including a<ColumnDefs> element within the table, and <ColumnDef>elements within that — just like a regular <Table>element.For example,<PhysicalSchema> <AutoGeneratedDateTable name='time_by_day_generated' startDate='2008-01-01 endDate='2020-01-31'> <ColumnDefs> <ColumnDef name='time_id'> <TimeDomain role='JULIAN' epoch='1996-01-01'/> </ColumnDef> <ColumnDef name='my_year'> <TimeDomain role='year'/> </ColumnDef> <ColumnDef name='my_month'> <TimeDomain role='MONTH'/> </ColumnDef> <ColumnDef name='quarter'/> <ColumnDef name='month_of_year'/> <ColumnDef name='week_of_year'/> <ColumnDef name='day_of_month'/> <ColumnDef name='the_month'/> <ColumnDef name='the_date'/> </ColumnDefs> <Key> <Column name='time_id/> </Key> </AutoGeneratedDateTable> <!-- Other tables... --></PhysicalSchema>The first three columns have nested <TimeDomain>elements that tell the generator how to populate them.The other columns have the standard column name for a particular timedomain, and therefore the <TimeDomain> element can be omitted. Forinstance,<ColumnDef name='month_of_year'/>is shorthand for<ColumnDef name='month_of_year' type='int'> <TimeDomain role="month"/></ColumnDef>The nested <Key> element makes that column valid as thetarget of a link (from a foreign key in the fact table, for instance),and also declares the column as a primary key in the CREATE TABLEstatement. This has the pleasant side-effect, on all databases I knowof, of creating an index. If you need other indexes on the generatedtable, create them manually.The <TimeDomain> element could be extended further. For instance, wecould add a locale attribute. This would allow different translationsof month and weekday names, and also support locale-specificdifferences in how week-in-day and day-of-week numbers arecalculated.Note that this functionality is checked into the mondrian-lagunitasbranch, so will only be available as part of Mondrian version 4. Thatrelease is still pre-alpha. We recently started to regularly build thebranch using Jenkins, and you should seethe number of failing tests dropping steadily over the next weeks andmonths. Already over 80% of tests pass, so it's worth downloading thelatest build to kick the tires on your application.
|