When running replication, two new files appear in your data directory: master.info and relay-log.info. MySQL uses these files to save information about your replication state, and they’re used when MySQL starts up, if they’re available.
Caution MySQL considers the information contained in the master.info and relay-log.info filebefore looking for settings in the configuration files. This means that changes in your my.cnf file may beignored, if the information is stored in the info files.
Both of these files provide information about the configuration and status of your replica-tion, but shouldn’t be used as configuration files to make changes. The primary purpose forthese files is to keep state information between database restarts, and for database backups.The slave threads control both of these files, and you should only use them for information.You can make changes to these files with the CHANGE MASTER command, which is discussed inmore detail in the section “CHANGE MASTER.”
Now that you’re familiar with the binary log, and we’ve hinted at how the binary log might beused in replication, let’s look at the steps that happen when data is replicated to another server:
- INSERT, UPDATE, DELETE or some other data-changing query is issued to the masterdatabase.
- The query is parsed, executed, and written to the binary log on the master.
- The I/O thread on the slave asks for new queries from the I/O thread on the master,and pulls anything new from the binary log on the master into a log on the slave calledthe relay log.
- A processing thread on the slave reads the relay log and executes the query.
The simple explanation is that for any data change on the master, an entry is made intothe binary log. That statement is copied to the slave and executed there as well, making theexact change to the slave as was made on the master and thus keeping the data in sync.
We hinted at different threads in these four steps. Running a replication slave requiresthree threads, in addition to those already used to keep your database’s non-replication-related features running. Two threads run on the slave, the first to communicate with themaster for entries in the master’s binary log and to pull the statements onto the slavemachine. The second slave thread reads the queries in the relay log (which were pulled fromthe master) and processes them. The third thread runs on the master and is responsible forcommunicating changes in the master’s binary log. If you have multiple slaves pointing at asingle master, the master will run a separate thread to communicate with each of the slaves.
Some of our specific challenges of the denormalization process were:
1. Dozens of legacy data formats that evolved over years. Peter Ondruška, a Facebook summer intern, defined a custom language to concisely express our data format conversion rules and wrote a compiler to turn this into runnable PHP. Three “data archeologists” wrote the conversion rules.
2. Non-recent activity data had been moved to slow network storage. We hacked a read-only build of MySQL and deployed hundreds of servers to exert maximum IO pressure and copy this data out in weeks instead of months.
3. Massive join queries that did tons of random IO. We consolidated join tables into a tier of flash-only databases. Traditionally PHP can perform database queries on only one server at a time, so we wrote a parallelizing query proxy that allowed us to query the entire join tier in parallel.
4. Future-proofing the data schema. We adopted a data model that’s compatible with Multifeed. It’s more flexible and provides more semantic context around data with the added benefit of allowing more code reuse.
Despite the fact that it has become a laughingstock and cautionary tale among the digital elite, Yahoo’s properties are still visited by some 700 million normal people a month.
700 million!
This humongous audience is attracted by several things:
- Yahoo Mail (still one of the leading mail platforms worldwide)
- Yahoo content (Finance, Sports, Entertainment, videos, etc.)
- Yahoo’s brand
- Yahoo’s “products”
And here’s something else Yahoo has.
Revenue and cash flow.
Even in its current deflated state, Yahoo still generates $1.6 billion of revenue per quarter ($6 billion per year) and $250 million of free cash flow ($1 billion a year).
Replication can allow you to take advantage of multiple storage engines for a single table ordatabase. What does that mean? With replication it’s possible to use one table type on themaster and another table type on the slave. Perhaps you want to have foreign keys, which areonly allowed using the InnoDB and BDB table types, but you also want to be able to use thefull-text indexing feature of the MyISAM table type. Because replication simply executesqueries from one server on another server, it’s possible to have the master database useInnoDB tables, which provides referential integrity. You can alter the tables replicated to the slave to be MyISAM, including the definition of full-text indexes. If you wanted to run queriesagainst the full-text index, you would send those queries to the slave with the MyISAM tablesand full-text index. Presumably, you’d use the InnoDB features on the master to enforce dataintegrity, but get the advantages of the MyISAM performance, and so on.
We’ve hinted at it with the full-text indexes in our multiple-storage-engines example, butit’s worthy to note that a slave database can have a different set of indexes than the master.This can be helpful if you have fundamentally different methods for accessing the data thatrequire multiple indexes on a single table. Spreading those indexes across two different data-bases and sending the queries to the appropriate machine can mean reduced index sizes andimproved performance.