Database access

From Toolserver wiki
Jump to: navigation, search
Getting started

One of the key features of a Toolserver account is access to a replicated copy of the Wikimedia databases. This page explains how to access them.

Contents

[edit] Overview

There are seven clusters of databases, defined by the Wikimedia Foundation and numbered from s1 to s7. The Toolserver replicates the clusters as they are. At the Toolserver, the commons database is available at all clusters, though Commons is properly a part of s4.

There are two different types of server for each cluster:

The RR server 
This is the server you would usually use. It will connect you to a randomly selected database server which contains the requested cluster.
The user server 
This is the server you should connect to if you need user database access (explained below)

By default, your database credentials are stored in $HOME/.my.cnf. The code examples below show you how to read that file in, rather than putting passwords in your script directly, which is a security risk.

[edit] Database names

Each Wikimedia project has a specific database name. The general pattern is langcodeproject name_p. The full list is available here (though without the _p suffix).

[edit] Replication lag

The servers, for the most part, are real-time copies of the live databases, but at times there may be replication lag (replag). Replication lag causes updates to the copy of the database on the Toolserver to lag behind the master server, so the database represents the state of the master at some point in the past.

[edit] Connecting to a database

There are two methods of connecting to a database: by cluster, and by database name. Many tools still use the cluster method, since originally this was the only available method. However, the database name method is generally easier.

[edit] By database name

To connect to a database by name, use the hostname XX.rrdb.toolserver.org (for the RR server) or XX.userdb.toolserver.org (for the user server), where XX is the database name. (Example: enwiki-p.rrdb.toolserver.org.)

Please notice that XX.userdb.toolserver.org is used only for user databases that are stored in servers that also contain a copy of the Wikimedia databases stored on XX.rrdb.toolserver.org. User databases that do not require that access are typically stored not here (see #sql-user).

[edit] By cluster

To connect to a database by cluster, you first need to determine which cluster the database is located on. Toolserver database#wiki explains how to do that; in short, connect to the toolserver database on the sql-toolserver server, and use a query like:
SELECT server FROM toolserver.wiki WHERE dbname='enwiki_p';
This will return an integer indicating the cluster, e.g. "1" for enwiki_p.

Then, depending on whether you want the RR or user server, connect to the hostname sql-sX-rr.toolserver.org or sql-sX-user.toolserver.org, where X is the cluster number. (Example: sql-s1-rr.toolserver.org.)

[edit] Notes

You may sometimes see tools or examples which use the old-style aliases sql-sX.toolserver.org or XX.db.toolserver.org. These aliases are obsolete, and should not be used anymore.

[edit] sql-user

Beside the database-servers with wmf-databases the toolserver also has a cluster for user-databases that need no joins with the wmf-databases. The cluster is reachable by sql-user-x.toolserver.org (where x is a letter). x should be the first letter of your user-name for normal (so if your username is wiki it would be sql-user-w.toolserver.org). A obsolete way to reach the user-databases was sql.toolserver.org.

[edit] Command-line access

To connect to a specific database (for example, enwiki_p) from the Unix command line, run: sql enwiki_p. This will automatically look up the correct server and connect to it.

[edit] Batch access

If you plan to run long queries on the database, e.g. for scheduled report jobs, we recommend using batch job scheduling to access the database. This will ensure that the database doesn't become overloaded.

[edit] Program access

See also: Iterating over wikis

In most programming languages, it will be sufficient to tell MySQL to use the database credentials found in $HOME/.my.cnf

Below are various examples in a few common programming languages.

[edit] Bash

-- 2> /dev/null; date; echo '
/* Bash/SQL compatible test structure
 *
 * Run time: ? <SLOW_OK>
 */
SELECT 1
;-- ' | mysql -ch sql-s1-rr enwiki_p > ~/query_results-enwiki; date;

[edit] C

#include <my_global.h>
#include <mysql.h>
 
...
 
 char *host = "sql-s1-user";
 MYSQL *conn = mysql_init(NULL);
 
 mysql_options(conn, MYSQL_READ_DEFAULT_GROUP, "client");
 if (mysql_real_connect(conn, host, NULL, NULL, NULL, 0, NULL, 0) == NULL) {
    printf("Error %u: %s\n", mysql_errno(conn), mysql_error(conn));
    ...
 }

[edit] Perl

use User::pwent;
use DBI;
 
my $database = "enwiki_p";
my $host = "enwiki-p.rrdb.toolserver.org";
 
my $dbh = DBI->connect(
    "DBI:mysql:database=$database;host=$host;"
    . "mysql_read_default_file=" . getpwuid($<)->dir . "/.my.cnf",
    undef, undef) or die "Error: $DBI::err, $DBI::errstr";

[edit] Python

See also: Django

Using oursql (uses less memory):

import os, oursql
db = oursql.connect(db='enwiki_p',
        host="enwiki-p.rrdb.toolserver.org",
        read_default_file=os.path.expanduser("~/.my.cnf"),
        charset=None,
        use_unicode=False
)

Extra flags are required for oursql to force binary mode since VARCHAR fields on sql-s2 are mislabeled with latin-1. Manual decoding is required even after upgrading since the fields will be VARBINARY instead. Note: oursql is only installed on solaris, see jira:TS-760 and jira:TS-1452 for more information.

[edit] PHP

<?php
$ts_pw = posix_getpwuid(posix_getuid());
$ts_mycnf = parse_ini_file($ts_pw['dir'] . "/.my.cnf");
$db = mysql_connect('enwiki-p.rrdb.toolserver.org', $ts_mycnf['user'], $ts_mycnf['password']);
unset($ts_mycnf, $ts_pw);
 
mysql_select_db('enwiki_p', $db);
 
// YOUR REQUEST HERE
 
?>

[edit] PHP (using PDO)

<?php
$ts_pw = posix_getpwuid(posix_getuid());
$ts_mycnf = parse_ini_file($ts_pw['dir'] . "/.my.cnf");
$db = new PDO("mysql:host=dewiki-p.rrdb.toolserver.org;dbname=dewiki_p", $ts_mycnf['user'], $ts_mycnf['password']);
unset($ts_mycnf, $ts_pw);
 
$q = $db->prepare('select * from page where page_id = :id');
$q->execute(array(':id' => 843020));
print_r($q->fetchAll());
?>

[edit] Java

Class.forName("com.mysql.jdbc.Driver").newInstance();
Properties mycnf = new Properties();
mycnf.load(new FileInputStream(System.getProperty("user.home")+"/.my.cnf"));
String password = mycnf.getProperty("password");
password=password.substring((password.startsWith("\""))?1:0, password.length()-((password.startsWith("\""))?1:0));
mycnf.put("password", password);
mycnf.put("useOldUTF8Behavior", "true");
mycnf.put("useUnicode", "true");
mycnf.put("characterEncoding", "UTF-8");
mycnf.put("connectionCollation", "utf8_general_ci");
String url = "jdbc:mysql://enwiki-p.rrdb.toolserver.org:3306/enwiki_p";
Connection conn = DriverManager.getConnection(url, mycnf);

[edit] Slow queries and the query killer

Query kill times
Replag Normal SLOW_OK
<30s 3600s
<120s 600s
<600s 420s 2760s
<1800s 300s 1800s
>1800s 180s 600s
SELECT /* SLOW_OK LIMIT:600 */ * FROM table...

To prevent replication lag caused by long queries which put load on the database, the Toolserver uses an automated query killer to kill queries. Query-Killer only runs when replication lag (replag) is noticeable and kills queries based on how long they have been running.

There are two kinds of queries: Normal- and SLOW_OK-query. If you know that you query will run for some time, you can add SLOW_OK (see below for an example) to your query and Query-Killer will be more patient with your query; please add a LIMIT: (see next paragraph) too.

If and when Query-Killer kills a query depends on the actual database replication lag. See the following table.

SLOW_OK-example:

SELECT /* SLOW_OK */ * FROM table...

Non-SELECT-queries with SLOW_OK will not be killed until the replag reaches 1800s.

Sometimes, you might want to artificially restrict the runtime of a query. You can do this using the LIMIT:<n> directive in the query:

SELECT /* LIMIT:60 */ * FROM table...

This will kill the query automatically if it runs for longer than 60 seconds.

This is useful for web tools, where you want to restrict the runtime of a query and present an error to the user if it runs too long. (Rather than allowing the connection to time out, or the user to get bored and go away.) It can also be used to limit execution time of a query which is usually fast, but sometimes runs for longer than desired.

There is also the possibility to get no e-mails if a LIMIT: query is killed. The syntax would be

SELECT /* LIMIT:60 NM*/ * FROM table...
.

N.B. If connecting using the command line 'mysql' tool, comments are not passed to the server by default, and so will not be visible to the query killer. Use the '-c' flag when connecting to suppress this behaviour.

[edit] phpMyAdmin

phpMyAdmin is a browser-based GUI interface for MySQL. To access phpMyAdmin on the Toolserver, visit https://phpmyadmin.toolserver.org and use your Toolserver username and your LDAP password for the first dialog. When you see the phpMyAdmin login page use your username and your MySQL password (which can be found in the .my.cnf file in your home directory).

[edit] Database schema

The current MediaWiki database schema is available in the MediaWiki Git repository at maintenance/tables.sql. A (possibly out of date) description of the schema is available at mw:Manual:Database layout.

[edit] Page text access

Page content itself is not available in the database. If you need to access it, use the MediaWiki API. WikiProxy may be useful as well.

[edit] General caveats

FIXME

[edit] Toolserver database

The toolserver (meta-)database is a database called "toolserver", containing information about other databases on the toolserver. It can be read by anyone and is available on all database servers. The host name of the primary copy is sql-toolserver.

The toolserver database is updated automatically on a daily basis. It is maintained by User:Duesentrieb, but please report any problems to JIRA.

Tables in the toolserver database:

[edit] wiki

Information about the wikis for which the Toolserver has data.

dbname the name of the wiki's database
lang the wiki's content language code. See also the is_multilang field.
family the wiki's family name (such as "wikipedia", "wiktionary", "wikibooks", etc)
domain the wiki's domain name. May be null if the wiki is closed.
size the wiki's size, as indicated by the site_stats table. This is pretty meaningless as an absolute number, but useful for sorting wikis by size.
is_meta 1 if the wiki is not a primary content wiki, but rather intended for community management and organizational tasks.
is_closed 1 if the wiki is closed (or private)
is_multilang 1 if the wiki has multilingual content. Overrides the "lang" field.
is_sensitive 1 if the wiki uses fully case-sensitive page titles, as a wiktionary typically does (see mw:Manual:$wgCapitalLinks)
root_category the wiki's root category. this is currently unreliable and mostly unused. may be null.
server a numeric value indicating what server/cluster the wiki resides on. If this value is e.g. 2, this means the wiki is on the "S2" cluster, so you will find the database on sql-s2.toolserver.org.
script_path the wiki's script path, typically (but not always) /w/. This is useful for building URLs to wiki pages or to the wiki's api.php.

So, for example, to get SQL server, database name and the API URLs of the 10 largest Wikipedias, you could do this:

 SELECT 
   lang,
   CONCAT("sql-s", server) as dbserver,
   dbname,
   CONCAT("http://", domain, script_path, "api.php") as url
 FROM toolserver.wiki
 WHERE family = "wikipedia"
 ORDER BY size DESC LIMIT 10;

[edit] namespace (deprecated)

This table contains the primary namespace names for the wikis in the wiki table. Use of this table is deprecated, use the namespacename table instead, which covers all valid names for each namespace in each wiki, not just the primary names.

dbname the name of the wiki's database
domain the wiki's domain name. May be null if the wiki is closed.
ns_id the namespace's numeric ID, as used in the wiki database.
ns_name the namespace's primary, localized name for that wiki. Useful for building links.

[edit] namespacename

This table contains the all valid namespace names for the wikis in the wiki table. Use of this table is preferred over the namespace table, since it covers all valid names for each namespace in each wiki, not just the primary names.

dbname the name of the wiki's database
domain the wiki's domain name. May be null if the wiki is closed.
ns_id the namespace's numeric ID, as used in the wiki database.
ns_name the namespace's primary, localized name for that wiki. Useful for building links.
ns_type the namespace entry's type: "primary", "canonical" or "alias"
ns_is_favorite A namespace may appear multiple times in the table (e.g. primary, canonical and alias etc.). One may think that there is always one set to primary, but that isn't the case. The one that the wiki uses in practice, is usually localized, and will be redirected to from others, that main one can be found by looking for ns_is_favorite=1. For each namespace of each wiki, there will only be one ns_is_favorite=1 .

This can be used to build wiki-syntax links from information from the database, e.g.:

  SELECT CONCAT("[[", ns_name, ":", pl_title, "]]") AS link
  FROM dewiki_p.pagelinks
  JOIN toolserver.namespacename ON dbname = "dewiki_p" AND ns_is_favorite = 1 AND ns_id = pl_namespace
  WHERE pl_from = 145864;

[edit] language

This table contains language names, by code, in English and the respective language.

lang the language code, as used in the wiki table
english_name the language's name in English
native_name the language's name in that language

For example, to get the languages of the ten largest Wikipedias:

 SELECT 
   domain,
   english_name,
   native_name
 FROM toolserver.wiki as W
 JOIN toolserver.language as L ON L.lang = W.lang
 WHERE family = "wikipedia"
 ORDER BY size DESC LIMIT 10;

[edit] User databases

By default, once a user is created an empty u_username database is created as well. Users can create additional databases by following the u_username_ pattern. For example, if user dab wanted to create a database named images, he would name it u_dab_images.

You can allow other users access to your databases by adding the _p suffix to the end of the database names. For example, if you wanted to share the previously created image database, you could rename it: u_dab_images_p.

If you don't need a backup of one of your databases, please include the string _transient in the name (for example: u_dab_transient_foobar).

User databases are usually stored on the sql server, and this is where you should connect to use them. However, sometimes it is helpful to create a user database on the same server as one of the replicated databases, e.g. to run joins. To do this, connect to the user server instead of the RR server (described above), which will allow you to create user databases.

To estimate the size of a user database, use the query:

SELECT sum( DATA_LENGTH ) + sum( INDEX_LENGTH ) as 'dbsize_in_bytes'
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE TABLE_SCHEMA = database();

[edit] Database status files

There are status-files about each cluster (and the user databases) at /var/www/status_sclusternumber or http://toolserver.org/status_sclusternumber (e.g., http://toolserver.org/status_s1). In both cases, clusternumber can be an integer from 1 to 6 inclusive.

These files are simple ASCII text files. Lines beginning with "#" are comment lines and can safely be ignored. A non-comment-line "OK;" indicates that everything is alright and no known maintenance or service interruption is scheduled.

[edit] LOAD DATA LOCAL

By default, MySQL does not allow the LOAD DATA LOCAL command to be used for security reasons. If you want to allow this, you can edit your $HOME/.my.cnf file to add the local-infile option:

[client]
...
[mysql]
local-infile

For more information, see the MySQL manual.

In oursql/Python it can be set for an individual connection: oursql.connection([...], local_infile=True), see Python/MySQL bulk load.

[edit] External access

Creating a tunnel in Windows/PuTTY: In configuration, Connection > SSH > Tunnels. Source port: 3306, Destination: enwiki-p.userdb.toolserver.org:3306, check Local, click Add. Creating a tunnel in Unix/Linux:

ssh -L 3307:enwiki-p.userdb.toolserver.org:3306 username@nightshade.toolserver.org
Excel

Follow the steps above to for creating a tunnel, then follow Connect Excel to MySQL Database.

[edit] Managing

rmytop -h sql-s1-rr  -h sql-s1-rr -h sql-s2-rr -h sql-s2-rr -h sql-s3-rr -h sql-s3-rr
Administration
Personal tools