Code snippets

Calculate replag
NB: This actually shows how long ago the last edit to the wiki was. If the wiki hasn't been edited for a while, it might appear the replag is higher than it really is, because the last edit was some time ago.

Select a server
See Queries for a description of the "sql" script. To switch databases such commonwiki_p to enwiki_p,

Copying user database
You can dump your user database from sql and import it to sql-s1, 2 and 3 using: mysqldump -h sql u_username | mysql -h sql-s1 u_username

You can also produce a database dump during the process (can be useful to restore the database in case of problems): mysqldump -h sql u_username | tee u_username.sql | mysql -h sql-s1 u_username

Copy a table within a database
You can copy the structure of table "foo" to a new table "bar" with the following: CREATE TABLE bar LIKE foo; You can then copy the data with the following: INSERT INTO bar SELECT * FROM foo;

This produces a copy of the table foo.

Fix UTF-8 encoded as latin-1
Versions of MySQL before 5 did not have binary nor UTF-8 support (or at least nobody cared). A common workaround was to encode the text as a UTF-8 byte sequence and give that to MySQL which read the byte in as if they were latin-1 characters. This worked alright until MySQL and VARCHAR became more Unicode aware. Now depending on settings, it may convert the latin-1 bytes to Unicode code points then encode to your specified encoding. So 'è' (U+00E8, UTF-8: ) becomes 'Ã¨' (U+00C3 U+00A8, UTF-8:  ), the equivalent python transformation is. This double encoding is the reason why JOINs will fail with the newer VARBINARY fields used everywhere by MediaWiki now. Now how to convert your pseudo UTF-8 fields into actual utf8:


 * Notes
 * LIKE is case sensitive with VARBINARY fields which are standard now in MediaWiki
 * VARCHAR will treat input as Unicode code points, encoding is specified by the collation attribute. Multibyte encoding like utf8, VARCHAR(255) may be larger than 255 bytes.
 * Since one of the clusters is still running MySQL 4, it is recommended to avoid using VARCHAR in user databases
 * Information in part based on Fixing character sets in MySQL

Kill processes
To get the process ID (pid), see viewing processes. Once you have the pid, you can kill it: kill pid

If the process doesn't exit, kill it harder: kill -9 pid

But be aware that kill -9 doesn't give the process any opportunity to clean up.

If the process is running in the foreground (i.e. from the shell), you can also kill it with CTRL-C. If that doesn't work, try CTRL-\ (which like kill -9, gives the process no opportunity to clean up).

To kill all your processes at once, including any and all ssh processes: pkill -u username

Calculate size of home
du -hs ~

Find folders taking up space
du --max-depth=1 | sort -k 1 -nr | head -n 10

Sort log files
Say if you've accidently, or intentionally, concatenated a bunch of apache access logs together so they're not sorted anymore, the following sort will generally get you sorted. sort -t ' ' -k 4.9,4.12n -k 4.5,4.7M -k 4.2,4.3n -k 4.14,4.15n -k 4.17,4.18n -k 4.20,4.21n

Check uptime of the cluster
ruptime -a

Bash
These scripts and settings should be placed in your .bash_profile file. After updating the file, you need to source it in order for the changes to become active ($ source ~/.bash_profile).

The env command gives a list of variables with their results for your current environment.

Setting a default editor
The default editor on nightshade is currently nano</tt>. If you want to change your personal default editor (for example, to use joe</tt>), use: export EDITOR=joe in ~/.bash_profile

MySQL queries
If you primarily deal with running queries on one particular database, there are bash scripts that can make life easier. The "sql" script is great for selecting the appropriate server, however it requires a lot of typing for commonly needed functions. And when a query finishes, you don't know how many results were output.

The "query" script runs an .sql file on the particular database and then prints the number of lines returned from the query.

The "gquery" scripts run an .sql file on the particular database, prints the number of lines returned from the query, and then gzips the output.

Both scripts are invoked using the script name followed by the file name (without an extension), for example: query articles-ns-0

Symlinks
If you are constantly needing to make files public from a particular directory, it's possible to create a "mkpub" shortcut to symlink the files to your public directory (public_html/).

Search array
If you need to know whether something is in an array of things:

then use it:

Output to pastebin
Add this alias to : $ echo "alias pastebin=\"curl -F 'sprunge=<-' http://sprunge.us\"" >> ~/.bash_aliases Then, to easily send data to a pastebin, pipe it into that alias: $ ls | pastebin http://sprunge.us/ZJZZ

Untar a file
$ tar -zxvf yourfile.tar.gz

Simple regexp for addresses in lenient formats
This detects all IPv4 addresses made with 1 to 3 digits (including extra leading zeroes) per component:
 * ([01]?[0-9]{1,2}|2([0-4][0-9]|5[0-5]))\.([01]?[0-9]{1,2}|2([0-4][0-9]|5[0-5]))\.([01]?[0-9]{1,2}|2([0-4][0-9]|5[0-5]))\.([01]?[0-9]{1,2}|2([0-4][0-9]|5[0-5]))</tt>

When there's a match, the four decimal components are returned in \1, \3, \5, \7 (only one substring for each). If the regexp engine supports counters and can return an array of matches, it can even be compacted into:


 * ([01]?[0-9]{1,2}|2([0-4][0-9]|5[0-5]))(\.([01]?[0-9]{1,2}|2([0-4][0-9]|5[0-5]))){3}</tt>

where the four components are returned in \1 (one substring), \3 (three substrings).

Note that this will unexpectedly match "000.000.000.000" or "010.01.001.1" (which are NOT usable as valid hostnames in IPv4 address format ; this means that such names will have to be resolved into some unknown address via a DNS hostname resolution query). Note that such pseudo-IP address formats are currently being used as valid user account names in Wikimedia sites, but NOT used by anonymous users just identified by their IPv4 address.

Improved regexp for addresses in shortest canonical format
This is a more restrictive version which detects only IPv4 addresses with a valid decimal format, without any extra leading zeroes in components, where the matched decimal 8-bit components are returned in \1, \3, \5, \7 (one substring for each):


 * (0|1[0-9]?[0-9]?|2([0-4][0-9]?|5[0-5]?|[6-9])?|[3-9][0-9]?)\.(0|1[0-9]?[0-9]?|2([0-4][0-9]?|5[0-5]?|[6-9])?|[3-9][0-9]?)\.(0|1[0-9]?[0-9]?|2([0-4][0-9]?|5[0-5]?|[6-9])?|[3-9][0-9]?)\.(0|1[0-9]?[0-9]?|2([0-4][0-9]?|5[0-5]?|[6-9])?|[3-9][0-9]?)</tt>

If the regexp engine supports counters and can return arrays for matches, you may also use this equivalent compact regexp where \1 matches the first decimal 8-bit component, and \3 is matched 3 times (once for each of the last three components):


 * (0|1[0-9]{0,2}|2([0-4][0-9]?|5[0-5]?|[6-9])?|[3-9][0-9]?)(\.(0|1[0-9]{0,2}|2([0-4][0-9]?|5[0-5]?|[6-9])?|[3-9][0-9]?)){3}</tt>

Quick and easy PHP boolean check
This is a faster, more compact, and easier to remember than using a regex with preg_match in PHP. It converts the given string to a valid Internet standard format string, and attempts to convert it back. If the string can be converted both ways, then it is a valid IP. If not, it is invalid.

Hexadecimal-only IPv6 address formats
The valid IPv6 address format is less restricted than the one for IPv4, because its use within URL’s is possible ONLY within delimiting [square brackets], with which such host address will never be confused with a valid DNS hostname. For this reason, extra leading zeroes are accepted (provided that each component doesn’t have more than 4 hex digits). Additionally, the letter case of hex digits is not significant.

Addresses in unabbreviated formats
Here is a simple deterministic regexp matching only unabbreviated IPv6 addresses, where the hex components are returned in \1 (one string) and \3 (seven strings):


 * ([0-9A-Fa-f]{1,4})(:([0-9A-Fa-f]{1,4})){7}</tt>

Addresses in canonical shortest unabbreviated format
Such unabbreviated address may still be canonicalized into uppercase and in its shortest form (without unnecessary leading zeroes), in which case the following regexp will only match this shortest unabbreviated form (which should be the one used for anonymous IPv6 users on MediaWiki sites where the first character of user account names is forced to uppercase), where the hex components are also returned in \1 (one string) and \3 (seven strings):


 * (0|[1-9A-F][0-9A-F]{0,3})(:(0|[1-9A-F][0-9A-F]{0,3})){7}</tt>

Addresses in abbreviated formats
However, the complexity of the standard notation is that one or more successive 16-bit hex components within the eight (separated by colons) may be dropped in the abbreviated notation (provided these components are all zero), by accepting that ONLY ONE of the specified components being empty (with such syntax, the total number of colons may be between 2 and 8, instead of just 7, and there may be between 0 and 8 hexadecimal 16-bit numbers) ; this will result in one possible occurence of double colons "::".

A deterministic regexp to match all abbreviated and unabbreviated IPv6 addresses follows (if you want to understand how it is structured, look at the wiki code source):


 * (::([0-9A-Fa-f]{1,4}(:[0-9A-Fa-f]{1,4}){0,6})?|[0-9A-Fa-f]{1,4}:(:([0-9A-Fa-f]{1,4}(:[0-9A-Fa-f]{1,4}){0,5})?|[0-9A-Fa-f]{1,4}:(:([0-9A-Fa-f]{1,4}(:[0-9A-Fa-f]{1,4}){0,4})?|[0-9A-Fa-f]{1,4}:(:([0-9A-Fa-f]{1,4}(:[0-9A-Fa-f]{1,4}){0,3})?|[0-9A-Fa-f]{1,4}:(:([0-9A-Fa-f]{1,4}(:[0-9A-Fa-f]{1,4}){0,2})?|[0-9A-Fa-f]{1,4}:(:([0-9A-Fa-f]{1,4}(:[0-9A-Fa-f]{1,4})?)?|[0-9A-Fa-f]{1,4}:(:([0-9A-Fa-f]{1,4})?|[0-9A-Fa-f]{1,4}:(:|[0-9A-Fa-f]{1,4}))))))))</tt>

For example, this regexp will accept "::", or "::0", or "0::", or "0::0", or the shortest unabbreviated format "0:0:0:0:0:0:0:0", or the full format "0000:0000:0000:0000:0000:0000:0000:0000" as they are all equivalent IPv6 addresses with a valid syntax. The strings "0:0:0:0:0:0:0:0::", or "00000000", or "::0::", or "0::0::0" will not be accepted as they are not in a valid IPv6 address format (too many colons, or missing colons with too many digits in a component, or ambiguous notation in the last two cases).

It is still preferable to canonicalize all abbreviated IPv6 addresses into the shortest unabbreviated form and with a single lettercase as shown above, notably if the address is used as the default user name (or talk page name) for contributing users not connected with their own named account, and that will be recorded in edit histories. (More information would be welcome from MediaWiki developers about which canonical format they will use as the default user name, for IPv6 users editing pages without being logged on, or for the private webserver logs.)

Dotted decimal or hexadecimal IPv4-mapped IPv6 address abbreviated formats
IPv4 addresses can also be mapped to IPv6 addresses to which they are implicitly bound on all network interfaces supporting the two routing protocols; for convenience, these can noted with the abbreviated IPv6 notation, by prefixing the "::" shorthand to the IPv4 decimal dotted notation. Although the addresses seem similar, they must be considered distinct because they may be routed differently, when datagram are sent to different routers with one of them not supporting both network adress protocols simultaneously with the same routes and similar QOS and security options.

Here also, such IPv6 address can only be used within [squared brackets] as a hostname in URLs, and due to this, the restriction on extra leading zeroes does not apply (but each decimal components must still be limited to at most 3 digits, and at least one); for example "::000.000.000.000" is a valid IPv6 address format, equivalent to the IPv6 address "::" (and it is mapped with the IPv4 "0.0.0.0" address):


 * ::(([01]?[0-9]{1,2}|2([0-4][0-9]|5[0-5]))(\.([01]?[0-9]{1,2}|2([0-4][0-9]|5[0-5]))){3}|([0-9A-Fa-f]{1,4}):([0-9A-Fa-f]{1,4}))</tt>

where the matches are returned either:
 * in \2 (one substring) and \4 (three substrings) for each decimal 8-bit component; or
 * in \5 (one substring) and \7 (seven substrings) for each hexadecimal 16-bit components.

Complete IPv6 address formats
The abbreviated notation for IPv4-mapped IPv6 addresses can also be used as a suffix that follows any 96-bit IPv6 prefix (other than just "::", i.e. the null prefix), and so should be combined within the IPv6 address syntax everywhere there can be at least two hexadecimal 16-bit segments (separated by a single colon) at end of the notation.

'Unit' testing
If you need to check various pieces of data against multiple tests and see what data passes each test a quick and dirty way to do that is like so

Dealing with UTF-8
One problem you might run into, especially when dealing with non-english wikis, is that page titles can contain utf8. Perl uses an internal flag to denote whether or not a string is utf8 or not. Concatenating or processing strings with mixed flags can often lead to strange side effects such as double encoding or getting mangled. Doing the following can help avoid most, if not all, problems: