Ghel

From Toolserver wiki
Jump to: navigation, search

ghel (GeoHack External Links) is a package consisting of a robust URL parser for links pointing to the GeoHack service, relational database of articles with geographic information, and an error log. It currently support over 70 language editions of Wikipedia and Wikimedia Commons. Several other programs are sometimes included, such as a "regular expression link matching" tool and an interwiki tool. The data is refreshed daily between 6:00 and 9:00 UTC.

Contents

[edit] Access and query examples

Toolserver users can access these databases by connecting to the server for their respective wikis. phpMyAdmin will helps introduce users to the layout of the tables. External users can either to use the locateCoord tool or download table dumps with the necessary WMF tables.

[edit] Querying an article

Select coordinates from Ridge Route (page_id: 809205). This featured article includes coordinates for all locations discussed.

SELECT * FROM u_dispenser_p.coord_enwiki WHERE gc_from=809205;

[edit] Pages containing location data without images

Select the first 100 pages that have geographical coordinates but lack images. Optimization and more sophisticated image analysis is left as an exercise for the reader.

SELECT page_id, gc_lat, gc_lon, gc_region, page_title
 
FROM page
 
JOIN u_dispenser_p.coord_enwiki ON gc_from = page_id
LEFT JOIN imagelinks ON il_from = page_id 
 
WHERE	page_namespace=0
AND	gc_primary = 1
AND	il_from IS NULL
 
LIMIT	100;

[edit] Featured articles by number of contained coordinates

Sort by number of coordinates of pages in Category:Featured articles.

SELECT COUNT(*) AS freq, page_title, GROUP_CONCAT(gc_primary)
 
FROM categorylinks
 
JOIN page ON page_id=cl_from
LEFT JOIN u_dispenser_p.coord_enwiki ON gc_from=cl_from
 
WHERE cl_to = "Featured_articles"
AND EXISTS (SELECT 1 FROM u_dispenser_p.coord_enwiki WHERE gc_from = cl_from AND gc_primary = 1 ) /* Make sure we are looking at articles with primary coordinates */
 
GROUP BY cl_from
ORDER BY freq;

[edit] Other languages of moon articles

Select articles with lunar coordinates and list the other languages they include.

SELECT page_title, GROUP_CONCAT(ll_lang)
FROM page
JOIN langlinks ON page_id = ll_from
JOIN u_dispenser_p.coord_enwiki ON page_id = gc_from
WHERE gc_globe = 'Moon'
AND page_namespace = 0
GROUP BY gc_from, gc_location

[edit] locateCoord tool

locateCoord.py (source code): a quick and simple tool to access and search the database. The this query will give coordinates that are 5 km near the center of New York City. A more sophsitcaed rewrite will be required as it is pretty limited and to add support for multiple formats such as JSON/XML/YMAL/etc. Do not query this tool faster than 10 times per minute.

[edit] To do

  • Develop an API capable of writing out in HTML, JSON, serialized PHP, KML, OSM, and XML.
  • Language independent article ranking table (length, incoming links, interwiki links)
  • Reset primary bit for multiple primary coordinates form the same article
  • WikiMiniAltas/OSM data integration under heavy load without killing the databases.
  • Documentation, source code should be documented so a novice could understand it.
  • Live updating, MySQL triggers functionality is required for this.

[edit] Database schema summary

mysql> describe u_dispenser_p.coord_enwiki;
+-------------+--------------------------+------+-----+---------+-------+
| Field       | Type                     | Null | Key | Default | Extra |
+-------------+--------------------------+------+-----+---------+-------+
| gc_from     | int(8) unsigned          | NO   | MUL | NULL    |       |
| gc_lat      | double(11,8)             | NO   |     | NULL    |       |
| gc_lon      | double(11,8)             | NO   |     | NULL    |       |
| gc_alt      | float                    | YES  |     | NULL    |       |
| gc_head     | float                    | YES  |     | NULL    |       |
| gc_dim      | float unsigned           | YES  |     | NULL    |       |
| gc_type     | varbinary(32)            | YES  |     | NULL    |       |
| gc_size     | float                    | YES  |     | NULL    |       |
| gc_region   | varbinary(127)           | YES  |     | NULL    |       |
| gc_globe    | enum('Earth','Moon',...) | NO   |     |         |       |
| gc_primary  | tinyint(1)               | NO   |     | 0       |       |
| gc_name     | varbinary(255)           | NO   |     | NULL    |       |
| gc_location | point                    | NO   | MUL | NULL    |       |
+-------------+--------------------------+------+-----+---------+-------+

[edit] Fields

  • gc_from is the page identifier of the referring wiki page.
  • gc_lat is the latitude portion of the location.
  • gc_lon is the longitude portion of the location.
  • gc_alt is the elevation typically in meters above the sea level of the location.
  • gc_head is the direction in degree from north for the object. While typically applies to photos, it be used on other objects with a notion of "forward" such as anchored ships, buildings, rivers, and statues.
  • gc_dim is user supplied approximate size of the object in meters. Typically the diameter of a circumscribed circle.
  • gc_type is a user supplied coordinate type. These may indicate things such a source or mouth of a lake. Types used by the English Wikipedia is given at w:WP:WikiProject Geographical coordinates/type:.
  • gc_size is the population size for the type parameter.
  • gc_region is the user supplied ISO 3166-1 alpha-2 code for the region the location is in. This may be deprecated in the future if no use is found for it.
  • gc_globe is the celestial body (such as Earth, Moon, Mars, Venus) the location belong to. English Wikipedia lists the ones they consider valid, although there is confusion which system to use on Jupiter.
  • gc_primary flags where the location should be considered the primary object(s) in the page. This flag is automatically unset on pages with multiple primary locations. Generally this should be one per article, but may be more with current corner cases with source and outlet of lakes and rivers.
  • gc_name is the user-defined name of the object. If not given, will default to &pagename= which is usually set to the page title. This behavior may change in the future. Do NOT JOIN to the field, use gc_from=page_id instead.
  • gc_location is the WkB POINT (lat, lon) representation of the location. Range is restricted to −90 ≤ x ≤ 90 and −180 ≤ y ≤ 180 for better compatibility with spatial implementations. By default this column has a MBR spatial index allowing O(log n) average case search times.

[edit] Sample code

[edit] Bash/SQL

-- ; date; echo '
SELECT CONCAT("* [[File:", page_title, "]]"), gc_lat, gc_lon
FROM page
JOIN u_dispenser_p.coord_commonswiki ON gc_from=page_id
WHERE page_namespace=6 /* File: namespace */
AND gc_primary=1
LIMIT 10;
-- ' | sql -u commonswiki_p; date

[edit] Python

# The database exists on the same servers  as the non-fast so JOIN operations are possible 
# This makes it easy to do queries like:
# * Articles with coordinates without images (Places needing photographs)
# * Category and coordinate intersection (Map Category:Skyscrapers in New York City)
# * Features on Mars
# * Finding articles of places which users edit often
 
dbName = "enwiki_p"
wgServer = "en.wikipedia.org"
 
# Setup connection
import oursql
connection = oursql.connect(db=dbName,
                             host=dbName.replace('_', '-')+'.userdb.toolserver.org',
                             read_default_file="/home/dispenser/.my.cnf",
                             charset=None,
                             use_unicode=False)
cursor = connection.cursor()
 
def output(s):
    "Print Unicode string as UTF-8"
    print s.encode('utf-8')
 
def getCoordinatesBox(top, right, bottom, left):
    """
    top/bottom are degrees of latitude
    right/left are degrees of longitude
 
    Uses global cursor object.  Wikipedia uses the WGS84 ellipsoid model.
    """
    ghel_table = "u_dispenser_p.coord_%s" % dbName[0:-2]
    cursor.execute("""
        SELECT gc_from, gc_lat, gc_lon, gc_name
        FROM """+ghel_table+"""
        WHERE MBRWithIn(gc_location, GeomFromText(?))
    """, (
        # MySQL unfortunately uses a simplified model where it only checks for points
        # within an MBR (i.e. inside a box), so it isn't possible to find coordinate
        # in Central Park without an additional step of filtering.
        'Polygon((%s %s, %s %s, %s %s, %s %s, %s %s))' % (top,left, top,right, bottom,right, bottom,left, top,left, ),
    ))
    return cursor
 
# Fort Greene Park, New York aligns nicely to the global grid.
# from 40.6933333 deg N, 73.97305556 deg W, 40.68972222 deg N, 73.9777778 deg W
results = getCoordinatesBox(40.6933333, 
                            -73.97305556,
                            40.68972222,
                            -73.9777778)
for (page_id, lat, lon, name) in results:
    output(u'Point %4.3f, %4.3f: "%s"' % (lat, lon, name,))
    # gc_name does not represent the article title, it represents the name 
    # of the point.  Use gc_from to call up the page as demonstrated below
    output(u'http://%s/w/index.php?curid=%s' % (wgServer, page_id,))

[edit] Dumps

The ghel database is dump every Thursdays at 9:40 UTC and is accessible from http://toolserver.org/~dispenser/dumps/ as compressed MySQL dumps.

This can imported into your database using the following command:

gunzip coord_enwiki.sql.gz -c | mysql --verbose -h SERVER -D DBNAME

As of September 2011, a Tab-separated values file (similar to csv file) is produced in the ETL process. Some work would be needed before it is useful to others.

[edit] Logs

Error and warning outputted from the tool are available at http://toolserver.org/~dispenser/logs/. Error are items ghel could not parse, while warning are things it could parse but should be corrected for other programs to read correctly.

[edit] Source code

  • geodbcompiler.py - Simple application to create and fill the database with the geographic data
  • ghel.py - GeoHack External Link parsing library
  • regioncheck.py - [OUTDATED] produces reports using Administrative Boundaries - First Level (ESRI) dataset retrieves all state boundary polygons and find the shortest distance to each one. If the point is found inside it skips it a moves to the next point. This way it gives the shortest distance to all points outside of the country.

[edit] Reusers

[edit] External links

greatessays.org

Administration
Personal tools