How To: Import 7 Million Locations from GeoNames to MySQL, then into MongoDB so they can be compared and optimized for Location Queries
1:57 am in Benchmarks, Location Queries, MongoDB, MySQL, Tutorials by Mark Smalley
It’s a long title I know, but it only comes half-close to the title I really wanted to use…
So, like everyone else who’s been paying any kind of attention to recent database technology trends, I’d heard about this No-SQL stuff, heard it was fast, and far more approachable to your average developer due to the JSON based formatting, but had never had the time to sit down and really investigate things. And then I heard about MongoDB, specifically about their superb geo-location functions – http://www.mongodb.org/display/DOCS/Geospatial+Indexing – and as you may know already, I’m a bit of a geo-location nut-case, so was instantly hooked when I also heard that Foursquare was using MongoDB too.
So where to start…? We first needed some geo-data, and where better to get that than from GeoNames – where you can download the allCountries.zip file (containing 7 Million+ records at over 200 MB), or simply grab the country-specific information for the country you reside.
We will need to get the data into MySQL, not only so that we can compare the speeds for proximity searches, but also so that we can learn a valuable lesson on how to export from MySQL into MongoDB, which really is both flexible and easy.
If we plan to use the GeoNames data, we MUST first create a table using the following SQL (with the field names being specifically important for the next stage):
CREATE TABLE `geonames` (
`geonameid` int(10) unsigned NOT NULL default '0',
`name` varchar(200) NOT NULL default '',
`ansiname` varchar(200) NOT NULL default '',
`alternatenames` varchar(2000) NOT NULL default '',
`latitude` double NOT NULL default '0',
`longitude` double NOT NULL default '0',
`feature_class` char(1) ,
`feature_code` varchar(10) ,
`country_code` char(2),
`cc2` varchar(60),
`admin1_code` varchar(20) default '',
`admin2_code` varchar(80) default '',
`admin3_code` varchar(20) default '',
`admin4_code` varchar(20) default '',
`population` bigint(11) default '0',
`elevation` int(11) default '0',
`gtopo30` int(11) default '0',
`timezone` varchar(40),
`modification_date` date default '0000-00-00',
PRIMARY KEY (`geonameid`)
) CHARACTER SET utf8 ;
With the geonames table created, we can import the data contained within the TXT file we just downloaded from GeoNames by using SQL similar to the following (where you will need to replace the location of the TXT file to the location you just extracted it to from the ZIP file you downloaded):
LOAD DATA INFILE 'C:/xampp/htdocs/allCountries.txt' INTO TABLE geonames(
geonameid, name, ansiname, alternatenames, latitude, longitude, feature_class, feature_code, country_code, cc2, admin1_code, admin2_code, admin3_code, admin4_code, population, elevation, gtopo30, timezone, modification_date
)
We should now have over 7 million places in our database (assuming you were brave enough to use the allCountries.txt file, which to be honest, may be too big for the next stages and cause you several problems like it did when running it on my personal PC at home). Once you have the records in your MySQL database, do not forget to add an index to the “latitude” and “longitude” fields in order to give the tests a fair chance later.
At this stage, we should probably start thinking about getting MongoDB set-up, and fortunately, the online documentation and general community-spirit for MongoDB is a jolly-fine one at that. For the purpose of this comparison, we will be using MongoDB and PHP to test the results, but in a future “How To”, I also hope to document my findings for NodeJS too!
Let’s download MongoDB – http://www.mongodb.org/downloads – then run-through the getting-started guides – http://www.mongodb.org/display/DOCS/Quickstart. I found set-up on the PC and Mac equally as easy. Simply download the package, save somewhere sensible, create the necessary “data” folder, then run the two recommended console executables “mongod.exe” and “mongo.exe” and if 3+3 equals 6 everything’s working properly! For the benefit of getting to the meat and bones of this How To, I’m going to assume that setting-up MongoDB was a breeze and that you were then capable of also copying the correct .dll (for windows) or .so (for Mac) into the correct “extension” folder as specified by your php.ini file and documented here – PHP Drivers for Mongo.
We are now ready to import the MySQL data into MongoDB, and although the method I am about to show you is not the preferred method for large data-sets (which should use the import function), showing you this way teaches you the most fundamental MongoDB functions, which includes adding new collections (tables) and objects (records), but also updating those objects and then retrieving the objects, all from PHP in syntax we as PHP developers should be very comfortable with. In fact, the following code provides everything you need in order to not only import the data from MySQL to MongoDB, but also to run comparative tests to see how fast each of the databases can run and return the same results.
Before you begin, please note that importing all 7 Million records using the methods below from MySQL to MongoDB is going to take some time, several hours, if not all day (and then some, especially using these methods), so if you are doing this as a way to get started and want some immediate feedback, you REALLY should consider importing one of the country specific .txt files from GeoNames, or using the country_code and (or) limit variables to migrate only a portion of the whole MySQL data into MongoDB.
Either way, feel free to use, improve and comment on the following code as required (as a reminder, this is not the best way to import large data-sets to MongoDB but have done it this way so that it’s easy to understand the fundamentals):
// We need this function for running stop-watch later...
function ms_microtime_float(){
list($usec, $sec) = explode(" ", microtime());
return ((float)$usec + (float)$sec);
}
// We need this function for making the most optimised MySQL proximity query available...
// Simply provide a database name, your current Lat / Lng and optional DB Limits and Maximum Distances...
function ms_get_locations_by_distance_using_sql($database_name='geonames',$lat,$lng,$limit=100,$distance=100,$country_specific=false){
if($country_specific){ $extra_where_clause = "AND country_code = 'MY'"; }else{ $extra_where_clause=''; }
if($limit>0){ $limit = 'LIMIT 0, '.$limit; }else{ $limit=NULL; }
$lng1 = $lng - $distance / abs(cos(deg2rad( $lat ) ) *69 );
$lng2 = $lng + $distance / abs(cos(deg2rad( $lat ) ) *69 );
$lat1 = $lat - ( $distance /69 );
$lat2 = $lat + ( $distance /69 );
/* THIS REALLY IS THE BEST WAY TO PERFORM PROXIMITY SEARCH IN MYSQL */
$nearest_locations_ordered_by_distance = "SELECT *,
( 6371 * acos( cos( radians( $lat ) ) * cos( radians( latitude ) ) * cos( radians( longitude ) - radians( $lng ) ) + sin( radians( $lat ) ) * sin( radians( latitude ) ) ) ) AS distance
FROM $database_name
WHERE longitude between $lng1 and $lng2
AND latitude between $lat1 and $lat2
$extra_where_clause
HAVING distance < $distance
ORDER BY distance
$limit";
return $nearest_locations_ordered_by_distance;
}
/*
* THESE ARE OUR OPTIONS
*
* In order to initially import data from MySQL to MongoDO use:
*
* $collect_mongo_results = false;
* $loop_through_sql = true; // SET TO FALSE IF ONLY INTERESTED IN MONGO RESULTS
* $add_to_mongo = true;
* $print_arrays = false;
*
* In order to compare times for collecting results in both MySQL and MongoDB use:
*
* $collect_mongo_results = true;
* $loop_through_sql = true; // SET TO FALSE IF ONLY INTERESTED IN MONGO RESULTS
* $add_to_mongo = false;
* $print_arrays = true;
*
*/
$collect_mongo_results = false;
$loop_through_sql = true;
$add_to_mongo = true;
$print_arrays = false; // Be careful, with 7 million results, this can take time
$limit = -1; // Set this to -1 if you wish to import everything from MySQL to MongoDB else add limit as number
$maximum_distance = 100; // This represents the maximum distance in KM for proximity search
$database_name = 'geonames'; // This will also be used for the MongoDB name
$table_name = 'countries'; // This will also be used as the MongoDB collection name
$mysql_host = 'localhost';
$mysql_username = 'root';
$mysql_password = '';
// These default Lat / Lng Coordinates are for Malaysia
$current_latitude = 3.190908; // Do not use a string!
$current_longitude = 101.682243; // Do not use a string!
// This country code is for Malaysia
// Change to false if wanting to run queries on all records...
$country_code = false; // This is used to perform sub-set queries on specific countries...
/* END OF OPTIONS */
// Connect to Mongo
$m = new Mongo();
// Select / Create a Database
$db = $m->$database_name;
// Select / Create a Collection (table)
$collection = $db->$table_name;
/* THIS ALLOWS SCRIPTS TO RUN WITHOUT TIMING OUT */
set_time_limit(0);
if($collect_mongo_results){
/* START TIMING MONGO */
$time_start = ms_microtime_float();
// Start building MongoDB query
/* PAY EXTRA ATTENTION TO ORDER OF LONGITUDE THEN LATITUDE */
/* THIS IS VERY IMPORTANT FOR MONGODB - OTHERWISE RESULTS WILL BE WRONG !!! */
$this_latlng = array('lng'=>$current_longitude,'lat'=>$current_latitude);
$mongo_near_query = array('geoNear'=>$table_name,'near'=>$this_latlng,'$spherical'=>true,'$maxDistance'=>$maximum_distance/6378,'num'=>$limit);
$cursor = $db->command($mongo_near_query);
/* THE SAME QUERY CAN BE RUN IN MONGO CONSOLE USING THE FOLLOWING */
//-> db.runCommand({geoNear:"countries",near:[101,3],spherical:true,maxDistance:100/6378,num:10})
if($print_arrays){
// iterate through the results
foreach ($cursor['results'] as $obj) {
echo '<br />This is MongoDB Array:<br />';
echo '<pre>';
print_r($obj);
echo '</pre>';
}
}
/* END MONGO TIMER AND DISPLAY RESULTS */
$time_end = ms_microtime_float();
$time = $time_end - $time_start;
$mongo_results = "<br />Collected and Printed using MongoDB in $time seconds<br />";
echo $mongo_results;
}
if($loop_through_sql){
/* START SQL TIMER */
$time_start = ms_microtime_float();
// Connect to MySQL
$link = mysql_connect($mysql_host, $mysql_username, $mysql_password);
if (!$link) {
die('Could not connect: ' . mysql_error());
}
// Connect to
if (!mysql_select_db($database_name)) {
die('Could not select database: ' . mysql_error());
}
/* WHICH QUERY TO RUN...? DEPENDS ON OPTIONS ABOVE */
if($add_to_mongo){
if($country_code){ $where_clause = "WHERE country_code = 'MY'"; }
$query = 'SELECT * FROM '.$table_name.$where_clause;
}else{
$query = ms_get_locations_by_distance_using_sql($table_name,$current_latitude,$current_longitude,$limit,$maximum_distance,$country_code);
}
// Perform Query
$result = mysql_query($query);
// Check result
// This shows the actual query sent to MySQL, and the error. Useful for debugging.
if (!$result) {
$message = 'Invalid query: ' . mysql_error() . "\n";
$message .= 'Whole query: ' . $query;
die($message);
}
// Use result
// Attempting to print $result won't allow access to information in the resource
// One of the mysql result functions must be used
// See also mysql_result(), mysql_fetch_array(), mysql_fetch_row(), etc.
while ($row = mysql_fetch_assoc($result)) {
if($print_arrays){
echo '<br />This is MySQL Array:<br />';
echo '<pre>';
print_r($row);
echo '</pre>';
}
/* END SQL TIMER */
$time_end = ms_microtime_float();
$time = $time_end - $time_start;
/* NOW NEED TO ADD DATA TO MONGO TOO */
if($add_to_mongo){
$obj = array();
foreach($row as $key => $value){
$obj[$key] = $value;
}
// This is the MongoDB PHP Function for inserting objects (records)
$collection->insert($obj);
}
// In order to use MongoDB Geo-Spatial Indexing we need a Lng / Lat field
// This shows us how to use the MongoDB PHP Update function
$collection->update(array("geonameid" => $row['geonameid']), array('$set' => array("latlng" => array('lng'=>(float)$row['longitude'],'lat'=>(float)$row['latitude']))));
// Create Geo-Spaital Index In Mongo Console (once this field has been added) using:
// db.countries.ensureIndex({latlng:'2d'})
}
/* THEN PRINT RESULTS */
echo "<br />Collected and Printed using MySQL in $time seconds<br />";
if($collect_mongo_results){
echo "WHERE AS ".$mongo_results;
}
// Free the resources associated with the result set
// This is done automatically at the end of the script
mysql_free_result($result);
mysql_close($link);
}
Several things to remember before running or after checking the results of the comparisons is whether you remembered to added the MySQL indexes to your latitude and longitudes, not to mention whether you added the geo-spatial index to your MongoDB, to do which, you will need to use the following command from within the Mongo Console (geeky I know – there are some GUIs, but this is much easier than it looks):
db.countries.ensureIndex({latlng:'2d'})
Some other simple yet useful commands for the console include dropping databases:
db.geonames.drop()
By default, opening the Mongo Console launches the “test” database, so to switch from that to geonames is as simple as typing:
use geonames
In conclusion, MongoDB is FAST. In searching through 65,411 records / objects (as I myself only imported the places for Malaysia in order to save time and get to the learning part), and displaying only those within a 100 KM radius, and then having those results limited to 10, ordered by distance with the closest first, and echoed out onto the page took 0.23555 seconds for MySQL, where as that very same query providing the very same results took only 0.00152 seconds with MongoDB – and on several occasions, it even came in around the 0.00095 mark! And to think that that’s with just one user accessing the query. The biggest problem with MySQL and the reason I initially sought alternatives was the fact that only one person can access the database at one time, where concurrent queries get added to a queue and need to wait their turn. With MongoDB, this is not the case, so really, truth be known, there is no comparison between them! But we will be running some concurrent tests soon, as well as running the same queries using the complete 7 Million record set…
But how do we import all 7 Million+ records and perform tests on those…?
By NOT using the methods above! I tried several times and several different ways to import all 7 million records, and the quickest most reliable method I could find was to first export the MySQL as CSV. I tried several times to use a mysqldump from console, but could never get the formatting right. Believe it or not, the most Mongo-Friendly method was to export from phpMyAdmin in CSV for MS Excel format (with the first row being field names) as this was able to then utilize the mongoimport function. However, whenever I tried to export the entire set, I ran into problems. What I ended-up doing was exporting a million records at a time as CSV.
Once you have all 8 CSV files, you can then import them into MongoDB using the following console commands:
mongoimport --db geonames --collection countries --type csv --file geonames_part1.csv --headerline --upsert
Do this for each part, and remember this must be done from a standard terminal console, not from the MongoDB console, which was my first mistake! After we have all 7 Million plus records into the database, we are going to need to add in the latlng / loc field, which does not yet exist and is needed by MongoDB for the geo-spatial indexing. To do this, the quickest method I found was to use a JS query. To do that, you will first need to create a new JavaScript with contents such as:
db.countries.find().forEach(function(data) {
db.countries.update({_id:data._id},{$set:{loc:{lng:parseFloat(data.longitude),lat:parseFloat(data.latitude)}}});
})
We then need to query the MongoDB using that JS file, which is perhaps my most favourite MongoDB feature and can be performed from a regular terminal (not the MongoDB console) using syntax such as:
mongo geonames add_latlng.js
Once we have that in place, we can add the Geo-Spatial Index (from the Mongo Console) by typing:
db.countries.ensureIndex({loc:"2d"})
The results are phenomenal:
Querying 7.8 Million records using MySQL and finding the 10 nearest locations ordered by distance took an average of 165 seconds!
With MongoDB, that same query took only 0.02 seconds!
Other than that, what’s next…?
MongoPress – for it is this that started the future of NoSQL CMS : -)
You may also be interested in MongoDB Vs MySQL – Benchmarks Re-Creating Typical CMS Functionality to Compare Performance of MySQL and MongoDB in PHP / Apache.
Thank you so much!
Can I translate this article to portuguese and post on my blog?
You most certainly may! Please let us know once you have done so – we can then provide a link to that article too!
Why not using the spatial extension on mysql as well?
Johan – we had done MySQL geo-spatial indexing and polygon-testing comparisons Vs the MySQL method above against each other just a week prior to this experiment, and for the particular set of circumstances we needed to address for our own application, both methods worked out at almost the exact same speed once all the indexing had been properly applied, and so for the benefit of speeding-up the process of this round of experiments, we simply used the same non-geo-spatial MySQL method to make the process of testing one that was more solely focused on the MogoDB side of things…
Nice write-up, but I am a bit reluctant to accept the end result (average of 165 seconds for MySQL). 8 mio records is nothing for RDBMS and with proper indexes your queries should be very fast. Did you try to EXPLAIN your query? I am quite sure you would find that indexes were not used, probably because you used “BETWEEN” (I think MySQL had problems with indexing BETWEEN conditions) instead of > and <.
Of course, one could argue that this just another proof why Mongo is so much better…
The bottom line is that with MongoDB it is at least a thousand times quicker to write to RAM than it is to HD, so it will always seem faster. In addition to that, the circumstances that we needed to recreate for our testing were especially heavy on MySQL as they were performing several passes, back and forth, and conducting a bunch of stuff outside the scope of MySQL itself, whereas MongoDB was able to handle the exact query we needed in a completely natural way – so like you say; “one could argue that this just another proof why Mongo is so much better” …
Hello!
Nice article, but I wanted to know a bit more about the storage engine used in MySQL and the indexes you’ve used…
In this particular case, if you were using MongoDB spatial indexes, which are (obviously) reducing lookup times from O(n*Log(n)) to O(1) by adding some overhead during writes or async index updating.
Did you try using a spatial index on MySQL? What about using point data?
If not, then I would not assert that this benchmark is valid, as we’re not comparing different data management system, but also different sorting and fetching algorithms… That makes the test inconsistent to my eyes, and it is obvious that you’re getting such crazy results (165s vs 0.02s).
I’m also interested in a PGSQL benchmark (as I’m not that big MySQL fan, even if I use it daily), but I don’t pretend you to do it for me
It would already be great to see how much time MySQL needs to find indexed data in a transaction.
I know that time is MongoDBFetchTime*SOME_CONSTANT, and not MongoDBFetchTime*n*Log(n).
Marco – as mentioned to Johan, we had done separate MySQL tests using geo-spatial indexing and POINTS and Polygons just a week or so prior to this test, and already concluded that for our particular case, both methods worked at almost exactly the same speed, but they simply were not capable of returning data in a natural way like MongoDB does, already arrayed and ready to use. I would also love to be able to spend more time and get more results, and will of course update this page if and when that happens. The thing is, we did all these tests ourselves, for ourselves, and this is just the final straw of those tests. The speed and lack of caching needed to achieve naturally geo-located objects from MongoDB is without question in a world of its own when compared to the horrors of geo-location and MySQL – but of course, that is just my opinion (and everyone else in my company who has also worked towards perfecting geo-location queries)…
alright. But could you please point me to a proof of this statement?
“The biggest problem with MySQL and the reason I initially sought alternatives was the fact that only one person can access the database at one time, where concurrent queries get added to a queue and need to wait their turn”
By user, I meant to say connection, as in for each write you make to the database, it gets locked-down (unless using InnoDB), but this is all about I/O at the end of the day and that MySQL needs to write to your hard-drive, whereas MongoDB first writes to RAM, this briefly gets touched upon here – http://highscalability.com/blog/2011/6/22/its-the-fraking-iops-1-ssd-is-44000-iops-hard-drive-is-180.html
Yes, but that is by design to get ACID compliance which I think is a good idea. And you were talking about only writes and then you have right about table lock (InnoDB – row locking, BDB – page locking). Even if MongoDB does write to memory initially it does take an global (server-wide?) lock when writing to persistence storage, short time lock perhaps but still.
I have tried the code here now (my testing machine is an CentOS 5 , 256MB memory, 1 core, VPS system) , and for mysql I get running time of 0.78 seconds….
Haven’t set up mongodb yet but will do that next
As “officially” mentioned below by Laulima, we were not trying to say that MongoDB is 8,00 times faster at all times, only that it was 8,000 times faster for the specific query we were running. MongoDB will always be faster than MySQL, due to the fact that it writes to RAM rather than HD, it’s just a matter of how much more faster, which all depends on what you need. For geo-location, it kicks-arse.
The GridFS media-storage (being used by MongoPress – http://mongopress.org) functionality of MongoDB is also reason enough to use it if you’re storing media and want it load-balanced and backed-up via sharding and replica-sets…
That seems like a much better estimate than 167 seconds… Johan, look forward to results of your tests if you intend on publishing them here.
And here is a quick summary of my tests
http://www.bjornalycke.se/articles/mysql-vs-mongodb-0
We did not intend for our article to claim that MongoDB is 8,000 times faster, only that the exact query we needed to conduct under the circumstances we needed to conduct them led to results that were over 8,000 times faster. The smaller the data-set and the more specific you get with the query (as you have done) makes the difference much smaller, but we wanted to test large queries and large data-sets using specific functionality that MySQL does not even offer and has to be conducted in PHP first as well as performing several passes at it, which when done on huge data-sets becomes very slow – unlike it did with MongoDB that conducts all of its magic in RAM at lightening fast speeds.
Nonetheless, thanks for the counter-post and started to explore the issue more closely. Hope you have fun with MongoDB, it really is an awesome new way to engage with data…
Well, that average I got was for 7.8 million rows, with the same default settings as you have used, so there should be no difference but still I get an average of 0.89 and you 165 seconds…
I have also tried to copy 85K rows from mysql to mongodb and that took 2 seconds. On this machine and this dataset I get an average of 0.0007688 seconds for mysql. (this is not the same machine as my prev. benchmark)
and for mongodb I get an average of 0.000078125 compared to mysql’s 0.0007688.
Hi Mark!
Thanks a lot for your explanation and good work. Thanks to it now I can have a hosted copy of the geonames database and grow my project up.
I am having some troubles for dumping/restoring… I can backup this way:
mongodump –db my_db –collection my_collection
Then restore with:
mongorestore –db other_db –collection my_collection
Then problem comes when trying to build the 2d index. It says:
point not in interval of [ -180, 180 )
Have you seen this error before?
Thanks again!
We’ve not seen this kind of error before, but do remember that the first problem we ever encountered with MongoDB geo-location was that they use LNG / LAT – rather than most other places, which uses LAT / LNG – that caused us much confusion and a bit of pain to begin with, but do not necessarily think that’s the problem you are having. It seems you have some suspect “points” that you’ll need to de-bug. Good luck, and thanks for taking the time to comment.
Hey! thanks for the reply.
Seems like i’ve found a bug:
http://groups.google.com/group/mongodb-user/browse_thread/thread/4b8da2157ff3b657/c0dfacce88b77479#c0dfacce88b77479
Sad I spent 2 days with this bug :-\
Seems that last comment really nails it on the head (at least for a temporary solution):
“You could change your indexes bound to be 181:”
———————————————————————————
db.places.ensureIndex( { loc : “2d” } , { min : -180 , max : 181 } )
———————————————————————————
Glad the mongo-group was able to help you get it answered!
Yes that’s true. This is the problem of being a mongodb novice