oqgraph-dev team mailing list archive
-
oqgraph-dev team
-
Mailing list archive
-
Message #00075
Re: IMDB tools
Hmm somehow a typo (or old syntax) was still present in the SQL file.
$ zcat actors.list.gz | pv | ./imdb2sql | mysql -u root --socket=/tmp/mysql.sock test
725MB 0:47:17
loading the actors requires 260263 pages (or K) in the keybuffer.
$ zcat actresses.list.gz | pv | ./imdb2sql | mysql -u root --socket=/tmp/mysql.sock test
427MB 0:24:42
After this, 380564 MyISAM index pages in the key buffer.
So key_buffer_size=400M for the above two operations is sufficient.
mysql> insert into actors_movies_actors select am1.actor_movie_id,am2.actor_id from actors_movies am1 join actors_movies am2 on (am1.movie_id=am2.movie_id AND am1.actor_id != am2.actor_id);
Query OK, 169301244 rows affected (1 hour 39 min 7.58 sec)
Records: 169301244 Duplicates: 0 Warnings: 0
MariaDB [test]> insert ignore into actors_graph_base (origid,destid) select am.actor_id,ama.actor_id from actors_movies_actors ama join actors_movies am using (actor_movie_id);
Query OK, 149853110 rows affected, 65535 warnings (1 hour 51 min 19.86 sec)
Records: 169301244 Duplicates: 19448134 Warnings: 19448134
Ok so a simple actor -> actor graph is 149 million vertices.
Somehow the system doesn't like the actors_graph table, getting errors (not found, error 2) no matter what I do. When I name it ag it's fine, with the basetable named agb.
Single underscores have been used in other tests, so I dunno what's going on there.
----- Original Message -----
> From: "Arjen Lentz" <arjen@xxxxxxxxxxxxx>
> To: oqgraph-dev@xxxxxxxxxxxxxxxxxxx
> Sent: Tuesday, 26 February, 2013 10:06:51 AM
> Subject: [Oqgraph-dev] IMDB tools
> Hi all
>
> Here is my old IMDB conversion tool (C code), and the schema I used.
> It uses only Actors and Movies, filtering out TV series and DVDs.
>
> 1484084 actors \_ actors
> 866799 actresses /
> 574006 movies
> 5900319 actors_movies
>
> actors.list.gz and actresses.list.gz files from
> ftp://ftp.fu-berlin.de/pub/misc/movies/database/
> (via http://www.imdb.com/interfaces)
>
> My laptop is still building the actors_movies_actors table so I don't
> yet have a total for that, but from memory I think it was around 45
> million edges.
>
> Currently I load using MyISAM with delay key writes and large key
> buffer size.
> We can possibly optimise the load methodology a bit further.
>
>
> Cheers,
> Arjen.
> --
> Arjen Lentz, Exec.Director @ Open Query (http://openquery.com)
> Australian peace of mind for your MySQL/MariaDB infrastructure.
>
> Follow us at http://openquery.com/blog/ & http://twitter.com/openquery
>
>
> --
> Mailing list: https://launchpad.net/~oqgraph-dev
> Post to : oqgraph-dev@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~oqgraph-dev
> More help : https://help.launchpad.net/ListHelp
--
Arjen Lentz, Exec.Director @ Open Query (http://openquery.com)
Australian peace of mind for your MySQL/MariaDB infrastructure.
Follow us at http://openquery.com/blog/ & http://twitter.com/openquery
-- We use MyISAM here because it's just a repeatable bulk load
-- ACID commits would just delay stuff...
-- enabling delay key write, key_buffer_size=3G
CREATE TABLE actors (
actor_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100) NOT NULL UNIQUE
) ENGINE=MyISAM DEFAULT CHARSET=latin1 DELAY_KEY_WRITE=1;
CREATE TABLE movies (
movie_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(300) NOT NULL UNIQUE
) ENGINE=MyISAM DEFAULT CHARSET=latin1 DELAY_KEY_WRITE=1;
CREATE TABLE actors_movies (
actor_movie_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
actor_id INT UNSIGNED NOT NULL,
movie_id INT UNSIGNED NOT NULL,
UNIQUE (actor_id,movie_id),
INDEX (movie_id)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 DELAY_KEY_WRITE=1;
CREATE TABLE `actors_movies_actors` (
`actor_movie_id` INT(10) UNSIGNED NOT NULL,
`actor_id` INT(10) UNSIGNED NOT NULL,
PRIMARY KEY (`actor_movie_id`,`actor_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 DELAY_KEY_WRITE=1;
-- populate link table
-- insert into actors_movies_actors select am1.actor_movie_id,am2.actor_id from actors_movies am1 join actors_movies am2 on (am1.movie_id=am2.movie_id AND am1.actor_id != am2.actor_id);
-- populate graph table
-- insert into actors_graph (origid,destid) select am.actor_id,ama.actor_id from actors_movies_actors ama join actors_movies am using (actor_movie_id);
Follow ups
References