← Back to team overview

oqgraph-dev team mailing list archive

Re: IMDB tools

 

Hmm somehow a typo (or old syntax) was still present in the SQL file.

$ zcat actors.list.gz | pv | ./imdb2sql | mysql -u root --socket=/tmp/mysql.sock test
725MB 0:47:17
loading the actors requires 260263 pages (or K) in the keybuffer.

$ zcat actresses.list.gz | pv | ./imdb2sql | mysql -u root --socket=/tmp/mysql.sock test
 427MB 0:24:42
After this, 380564 MyISAM index pages in the key buffer.

So key_buffer_size=400M for the above two operations is sufficient.

mysql> insert into actors_movies_actors select am1.actor_movie_id,am2.actor_id from actors_movies am1 join actors_movies am2 on (am1.movie_id=am2.movie_id AND am1.actor_id != am2.actor_id);
Query OK, 169301244 rows affected (1 hour 39 min 7.58 sec)
Records: 169301244  Duplicates: 0  Warnings: 0

MariaDB [test]> insert ignore into actors_graph_base (origid,destid) select am.actor_id,ama.actor_id from actors_movies_actors ama join actors_movies am using (actor_movie_id);
Query OK, 149853110 rows affected, 65535 warnings (1 hour 51 min 19.86 sec)
Records: 169301244  Duplicates: 19448134  Warnings: 19448134

Ok so a simple actor -> actor graph is 149 million vertices.

Somehow the system doesn't like the actors_graph table, getting errors (not found, error 2) no matter what I do. When I name it ag it's fine, with the basetable named agb.
Single underscores have been used in other tests, so I dunno what's going on there.


----- Original Message -----
> From: "Arjen Lentz" <arjen@xxxxxxxxxxxxx>
> To: oqgraph-dev@xxxxxxxxxxxxxxxxxxx
> Sent: Tuesday, 26 February, 2013 10:06:51 AM
> Subject: [Oqgraph-dev] IMDB tools
> Hi all
> 
> Here is my old IMDB conversion tool (C code), and the schema I used.
> It uses only Actors and Movies, filtering out TV series and DVDs.
> 
> 1484084 actors \_ actors
> 866799 actresses /
> 574006 movies
> 5900319 actors_movies
> 
> actors.list.gz and actresses.list.gz files from
> ftp://ftp.fu-berlin.de/pub/misc/movies/database/
> (via http://www.imdb.com/interfaces)
> 
> My laptop is still building the actors_movies_actors table so I don't
> yet have a total for that, but from memory I think it was around 45
> million edges.
> 
> Currently I load using MyISAM with delay key writes and large key
> buffer size.
> We can possibly optimise the load methodology a bit further.
> 
> 
> Cheers,
> Arjen.
> --
> Arjen Lentz, Exec.Director @ Open Query (http://openquery.com)
> Australian peace of mind for your MySQL/MariaDB infrastructure.
> 
> Follow us at http://openquery.com/blog/ & http://twitter.com/openquery
> 
> 
> --
> Mailing list: https://launchpad.net/~oqgraph-dev
> Post to : oqgraph-dev@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~oqgraph-dev
> More help : https://help.launchpad.net/ListHelp

-- 
Arjen Lentz, Exec.Director @ Open Query (http://openquery.com)
Australian peace of mind for your MySQL/MariaDB infrastructure.

Follow us at http://openquery.com/blog/ & http://twitter.com/openquery

-- We use MyISAM here because it's just a repeatable bulk load
-- ACID commits would just delay stuff...
-- enabling delay key write, key_buffer_size=3G

CREATE TABLE actors (
	actor_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
	name VARCHAR(100) NOT NULL UNIQUE
) ENGINE=MyISAM DEFAULT CHARSET=latin1 DELAY_KEY_WRITE=1;

CREATE TABLE movies (
	movie_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
	title VARCHAR(300) NOT NULL UNIQUE
) ENGINE=MyISAM DEFAULT CHARSET=latin1 DELAY_KEY_WRITE=1;

CREATE TABLE actors_movies (
	actor_movie_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
	actor_id INT UNSIGNED NOT NULL,
	movie_id INT UNSIGNED NOT NULL,
	UNIQUE (actor_id,movie_id),
	INDEX (movie_id)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 DELAY_KEY_WRITE=1;

CREATE TABLE `actors_movies_actors` (
  `actor_movie_id` INT(10) UNSIGNED NOT NULL,
  `actor_id` INT(10) UNSIGNED NOT NULL,
  PRIMARY KEY (`actor_movie_id`,`actor_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 DELAY_KEY_WRITE=1;

-- populate link table
-- insert into actors_movies_actors select am1.actor_movie_id,am2.actor_id from actors_movies am1 join actors_movies am2 on (am1.movie_id=am2.movie_id AND am1.actor_id != am2.actor_id);

-- populate graph table
-- insert into actors_graph (origid,destid) select am.actor_id,ama.actor_id from actors_movies_actors ama join actors_movies am using (actor_movie_id);


Follow ups

References