KazMuzik.net
Music / Technology / Healthcare / Immigration / アメリカ
Google
 
<< Pro Tools Ozone Academic Kit Crossgrade to M-Powered 7.4 - (#6 of Pro Tools 7.4 series)Kaz Muzik Blog Backup Project #30 >>

Kaz Muzik Blog Backup Project #29 - KazMuzik Blog
2007-11-17 23:25

7/4/2007 (#25) 以来、4ヶ月以上、backup をサボっていたため、今日、Nutch と Derby を使った backup をとりました。ほとんど、手順は同じですが、最初に Nutch の crawldb に inject するための URL リストは、#27 の monthly summary page(s) から作成しました。

$ cd ~/kazmuzikblog
$ mkdir /usr/local/nutch-0.9/kazmuzik-url-dir
$ java -classpath classes LiveJournalMonthlyManager \
| cut -f 2 \
| sed -e 's/^/http\:\/\/kazuomik.livejournal.com\//' -e 's/$/.html/' \
> /usr/local/nutch-0.9/kazmuzik-url-dir/20071116.txt
$ cd /usr/local/nutch-0.9
$ bin/nutch inject kazmuzik-crawldb kazmuzik-url-dir
...
$ mkdir kazmuzik-segments
$ bin/nutch generate kazmuzik-crawldb kazmuzik-segments
...
Generator: segment: kazmuzik-segments/20071117161617
...
$ bin/nutch fetch kazmuzik-segments/20071117161617
...
fetching http://kazuomik.livejournal.com/141077.html
fetching http://kazuomik.livejournal.com/88261.html
fetching http://kazuomik.livejournal.com/151961.html
...
fetching http://kazuomik.livejournal.com/36988.html
fetching http://kazuomik.livejournal.com/79224.html
fetching http://kazuomik.livejournal.com/81466.html
Fetcher: done
$ bin/nutch readseg -list -dir kazmuzik-segments
NAME            GENERATED  FETCHER START        FETCHER END          FETCHED PARSED
20071117161617  609        2007-11-17T16:16:46  2007-11-17T17:11:29  609     594
$ touch kazmuzik-segments/20071117161617/fetcher.done
$ bin/nutch updatedb kazmuzik-crawldb -dir kazmuzik-segments -noAdditions
...
$ bin/nutch readdb kazmuzik-crawldb -stats
CrawlDb statistics start: kazmuzik-crawldb
Statistics for CrawlDb: kazmuzik-crawldb
TOTAL urls:     609
retry 0:        609
min score:      1.001
avg score:      1.015
max score:      1.114
status 1 (db_unfetched):        15
status 2 (db_fetched):  594
CrawlDb statistics: done
$ bin/nutch generate kazmuzik-crawldb kazmuzik-segments
...
Generator: segment: kazmuzik-segments/20071117184651
...
$ bin/nutch fetch kazmuzik-segments/20071117184651
...
fetching http://kazuomik.livejournal.com/152532.html
fetching http://kazuomik.livejournal.com/148363.html
fetching http://kazuomik.livejournal.com/34221.html
Fetcher: done
$ bin/nutch readseg -list -dir kazmuzik-segments
NAME            GENERATED  FETCHER START        FETCHER END          FETCHED PARSED
20071117161617  609        2007-11-17T16:16:46  2007-11-17T17:11:29  609     594
20071117184651  15         2007-11-17T18:47:24  2007-11-17T18:48:40  15      15
$ bin/nutch updatedb kazmuzik-crawldb -dir kazmuzik-segments -noAdditions
...
$ bin/nutch readdb kazmuzik-crawldb -stats
CrawlDb statistics start: kazmuzik-crawldb
Statistics for CrawlDb: kazmuzik-crawldb
TOTAL urls:     609
retry 0:        609
min score:      1.002
avg score:      1.031
max score:      1.228
status 2 (db_fetched):  609
CrawlDb statistics: done
$ bin/nutch mergesegs kazmuzik-segments2 -dir kazmuzik-segments
...
$ bin/nutch readseg -list -dir kazmuzik-segments2
NAME            GENERATED  FETCHER START        FETCHER END          FETCHED PARSED
20071117185404  609        2007-11-17T16:16:46  2007-11-17T18:48:40  609     609
$ cd ~/kazmuzikblog
$ java -classpath classes:/usr/java/jdk/db/lib/derby.jar:/usr/local/nutch-0.9/nutch-0.9.jar:\
/usr/local/nutch-0.9/lib/hadoop-0.12.2-core.jar:/usr/local/nutch-0.9/lib/commons-logging-1.0.4.jar:\
/usr/local/nutch-0.9/lib/log4j-1.2.13.jar \
LiveJournalEntryDatabaseInitializer /usr/local/nutch-0.9/kazmuzik-segments2/20071117185404
...
$ 

3つのエントリで、LiveJournalEntryParser が Exception を throw してしまいました。152816 は、11/10 の書いたこのプロジェクトの #26 ですが、mood, location, tag(s) を parse するときに、そこのソースコードで "<" をエスケープしていなかったため、そこで引っ掛かってしまいました。128843129431 は、Subject を書いていませんでしたが、parser がサポートしていませんでした。今回は、エントリに Subject を追加するという workaround で回避しておきました。この 3つの URL リストを作成して、freegen という Nutch のサブコマンドで、直接 Nutch segment を generate して、再開しました。
$ cd /usr/local/nutch-0.9
$ mkdir kazmuzik-url-dir2
$ cat > kazmuzik-url-dir2/20071116b.txt
http://kazuomik.livejournal.com/128843.html
http://kazuomik.livejournal.com/129431.html
http://kazuomik.livejournal.com/152816.html
^D
$ bin/nutch freegen kazmuzik-url-dir2 kazmuzik-segments2
...
$ bin/nutch fetch kazmuzik-segments2/20071117221341
...
$ bin/nutch mergesegs kazmuzik-segments3 -dir kazmuzik-segments2
...
$ bin/nutch readseg -list -dir kazmuzik-segments3
NAME            GENERATED  FETCHER START        FETCHER END          FETCHED PARSED
20071117221515  609        2007-11-17T16:16:46  2007-11-17T22:14:35  609     609
$ java -classpath classes:/usr/java/jdk/db/lib/derby.jar:/usr/local/nutch-0.9/nutch-0.9.jar:\
/usr/local/nutch-0.9/lib/hadoop-0.12.2-core.jar:/usr/local/nutch-0.9/lib/commons-logging-1.0.4.jar:\
/usr/local/nutch-0.9/lib/log4j-1.2.13.jar \
LiveJournalEntryDatabaseInitializer /usr/local/nutch-0.9/kazmuzik-segments3/20071117221515
$ java -classpath classes:/usr/java/derby/lib/derby.jar LiveJournalHtmlCreator
$ 

Tags: programming