$ bin/nutch readseg -get crawl-1/segments/20070417062034 http://kazuomik.livejournal.com/?skip=20 \
-nocontent -nofetch -nogenerate -noparse -noparsetext
SegmentReader: get 'http://kazuomik.livejournal.com/?skip=20'
ParseData::
Version: 5
Status: success(1,0)
Title:
Outlinks: 5
outlink: toUrl: http://kazuomik.livejournal.com/kazuomik/__rpc_controlstrip?user=kazuomik anchor:
outlink: toUrl: http://stat.livejournal.com/ anchor:
outlink: toUrl: http://stat.livejournal.com/img anchor:
outlink: toUrl: http://www.livejournal.com/ anchor:
outlink: toUrl: http://kazuomik.livejournal.com/ anchor:
Content Metadata: nutch.content.digest=67e3f015e805111056350368408cab9c Date=Tue, 17 Apr 2007 13:21:04 GMT \
Vary=Accept-Encoding Content-Length=29932 Content-Encoding=gzip nutch.crawl.score=0.035714287 \
Set-Cookie=ljuniq=bxt7sDSFybXyplh:1176816064:pgstats0; expires=Saturday, 16-Jun-2007 13:21:04 GMT; \
domain=.livejournal.com; path=/ nutch.segment.name=20070417062034 Connection=close \
Content-Type=text/html; charset=utf-8 Server=Apache Cache-Control=private, proxy-revalidate
Parse Metadata: CharEncodingForConversion=UTF-8 caching.forbidden=content OriginalCharEncoding=utf-8
$
|