2007年10月24日 星期三

[網頁] 如何整批抓取他人網頁資料 !

以抓取LiveABC(http://www.liveabc.com/site/daily_sentence/dailysentence_main.asp)
每日一句之內容及語音檔為例 :

1.由於它有 82 頁, 控制顯示每頁可以使用GET方式傳page=?來指定之
 例如要取得第30頁, 就打網址:
 http://www.liveabc.com/site/daily_sentence/dailysentence_main.asp?page=30
 如此我們就可以用php寫以下程式,將全部頁面一次抓下來
   $fp = fopen("all.htm","a+");

  for ($i = 1; $i <= 82; $i++) {
   $html = implode('',
    filele('http://www.liveabc.com/site/daily_sentence/dailysentence_main.asp?page='
     .$i.''));
   fwrite($fp, "\n\n");
   fwrite($fp, $html);
  }

  fclose($fp);
 ?>

2.以上步驟會得 all.htm 檔, 再用以下程式取得完整內容連結及影音檔下載連結
 以下程式其實是分好次執行才完成,
 參考以下程式可以完成我們要的功能 : 最後產生 sen.url & file
 #!/usr/bin/perl -w
 #
 #

 #$txtfile = "audio.htm";
 #$txtfile = "all2.htm";
 #$txtfile = "wav.txt";
 $txtfile = "rdata2.sh";
 open(IN, $txtfile) || die "can't open $txtfile : $!";

 while () {
   #if (/playAudio/ or /PopUp/) {
   # print $_;
   #}

   #$where = index($_,"playAudio");
   #if ($where >= 0) {
    # $af = substr($_,$where+11,43);
    # print $af,",";
   #}
   #$where = index($_,"PopUp");
   #if ($where >= 0) {
   # $a = substr($_,$where+7,40);
   # $where = index($a,"'");
   # $ag = substr($a,0,$where);
   # print $ag,"\n";
   #}

   print trim($_),"\n";
 }

 close(IN) || die "can't close $txtfile : $!";

 sub trim
 {
  $string = $_[0];
  $string =~ s/^\s+//;
  $string =~ s/\s+$//;
  return $string;
 }
3.以上步驟2可得到 file.url
 利用linux命令 : wget 整批下載聲音檔
4.以上步驟2可得到 sen.url, 由此以下列程式取得完整內容及產生資料庫匯入程式檔
 
  $fpr = fopen("sen.url","r"); //由步驟2取得每個完整內容的連結
  $fp = fopen("rdata.sql","a+"); // 產生資料庫匯入程式檔
  $fp2 = fopen("rdata2.txt","a+"); //得到聲音檔名, 對應到資料庫每筆的編號

  $patterns[0] = '/ */';
  $patterns[1] = '/\r/';
  $patterns[2] = '/\n/';
  $patterns[3] = '/ /';
  $replaces[0] = ' ';
  $replaces[1] = '';
  $replaces[2] = '';
  $replaces[3] = '';
  $lid = 817; // 由於總筆數為816筆,以後用遞減方式
  while ($r = fscanf($fpr, "%s\n")) {
    list ($url) = $r;
    $html = implode('', file($url)); // 取得每列的每日一句網頁

    //取得標題內容
    $html = strstr($html,"");
    $start = 0;
    $end = strpos($html,"");
    $len = $end - $start;
    $head = addslashes(preg_replace($patterns,
      $replaces,trim(strip_tags(substr($html,$start,$len)))));

    //取得聲音檔名稱
    $html = strstr($html,"db_dailysentence");
    $start = strpos($html,"db_dailysentence") + 25;
    $end = strpos($html,".wav") + 4;
    $len = $end - $start;
    $file = substr($html,$start,$len);

    //取得發表日期
    $date = substr($file, 4,10);

    //取得英文內容
    $html = strstr($html,"");
    $start = 0;
    $end = strpos($html,"
");
    $len = $end - $start;
    $english = addslashes(preg_replace($patterns,$replaces,
      trim(strip_tags(substr($html,$start,$len)))));

    //取得中文內容
    $html = substr($html,$len);
    $start = 0;
    $end = strpos($html,"");
    $len = $end - $start;
    $chinese = addslashes(preg_replace($patterns,$replaces,
      trim(strip_tags(substr($html,$start,$len)))));

    //取得句子說明內容
    $html = strstr($html,"");
    $start = 16;
    $end = strpos($html,"");
    $len = $end - $start;
    $tail = addslashes(preg_replace($patterns,$replaces,
        trim(strip_tags(substr($html,$start,$len)))));

    $lid--; //資料庫編號, 遞減

    //產生資料庫匯入敍述
    $out = "INSERT INTO bm_learning VALUES($lid,1,'$head

$english
$chinese

$tail

資料來源 : LiveABC互動英語教學集團','$date');\n";

    fwrite($fp, $out); //寫檔rdata.sql
    $out2 = $lid.",".$file."\n";
    fwrite($fp2, $out2); // 寫檔rdata2.txt, 以取得資料庫編號及聲音檔名的對應
  }

  fclose($fpr);
  fclose($fp);
  fclose($fp2);
 ?>
 5.由於是事後寫的, 之前有多次試驗, 所以以上步驟僅供參考用

0 意見: