IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    [原]PHP读取doc,docx,xls,pdf,txt内容

    zhoubl668发表于 2016-12-19 16:43:38
    love 0

    我的一个客户有这样的需求:上传文件,可以是doc,docx,xls,pdf,txt格式,现需要用php读取这些文件的内容,然后计算文件里面字数.

    1.PHP读取DOC格式的文件

          PHP没有自带读取word文件的类,或者是库,这里我们使用antiword(http://www.winfield.demon.nl/)这个包来读取doc文件.

         首先介绍一下如何在windows下使用:

          1.打开http://www.winfield.demon.nl/(antiword下载页面),找到对应的windows版本(http://www.winfield.demon.nl/#Windows),下载antiword windows版本(antiword-0_37-windows.zip);

          2.将下载下来的文件解压到C盘根目录下;

    这里还有一点需要注意的:http://www.informatik.uni-frankfurt.de/~markus/antiword/00README.WIN这个连接里有windows下安装的说明文件.

      需要设置环境变量,我的电脑(右键)->高级->环境变量->在上面的用户变量里新建一个

      变量名:HOME

      变量值:c:\home这个目录应该是存在的,如果不存在就在C盘下创建一个home文件夹.

    然后在系统变量,修改Path,在Path变量的值最前面加上%HOME%\antiword.

     

          3.开始->运行->CMD 进入到antiword目录;

          输入 antiword -h 看看效果.

     

       4.然后我们使用antiword –t 命令读取一下doc文件内容;首先复制一个doc文件到c:\antiword目录,然后执行

       >antiword –t 文件名.doc

       就可以看到屏幕上输出word文件的内容了.

    可能你会问了,这和PHP读取word有什么关系呢?呵呵,别急,我们来看看如何在PHP里使用这个命令.

      <?php

      $file = “D:\xampp\htdocs\word_count\uploads\doc-english.doc”;

       $content = shell_exec(“c:\antiword\antiword –f $file”);

    ?>

    这样就把word里面的内容读取content里面了.

    至于如何在Linux下读取doc文件内容,就是下载linux版本的压缩包,里面有readme.txt文件,按照那种方式安装就可以了.

    $content = shell_exec ( "/usr/local/bin/antiword -f $file" );

    2.PHP读取PDF文件内容

        php也没有专门用来读取pdf内容的类库.这样我们采用第三方包(xpdf).还是先做windows下的操作,下载,将其解压到C盘根目录下.

       开始->运行->cmd->cd /d c:\xpdf 
    <?php

       $file = “D:\xampp\htdocs\word_count\uploads\pdf-english.pdf”;

        $content = shell_exec ( "c:\\xpdf\\pdftotext $file -" );

       ?>

    这样就可以把pdf文件的内容读取到php变量里了.

    Linux下的安装方法也很简单这里就不在一一列出

    <?php

    $content = shell_exec ( "/usr/bin/pdftotext $file -" );

    ?>

    3.PHP读取ZIP文件内容

    首先使用PHP zip解压zip文件,然后读取解压包里的文件,如果是word就采用antiword读取,如果是pdf就使用xpdf读取.

    <?php

    /** 
    * Read ZIP valid file 
    * 
    * @param string $file file path 
    * @return string total valid content 
    */ 
    function ReadZIPFile($file = '') { 
        $content = ""; 
        $inValidFileName = array (); 
        $zip = new ZipArchive ( ); 
        if ($zip->open ( $file ) === TR ) { 
            for($i = 0; $i < $zip->numFiles; $i ++) { 
                $entry = $zip->getNameIndex ( $i ); 
                if (preg_match ( '#\.(txt)|\.(doc)|\.(docx)|\.(pdf)$#i', $entry )) { 
                    $zip->extractTo ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ), array ( 
                            $entry 
                    ) ); 
                    $content .= CheckSystemOS ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) . "/" . $entry ); 
                } else { 
                    $inValidFileName [$i] = $entry; 
                } 
            } 
            $zip->close (); 
            rrmdir ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) ); 
            /*if (file_exists ( $file )) { 
                unlink ( $file ); 
            }*/ 
            return $content; 
        } else { 
            return ""; 
        } 
    }

    ?>

    4.PHP读取DOCX文件内容

      docx文件其实是由很多XML文件组成,其中内容就存在于word/document.xml里面.

       我们找到一个docx文件,使用zip文件打开(或者把docx后缀名改为zip,然后解压)

     

    在word目录下有document.xml

    docx文件的内容就存在于document.xml里面,我们读取这个文件就可以了.

    <?php

    /** 
    * Read Docx File 
    * 
    * @param string $file filepath 
    * @return string file content 
    */ 
    function parseWord($file) { 
        $content = ""; 
        $zip = new ZipArchive ( ); 
        if ($zip->open ( $file ) === tr ) { 
            for($i = 0; $i < $zip->numFiles; $i ++) { 
                $entry = $zip->getNameIndex ( $i ); 
                if (pathinfo ( $entry, PATHINFO_BASENAME ) == "document.xml") { 
                    $zip->extractTo ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ), array ( 
                            $entry 
                    ) ); 
                    $filepath = pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) . "/" . $entry; 
                    $content = strip_tags ( file_get_contents ( $filepath ) ); 
                    break; 
                } 
            } 
            $zip->close (); 
            rrmdir ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) ); 
            return $content; 
        } else { 
            return ""; 
        } 
    }

    ?>

    如果想要通过PHP创建docx文件,或者是把docx文件转为xhtml,pdf可以使用phpdocx,(http://www.phpdocx.com/)

     

    5.PHP读TXT

    直接使用PHP file_get_content函数就可以了.

    <?php

    $file = “D:\xampp\htdocs\word_count\uploads\eng.txt”;

    $content = file_get_content($file);

    ?>

    6.PHP读EXCEL

    http://phpexcel.codeplex.com/

     

    现在只是读取文件内容了,怎么计算单词的个数呢?

    PHP有一个自带的函数,str_word_count,这个函数可以计算出单词的个数,但是如果要计算antiword读取出来的doc文件的单词个数就会很大的误差.

    这里我们使用以下这个函数专门用来读取单词个数 
    <?php

    /** 
    * statistic word count 
    * 
    * @param string $content word content of the file 
    * @return int word count of the content 
    */ 
    function StatisticWordsCount($text = '') { 
        //    $text = trim ( preg_replace ( '/\d+/', ' ', $text ) ); // remove extra spaces 
        $text = str_replace ( str_split ( '|' ), '', $text ); // remove these chars (you can specify more) 
        //    $text = str_replace ( str_split ( '-' ), '', $text ); // remove these chars (you can specify more) 
        $text = trim ( preg_replace ( '/\s+/', ' ', $text ) ); // remove extra spaces 
        $text = preg_replace ( '/-{2,}/', '', $text ); // remove 2 or more dashes in a row 
        $len = strlen ( $text ); 
        if (0 === $len) { 
            return 0; 
        } 
        $words = 1; 
        while ( $len -- ) { 
            if (' ' === $text [$len]) { 
                ++ $words; 
            } 
        } 
        return $words; 
    }

    ?>

    详细的代码如下:

    <?php 
    /** 
    * check system operation win or linux 
    * 
    * @param string $file contain file path and file name 
    * @return file content 
    */ 
    function CheckSystemOS($file = '') { 
        $content = ""; 
        //    $type = s str ( $file, strrpos ( $file, '.' ) + 1 ); 
        $type = pathinfo ( $file, PATHINFO_EXTENSION ); 
        //    global $UNIX_ANTIWORD_PATH, $UNIX_XPDF_PATH; 
        if (strtoupper ( s str ( PHP_OS, 0, 3 ) ) === 'WIN') { //this is a server using windows 
            switch (strtolower ( $type )) { 
                case 'doc' : 
                    $content = shell_exec ( "c:\\antiword\\antiword -f $file" ); 
                    break; 
                case 'docx' : 
                    $content = parseWord ( $file ); 
                    break; 
                case 'pdf' : 
                    $content = shell_exec ( "c:\\xpdf\\pdftotext $file -" ); 
                    break; 
                case 'zip' : 
                    $content = ReadZIPFile ( $file ); 
                    break; 
                case 'txt' : 
                    $content = file_get_contents ( $file ); 
                    break; 
            } 
        } else { //this is a server not using windows 
            switch (strtolower ( $type )) { 
                case 'doc' : 
                    $content = shell_exec ( "/usr/local/bin/antiword -f $file" ); 
                    break; 
                case 'docx' : 
                    $content = parseWord ( $file ); 
                    break; 
                case 'pdf' : 
                    $content = shell_exec ( "/usr/bin/pdftotext $file -" ); 
                    break; 
                case 'zip' : 
                    $content = ReadZIPFile ( $file ); 
                    break; 
                case 'txt' : 
                    $content = file_get_contents ( $file ); 
                    break; 
            } 
        } 
        /*if (file_exists ( $file )) { 
            @unlink ( $file ); 
        }*/ 
        return $content; 
    }

    /** 
    * statistic word count 
    * 
    * @param string $content word content of the file 
    * @return int word count of the content 
    */ 
    function StatisticWordsCount($text = '') { 
        //    $text = trim ( preg_replace ( '/\d+/', ' ', $text ) ); // remove extra spaces 
        $text = str_replace ( str_split ( '|' ), '', $text ); // remove these chars (you can specify more) 
        //    $text = str_replace ( str_split ( '-' ), '', $text ); // remove these chars (you can specify more) 
        $text = trim ( preg_replace ( '/\s+/', ' ', $text ) ); // remove extra spaces 
        $text = preg_replace ( '/-{2,}/', '', $text ); // remove 2 or more dashes in a row 
        $len = strlen ( $text ); 
        if (0 === $len) { 
            return 0; 
        } 
        $words = 1; 
        while ( $len -- ) { 
            if (' ' === $text [$len]) { 
                ++ $words; 
            } 
        } 
        return $words; 
    }

    /** 
    * Read Docx File 
    * 
    * @param string $file filepath 
    * @return string file content 
    */ 
    function parseWord($file) { 
        $content = ""; 
        $zip = new ZipArchive ( ); 
        if ($zip->open ( $file ) === tr ) { 
            for($i = 0; $i < $zip->numFiles; $i ++) { 
                $entry = $zip->getNameIndex ( $i ); 
                if (pathinfo ( $entry, PATHINFO_BASENAME ) == "document.xml") { 
                    $zip->extractTo ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ), array ( 
                            $entry 
                    ) ); 
                    $filepath = pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) . "/" . $entry; 
                    $content = strip_tags ( file_get_contents ( $filepath ) ); 
                    break; 
                } 
            } 
            $zip->close (); 
            rrmdir ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) ); 
            return $content; 
        } else { 
            return ""; 
        } 
    }

    /** 
    * Read ZIP valid file 
    * 
    * @param string $file file path 
    * @return string total valid content 
    */ 
    function ReadZIPFile($file = '') { 
        $content = ""; 
        $inValidFileName = array (); 
        $zip = new ZipArchive ( ); 
        if ($zip->open ( $file ) === TR ) { 
            for($i = 0; $i < $zip->numFiles; $i ++) { 
                $entry = $zip->getNameIndex ( $i ); 
                if (preg_match ( '#\.(txt)|\.(doc)|\.(docx)|\.(pdf)$#i', $entry )) { 
                    $zip->extractTo ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ), array ( 
                            $entry 
                    ) ); 
                    $content .= CheckSystemOS ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) . "/" . $entry ); 
                } else { 
                    $inValidFileName [$i] = $entry; 
                } 
            } 
            $zip->close (); 
            rrmdir ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) ); 
            /*if (file_exists ( $file )) { 
                unlink ( $file ); 
            }*/ 
            return $content; 
        } else { 
            return ""; 
        } 
    }

    /** 
    * remove directory 
    * 
    * @param string $dir path dir 
    */ 
    function rrmdir($dir) { 
        if (is_dir ( $dir )) { 
            $objects = scandir ( $dir ); 
            foreach ( $objects as $object ) { 
                if ($object != "." && $object != "..") { 
                    if (filetype ( $dir . "/" . $object ) == "dir") { 
                        rrmdir ( $dir . "/" . $object ); 
                    } else { 
                        unlink ( $dir . "/" . $object ); 
                    } 
                } 
            } 
            reset ( $objects ); 
            rmdir ( $dir ); 
        } 
    }

     

     

    //调用方法

    $file = “D:\xampp\htdocs\word_count\uploads\pdf-german.zip”;

    $word_number = StatisticWordsCount ( CheckSystemOS ( $file) );

    ?>





    http://www.it300.com/article-15290.html



沪ICP备19023445号-2号
友情链接