PHP读取word文档里的文字及图片,并保存

一、composer安装phpWord

composer require phpoffice/phpword

传送门:https://packagist.org/packages/phpoffice/phpword

二、phpWord 读取 docx 文档(注意是docx格式,doc格式不行)

如果你的文件是doc格式,直接另存为一个docx就行了;如果你的doc文档较多,可以下一个批量转换工具:http://www.batchwork.com/en/doc2doc/download.htm

如果你还没配置自动加载,则先配置一下:

require ‘./vendor/autoload.php’;

加载文档:

$dir = str_replace('\', '/', DIR) . '/';
$source = $dir . 'test.docx';
$phpWord = \PhpOffice\PhpWord\IOFactory::load($source);

三、关键点

1)对齐方式:PhpOffice\PhpWord\Style\Paragraph -> getAlignment()

2)字体名称:\PhpOffice\PhpWord\Style\Font -> getName()

3)字体大小:\PhpOffice\PhpWord\Style\Font -> getSize()

4)是否加粗:\PhpOffice\PhpWord\Style\Font -> isBold()

5)读取图片:\PhpOffice\PhpWord\Element\Image -> getImageStringData()

6)ba64格式图片数据保存为图片:file_put_contents($imageSrc, base64_decode($imageData))

四、完整代码

 require './vendor/autoload.php';
 function docx2html($source)
 {
     $phpWord = \PhpOffice\PhpWord\IOFactory::load($source);
     $html = '';
     foreach ($phpWord->getSections() as $section) {
         foreach ($section->getElements() as $ele1) {
             $paragraphStyle = $ele1->getParagraphStyle();
             if ($paragraphStyle) {
                 $html .= '
             } else {
                 $html .= '
';             }             if ($ele1 instanceof \PhpOffice\PhpWord\Element\TextRun) {                 foreach ($ele1->getElements() as $ele2) {                     if ($ele2 instanceof \PhpOffice\PhpWord\Element\Text) {                         $style = $ele2->getFontStyle();                         $fontFamily = mb_convert_encoding($style->getName(), 'GBK', 'UTF-8');                         $fontSize = $style->getSize();                         $isBold = $style->isBold();                         $styleString = '';                         $fontFamily && $styleString .= "font-family:{$fontFamily};";                         $fontSize && $styleString .= "font-size:{$fontSize}px;";                         $isBold && $styleString .= "font-weight:bold;";                         $html .= sprintf('%s',                             $styleString,                             mb_convert_encoding($ele2->getText(), 'GBK', 'UTF-8')                         );                     } elseif ($ele2 instanceof \PhpOffice\PhpWord\Element\Image) {                         $imageSrc = 'images/' . md5($ele2->getSource()) . '.' . $ele2->getImageExtension();                         $imageData = $ele2->getImageStringData(true);                         // $imageData = 'data:' . $ele2->getImageType() . ';base64,' . $imageData;                         file_put_contents($imageSrc, base64_decode($imageData));                         $html .= '';                     }                 }             }             $html .= '
';
         }
     }
 return mb_convert_encoding($html, 'UTF-8', 'GBK');
 }
 $dir = str_replace('\', '/', DIR) . '/';
 $source = $dir . 'test.docx';
 echo docx2html($source);     

五、补充

很明显,这是一个简陋的word读取示例,只读取了段落的对齐方式,文字的字体、大小、是否加粗及图片等信息,其他例如文字颜色、行高。。。等等信息都忽悠了。需要的话,请自行查看phpWord源码,看\PhpOffice\PhpWord\Style\xxx 和 \PhpOffice\PhpWord\Element\xxx 等类里有什么读取方法就可以了

六、2020-07-21 补充

可以用以下方法直接获取到完整的html

$phpWord = \PhpOffice\PhpWord\IOFactory::load('xxx.docx');
$xmlWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord, "HTML");
$html = $xmlWriter->getContent();

注:html内容里包含了head部分,如果只需要style和body的话,需要自己处理一下;然后图片是base64的,要保存的话,也需要自己处理一下

base64数据保存为图片请参考上面代码

如果只想获取body里的内容,可以参考 \PhpOffice\PhpWord\Writer\HTML\Part\Body 里的 write 方法
复制代码

$phpWord = \PhpOffice\PhpWord\IOFactory::load('xxxx.docx');
$htmlWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord, "HTML");
$content = '';
foreach ($phpWord->getSections() as $section) {
$writer = new \PhpOffice\PhpWord\Writer\HTML\Element\Container($htmlWriter, $section);
$content .= $writer->write();
}
echo $content;exit;

图片的处理的话,暂时没有好办法能在不修改源码的情况下处理好,改源码的话,相关代码在 \PhpOffice\PhpWord\Writer\HTML\Element\Image 里

public function write()
 {
     if (!$this->element instanceof ImageElement) {
         return '';
     }
     $content = '';
     $imageData = $this->element->getImageStringData(true);
     if ($imageData !== null) {
         $styleWriter = new ImageStyleWriter($this->element->getStyle());
         $style = $styleWriter->write();
         // $imageData = 'data:' . $this->element->getImageType() . ';base64,' . $imageData;
         $imageSrc = 'images/' . md5($this->element->getSource()) . '.' . $this->element->getImageExtension();
         // 这里可以自己处理,上传oss之类的
         file_put_contents($imageSrc, base64_decode($imageData));
     $content .= $this->writeOpening();     $content .= "<img border=\"0\" style=\"{$style}\" src=\"{$imageSrc}\"/>";     $content .= $this->writeClosing(); } return $content;
 }