php通过正则提取页面指定内容实例

作者:简简单单 2014-06-30

例子代码如下,可常用于采集哦、

 代码如下 复制代码


1、获取页面标题

//提取标题
            preg_match('/(?<title>.*?)<\/title>/i', $html, $titleArr);<br />             $title = $titleArr['title'];<br /> 2、获取body主体内容,并将背景图片提取出来替换成其他图片地址</p> <p>/**<br />  * 获取BODY主体区域内容<br />  * @param $html<br />  * @param $urlRoot<br />  * @return mixed<br />  */<br /> function getBody($html,$urlRoot = null){<br />     //提取BODY主体<br />     preg_match('/<!--body-->(.*?)<!--body-->/is ', $html, $bodyArr);<br />     if(!$bodyArr){<br />         preg_match('/<body.*?>(.*?)<\/body>/is ', $html, $bodyArr);<br />     }<br />     $body = $bodyArr[1];<br />     //替换img文件<br />     $body =  preg_replace('/(<[img|IMG].*src=[\'|"])(\.\.\/)*(img.[^\'||^"]+)/',"$1$urlRoot$3",$body);<br />     //替换html文件内的css背景图片<br />     $body =  preg_replace('~\b(background(-image)?\s*:(.*?)\(\s*[\'|"]?)(\.\.\/)*(img.*?)?\s*\)~i',"$1$urlRoot$5)",$body);<br />     return $body;<br /> }<br /> 3、提取页面Description内容</p> <p>function getDescription($html){<br />     // Get the 'content' attribute value in a <meta name="description" ... /><br />     $matches = array();<br />  <br />     // Search for <meta name="description" content="Buy my stuff" /><br />     preg_match('/<meta.*?name=("|\')description("|\').*?content=("|\')(.*?)("|\')/i', $html, $matches);<br />     if (count($matches) > 4) {<br />         return trim($matches[4]);<br />     }<br />  <br />     // Order of attributes could be swapped around: <meta content="Buy my stuff" name="description" /><br />     preg_match('/<meta.*?content=("|\')(.*?)("|\').*?name=("|\')description("|\')/i', $html, $matches);<br />     if (count($matches) > 2) {<br />         return trim($matches[2]);<br />     }<br />  <br />     // No match<br />     return null;<br /> }<br /> 4、替换css文件的背景图片地址</p> <p>/**<br />  * 获取CSS内容<br />  * @param $cssCnt<br />  * @param $urlRoot<br />  * @return mixed<br />  */<br /> function getCss($cssCnt,$urlRoot =null){<br />     //匹配包含 img文件夹的相对路径图片 (含义绝对路径的不包含在其中)<br />     //匹配替换不一定准确,因为只是将 含义 ../ 的地址转为url 而没有考虑 ../../ 之类的层级关系<br />     $css =  preg_replace('~\b(background(-image)?\s*:(.*?)\(\s*[\'|"]?)(\.\.\/)*(img.*?)?\s*\)~i',"$1$urlRoot$5)",$cssCnt);<br />     //添加css前缀<br />     $css =  preg_replace('/\b.(.*?)[,|{]/',"pat .$0",$cssCnt);<br />     //TODO 压缩css<br />     return $css;<br /> }</p> <p><br />  </p> </td> </tr> </table> <p>从上面例子来看其实都是非常的简单就是批有规律的标签为开始与结束节点,这样我们可以获取这两个字符之类的内容也就是我们要提取的内容了哦,只是在中间有字符或空格的一些处理了哦。</p></td> </tr> </table> </div> </div> </section> <div class="wrap-box"> <div class="turnPage wrapStyle"> <a href="https://m.111com.net/art-63313.htm">上一个:<span>php抓取网站图片并保存本地服务器实例</span></a> <a href="https://m.111com.net/art-63315.htm">下一个:<span>php生成xml时添加CDATA标签</span></a> </div> </div> <section class="wrap-box"> <div class="g-tit"> <h2>相关文章</h2> </div> <ul class="s-list nobord notop"> <li> <a href="https://m.111com.net/art-228481.htm" class="s-card"> <div class="s-card-l"> <p class="tit">PHP导出数据超时的优化建议解读</p> <div class="info"> <span class="person">php入门</span> <span class="time">2022-10-31</span> </div> </div> </a> </li> <li> <a href="https://m.111com.net/art-228478.htm" class="s-card"> <div class="s-card-l"> <p class="tit">PHP之mysql位运算解析</p> <div class="info"> <span class="person">php与数据库</span> <span class="time">2022-10-31</span> </div> </div> <div class="s-card-pic"> <img src="https://assets.111com.net/images/lazy.gif" data-src="https://img.111com.net/m00/42/ef/3448d1defd3b8b2e45810dc41202e9e1_c_246_164.png" alt="PHP之mysql位运算解析" /> </div> </a> </li> <li> <a href="https://m.111com.net/art-228475.htm" class="s-card"> <div class="s-card-l"> <p class="tit">Laravel实现登录跳转功能解析</p> <div class="info"> <span class="person">php入门</span> <span class="time">2022-10-31</span> </div> </div> <div class="s-card-pic"> <img src="https://assets.111com.net/images/lazy.gif" data-src="https://img.111com.net/m00/40/14/fbf44132f35cc92c9056531d76a40f91_c_246_164.png" alt="Laravel实现登录跳转功能解析" /> </div> </a> </li> <li> <a href="https://m.111com.net/art-228473.htm" class="s-card"> <div class="s-card-l"> <p class="tit">php双向队列解读</p> <div class="info"> <span class="person">php入门</span> <span class="time">2022-10-31</span> </div> </div> </a> </li> <li> <a href="https://m.111com.net/art-226305.htm" class="s-card"> <div class="s-card-l"> <p class="tit">Laravel异常上下文解决教程</p> <div class="info"> <span class="person">php高级应用</span> <span class="time">2022-10-24</span> </div> </div> </a> </li> <li> <a href="https://m.111com.net/art-226295.htm" class="s-card"> <div class="s-card-l"> <p class="tit">php数组查询元素位置方法介绍</p> <div class="info"> <span class="person">php入门</span> <span class="time">2022-10-24</span> </div> </div> </a> </li> </ul> </section> <section class="wrap-box"> <div class="g-tit"> <h2>精彩推荐</h2> </div> <ul class="card-box"> <li class="card3"> <a href="https://m.111com.net/azgame/46853.htm" target="_self" class="figure"> <div class="figure-box"> <img src="https://assets.111com.net/images/lazy.gif" data-src="https://img.111com.net/attachment/m_soft/46853/5dcb71bae30a3.png" alt="一剑斩仙" /> </div> <p class="figure-head">一剑斩仙</p> <span class="figure-btn">下载</span> </a> </li> <li class="card3"> <a href="https://m.111com.net/azgame/224592.htm" target="_self" class="figure"> <div class="figure-box"> <img src="https://assets.111com.net/images/lazy.gif" data-src="https://img.111com.net/m00/77/b7/b0a10d2c1b09f241dabff441346fb91c.png" alt="超级雷电战机" /> </div> <p class="figure-head">超级雷电战机</p> <span class="figure-btn">下载</span> </a> </li> <li class="card3"> <a href="https://m.111com.net/azgame/30153.htm" target="_self" class="figure"> <div class="figure-box"> <img src="https://assets.111com.net/images/lazy.gif" data-src="https://img.111com.net/attachment/m_soft/30153/6d482cbc24.jpg" alt="烈火一刀" /> </div> <p class="figure-head">烈火一刀</p> <span class="figure-btn">下载</span> </a> </li> <li class="card3"> <a href="https://m.111com.net/azgame/42147.htm" target="_self" class="figure"> <div class="figure-box"> <img src="https://assets.111com.net/images/lazy.gif" data-src="https://img.111com.net/attachment/m_soft/42147/5d4cf5dfeb6ca.png" alt="天使纪元" /> </div> <p class="figure-head">天使纪元</p> <span class="figure-btn">下载</span> </a> </li> </ul> <ul class="card-box-b"> <li class="card10"> <a href="https://m.111com.net/azgame/127872.htm" target="_self" class="figure2"> <div class="figure-box"> <img src="https://assets.111com.net/images/lazy.gif" data-src="https://img.111com.net/mobile/202111/15/35c0b8587a.png" alt="完美修真" /> </div> <div class="figure-cont"> <p class="figure-head">完美修真</p> <div class="figure-desc"> <span>角色扮演</span> <span>538.52 MB</span> </div> <div class="figure-desc"> <p>高清画质体验,华丽场景视觉盛宴</p> </div> </div> <span class="figure-btn">下载</span> </a> </li> <li class="card10"> <a href="https://m.111com.net/azgame/30904.htm" target="_self" class="figure2"> <div class="figure-box"> <img src="https://assets.111com.net/images/lazy.gif" data-src="https://img.111com.net/attachment/m_soft/30904/fd82743d5b.png" alt="永恒领主" /> </div> <div class="figure-cont"> <p class="figure-head">永恒领主</p> <div class="figure-desc"> <span>即时网游</span> <span>392.00 MB</span> </div> <div class="figure-desc"> <p>3D魔幻MMO</p> </div> </div> <span class="figure-btn">下载</span> </a> </li> <li class="card10"> <a href="https://m.111com.net/azgame/113807.htm" target="_self" class="figure2"> <div class="figure-box"> <img src="https://assets.111com.net/images/lazy.gif" data-src="https://img.111com.net/mobile/202107/01/275a3dca7e.png" alt="九州八荒录" /> </div> <div class="figure-cont"> <p class="figure-head">九州八荒录</p> <div class="figure-desc"> <span>角色扮演</span> <span>232.56 MB</span> </div> <div class="figure-desc"> <p>九州八荒,一剑入魂。</p> </div> </div> <span class="figure-btn">下载</span> </a> </li> <li class="card10"> <a href="https://m.111com.net/azgame/372634.htm" target="_self" class="figure2"> <div class="figure-box"> <img src="https://assets.111com.net/images/lazy.gif" data-src="https://img.111com.net/m00/f5/99/f276753f2f8beb9dd0cceb843afcf723.png" alt="航海王燃烧意志" /> </div> <div class="figure-cont"> <p class="figure-head">航海王燃烧意志</p> <div class="figure-desc"> <span>角色扮演</span> <span>1.57 GB</span> </div> <div class="figure-desc"> <p>热血激战动作冒险游戏</p> </div> </div> <span class="figure-btn">下载</span> </a> </li> <li class="card10"> <a href="https://m.111com.net/azgame/37526.htm" target="_self" class="figure2"> <div class="figure-box"> <img src="https://assets.111com.net/images/lazy.gif" data-src="https://img.111com.net/m00/f3/6a/29dc4894964b32d77cad35e9676985d4.png" alt="君临传奇" /> </div> <div class="figure-cont"> <p class="figure-head">君临传奇</p> <div class="figure-desc"> <span>角色扮演</span> <span>43.57 MB</span> </div> <div class="figure-desc"> <p>经典RPG战斗手游</p> </div> </div> <span class="figure-btn">下载</span> </a> </li> </ul> </section> <footer class="foot"> <a href="https://m.111com.net/" class="logo-icon"> <img src="https://assets.111com.net/mobile/images/logo2.png" alt="一聚教程网"> </a> <p>Copyright © 2010-2024</p> <p>111com.net All Rights Reserved</p> </footer> <div class="back-top" style="display: block;"> <span class="icon-box"><svg class="icon" viewBox="0 0 1024 1024"> <path d="M213.333333 640h170.666667v256h256v-256h170.666667l-298.666667-341.333333zM170.666667 128h682.666666v85.333333H170.666667z" fill="#0374f3"> </path> </svg> </span> </div> <script src="https://api.111com.net/api/stat/hits?type=article&id=63314"></script> </div> <script src="https://assets.111com.net/js/stat.js?v=2024022101"></script> </body> </html>