基于Lucene的Java Web 搜索引擎设计

这是一个简易的搜索引擎。该设计是针对PDF文档的搜索。用户提供关键词，搜索引擎返回搜索结果到用户界面，用户通过返回的超链接可查看文档的详细内容。

设计概述

首先，通过BuildIndex创建文档的索引。然后，将关键词作为Search的参数，返回查询结构，文档内容高亮显示。

jar包

创建索引（BuildIndex.java）

创建索引的过程就是将每个文档转化成Document对象，然后调用IndexWriter的addDocument方法将Document对象添加到索引中。

public static void run(String sdir, String indexPath) throws Exception{
	File fdir=new File(sdir);
	File[] flist=fdir.listFiles();
	Document doc;		
	Analyzer analyzer=new IKAnalyzer();
	IndexWriterConfig conf = new IndexWriterConfig(analyzer);
	conf.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
	Directory dir = FSDirectory.open(Paths.get(indexPath));
	IndexWriter idxWriter=new IndexWriter(dir,conf);		
	for(File f : flist){
		doc=buildDoc(f);
		idxWriter.addDocument(doc);
	}
	idxWriter.close();
}
/**
 * 为一篇文档建立document对象
 * @throws FileNotFoundException 
 * */
private static Document buildDoc(File f) throws Exception{
	String fname=f.getPath();
	Document doc=new Document();
	FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
	offsetsType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
	Field body = new Field("contents", getContent(f), offsetsType);
	doc.add(body);
	doc.add(new StringField("path",fname,Field.Store.YES));
	return doc;
}

提取PDF文档内容（ReadPDF.java）

使用PDF-Box所提供的库从PDF文档中提取内容。
PDFbox是一个开源的、基于Java的、支持PDF文档生成的工具库，它可以用于创建新的PDF文档，修改现有的PDF文档，还可以从PDF文档中提取所需的内容。

public static String readPdf(String path) throws Exception { 
		StringBuffer content = new StringBuffer("");// 文档内容
     	  	FileInputStream fis = new FileInputStream(path);
      	 	PDFParser p = new PDFParser(new RandomAccessBuffer(fis));
     	  	p.parse();
     	  	PDDocument document= p.getPDDocument();
      	 	PDFTextStripper ts = new PDFTextStripper();
      	 	content.append(ts.getText(document));
       		fis.close();
       		document.close();
      	 	return content.toString().trim();
}

关键词查询（Search.java）

建立好索引后，就可以利用索引进行关键词查询。创建分析器Analyze，采用QueryParser实现模糊查询。

public static ArrayList<String> run(String indexPath, String queryVal) throws Exception {
	ArrayList<String> list = new ArrayList<String>(); 
	String field="contents";
	TopDocs docs;
	ScoreDoc sdoc[] = null;
	IndexSearcher is;
	Analyzer analyzer;
	QueryParser parser;
	Query query;
	Document doc = null;
	try{
		IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(indexPath)));
		is=new IndexSearcher(reader); 
		analyzer = new StandardAnalyzer();
		parser = new QueryParser(field, analyzer);
		query=parser.parse(queryVal);
		docs=is.search(query, 10);
		sdoc=docs.scoreDocs;
		for(int i=0;i<sdoc.length;i++){
			doc = is.doc(sdoc[i].doc);
	        list.add(doc.get("path"));
		}			
	}catch(Exception e){
		e.printStackTrace();
		System.exit(1);
	}		
	return list;
}

高亮显示（Highlighter.java）

高亮显示技术，是搜索引擎常用到的一项重要技术。在搜索引擎开发中引入高亮显示技术，使搜索结果一目了然。
在Lucene5.X的Highlighter包中提供了一个简单的Highlighter功能，PostingsHighlighter。实现步骤为，创建PostingsHighlighter对象，创建对象查询结果，获取查询结果，获取该查询结果所对应的高亮snippets。

public static String[] run(String indexPath, String queryVal) throws Exception{
	String field="contents";
	TopDocs docs;
	IndexSearcher is;
	Analyzer analyzer;
	QueryParser parser;
	Query query;
	String highlights[] = null;
	try{
		IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(indexPath)));
		is=new IndexSearcher(reader); 
		analyzer = new StandardAnalyzer();
		parser = new QueryParser(field, analyzer);
		query=parser.parse(queryVal);
		docs=is.search(query, 10);			
		PostingsHighlighter highlighter = new PostingsHighlighter();
		highlights = highlighter.highlight("contents", query, is, docs, 3)
	}catch(Exception e){
		e.printStackTrace();
		System.exit(1);
	}		
	return highlights;
}

用户界面

用户界面采用B/S架构（浏览器/服务器模式）实现。服务器端使用Java Servlet技术完成，用户通过浏览器输入待查询关键词，服务器端返回搜索到的相关文档，客户端显示响应的查询结果的路径超链接，以及发现关键词的部分段落。点击超链接查看文档的详细内容。

用户主界面（index.jsp）

Index.jsp提供了一个查询初始界面，它接受用户所要查询的的关键词，并将这个词传给Result Servlet做下一步处理。

<body>
<div style="height:150px;"></div>
<form action="Result" method="post">
	<div class="b_searchboxForm">	
		<input class="b_searchbox" id="sb_form_q" name="q" title="输入搜索词" value="" maxlength="100" type="search">
		<input class="b_searchboxSubmit" id="sb_form_go" title="搜索" tabindex="0" name="go" value="Go" type="submit" >
	</div>
</form>
</body>

查询结果（Result.java）

Result Servlet 使用request.getParameter(“q”);获取用户查询的关键词，将关键词作为参数送给Highlighter和Search并获取返回的相应结果集。使用一个循环将获取的信息返回给用户。

protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
	String sdir="E:\\Program\\woskspace EE\\SearchBox\\www2016";
	String indexPath = "E:\\Program\\woskspace EE\\SearchBox\\index";
	String search = request.getParameter("q");
	String result[]=null;
	ArrayList<String> path = new ArrayList<String>();
		try {
//		BuildIndex.run(sdir, indexPath);（只需在程序第一次运行时执行）
		result = Highlighter.run(indexPath, search);
		path=Search.run(indexPath, search);
	} catch (Exception e) {
		e.printStackTrace();
	}
	PrintWriter out= response.getWriter();
	
	response.setContentType("text/html");
	out.println("<html>");
	out.println("<body style=\"background-color: #ffc;\">");
	out.println("<div style=\"font-family: \"Segoe UI\",Segoe,Tahoma,Arial,Verdana,sans-serif ;font-size: 18px;\">");
	out.println("<form name=\"input\" action=\"Result\" method=\"post\" >");
	out.println("<input type=\"text\" name=\"q\" id=\"s\" style=\"width:550px; max-height:40px; height:45px; margin-top:3px>; border: 1px #e5e5e5 solid;\">");
	out.println("<input type=\"submit\" value=\"Go\"style=\"width:42px; height:42px; border: 1px #ccc solid;\">"); 
	out.println("<br><br>"); 
	out.println("</div>");
	out.println("<div>");
	for(int i=0;i<path.size();i++){
		out.println("<p font-size=\"20px\" >");
		out.println("<a href='Document?path="+path.get(i)+"'>"+path.get(i)+"</a>");
		out.println("</p>");
		out.println("<p font-size=\"18px\">"+result[i]+"</p>");
		out.println("<br>");
	}
	out.println("</div>");
	out.println("</body>"); 
	out.println("</html>"); 
	out.flush(); 
	out.close(); 		
}

文档内容显示（Document.java）

Document Servlet通过request.getParameter(“path”)获取Result Servlet提供的文档路径。创建Document对象，采用PDFTextStripper读取文档内容，PDDocumentInformation获取文档信息，如标题。按顺序输出文档内容。

protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
	PrintWriter out= response.getWriter();
	String path = request.getParameter("path");
	out.println("<html>");
	out.println("<body style=\"background-color: #ffc;\">");
	out.println("<div style=\"font-family: \"Segoe UI\",Segoe,Tahoma,Arial,Verdana,sans-serif ;font-size: 18px;\">");
	out.println("<form name=\"input\" action=\"Result\" method=\"post\" >");
	out.println("<input type=\"text\" name=\"q\" id=\"s\" style=\"width:550px; max-height:40px; height:45px; margin-top:3px>; border: 1px #e5e5e5 solid;\">");
	out.println("<input type=\"submit\" value=\"Go\"style=\"width:42px; height:42px; border: 1px #ccc solid;\">"); 
	out.println("<br><br>"); 
	out.println("</div>");
	out.println("<div>");
	PDDocument document = null;
	File PDFpath= new File(path);
       try
       {
		document=PDDocument.load(PDFpath);
           // 读文本信息与内容
		PDDocumentInformation info = document.getDocumentInformation();
         	PDFTextStripper stripper=new PDFTextStripper();
           // 设置按顺序输出
           stripper.setSortByPosition(true);
           String content = stripper.getText(document);
           out.println(info.getTitle()); 
           out.println("<br><br><br>");
           out.println(content);
       }
       catch(Exception e)
       {
       	e.printStackTrace();
       }
	out.println("</div>");
	out.println("</body>"); 
	out.println("</html>"); 
	document.close();
	out.flush(); 
	out.close(); 	
}