Membuat Search Engine Lokal Menggunakan Lucene

Apache Lucene (Lucene) merupakan sebuah pustaka (library) yang dapat digunakan untuk melakukan pencarian teks. Dengan bantuan Lucene kita dapat membuat mesin pencari lokal. Lucene sangat membantu sekali untuk bidang-bidang yang terkait dengan pengolahan teks seperti Information Retrieval, Data Mining dan lainnya. Untuk lebih jelasnya silakan anda baca disini: Wikipedia: Lucene

Artikel ini berisi tutorial teknis menggunakan Lucene. Materi tutorial diambil dari sini: Lucene Tutorial dengan beberapa modifikasi. Untuk contoh kasus saya memberikan dokumen sejarah kerajaan nusantara, setiap file dalam dokumen mewakili satu nama kerajaan.

SYARAT YANG DIBUTUHKAN:

  • Sistem Operasi Linux (karena saya menggunakan Linux)
  • Java (1.6 ke atas)
  • Lucene

LANGKAH-LANGKAH INSTALASI:

Buat folder dengan nama “tutorial_lucene” letakan di dalam direktori /home:

  • $cd /home/abdiansah (pindah ke direktori /home)
  • $mkdir tutorial_lucene (buat folder)

Download pustaka Lucene (saya menggunakan Lucene versi 4.10.2) disini:

  • Lucene
  • file download: lucene-[VERSI].tgz (contoh: lucene-4.10.2.tgz)

Letakan dan ekstraksi file lucene-[VERSI].tgz (silakan versi-nya disesuaikan dengan versi Lucene yang anda download) ke dalam direktori ini:

  • $cd /home/abdiansah/tutorial_lucene
  • $tar -xvzf lucene-[VERSI].tgz

Folder lucene-[VERSI] hasil dari ekstraksi tadi berisi banyak putaka *.jar, tetapi untuk artikel ini hanya menggunakan tiga file jar yaitu:

  • lucene-core-[VERSI].jar (../lucene-4.10.2/core/)
  • lucene-analyzers-common-[VERSI].jar (../lucene-4.10.2/analysis/common/)
  • lucene-queryparser-[VERSI].jar (../lucene-4.10.2/queryparser/)

Silakan anda salin ketiga file tersebut dan letakan kedalam folder: “/home/abdiansah/tutorial_lucene”. Setelah itu buat folder “index” dan file java dengan nama “TextFileIndexer.java”.

  • $cd /home/abdiansah/tutorial_lucene/
  • $mkdir index
  • $cat > TextFileIndexer.java

Unduh dokumen kemudian ekstrak ke dalam direktori “../tutorial_lucene/”

  • Dokumen
  • $tar -xvzf doc.tar.tgz
  • Setelah ekstraksi maka akan ada folder “doc” di dalam folder “tutorial_lucene”

KODE JAVA (TextFileIndexer.java):

  • Kodenya bisa anda ketik ulang atau
  • Unduh disini: kode
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import java.io.*;
import java.util.ArrayList;

/**
 * This terminal application creates an Apache Lucene index in a folder and adds files into this index
 * based on the input of the user.
 */
public class TextFileIndexer {
  private static StandardAnalyzer analyzer = new StandardAnalyzer();
  private IndexWriter writer;
  private ArrayList queue = new ArrayList();
  public static void main(String[] args) throws IOException {
    System.out.println("Enter the path where the index will be created: (e.g. /tmp/index or c:\\temp\\index)");
    String indexLocation = null;
    BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
    String s = "/home/abdiansah/Projects/Tutorial_Lucene/index/";
    TextFileIndexer indexer = null;
    try {
      indexLocation = s;
      indexer = new TextFileIndexer(s);
    } catch (Exception ex) {
      System.out.println("Cannot create index..." + ex.getMessage());
      System.exit(-1);
    }

    //===================================================
    //read input from user until he enters q for quit
    //===================================================
    while (!s.equalsIgnoreCase("q")) {
      try {
        System.out.println("Enter the full path to add into the index (q=quit): (e.g. /home/ron/mydir or c:\\Users\\ron\\mydir)");
        System.out.println("[Acceptable file types: .xml, .html, .html, .txt]");
        s = "/home/abdiansah/Projects/Tutorial_Lucene/corpus";
        if (s.equalsIgnoreCase("q")) {
          break;
        }
        //try to add file into the index
        indexer.indexFileOrDirectory(s); s = "q";
      } catch (Exception e) {
        System.out.println("Error indexing " + s + " : " + e.getMessage());
      }
    }

    //===================================================
    //after adding, we always have to call the
    //closeIndex, otherwise the index is not created
    //===================================================
    indexer.closeIndex();

    //=========================================================
    // Now search
    //=========================================================
    IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(indexLocation)));
    IndexSearcher searcher = new IndexSearcher(reader);
    TopScoreDocCollector collector = TopScoreDocCollector.create(5, true);

    s = "";
    while (!s.equalsIgnoreCase("q")) {
      try {
        System.out.println("Enter the search query (q=quit):");
        s = br.readLine();
        if (s.equalsIgnoreCase("q")) {
          break;
        }
        Query q = new QueryParser("contents", analyzer).parse(s);
        searcher.search(q, collector);
        ScoreDoc[] hits = collector.topDocs().scoreDocs;

        // 4. display results
        System.out.println("Found " + hits.length + " hits.");
        for(int i=0;i<hits.length;++i) {
          int docId = hits[i].doc;
          Document d = searcher.doc(docId);
          System.out.println((i + 1) + ". " + d.get("path") + " score=" + hits[i].score);
        }

      } catch (Exception e) {
        System.out.println("Error searching " + s + " : " + e.getMessage());
      } finally {
         collector = TopScoreDocCollector.create(5, true);
      }
    }

    // hapus seluruh file index pencarian
    File dir = new File("/home/abdiansah/Projects/Tutorial_Lucene/index/");
    for(File f: dir.listFiles())
      f.delete();
  }

  /**
   * Constructor
   * @param indexDir the name of the folder in which the index should be created
   * @throws java.io.IOException when exception creating index.
   */
  TextFileIndexer(String indexDir) throws IOException {
    // the boolean true parameter means to create a new index everytime,
    // potentially overwriting any existing files there.
    FSDirectory dir = FSDirectory.open(new File(indexDir));
    IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer);
    writer = new IndexWriter(dir, config);
  }

  /**
   * Indexes a file or directory
   * @param fileName the name of a text file or a folder we wish to add to the index
   * @throws java.io.IOException when exception
   */
  public void indexFileOrDirectory(String fileName) throws IOException {
    //===================================================
    //gets the list of files in a folder (if user has submitted
    //the name of a folder) or gets a single file name (is user
    //has submitted only the file name)
    //===================================================
    addFiles(new File(fileName));
    int originalNumDocs = writer.numDocs();
    for (File f : queue) {
      FileReader fr = null;
      try {
        Document doc = new Document();
        //===================================================
        // add contents of file
        //===================================================
        fr = new FileReader(f);
        doc.add(new TextField("contents", fr));
        doc.add(new StringField("path", f.getPath(), Field.Store.YES));
        doc.add(new StringField("filename", f.getName(), Field.Store.YES));
        writer.addDocument(doc);
        System.out.println("Added: " + f);
      } catch (Exception e) {
        System.out.println("Could not add: " + f);
      } finally {
        fr.close();
      }
    }

    int newNumDocs = writer.numDocs();
    System.out.println("");
    System.out.println("************************");
    System.out.println((newNumDocs - originalNumDocs) + " documents added.");
    System.out.println("************************");
    queue.clear();
  }

  private void addFiles(File file) {
    if (!file.exists()) {
      System.out.println(file + " does not exist.");
    }
    if (file.isDirectory()) {
      for (File f : file.listFiles()) {
        addFiles(f);
      }
    } else {
      String filename = file.getName().toLowerCase();
      //===================================================
      // Only index text files
      //===================================================
      if (filename.endsWith(".htm") || filename.endsWith(".html") ||
              filename.endsWith(".xml") || filename.endsWith(".txt")) {
        queue.add(file);
      } else {
        System.out.println("Skipped " + filename);
      }
    }
  }

  /**
   * Close the index.
   * @throws java.io.IOException when exception closing
   */
  public void closeIndex() throws IOException {
    writer.close();
  }
}

KOMPILASI:

  • $javac -classpath lucene-analyzers-common-4.10.2.jar:lucene-core-4.10.2.jar:lucene-queryparser-4.10.2.jar TextFileIndexer.java

RUNNING:

  • $java -classpath lucene-analyzers-common-4.10.2.jar:lucene-core-4.10.2.jar:lucene-queryparser-4.10.2.jar:. TextFileIndexer

HASIL:

  • Akan ada permintaan untuk memasukan kata yang akan dicari: “Enter the search query (q=quit):”
  • Silakan anda masukan salah satu tokoh yang ada dalam suatu kerajaan misalnya “gajah mada”, “raja sanjaya”, “kertanegara” dan lainnya.

Selamat mencoba!

Tinggalkan Balasan

Isikan data di bawah atau klik salah satu ikon untuk log in:

Logo WordPress.com

You are commenting using your WordPress.com account. Logout /  Ubah )

Foto Google

You are commenting using your Google account. Logout /  Ubah )

Gambar Twitter

You are commenting using your Twitter account. Logout /  Ubah )

Foto Facebook

You are commenting using your Facebook account. Logout /  Ubah )

Connecting to %s