Right now we are working on a new project using Apache Nutch 2.x, Apache Hadoop, Apache Solr 4 and a lot of other cool tools/modules/API’s/etc. After following the instructions found on http://nlp.solutions.asia/?p=180, I’ve successfully connected Apache Nutch, MySQL and Apache Solr.

In summary:
- Create a database to hold your data
- Use SQLDataStore and add configuration for your MySQL server
- Update Apache Nutch configuration
- Update Solr schema
Now our Apache Nutch uses MySQL as data store (the place where it keeps the result of the crawling process, such as URL, text content, metadata, and so on). That’s grand, but there is one part missing in the Solr Schema provided in the blog post.
Due to SOLR-3432, after following the tutorial and replacing the schema, we couldn’t delete the whole index anymore. After following the instructions in the bug comments, and adding the following entry in schema.xml it worked again.
[xml]<field name="_version_" type="long" indexed="true" stored="true"/>[/xml]
Restart Apache Solr and run the following command and your index will be reset.
[shell]curl http://localhost:8983/solr/collection1/update?commit=true -H "Content-Type: text/xml" –data-binary "<delete><query>*:*</query></delete>"[/shell]
Hope it helps if you are creating a similar set up. In the next posts we will explain how to set up Apache Nutch 2.x branch in Eclipse. It is very helpful for writing and debugging plug-ins.
Laters! -B











Did you accomplish to index metatags? if so .. how?
Hi, I can’t recall if it found any metatags, as I was using only parts of the body of the page. I’ll give it a try and will comment either here or in NUTCH-1478
Cheers, -B
Could u write every steps of connecting MYSQL to Nutch? I tried but failed.
TKS.
Sure thing mate. This post is somewhat old, and probably the code changed already. Could you say what errors happened and what was your set up, please? I’ll give it a try with your settings.
All the best, Bruno