Thursday, January 14, 2010

State of MySQL Search

Recently I took part in the first meeting of the MySQL Search Team. You can read more about the team here.

My task was to represent the customer interests regarding fulltext search. My report had a few main points which I made:

  • MySQL Fulltext Search (FTS) has several big issues with it compared to other solutions. I have listed these in the order of importance that I see. Numbers 1 and 2 are very close in importance however, so I can see those going either way.

    1. MyISAM only: Limits the usefulness in many very busy environments due to table level locking and crash safety

    2. Performance: The design of FTS makes performance hurt as the index size exceeds RAM. Combining multiple second queries with MyISAM table level locking is a recipe for failure.

    3. Features: There are a lot of features lacking in FTS that exist in other products. Stemming, synonyms, dictionary usage, CJK support, etc... are all potential avenues of investigation.

  • Third-party solutions are commonly used for searching data in MySQL, but does require extra work to deploy or implement. Tools like Lucene, Sphinx, etc... are very good at what they do, and are normally faster and more fully featured compared to MySQL FTS. The big issue is that normally there is some effort involved in implementing these, so the ease of use in implementing and maintaining is lower than MySQL FTS.



So the forward looking question is how to combine the best attributes of these two solutions to get one easy to deploy and use robust solution. Personally, I can see a few possible routes forward:
  • Improve MySQL FTS! There is a lot of research and known best practices that exist for search. This would leverage those in order to make the MySQL solution more robust and featureful. From the current state of FTS, it would be a pretty decent amount of work, but could have good returns.
  • Improve the glue code between the third parties and MySQL. This could take a variety of forms. One idea I had was in the the form of index plugins where as rows are inserted/updated/deleted, the external datastores can be automatically updated as well. This is very similar to triggers, but is a bit different since there can be more integration and easier deployment.
It's possible the best solution may be a mix of the two options to allow for more flexibility or even something completely new that I haven't thought of!

Regardless, I am happy that people at MySQL are finally taking a look at the poor state of the MySQL fulltext search environment and trying to figure out new good solutions for this space.

3 comments:

  1. FTS in MySQL requires a crash safe data store. MySQL could have done that with Maria, but it decided to be everything (OLTP, DW, FTS) and nobody uses it today. Crash safe doesn't require support for concurrent transactions. It is good enough to support 1 writer with N readers and even better if it can tell us the ID of the last commit to which it recovered after a crash to stay in sync with the binlog.

    FTS on MyISAM with current MyISAM behavior cannot be used by most of us.

    Regardless of what is done with FTS, I think
    MySQL should/can do more to play better with others.

    Row-based replication should make it easy to integrate MySQL with special-purpose data stores. But someone has to build the glue to make that easy to use and a library for parsing row-based binlog events into something that is easy for others to use.

    For many of us, MySQL is just a piece of the data management solution. It is unable to do everything that we need (cache == memcache, data warehouse == hive, text search == sphinx). This is an opportunity for MySQL to learn how to play better with others. It is the engine from which data originates.

    ReplyDelete
  2. The idea of making a row-based replication reader client which can then integrate into other searching solutions sounds neat to me.

    It is pretty common to want to have the FTS server separate from your main system to prevent overloading it and the asynchronous nature of the row based applier would work well with it as well.

    Also this could effectively be done for existing 5.1 servers without needing any code on the server side effectively which is always a plus.

    ReplyDelete
  3. I agree with Mark. The MySQL RDMS is a part of the solution and we have many things to join with its solutions to achive specific objectives.

    About the row-based replication, I have many difficults to use this because other problems over replication and, as Mark quoted in your answer, read bin-log with the mysqlbinlog app.

    Maybe, FTS is little child calling for help yet. I get used this resource in some cases, but, we don't have all desired features in it.

    ReplyDelete