mysqlsqlhadoopmahoutnosql

Getting probability density graph & k-means clustering with 300 million rows


The DBMS I use is MySQL(MariaDB).

The table scheme is as below:

CREATE TABLE MyTable (
ID     INT     PRIMARY KEY,
TEXT   VARCHAR(200),
VALUE  DECIMAL(15,2) )

The table has 300 million rows or more.

I'd like to get the result from the following two processes by extracting values from the texts(For example, SELECT VALUE FROM MyTable WHEN TEXT LIKE '%any keywords%'; SQL) (Results to be displayed on the web)

  1. To draw probability density graph
  2. To cluster values by using K-Means Algorithm

Is it possible to obtain the results above only by using SQL? If so, how is the performance? (the required response time is less than 2 seconds) If not, could you recommend better way?

If there are 10 data nodes with the combination of NoSQL and Mahout, is it possible to get the result from each query within 2 seconds especially when there are 5 queries per second? If not, how many data nodes are required?

So, please recommand me the system architecture if you know any solution to the trouble I've currently run into.


Solution

  • This is a bit long for a comment.

    Your expectations are a bit extreme. It would might be possible to meet the requirements, using a lot of custom code and systems with lots of processors and lots of memory.

    First, you don't seem to understand how k-means works. What is the distance metric?

    Second, you don't explain why you need to re-cluster the records for each query. Typically, clustering is more of an offline activity and scoring (or assigning clusters) is online.

    Finally, I wouldn't recommend k-means clustering on raw text. There are other algorithms for clustering text that could very well be more appropriate for your actual problem. I would suggest you learn a bit about data mining (What is the k-means algorithm? What is it useful? What is expectation-maximization clustering? What is singular value decomposition?) I would also suggest that you learn about text analysis (What is tokenization? What is stemming? What are bag-of-words approaches? What is semantic analysis?). Your question betrays a lack of understanding of both these subjects.