ㄚ琪最近读Slashdot这一篇How Twitter Is Moving To the Cassandra Database,我特地查一下华文的新闻也看到有关碎碎念Twitter,这一则来自部落格MyNoSQL新的消息:Twitter用户暴增20倍 计划弃用MySQL,原来Twitter是用MySQL资料库的,感觉就很亲切,不过现在听说要改用Cassandra资料库,这个资料库就真的很陌生了,进去Cassandra的网页看看,它的slogan这样写:‘A highly scalable, eventually consistent, distributed, structured key-value store. ’,这是一个高延展性的第二代分散式资料库,Wiki上查到的应该是第一代的吧!至于第二代的分散式资料库我就没有sense了。Cassandra网站上说明它的由来,Dynamo的完整分散式设计跟Bigtable的栏位家族基础的资料模式。
Dynamo跟Facebook又很有渊源,因为Cassandra是2008年由Facebook开放原始码来的,它是由Avinash Lakshman跟Prashant Malik设计的,而Avinash Lakshman就是Amazon’s Dynamo的作者之一,在很多方面我们可以把Cassandra看成是Dynamo 2.0。
Bigtable就较熟悉了,之前的Google 应用服务引擎的笔记有提到这个资料库的使用,因为太习惯用MySQL了,所以在用Bigtbale时可能就会稍微有点不习惯,不过这个问题应该很容易可以解决!
在MyNoSQL与Ryan King的对话中,有一些很有趣的地方,一个就是Ryan King曾经评估过的方案有哪些?‘Various databases: HBase, Voldemort, MongoDB, MemcacheDB, Redis, Cassandra, HyperTable and probably some others I’m forgetting.’,里面未曾提过我们所熟知的Oracle、MSSQL、DB2之类的,就觉得他的思维很奇特!
另外一个就是‘How do you plan to migrate existing data?’,你如何计划合并现有的资料,这可是很大的问题喔,就像ㄚ琪以前要把nuke的资料转到drupal中,又或是在公司要作不同资料库的资料转移,这真的是很大的工程。
Ryan King提出的作法:
Write code that can write to Cassandra in parallel to Mysql, but keep it disabled by the tool I mentioned above
Slowly turn up the writes to Cassandra (we can do this by user groups “turn this feature on for employees only” or by percentages “turn this feature on for 1.2% of users”)
Find a bug 🙂
Turn the feature off
Fix the bug and deploy
GOTO #2
然后:
- Take a backup from the mysql databases
- Run an importer that imports the data to cassandraSome side notes here about importing. We were originally trying to use the
BinaryMemtable[2] interface, but we actually found it to be too fast — it would saturate the backplane of our network. We’ve switched back to using the Thrift interface for bulk loading (and we still have to throttle it). The whole process takes about a week now. With infinite network bandwidth we could do it in about 7 hours on our current cluster. - Once the data is imported we start turning on real read traffic to Cassandra (in parallel to the mysql traffic), again by user groups and percentages.
- Once we’re satisfied with the new system (we’re using the real production traffic with instrumentation in our application to QA the new datastore) we can start turning down traffic to the mysql databases.
只是听听人家在说Cassandra有多好,那没什么用,到不如实作一次看看,up and running with cassandra有提到怎么安装,我也来试试看!
用安装CassandraCassandram 需有Java 6的环境才能跑!现在可以下载的版本是0.5.1。
操作可以参考Cassandra Installation and Configuration这一篇,不过在Windows下跑有些另外要注意的是cassandra.bat这个档,还有设定CASSANDRA_HOME的环境变数为D:\cassandra,cassandra.bat这个档里有
subst P: “%CASSANDRA_HOME%\lib”
P:
set CLASSPATH=P:\
for %%i in (*.jar) do call :append %%i
goto okClasspath
:append
set CLASSPATH=%CLASSPATH%;P:\%*
SUBSET是建立路径与磁碟机代号之间的关连。
所以要注意P槽是否正在使用中,如果使用中,就改成别的吧!还有-Dcom.sun.management.jmxremote.port=8080^这一行看来会使用8080 port,所以你有在用这个服务,需要把这服务关掉,或将这个port号改掉,不过我没这样试!
先摆着明天再继续来玩玩看!
