Each node in spark cluster has an independent database. Can distributed statistical computation be realized?

  mongodb, question

I built spark on two machines, one of which is master and slave, and the other is slave. Both machines are equipped with independent mongodb databases. My main program lets them count the contents of their own databases and then aggregate the results into a database on a server. The current code is submitted on the master node. However, after I spark-submit, it seems that I only counted the data in mongodb on the master node, but not on the other worker node. What is the reason? The code is as follows:

val conf = new SparkConf().setAppName("Scala Word Count")
 val sc = new SparkContext(conf)
 val config = new Configuration()
 //The following code indicates that only the data on the local database are counted. It is assumed that this is the problem.
 config.set("mongo.input.uri", "mongodb://")
 //The statistical results are output to the server
 config.set("mongo.output.uri", "mongodb://")
 val mongoRDD = sc.newAPIHadoopRDD(config, classOf[com.mongodb.hadoop.MongoInputFormat], classOf[Object], classOf[BSONObject])
 // Input contains tuples of (ObjectId, BSONObject)
 val countsRDD = mongoRDD.flatMap(arg => {
 var str = arg._2.get("type").toString
 str = str.toLowerCase().replaceAll("[.,!  ?  \n]", " ")
 str.split(" ")
 .map(word => (word, 1))
 .reduceByKey((a, b) => a + b)
 // Output contains tuples of (null, BSONObject) - ObjectId will be generated by Mongo driver if null
 val saveRDD = countsRDD.map((tuple) => {
 var bson = new BasicBSONObject()
 bson.put("word", tuple._1)
 bson.put("count", tuple._2.toString() )
 (null, bson)
 // Only MongoOutputFormat and config are relevant
 saveRDD.saveAsNewAPIHadoopFile("file:///bogus", classOf[Any], classOf[Any], classOf[com.mongodb.hadoop.MongoOutputFormat[Any, Any]], config)

Ask yourself and answer. The reason may be this:

val mongoRDD = sc.newAPIHadoopRDD(config, classOf[com.mongodb.hadoop.MongoInputFormat], classOf[Object], classOf[BSONObject])

This line of code indicates that the driver reads the database and then loads the qualified data into RDD, because is set as the input, that is, reads the data from the driver’s mongodb. Since the driver is on the master, the data read is naturally the data on the master.