|This is an edited version of http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureIntro_c.html (Under Documentation > Home> Understanding the architecture > Architecture in brief ) for feedback purposes.
In this edited version, I tried to explain things by what they’re supposed to do, rather than how they work. Ben Slade
Architecture in brief
An overview of Cassandra’s structure.
Cassandra is designed to handle big data workloads across multiple nodes with no single point of failure. Its architecture is based on the understanding that system and hardware failures can and do occur. Cassandra addresses the problem of failures by employing a peer-to-peer distributed system where all nodes are the same and data is distributed among all nodes in the cluster [can a cluster be defined across data centers? yes]. Each node exchanges information across the cluster every second [same for multiple data centers?]. A [sequentially written] commit log on each node captures write activity to ensure data durability. Data is also written to an in-memory structure, called a memtable, and then written to a [append only] data file called an SSTable on disk once the memory structure is full [can the same data be in the memtable and the SSTable at the same time? Ie., can the memtable act like a cache?]. [Is the commit log truncated when data is written to the SSTable?] All writes are automatically partitioned and replicated throughout the cluster. Using a process called compaction Cassandra periodically consolidates SSTables, discards tombstones (an indicator that a column was deleted), and regenerates the index in the SSTable.
Cassandra is a row-oriented database. Cassandra’s architecture allows any authorized user to connect to any node in any data center and access data using the CQL language. For ease of use, CQL uses a similar syntax to SQL. From the CQL perspective the database consists of tables. Typically, a cluster has one keyspace per application. Developers can access CQL through cqlsh as well as via drivers for application languages.
Client read or write requests can go to any node in the cluster. When a client connects to a node with a request, that node serves as the coordinator for that particular client operation. The coordinator acts as a proxy between the client application and the nodes that own the data being requested. The coordinator determines which nodes in the ring should get the request based on how the cluster is configured. For more information, see Client requests.
Key components for configuring Cassandra¶
- Gossip: A peer-to-peer communication protocol to discover and share location and state information about the other nodes in a Cassandra cluster.Gossip information is also persisted locally by each node to use immediately when a node restarts. You may want to purge gossip history on node restart for various reasons, such as when the node’s IP addresses has changed.
- Partitioner: A partitioner determines how to distribute the data across the nodes in the cluster. Choosing a partitioner determines which node to place the first copy of data on.[Partitioners use various algorithms to assign the key value of a data row to an integer "token".] You must set the partitioner type and assign the node a num_tokens value for each node (the more tokens assigned to a node, the more data will be stored there). [New configurations typically use virtual nodes to evenly spread tokens across (physical) nodes]. If not using virtual nodes (vnodes), use the initial_token setting instead.
- Replica placement strategy: Cassandra stores copies of data [a "copy of data" or "replica" is a group of data rows mapping to the same token value] on multiple nodes to ensure reliability and fault tolerance. A replication strategy determines which [on how many redundant] nodes to place replicas [within a data center, if applicable]. The first replica of data is simply the first copy; it is not unique in any sense.When you create a keyspace, you must define the replica placement strategy and the number of replicas you want. [The NetworkTopologyStrategy is highly recommended for most deployments because it is much easier to expand to multiple data centers when required by future expansion If using data centers, you define the number of replicas you want in each data center. Each data center holds a copy of all data].
- Snitch: A snitch defines the topology information [eg., racks and data centers] that the replication strategy uses to place replicas and route requests efficiently. [By default, the snitch software monitors the performance of reads from the various replicas and chooses the best replica for reading based on this history]You need to configure a snitch when you create a cluster. The snitch is responsible for knowing the location of nodes within your network topology and distributing replicas by grouping machines into data centers and racks.
- The cassandra.yaml file is the main configuration file for Cassandra. In this file, you set the initialization properties for a cluster, caching parameters for tables, properties for tuning and resource utilization, timeout settings, client connections, backups, and security.
- Cassandra stores table properties in the system keyspace. You set storage configuration attributes on a per-keyspace or per-table basis programmatically or using a client application, such as CQL.By default, a node is configured to store the data it manages in the /var/lib/cassandra directory. In a production cluster deployment, you change the commitlog-directory to a different disk drive from the data_file_directories.
The article at cbsnews.com exposes the farcical effort to save money by reducing the “size” (number of employees) of the federal government:
“As Washington’s use of private contractors grows, the government is paying those contractors billions more than it would pay their government workers to do the same job, according to a study by the Project On Government Oversight (POGO).
In an attempt to verify frequently made claims that the government can save money by outsourcing its work, the nonprofit Project On Government Oversight (POGO) compared the total annual compensation for federal (and private sector) employees with federal contractor billing rates.
The group found that in 33 of the 35 occupational categories it reviewed, federal government employees were less expensive than contractors. On average, the federal government pays contractors 1.83 times more than it pays federal employees and two times more than what comparable workers in the private sector are paid.”
The following are excepts from The Atlantic Monthly article Can We Trust Google With the Stratosphere?
Google’s balloons’ primary mission might be to deliver Internet service, but it’s their aviation technology that is the real innovation. These balloons operate in the stratosphere, 12 miles up. Unlike unmanned weather balloons, they are capable of staying afloat for months, maybe years at a time. Each Loon balloon is about 50 feet wide and 40 feet high, relying solely on helium for lift. The envelope, or “balloon” part of the balloon, is one-tenth of an inch thick polyethylene fabric, lightweight and relatively delicate, but strong enough to withstand the high pressure differential of great altitudes. Google’s super-pressure balloons each have dual automatic air vents, which a remote pilot at Google Mission Control uses to control altitude by adjusting outside air levels. Tracking their every move by GPS, Google Mission Control says they can not only make them hover to a certain extent, but effectively navigate the Loons around the globe for weeks on end.
Each Loon balloon has three radio frequency antennas (on 2.4 Ghz and 5.8 Ghz bands) and a ground-pointing WiFi antenna, which beams an Internet signal to Earth in a 12-mile radius.
Google’s Loon balloons can talk to each other, and control themselves.”We use a distributed mesh network, so each balloon is pretty autonomous and has pretty much the same hardware in it,” Sameera Ponda, a lead aerospace engineer at the Dos Palos site that day, said on the video stream. “As one balloon floats over a certain area that balloon is talking to the ground antennas, and as that balloon floats away, another balloon comes in and takes its place, so it’s a pretty seamless operation.”
The extreme height at which Google’s Loons can flexibly operate raises a lot of questions. Where will they go? To what jurisdictions are they subject? Who regulates the stratosphere? Are they subject to physical intervention? And what will it mean for the world when Google breaks precedent, and achieves a stable stratospheric communications platform where everyone else has failed?
It’s crucial to figure out who controls the open space where Google’s Loons fly, and this is more difficult than it would seem. In the U.S., there are four classes of controlled airspace. According to the Code of Federal Regulations, Classes B, C, D, and E are below 10,000 feet, and designed to control lower traffic around airports. Class A covers airspace between 18,000 feet and where flight level begins to max out, around 60,000 feet (roughly 12 miles). Above that is the stratosphere, where Earth’s atmosphere gradually dissipates into outer space and the Loon balloons will fly in droves. Though there’s no point where space “begins,” the Kármán Line (327,360 ft.) has typically served that marker. Between where planes can fly and the Kármán Line, though, there’s almost 19 miles of unregulated stratosphere. Though there’s been debate, the stratosphere is generally considered sovereign airspace, but for most countries, it is un-policeable. Not only legally, but physically; no one can get high enough to touch it.
Hardly a speck in the blue overhead, ‘unmanned free balloons’ are the least regulated class of aircraft. With its Project Loon, Google is venturing into not one but two vast open spaces — the law and the sky.
A disturbing review of the surveillance programs instituted by the Bush administration and continued by the Obama administration. From the NYTimes Op-Ed The Criminal N.S.A. (written by Jennifer Stisa Granick, director of civil liberties at the Stanford Center for Internet and Society and Christopher Jon Sprigman, professor at the University of Virginia School of Law) :
We may never know all the details of the [United States'] mass surveillance programs, but we know this: The administration has justified them through abuse of language, intentional evasion of statutory protections, secret, unreviewable investigative procedures and constitutional arguments that make a mockery of the government’s professed concern with protecting Americans’ privacy.
And from Al Gore (from an article in an IEEE journal):
[The NSA surveillance] in my view violates the Constitution…. The Fourth Amendment language is crystal clear. It isn’t acceptable to have a secret interpretation of a law that goes far beyond any reasonable reading of either the law or the Constitution and then classify as top secret what the actual law is.
In the NY Times article, Don’t Blame the Work Force, Peter Cappelli, a professor of management at the Wharton School, has noted sharply different opinions between corporate executives, who typically say that schools are failing to give workers the skills they need, and the people who actually do the hiring, who say the real obstacles are traditional ones like lack of on-the-job experience. In addition, when there are many more applicants than jobs, employers tend to impose overexacting criteria and then wait for the perfect match. They also offer tightfisted pay packages. What employers describe as talent shortages are often failures to agree on salary.
If a business really needed workers, it would pay up. That is not happening, which calls into question the existence of a skills gap as well as the urgency on the part of employers to fill their openings. Research from the National Bureau of Economic Research found that “recruiting intensity” — that is, business efforts to fill job openings — has been low in this recovery. Employers may be posting openings, but they are not trying all that hard to fill them, say, by increasing job ads or offering better pay packages.
The average sentence for a first time non violent drug offender convicted under federal mandatory minimum sentencing laws is now longer than the average sentence for rape, child molestation, bank robbery, and manslaughter
It’s just not reasonable. See:
Amtrak’s 15 little-traveled longer routes lose nearly $600 million each year. Northeast routes, earn an operating profit of some $205.4 million
From The Washington Post Wonk Blog (by Brad Plummer):
The best way to think of Amtrak is that it’s essentially two different train systems rolled into one. One system is quite successful, the other isn’t.
First, there are Amtrak’s shorter passenger routes that run less than 400 miles and tend to connect major cities. Think of the Acela Express in the Northeast, or the Pacific Surfliner between San Diego and Los Angeles. These 26 routes carry four-fifths of Amtrak’s passengers, or 25.8 million riders per year. And they’re growing rapidly. Taken as a whole, these shorter routes are profitable to operate — mainly because the two big routes in the Northeast Corridor earn enough to cover losses elsewhere.
Then there are Amtrak’s 15 long-haul routes over 750 miles. Many of these were originally put in place to placate members of Congress all over the country, and they span dozens of states. This includes the California Zephyr route, which runs from Chicago to California and gets just 376,000 riders a year. All told, these routes lost $597.3 million in 2012.
Brookings has a neat interactive tool that lets you scrutinize each of Amtrak’s routes, looking at how many passengers they carry and how much money they make (or lose) each year.