SSL and internet security news

algorithms

Auto Added by WPeMatico

Data, Surveillance, and the AI Arms Race

According to foreign policy experts and the defense establishment, the United States is caught in an artificial intelligence arms race with China — one with serious implications for national security. The conventional version of this story suggests that the United States is at a disadvantage because of self-imposed restraints on the collection of data and the privacy of its citizens, while China, an unrestrained surveillance state, is at an advantage. In this vision, the data that China collects will be fed into its systems, leading to more powerful AI with capabilities we can only imagine today. Since Western countries can’t or won’t reap such a comprehensive harvest of data from their citizens, China will win the AI arms race and dominate the next century.

This idea makes for a compelling narrative, especially for those trying to justify surveillance — whether government- or corporate-run. But it ignores some fundamental realities about how AI works and how AI research is conducted.

Thanks to advances in machine learning, AI has flipped from theoretical to practical in recent years, and successes dominate public understanding of how it works. Machine learning systems can now diagnose pneumonia from X-rays, play the games of go and poker, and read human lips, all better than humans. They’re increasingly watching surveillance video. They are at the core of self-driving car technology and are playing roles in both intelligence-gathering and military operations. These systems monitor our networks to detect intrusions and look for spam and malware in our email.

And it’s true that there are differences in the way each country collects data. The United States pioneered “surveillance capitalism,” to use the Harvard University professor Shoshana Zuboff’s term, where data about the population is collected by hundreds of large and small companies for corporate advantage — and mutually shared or sold for profit The state picks up on that data, in cases such as the Centers for Disease Control and Prevention’s use of Google search data to map epidemics and evidence shared by alleged criminals on Facebook, but it isn’t the primary user.

China, on the other hand, is far more centralized. Internet companies collect the same sort of data, but it is shared with the government, combined with government-collected data, and used for social control. Every Chinese citizen has a national ID number that is demanded by most services and allows data to easily be tied together. In the western region of Xinjiang, ubiquitous surveillance is used to oppress the Uighur ethnic minority — although at this point there is still a lot of human labor making it all work. Everyone expects that this is a test bed for the entire country.

Data is increasingly becoming a part of control for the Chinese government. While many of these plans are aspirational at the moment — there isn’t, as some have claimed, a single “social credit score,” but instead future plans to link up a wide variety of systems — data collection is universally pushed as essential to the future of Chinese AI. One executive at search firm Baidu predicted that the country’s connected population will provide them with the raw data necessary to become the world’s preeminent tech power. China’s official goal is to become the world AI leader by 2030, aided in part by all of this massive data collection and correlation.

This all sounds impressive, but turning massive databases into AI capabilities doesn’t match technological reality. Current machine learning techniques aren’t all that sophisticated. All modern AI systems follow the same basic methods. Using lots of computing power, different machine learning models are tried, altered, and tried again. These systems use a large amount of data (the training set) and an evaluation function to distinguish between those models and variations that work well and those that work less well. After trying a lot of models and variations, the system picks the one that works best. This iterative improvement continues even after the system has been fielded and is in use.

So, for example, a deep learning system trying to do facial recognition will have multiple layers (hence the notion of “deep”) trying to do different parts of the facial recognition task. One layer will try to find features in the raw data of a picture that will help find a face, such as changes in color that will indicate an edge. The next layer might try to combine these lower layers into features like shapes, looking for round shapes inside of ovals that indicate eyes on a face. The different layers will try different features and will be compared by the evaluation function until the one that is able to give the best results is found, in a process that is only slightly more refined than trial and error.

Large data sets are essential to making this work, but that doesn’t mean that more data is automatically better or that the system with the most data is automatically the best system. Train a facial recognition algorithm on a set that contains only faces of white men, and the algorithm will have trouble with any other kind of face. Use an evaluation function that is based on historical decisions, and any past bias is learned by the algorithm. For example, mortgage loan algorithms trained on historic decisions of human loan officers have been found to implement redlining. Similarly, hiring algorithms trained on historical data manifest the same sexism as human staff often have. Scientists are constantly learning about how to train machine learning systems, and while throwing a large amount of data and computing power at the problem can work, more subtle techniques are often more successful. All data isn’t created equal, and for effective machine learning, data has to be both relevant and diverse in the right ways.

Future research advances in machine learning are focused on two areas. The first is in enhancing how these systems distinguish between variations of an algorithm. As different versions of an algorithm are run over the training data, there needs to be some way of deciding which version is “better.” These evaluation functions need to balance the recognition of an improvement with not over-fitting to the particular training data. Getting functions that can automatically and accurately distinguish between two algorithms based on minor differences in the outputs is an art form that no amount of data can improve.

The second is in the machine learning algorithms themselves. While much of machine learning depends on trying different variations of an algorithm on large amounts of data to see which is most successful, the initial formulation of the algorithm is still vitally important. The way the algorithms interact, the types of variations attempted, and the mechanisms used to test and redirect the algorithms are all areas of active research. (An overview of some of this work can be found here; even trying to limit the research to 20 papers oversimplifies the work being done in the field.) None of these problems can be solved by throwing more data at the problem.

The British AI company DeepMind’s success in teaching a computer to play the Chinese board game go is illustrative. Its AlphaGo computer program became a grandmaster in two steps. First, it was fed some enormous number of human-played games. Then, the game played itself an enormous number of times, improving its own play along the way. In 2016, AlphaGo beat the grandmaster Lee Sedol four games to one.

While the training data in this case, the human-played games, was valuable, even more important was the machine learning algorithm used and the function that evaluated the relative merits of different game positions. Just one year later, DeepMind was back with a follow-on system: AlphaZero. This go-playing computer dispensed entirely with the human-played games and just learned by playing against itself over and over again. It plays like an alien. (It also became a grandmaster in chess and shogi.)

These are abstract games, so it makes sense that a more abstract training process works well. But even something as visceral as facial recognition needs more than just a huge database of identified faces in order to work successfully. It needs the ability to separate a face from the background in a two-dimensional photo or video and to recognize the same face in spite of changes in angle, lighting, or shadows. Just adding more data may help, but not nearly as much as added research into what to do with the data once we have it.

Meanwhile, foreign-policy and defense experts are talking about AI as if it were the next nuclear arms race, with the country that figures it out best or first becoming the dominant superpower for the next century. But that didn’t happen with nuclear weapons, despite research only being conducted by governments and in secret. It certainly won’t happen with AI, no matter how much data different nations or companies scoop up.

It is true that China is investing a lot of money into artificial intelligence research: The Chinese government believes this will allow it to leapfrog other countries (and companies in those countries) and become a major force in this new and transformative area of computing — and it may be right. On the other hand, much of this seems to be a wasteful boondoggle. Slapping “AI” on pretty much anything is how to get funding. The Chinese Ministry of Education, for instance, promises to produce “50 world-class AI textbooks,” with no explanation of what that means.

In the democratic world, the government is neither the leading researcher nor the leading consumer of AI technologies. AI research is much more decentralized and academic, and it is conducted primarily in the public eye. Research teams keep their training data and models proprietary but freely publish their machine learning algorithms. If you wanted to work on machine learning right now, you could download Microsoft’s Cognitive Toolkit, Google’s Tensorflow, or Facebook’s Pytorch. These aren’t toy systems; these are the state-of-the art machine learning platforms.

AI is not analogous to the big science projects of the previous century that brought us the atom bomb and the moon landing. AI is a science that can be conducted by many different groups with a variety of different resources, making it closer to computer design than the space race or nuclear competition. It doesn’t take a massive government-funded lab for AI research, nor the secrecy of the Manhattan Project. The research conducted in the open science literature will trump research done in secret because of the benefits of collaboration and the free exchange of ideas.

While the United States should certainly increase funding for AI research, it should continue to treat it as an open scientific endeavor. Surveillance is not justified by the needs of machine learning, and real progress in AI doesn’t need it.

This essay was written with Jim Waldo, and previously appeared in Foreign Policy.

Powered by WPeMatico

MD5 and SHA-1 Still Used in 2018

Last week, the Scientific Working Group on Digital Evidence published a draft document — “SWGDE Position on the Use of MD5 and SHA1 Hash Algorithms in Digital and Multimedia Forensics” — where it accepts the use of MD5 and SHA-1 in digital forensics applications:

While SWGDE promotes the adoption of SHA2 and SHA3 by vendors and practitioners, the MD5 and SHA1 algorithms remain acceptable for integrity verification and file identification applications in digital forensics. Because of known limitations of the MD5 and SHA1 algorithms, only SHA2 and SHA3 are appropriate for digital signatures and other security applications.

This is technically correct: the current state of cryptanalysis against MD5 and SHA-1 allows for collisions, but not for pre-images. Still, it’s really bad form to accept these algorithms for any purpose. I’m sure the group is dealing with legacy applications, but I would like it to really push those application vendors to update their hash functions.

Powered by WPeMatico

Evidence for the Security of PKCS #1 Digital Signatures

This is interesting research: “On the Security of the PKCS#1 v1.5 Signature Scheme“:

Abstract: The RSA PKCS#1 v1.5 signature algorithm is the most widely used digital signature scheme in practice. Its two main strengths are its extreme simplicity, which makes it very easy to implement, and that verification of signatures is significantly faster than for DSA or ECDSA. Despite the huge practical importance of RSA PKCS#1 v1.5 signatures, providing formal evidence for their security based on plausible cryptographic hardness assumptions has turned out to be very difficult. Therefore the most recent version of PKCS#1 (RFC 8017) even recommends a replacement the more complex and less efficient scheme RSA-PSS, as it is provably secure and therefore considered more robust. The main obstacle is that RSA PKCS#1 v1.5 signatures use a deterministic padding scheme, which makes standard proof techniques not applicable.

We introduce a new technique that enables the first security proof for RSA-PKCS#1 v1.5 signatures. We prove full existential unforgeability against adaptive chosen-message attacks (EUF-CMA) under the standard RSA assumption. Furthermore, we give a tight proof under the Phi-Hiding assumption. These proofs are in the random oracle model and the parameters deviate slightly from the standard use, because we require a larger output length of the hash function. However, we also show how RSA-PKCS#1 v1.5 signatures can be instantiated in practice such that our security proofs apply.

In order to draw a more complete picture of the precise security of RSA PKCS#1 v1.5 signatures, we also give security proofs in the standard model, but with respect to weaker attacker models (key-only attacks) and based on known complexity assumptions. The main conclusion of our work is that from a provable security perspective RSA PKCS#1 v1.5 can be safely used, if the output length of the hash function is chosen appropriately.

I don’t think the protocol is “provably secure,” meaning that it cannot have any vulnerabilities. What this paper demonstrates is that there are no vulnerabilities under the model of the proof. And, more importantly, that PKCS #1 v1.5 is as secure as any of its successors like RSA-PSS and RSA Full-Domain.

Powered by WPeMatico

NIST Issues Call for “Lightweight Cryptography” Algorithms

This is interesting:

Creating these defenses is the goal of NIST’s lightweight cryptography initiative, which aims to develop cryptographic algorithm standards that can work within the confines of a simple electronic device. Many of the sensors, actuators and other micromachines that will function as eyes, ears and hands in IoT networks will work on scant electrical power and use circuitry far more limited than the chips found in even the simplest cell phone. Similar small electronics exist in the keyless entry fobs to newer-model cars and the Radio Frequency Identification (RFID) tags used to locate boxes in vast warehouses.

All of these gadgets are inexpensive to make and will fit nearly anywhere, but common encryption methods may demand more electronic resources than they possess.

The NSA’s SIMON and SPECK would certainly qualify.

Powered by WPeMatico

Two NSA Algorithms Rejected by the ISO

The ISO has rejected two symmetric encryption algorithms: SIMON and SPECK. These algorithms were both designed by the NSA and made public in 2013. They are optimized for small and low-cost processors like IoT devices.

The risk of using NSA-designed ciphers, of course, is that they include NSA-designed backdoors. Personally, I doubt that they’re backdoored. And I always like seeing NSA-designed cryptography (particularly its key schedules). It’s like examining alien technology.

Powered by WPeMatico

Lessons Learned from the Estonian National ID Security Flaw

Estonia recently suffered a major flaw in the security of their national ID card. This article discusses the fix and the lessons learned from the incident:

In the future, the infrastructure dependency on one digital identity platform must be decreased, the use of several alternatives must be encouraged and promoted. In addition, the update and replacement capacity, both remote and physical, should be increased. We also recommend the government to procure the readiness to act fast in force majeure situations from the eID providers.. While deciding on the new eID platforms, the need to replace cryptographic primitives must be taken into account — particularly the possibility of the need to replace algorithms with those that are not even in existence yet.

Powered by WPeMatico

Whistleblower Investigative Report on NSA Suite B Cryptography

The NSA has been abandoning secret and proprietary cryptographic algorithms in favor of commercial public algorithms, generally known as “Suite B.” In 2010, an NSA employee filed some sort of whistleblower complaint, alleging that this move is both insecure and wasteful. The US DoD Inspector General investigated and wrote a report in 2011.

The report — slightly redacted and declassified — found that there was no wrongdoing. But the report is an interesting window into the NSA’s system of algorithm selection and testing (pages 5 and 6), as well as how they investigate whistleblower complaints.

Powered by WPeMatico