Hadoop Weekly - March 24, 2014



Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

Cloudera and Platfora both reported new rounds of funding this week, and MapR and Jaspersoft as well as Cloudera and Trifacta announced new partnerships. In addition, Pivotal introduced a new version of their distribution, Pivotal HD, and Microsoft announced the general availability of the newest version of HDInsight, which includes Apache Hadoop 2.2. With several interesting technical articles, this week’s newsletter should have something for everyone.

In a tutorial that combines Apache Pig, Cloudera Impala, and Microsoft Power BI, you’ll load a dataset describing on-time performance of flights in the US over the last 30 years. The data describing each flight is joined with carriers, planes, and airports in a Pig job. Next, Pig is used to do simple aggregate analysis. Finally, the tutorial walks through hooking up Microsoft PowerBI to data retrieved through Cloudera Impala in order to do more advanced analysis.

William Butler Yeats © estate of Sir William Rothenstein
National Portrait Gallery, London
En Ignoria: Los cisnes salvajes de Coole bilingüe
Los árboles son bellos en otoño,las sendas de los bosques están secas;bajo el crepúsculo de octubre, el aguarefleja un cielo inmóvil;sobre el agua que brilla entre las piedras,cincuenta y nueve cisnes. (…)
The trees are in their autumn beauty,   The woodland paths are dry,Under the October twilight the water   Mirrors a still sky;Upon the brimming water among the stones   Are nine-and-fifty swans. (…)


Los árboles son bellos en otoño,
las sendas de los bosques están secas;
bajo el crepúsculo de octubre, el agua
refleja un cielo inmóvil;
sobre el agua que brilla entre las piedras,
cincuenta y nueve cisnes. (…)


The trees are in their autumn beauty,   
The woodland paths are dry,
Under the October twilight the water   
Mirrors a still sky;
Upon the brimming water among the stones   
Are nine-and-fifty swans. (…)

La historieta

—¿Y tú crees en Dios?
—Claro. Inclusive, Dios, físicamente, ya estuvo sobre la tierra.
—Órale, ¿neta?
—Sí, se hizo hombre y luego murió por todos nosotros.
—Órale. ¿Dios se hizo hombre? Chale, pues qué cagado. Y qué loco eso de morir por todos. ¿Como para qué o qué? Y luego, ¿de qué sirvió?
—Sí, fue por amor. Nos salvó a todos muriendo. Pero resucitó al tercer día y sigue vivo.
—Órale. ¡Qué historia! ¿Y puedo verlo y hablar con él?
—No, es que poco tiempo después de que resucitó se fue volando al Cielo.
—Ay, no mames.


NOTE!!! - Many mistakenly believed I have meant specifically that MH370 flew BEHIND SQ68. When I say shadow, I mean that it may have flown above or below SQ68 slightly. Listening to ATC instructions would have allowed MH370 to stay current on SQ68’s next move.

Monday, March 17, 2014 - 12:01 AM…

Interesting. Something like this will become the key for solving the mistery.

My New Year’s Resolution for 2014 was to get more people started in User Experience (UX) Design. I posted one lesson every day in January, and thousands of people came to learn!

Below you will find links to all 31 daily lessons.

Basic UX Principles: How to get started

"Like anybody can tell you, I am not a very nice man. I don’t know the word. I have always admired the villain, the outlaw, the son of a bitch. I don’t like the clean-shaven boy with the necktie and the good job. I like desperate men, men with broken teeth and broken minds and broken ways. They interest me. They are full of surprises and explosions. I also like vile women, drunk cursing bitches with loose stockings and sloppy mascara faces. I’m more interested in perverts than saints. I can relax with bums because I am a bum. I don’t like laws, morals, religions, rules. I don’t like to be shaped by society."

— Charles Bukowski



Proper assessment of the harms caused by the misuse of drugs can inform policy makers in health, policing, and social care. We aimed to apply multicriteria decision analysis (MCDA) modelling to a range of drug harms in the UK.


Members of the Independent Scientifi c Committee on Drugs, including two invited specialists, met in a 1-day interactive workshop to score 20 drugs on 16 criteria: nine related to the harms that a drug produces in the individual and seven to the harms to others. Drugs were scored out of 100 points, and the criteria were weighted to indicate their relative importance.


MCDA modelling showed that heroin, crack cocaine, and metamfetamine were the most harmful drugs to individuals (part scores 34, 37, and 32, respectively), whereas alcohol, heroin, and crack cocaine were the most harmful to others (46, 21, and 17, respectively). Overall, alcohol was the most harmful drug (overall harm score 72), with heroin (55) and crack cocaine (54) in second and third places.


These findings lend support to previous work assessing drug harms, and show how the improved scoring and weighting approach of MCDA increases the differentiation between the most and least harmful drugs. However, the findings correlate poorly with present UK drug classification, which is not based simply on considerations of harm.

˝Drug misuse and abuse are major health problems. Harmful drugs are regulated according to classification systems that purport to relate to the harms and risks of each drug. However, the methodology and processes underlying classification systems are generally neither specified nor transparent, which reduces confidence in their accuracy and undermines health education messages. We developed and explored the feasibility of the use of a nine-category matrix of harm, with an expert delphic procedure, to assess the harms of a range of illicit drugs in an evidence-based fashion. We also included five legal drugs of misuse (alcohol, khat, solvents, alkyl nitrites, and tobacco) and one that has since been classified (ketamine) for reference. The process proved practicable, and yielded roughly similar scores and rankings of drug harm when used by two separate groups of experts. The ranking of drugs produced by our assessment of harm differed from those used by current regulatory systems. Our methodology offers a systematic framework and process that could be used by national and international regulatory bodies to assess the harm of current and future drugs of abuse.

The results of this study do not provide justification for the sharp A, B, or C divisions of the current classifications in the UK Misuse of Drugs Act. Distinct categorisation is, of course, convenient for setting of priorities for policing, education, and social support, as well as to determine sentencing for possession or dealing. But neither the rank ordering of drugs nor their segregation into groups in the Misuse of Drugs Act classification is supported by the more complete assessment of harm described here. Sharply defined categories in any ranking system are essentially arbitrary unless there are obvious discontinuities in the full set of scores. Results show only a hint of such a transition in the spectrum of harm, in the small step in the very middle of the distribution, between buprenorphine and cannabis. Interestingly, alcohol and tobacco are both in the top ten, higher-harm group. There is a rapidly accelerating harm value from alcohol upwards. So, if a three-category classification were to be retained, one possible interpretation of our findings is that drugs with harm scores equal to that of alcohol and above might be class A, cannabis and those below might be class C, and drugs in between might be class B. In that case, it is salutary to see that alcohol and tobacco —the most widely used unclassified substances— would have harm ratings comparable with class A and B illegal drugs, respectively.

Our findings raise questions about the validity of the current Misuse of Drugs Act classification, despite the fact that it is nominally based on an assessment of risk to users and society. The discrepancies between our findings and current classifications are especially striking in relation to psychedelic-type drugs. Our results also emphasise that the exclusion of alcohol and tobacco from the Misuse of Drugs Act is, from a scientific perspective, arbitrary. We saw no clear distinction between socially acceptable and illicit substances. The fact that the two most widely used legal drugs lie in the upper half of the ranking of harm is surely important information that should be taken into account in public debate on illegal drug use. Discussions based on a formal assessment of harm rather than on prejudice and assumptions might help society to engage in a more rational debate about the relative risks and harms of drugs.

We believe that a system of classification like ours, based on the scoring of harms by experts, on the basis of scientific evidence, has much to commend it. Our approach provides a comprehensive and transparent process for assessment of the danger of drugs, and builds on the approach to this issue developed in earlier publications but covers more parameters of harm and more drugs, as well as using the delphic approach, with a range of experts. The system is rigorous and transparent, and involves a formal, quantitative assessment of several aspects of harm. It can easily be reapplied as knowledge advances. We note that a numerical system has also been described by MacDonald and colleagues to assess the population harm of drug use, an approach that is complementary to the scheme described here, but as yet has not been applied to specific drugs. Other organisations (eg, the European Monitoring Centre for Drugs and Drug Addiction and the CAM committee of the Dutch government) are currently exploring other risk assessment systems, some of which are also numerically based. Other systems use delphic methodology, although none uses such a comprehensive set of risk parameters and no other has reported on such a wide range of drugs as our method. We believe that our system could be developed to aid in decision-making by regulatory bodies—eg, the UK’s Advisory Council on the Misuse of Drugs and the European Medicines Evaluation Agency—to provide an evidence-based approach to drug classification.˝

—David Nutt, Leslie A King, William Saulsbury, Colin Blakemore

6 dataset lists curated by data scientists


Since we do a lot of experimenting with data, we’re always excited to find new datasets to use with Mortar.  We’re saving bookmarks and sharing datasets with our team on a nearly-daily basis.

There are tons of resources throughout the web, but given our love for the data scientist community, we thought we’d pick out a few of the best dataset lists curated by data scientists.

Below is a collection of six great dataset lists from both famous data scientists and those who aren’t well-known:

El libro, el hombre y Jorge Luis Borges.

El libro, el hombre y Jorge Luis Borges.