Data Science Foundations: Geometry and Topology of
Complex Hierarchic Systems and Big Data Analytics
Datasets and Software Accompanying This Book
Book Review, by Bill Shannon, in Biometrics, Journal of the International
Biometric Society, Volume 75, Issue 1, March 2019, page 361, first published 07 May
2019. Wiley Online Library:
Provision of data and of some R software, and in a few cases,
other software, is with the following
objective: to facilitate learning by doing, i.e.
carrying out analyses, and reproducing results and outcomes.
That may be both interesting and useful, in parallel with
the more methodology-related aspects that can be, and that ought
to be, revealing and insightful.
- Related to subsection 1.2: Casablanca Movie Script
- scenes01to77_textfiles.zip, 77 text
files, from the Casablanca movie script, for each scene.
all77fileswords.txt, full extracted word list.
xtabulate-scene77.txt, cross-tabulation of
77 scenes by a corpus of 2654 words.
- scene43_11textfiles.zip, 11 text files,
for the subscenes of scene 43.
all43fileswords.txt, full extracted word list.
xtabulate-scene77.txt, cross-tabulation of 11
subscenes, from movie scene 43, by a corpus of 210 words.
- readdatatables.txt, reading the two previous
data tables into R.
- Related to subsection 1.3: Research Funding
- section1pt3.zip, research centres and clusters,
with descriptions in separate text files, the set of all words extracted, allwords.txt,
and the cross-tabulation of centres/clusters by years, institutes and themes.
data5678.txt, from the years 2005 ... 2008, cross-tabulation
of RFP, Research Frontiers Programme projects crossed by years, institute, and themes.
- Related to subsection 2.1: Twitter
- ch2twitter.zip, containing 302 tweets as separate
text files, the list of 1787 words, allwords.txt, and the cross-tabulation of tweets
by words, xtab.txt
- Related to subsection 2.4: CSI Las Vegas
- CSI101.zip, 50 scenes in individual text files, 1679
word list in allwords.txt, and two cross-tabulations of the scenes with word sets.
- Related to subsection 2.6: Commodity Fetishism
- commodityfetishm.zip, 21 paragraphs as
individual text files, allwords.txt list of 974 words extracted, and cross-tabulation
of paragraphs by words.
- commodityfetishm2.zip, seven selected
paragraphs, allwords.txt list of 482 words, and cross-tabulation of paragraphs by
- Related to subsection 3.3: Tables 3.1, 3.2 and Fig.
- R processing (text file), and
- Related to subsection 3.6.3: Haar wavelet transform
of a dendrogram
- haarum.r. Hierarchical Haar wavelet transform in R
(see commented lines at start for example of use, using Fisher's iris data), which
works on a hierarchy
produced by a hierarchical clustering program. This hierarchical Haar wavelet
transform carries out the following processing tasks: (i) from the data and a
hierarchy, produce the wavelet transform; (ii) filter the wavelet coefficients,
using a user-specified hard threshold; and (iii) reconstruct the data, i.e.
perform the inverse wavelet transform.
- Related to subsection 5.1: Identifying ultrametricity of
1D signals. (F. Murtagh, "Identifying the ultrametricity of time series", European Physical
Journal B, 43, 573-579, 2005.)
- The program used in this work:
equil-time-series.c Also needed is
nrutil.h for a few definitions.
Examples of use:
equil-time-series 1326 2 ftse1326.dat
(Input time series ftse1326.dat; number of values to be read, 1326, must
be equal to, or less than, the number of values in the file; and the value
of 2, number of categories used in the coding, should be fixed always at 2.)
equil-time-series 1326 2 r
(Use 1326 uniformly distributed random values.)
- Related to subsection 5.2: Random projection, clustering, Baire metric
- File with listing of all processing. All background and R code, text file.
File to read for all software, data, processing.
- Program to extract data from the BCI-encoded chemical dataset:
Compile and link using: gcc -lm ExtractBCI.c -o ExtractBCI
- Used in processing in R, repeatedly carry out random projections and
determine their mean: RepeatRanProj.r
- Slimmed down Correspondence Analysis program:
- Just drawing in a factor plane projection: plaxes.r
- Output plots. (See full description in the background description above.)
- Some Further Data Sets.
- Annex to Chapter 6: High Dimensional Simulations: Uniform,
- face-reconstruction.txt Generating data for the
faces, clustering them and wavelet transforming the dendrogram. Then reconstructing
the face data.
- haarum.r Haar wavelet transform of a dendrogram, with an option
to carry out filtering in the sparsified, wavelet transform space.
- faces.R, display data as Chernoff faces. (Ad hoc program, to be
- Section 8.1.3 Determining Depth of Emotion, and Tracking Emotion,
and some other data sets.
Dialogue between Ilsa and Rick in key scenes
in the Casablanca movie. CSV file..
From Chapters 9 to 12 of Flaubert's
Madame Bovary, 22 successive text segments. Zip file containing 22 text
Twitter flows from New York Times, Le Monde, Guardian,
Irish Times, Süddeutsche Zeitung. The original of each of these 1000-tweet
flows, and the tweet texts alone, saved from R as .RData files. A readme file
describes the setting up of these files.
- Subtitles of the film Casablanca, in English, French, German.
Additional Datasets and Software Analytics Environments.
Access all here: From this web page. [ Note: currently access requires a password.
Here are the details, name: Courses ; and the password consists of three letters,
a number, and then four letters, as one word: Wel 1 Come ]
- Big Data Analytics:
- Using Apache Lucene,
- Semantic Vectors,
- Solr, using:
- Classification Literature Automated Search Service (Here the data from Volume
23, with 1994 data, to Volume 41, with 2012 data, are available. In this data, there
are 93,191 bibliographic records.),
- Data Analysis of Cooking Recipes (152,998 recipes), Availing both of R and Solr.
Author contact details: f m u r t a g h @ a c m . o r g
Most recent update: 2017 October 29. Updated were: (i) Under "Chapter 10", the link
now has access details written here; (ii) Under "Related to subsection 5.2", "Software",
a .c program and three .r scripts are now accessible. (iii) Under "Chapter 5", the link
to nrutil.h is accessible.