- Dr. Shalin Hai-Jew, ITS, Kansas State University
Part 01: A Crowd-Sourced Definition of "Data Science"
“Data science,” both as a term-of-art and as many types of practices, is still being defined. In one crowd-written definition, it is “a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to data mining” (“Data science,” Feb. 20, 2019).
Classic basic data structure...
A Basic Data Table Structure for Structured (Labeled) Data -> Data Arrays -> Matrices -> Cross Tabulations...
The definition of "data" has expanded--from structured data (data arrays, data tables, spreadsheets) to semi-structured/unstructured data (imagery, text sets, multimodal data sets), and others. This draws from the ideas in qualitative research that everything has some data value and can be informative in one way or another. Data also come from sensors and IoT (Internet of Things) devices and any number of other sources. An assumption is that some degree of "ground truth" is accessible.
What is knowable has expanded...in ways that may be invisible to most people.
Past research is put to the test of validity/invalidity based on current practices and knowledge.
Research is evaluated based on precision of available measures. It is evaluated based on explanatory depth, value to decision making, and insights.
What may be analyzed covers "all" time--past, current, and future. Past thinking may be studied and validated/invalidated. Learning from the past may be extended into the present and future.
"Computational thinking" is about...
- efficient problem solving
- based on a knowledge of "priors" and baselines
- with clear and logic-based hypothesizing
- recordable steps
- generalizable steps (to other contexts)
- available acquire-able data
- statistical methods
- computational efficiency, and
- parsimonious (as simple and pared-down as possible) modeling.
It is about pursuing disconfirming information against a hypothesis in order to achieve rigor.
Beyond computational thinking (thinking algorithmically, defining with hyper precision, applying statistical analysis), visual thinking is an important part of data science.
Data science enables new senses of complexity with prior unknown relationships and interrelationships between variables.
From one exemplar to an N = all...
From "n/N = One" to "n/N = All"
Learning can be done from an n=1 to an n=all. (An "N" is drawn from multiple populations.) Some qualified generalizations can be made from any of the above. The idea is to find sensitive "indicators" or "tells" of particular phenomena or identities or classifications in a sparse way. Simplicity is preferable over unnecessary complexity.
"All" is understood as really the available set of data. It is not "all" of reality in the real.
As a "science"...
As a "science," there are defined procedures...professional ethics...domain standards...and accepted research and data collection and analytic approaches. There is a sense that exploration is continuous.
The 'Scientific Process' and Data Science
Ways to contribute and build a reputation in this space...
Making Personal Marks in this Space
People make their marks in this data science and analytics space by...
- how they achieve new learning from available data
- how they evolve new methodologies for acquiring data-based insights
- how they create usable conceptual constructs for reasoning about phenomena
- how they demonstrably solve (hard) real-world problems in practical and verifiable ways using data
- how they collaborate with others
Ultimately, they also make a mark by what data and tools they have access to and can wield effectively. This depends on their work and their data hobbyist interests.
Making a mark involves two basic aspects related to other-researcher attention: (1) attention-getting and (2) attention-keeping. The first comes from surprise or unexpectedness, and the latter comes from substance. The first is generally short-term and fleeting, and the latter is generally long-term and potentially more enduring. The aspects of attention-getting and attention-keeping may emerge from (1) the conceptualization of the research, (2) the execution or method of the work (its rigor, its reproducibility, its creativity, its transferability, its emulate-ability), and (3) the research findings (their relevance, their expectedness, their challenge to existing knowledge). An individual or a research team really only has control in terms of (1), the initial conceptualization, and (2) and (3) depend on other factors. Working on a trusted lean team can be helpful to fill in expertise gaps.
Part 02: Some Elements of Data Science
"Data Science" as a Venn Diagram
Methods for data pattern identification from datasets have expanded with "machine learning" and "deep learning" and other forms of artificial intelligence (AI).
- To refresh, some common data patterns may include...
- classic descriptive statistics in quant data indicating the center of a distribution (mean, mode), its spread (standard deviation), frequency distribution shape (symmetry, peaks, skewness/kurtosis, uniformity/non-uniformity, tails and outliers, and others;
- convergence or divergence (over time, over data, over methods...);
- (random) sampling power;
- similarities / dissimilarities;
- association between variables;
- relations between variables;
- variable roles in contributing to particular constructs (factor analyses, principal component analyses, and others);
- time relations, changes over time, time relationships
- geographical spatial relations;
- time-based oscillations / frequencies;
- classifier predictive analytics;
- data-to-latent (hidden) insights;
- object identification in imagery;
- sentiment in imagery;
- topic modeling;
- language patterns;
- probabilities, and others...
Data patterning and finitude. The assumption behind data patterning is that realities in the world express in ways that are somewhat observable. [Researchers are thought to be able to achieve "saturation" in their knowledge, ultimately, because virtually every phenomenon is somewhat limited and somewhat finite.]
Finitude or Infinitude re: Studied Phenomena?
Everything is patterned (pretty much). Virtually everything in the world is patterned and somewhat understandable through patterning. Actual randomness can be challenging to find in the wild. This is not to say that there are not chance effects on in-world phenomena. This is not to say that there is not complexity. This is not to say that there is not chaos.
Divergent voices help teams to consider different possibilities. While researchers may agree on the available data, they may come at the facts very differently, through different value systems, frameworks, interpretations, and lenses. Diversities of interpretation are important for advancing fields. The "tenth man" (person) approach is to purposefully assign a person to be a divergent voice in analyses and discussions in order to broaden the mental space around issues. Sometimes, this divergent voice is not only insightful but correct.
Data modeling refers to the use of "training data" to create predictive models to identify future data records / true examples of the same type. There are various ways to model data and then to test the accuracy of the model [based on the weighted "harmonic mean" or "F1 score" of the precision and recall of the test]. Often, the tradeoff is between high recall (finding as many of the target objects as possible) and precision (making sure that whatever is identified as being a particular thing is actually that thing). Excellent models are those that achieve the highest of both possible in the particular space. Every example of the kind is identified accurately, and there are no false positives or false negatives (or misidentifications). There are ways to test competing models and to tweak them computationally to get closer to accuracy. The tests are usually conducted against set-aside data from the original labeled set.
In a binary differentiating model, accuracy refers to the total of true positives and true negatives divided by the full set of true positive, true negatives, false positives, and false negatives. In other words, accuracy is how well the model works in properly identifying true positives and true negatives from the full set. "Precision," the positive predictive value, shows the amount of true positives identified as such. Precision is calculated as the amount of true positives divided by the sum of true positives and false positives. (The higher the amount of false positives, the weaker the precision and the noisier the data.) "Recall" refers to the amount of correctly identified true positives divided by the sum of true positives and false negatives (actual positive values identified as non-positive values). Recall points to how many of the positives in the set are correctly identified as such. The testing of the model is done against a portion of the labeled training set in order to see how well the model achieves accuracy of labeling. (The models I've seen applied to open-source datasets range from about 60% - 100% accurate.)
There are ensemble methods that enable combining multiple techniques to acquire improved predictive results.
In "machine learning," various algorithms may be systemically applied to identify patterns in data (based on various statistical methods). In some cases, the machine learning is supervised (informed by human labeling of data into categories), and in other cases, it is unsupervised (not informed by similarities between observed data and the creation of sets by similarity between item features or combinations of features). The types of problems solved with machine learning involve object identification / classification ("output variable"), the clustering of high-dimensional data, and others. Machine learning is often achieved through various iterations of exploration to solve human problems--in the context of human oversight and expertise.
Advances are occurring in every sector of the prior "data science" Venn diagram, with (1) various types of data ("everything datafied"), (2) new analytical methods and technologies (and custom programs), and (3) creative applications of data science in a variety of domains / fields / disciplines by various researchers and research teams with a variety of software tools and custom programs.
Old datasets have new insights to give up through recombinations and new exploratory approaches. Former research may be validated/invalidated based on the newer enablements. (Data analyses have impacts in the past, present, and future.)
Part 03: Some Visual Senses of "Data" and "Data Science" (to Warm Up to the Topic)
#datascience hashtag network on Twitter and hundreds of close-in conversations
Identified Groups of Co-conversants in #datascience Hashtag Network on Twitter
Gists: #datascience is dynamic, and those on social media are busy engaging about it in a range of different large and small-cluster (social network) conversations. (What may be non-obvious in the graph above is that the social media accounts, the messaging, the time of the messaging, and other details are available in the dataset from Twitter. The data was extracted here using NodeXL, a free add-on to MS Excel. A third-party social media tool was used to link to Twitter. A legitimate Twitter account is needed to "whitelist" into Twitter for the data. The API is rate-limited...and there is a top limit of about 3500 - 4000 Tweets per capture, from most recent available messages.]
Tweets from the #datascience Hashtag Network on Twitter
Word Cloud of Tweets from #dastascience Hashtag Network on Twitter
Gists: The Tweets in the #datascience hashtag network involve discussions about #machine learning, AI (artificial intelligence), #deeplearning, the Internet of Things, R, and other aspects. Active participants may be seen as including @iainljbrown and @kirkdborne and others.
DataScienceCtrl @DataScienceCtrl on Twitter
- based in Los Angeles, California
- joined in October 2011
- "The Online Resource for Big Data Practitioners"
- Focus on "data science, ML, AI, deep learning,dataviz, Hadoop, IoT, and BI"
- 42,025 Tweets at the time of data capture, 1,112 following, 137,163 followers, 4,553 likes
Location of Social Network
Word Cloud Summary of Tweet Messaging from 2,385 Original Messages
Activity by Week in Chronological Time
Word Tree around "Big Data" per @DataScienceCtrl Microblog Messaging
Word Frequencies and Popular Issues in Recent Data Science Articles
Word Cloud from 168 Academic Articles about Data Science (from a Word Frequency Count)
Word Frequency Counts > 1000 References in Data Science Article Text Set (pareto chart)
Word Cloud from 168 Academic Articles about Data Science (from a Word Frequency Count) / Without Parameters
Gists: The academic literature around "data science" engage a variety of issues. Some are specific to particular domains. Others are general. Based on a one-gram word frequency count, the main focuses in peer-reviewed publications are data, science, research, modelss, big data, and so on.
Related Issues with "Data Science" (on Wikipedia)
"Data_science" Article-Article Network on Wikipedia in a One-Directional (Outlinks) Directed Graph (1 deg.)
Gists: "Data science" as an article has some relevant outlinks to other article pages on Wikipedia. Notable individuals who've contributed to the field are linked to, like Nate Silver of 538 fame. There are ties to power centers for data science. There are links to related topics like "predictive modeling" and "open science".
"Data" Related Tags Network on Flickr (1.5 deg)
"Data" Related Tags Network on Flickr (1.5 deg.)
Gists: A related tags network involves the identification of co-occurring tags (labels) between a seed tag ("data," in this case) and other "folk" tags applied by users who share imagery on an image-sharing social platform. Of late, auto-coded tags/labels are also applied to imagery. At a certain threshold of co-occurrences, those top-level co-occurring tags are listed together. In a 1.5 degree network, the tags may be seen at one degree (direct ties to the "data" tag) but also there is transitivity between these direct tags ("alters") in the "data" ego neighborhood. The above is a 1.5 degree related tags network. Note how the co-occurring tags in each group evoke a larger construct per group (in each container).
Autocoded Themes in Academic Data Science Article Set (in a Treemap Diagram)
Autocoded Themes in Academic Data Science Article Set (in a Treemap Diagram)
Autocoded Themes in Academic Data Science Article Set (in a Sunburst Diagram)
Gists: Using a form of topic modeling, it is possible to pull high-level topics and literally related sub-topics from a set of texts (text corpora). The above shows some of the combined topical focuses of the academic articles around "data science." Of interest would also be the "long tail" of topics in the academic literature to understand one-off types of research being done in this field and researchers engaged in niche research.
[How Text Sets are Processed: Note that if the articles were run separately through the topic modeling, a different set of high level topics and related sub topics would be pulled...even though the underlying articles are the exact same ones. The above was attained from a combined set of texts from the 168-articles. Why would running the same articles separately result in somewhat different topics (in a more nuanced sense)?]
Top-Level Topics in the Data Science Article Set (in a 3D Bar Chart)
Top-Level Topics in the Data Science Article Set (in a 3D Bar Chart)
Gists: The autocoded topic modeling information may be visualized at just the high level topics based on the data science article set.
Part 04: Some Common Research Sequences Using "Data Science"
Some Common Research Sequences Using "Data Science"
A brief summary of the sequences (above)
- research design (and a priori hypothesizing)
- review of the literature
- instrument creation, pilot-testing, and revision (for construct validity through internal consistency, reliability or consistent results in test-retest conditions, and other features)
- data collection (and recording...to enable reviewability with different lenses and approaches)
- data cleaning and pre-processing (and "feature engineering" or the making of new variables by combining data)
- data aggregation / merging (if relevant)
- analyses (findings, confidence level, internal validity of model or instrument, external validity or how well the experimental results map to the world)
- post hoc hypothesizing (theorizing from the research results, with the new data-informed insights from the research work)
- write-up (for coherence, accuracy, critique-ability)
- presentation (and potential follow-on work)
There are differing orders of operations that are possible based on research objectives and contexts, but sequentiality does matter. "Data science" analysis can actually be applied at any point, including during the earliest stages of research design. Research is achieved in a (tightly-coupled or loosely-coupled) chain, with follow-on work based on earlier work.
You've got the data! You've got your discoveries! Yay!
So what? And?
The idea is to harness data and the discoverables from that for the "common good" (without incidental harms, in first- second- and other order effects). This is a tall order but necessary for constructive human endeavors. People come together to debate these approaches and to create rules-based regimes to enable research that is beneficent.
Part 05: Some Tenets of "Data Science"
About data science research:
- Some fundamentals to research are the same. You can start with a research design (and testable hypotheses), or you can decide to go straight to available (found) data and explore. (This latter "exploratory" / "discovery research" approach is especially common with “big data” datasets, with researchers finding out what associations and correlations there may be between variables or elements that they may not have ever considered. They are letting the data "speak." Researchers do not have to posit a hypothesis first and then test that hypothesis with the data. This is why some refer to the "post-research" era of big data. If so much is known and so much data is available, who needs hypothesizing? Just cut to the chase and analyze the data...is the thinking.)
- The research and academic publications standards are probably higher with the assumptions of repeatability and reproducibility of findings…especially in an age of shared (cleaned, de-identified) quantitative datasets upon publication. (This means that if other researchers run the same statistical tests over the same datasets, they should find the same data outcomes as you have. No errors!)
- The research methods matter--whether they are quantitative, qualitative, mixed methods, or multi-methods. Research methods are designed to mitigate for human limits (subjectivities, cognitive biases), measurement instrument limits (reading "noise" for "signal"), and other limitations. Statistical analysis is applied to more precisely identify associations and relationships. Logic is applied inductively and deductively to enable reasoning from research results and data.
- Research participants should not experience any harm. They should be informed of the research purposes in honest ways. They should have opportunities to opt out if they do not want to continue.
- Data should not be acquired under false pretenses. (In rare cases, under the oversight of the IRB, limited deceptions may be approved for research that may not be achievable otherwise.)
- Also, data should not be used beyond the approved usages allowed under contract or within the workplace.
- "Data science," like all sciences, is imperfect. Constant exploration and learning are critical.
About data handling:
Example: Transcoding from Streaming Video To...
- The legal restrictions on the usage of data still apply. Data can be used in particular ways and not others; they cannot be broadly shared unless released to be shared.
- When transcoding data from one type to another (or creating data equivalencies across types), the effort should be accurate and comprehensive, with nothing lost in the transcoding (no lossiness).
- People's privacy should be protected at all times, from the point of capture to its analysis to its archival.
- De-identified datasets should be de-identified effectively. This means that the shared datasets should not be exploitable with identities re-identified by others with access to technologies, wherewithal, and other semi-related datasets. .
- The rules for data handling also generally apply…including the need to maintain a pristine master dataset before any data cleaning is done, so there is a complete and pristine set to recopy for reuse in the research and data analysis process.
About research standards:
- In terms of research, a research project can be spoiled / compromised at any stage: research design, IRB signoff and continuing oversight (as needed), setup, instrument design, research methodology, data collection, data curation (as in text corpora, social image sets), data analysis, logical reasoning, and so on. (This is why it is a good idea to have professional and kind colleagues pound on your work at every stage to make sure it’s rigorous. And when it is not, own it with qualifiers and delimitations in the writing.)
- Be able to be fully transparent about every step of the research work. Document in detail. Relay the work specifically. (Do not just engage in going through the motions. Avoid “scientism.”)
- When setting up research, set it up to capture knowledge that represents the world as it is, not how you want to see it. Make sure to design ways for you to see realities beyond what you might have initially imagined.
- Hypotheses should be theoretically (and practically) falsifiable ("proven" or established to be inaccurate). There have to be frameworks, techniques, and standards...by which something may be at least tentatively decided. Assertions which are not testable are not scientifically relevant.
- Do not p-hack or throw out data or change parameters to get to a level of “statistical significance” (p < .01 or p < .05) in order to have the grounds to reject the null hypothesis. Set the parameters during the research design, and go with those in an a priori way, not a post hoc one. (A p-value is a probability value.)
- Do not manipulate data towards your own desired ends (no self-dealing). Whatever standards you apply to others' research work (as a peer reviewer), apply also to yourself. (Apply grace and rigor to others and yourself.)
- Apply common sense and abductive logic to the interpretation of the findings. Humans are always in the loop at some point…and they must engage in sense-making. (Computational machines can make “machine sense” that make no sense to humans, such as when computers are used to emulate a human “coding hand” with a Kappa coefficient of 1 or close to 1…but all kinds of non-sensical results.)
- More data points are preferred over fewer. But sometimes a few established relevant indicators may be sufficient.
- The same data may be interpreted in different ways based on differing interpretive lenses and differing values. Researchers can agree on the facts but may differ on the meanings of those facts.
- Researchers have to be willing to "go there," even if the data seem counter-intuitive, even if the results are undesirable.
- Current artificial intelligence (AI) applications are currently still highly and purposively narrow, designed to solve certain defined problems and aid in certain types of decision making. Humans are in the loop in the decisions.
About representing research findings:
Embargoes and Black Boxes
- Be as transparent as possible about representing research and research findings.
- If works are shared with the public under contract through academic and other publishers, the author contracts have to be respected.
- Embargoed data have to stay embargoed (not shared).
- Understand the conventions of data visualizations, and adhere to them, so as not to mislead users of the visuals (or underlying data). These conventions involve directions of interpreting data visuals and diagrams, shapes, lines, fill types, and others. Directionality is usually from top to bottom, left to right, clockwise (and not counter-clockwise), and so on. Particular types of visuals have more in-depth rules of engagement, many of them invisible and assumed (unless one pays closer attention). Design data visualizations in ways that are clear and precise (and labeled). This is especially if the data visualizations are "non-consumptive." Auto-drawn data visualizations may require further explication.
- Enable access to the underlying data, so others may revisualize the data in different ways (if data de-identification is effective and professional). [This approach is common in quantitative research approaches but not qualitative research approaches.]
- Avoid overreach. Qualify assertions where necessary. Restrict generalizability of the findings where necessary. Acknowledge research limits and blind spots. Doubling-down on research assertability and relevance is a rookie move and should be avoided. Research works are stronger when the researchers can objectively see the limits of their work and identify gaps that other researchers may explore. If engaging in post hoc hypothesizing, indicate it as such.
About social and other online data:
- Online, on the Social Web and on social media platforms, it is possible to capture various text sets…various image sets…and summary data (like the related tags networks, #hashtag networks, social networks, mass search data, mass book data, and other data).
- Social data are structured and semi-structured. They are multimodal, such as combined imagery, audio, text, and motion video, in various combinations.
- There are automated ways to capture social data. There are manual ways to capture social data.
- How datasets are curated affect what is discoverable from data explorations.
- When people share information on social media, their agreed-on end user license agreements (EULAs) release such information for public use. However, it is important to be careful in using such public data, especially in terms of assertions and in terms of linking to individuals. Some human subjects review entities have a pro forma process in terms of approving or disapproving research using such data. However, many social media platforms themselves use such data for their analytics and advertising, and many sell or share such personal data to third-party entities. And public data on the Social Web has been used to train various AI.
About multiple natural languages:
- Any language that is represented on the Web and Internet can be analyzed in (most) qualitative data analytics software packages because of UTF-8.
- Most software analytics tools do not enable multiple language analyses simultaneously (usually a “base language” is required to be defined), but some packages are multi-lingual and include built-in dictionaries and other features in multiple languages. This means that after one language has been analyzed, the base language can be reset for other language-based analyses.
- A number of online survey tools have built in Google Translate, to enable versioning surveys in multiple languages. Optimally, native speakers would vet languages for accurate understandings and accurate representations.
About "non-consumptive" mass data:
- Some mass data may be studied in a non-consumptive way (think Google Books Ngram Viewer, Google Correlate, and some others). Non-consumptiveness suggests that the researcher can see summary data but cannot have access to the underlying datasets informing the findings. This may be because the datasets are under intellectual property protections...or the datasets are very large technologically inaccessible...or there are costs to access.
- Some de-identified data snippets are available on data.gov. There are open-access databases, of various provenance. These are tested against re-identification.
- Commercial datasets are available for various topics. These are fairly expensive and have contractual restrictions on their usage. In terms of very large "big data" social datasets, these often have to be explored where the data resides.
About data analytics and statistics tools:
- Be careful because if a user goes in and is not clear what is going on statistically, it is still possible to output something that looks passable. Without sufficient oversight, researchers may push into publication something that has no empirical grounds and is not logically assertable.
- If every data point looks statistically significant or earthshaking findings have been discovered, do a reality ("sanity" or "gut") check because it is likely that there are mistakes in the data and / or data processing and / or researcher assumptions. User error is not uncommon. (If it’s too good to be true… If it is too improbable… ) A reality check is run against known "priors" and known states of the world...
- Be skeptical. A core science assumption is to be skeptical of everything, everyone, and (especially) oneself.
- Define and pressure-test assumptions and understandings. Back-track the entire processes because there are limitations at every step and the possibility of the introduction of errors. The parameters of the data analytics have to be accurate…and make sense. Run the data with a fresh head. Run it multiple times. See what happens with different parameters when modeling the data. Consider a wide range of interpretive possibilities. Invite trusted persons to review the work.
- And if you have an amazing discovery, so be it--but pressure test the work before shouting from the rooftops. And qualify, qualify, qualify.
About taking the research into the world:
- Analyzing data should be done in professional ways and with a sophisticated understanding of the larger domain and society and world. Research has implications on the larger world, and those should be considered when setting up and conducting research and then sharing the findings. Post-research actions, decision-making, and other efforts should follow relevant ethical guidelines.
- Consider counterfactuals. Use a broad range of possible interpretations in the findings. Avoid tunnel vision.
- Human rights and human rights protections have to be upheld.
- Applicable laws--local to global--should be adhered to.
- Give credit where it is due.
- Do not plagiarise. Keep accurate documentation, and avoid sloppy citations.
About data scientists and professional skill sets:
- Those who work in this field need to know their topic domain area very well (without being "captured" by it). They should also know peripheral fields well, in an interdisciplinary sense.
- They need to understand data in its various forms.
- They need to be able to execute the processes of the research work thoroughly (and work with others with different augmentary expertise).
- They must be able to make defensible assertions, with the proper level of confidence. This is not about “proving” an assertion but about observations of the world with the best available information. Over-asserting beyond the available information should be avoided.
- It helps to be able to work with databases.
- Understanding statistical methods is important (although machines do the heavy lifting). Start with basic statistics, and acquire new understandings and skills piecemeal.
- It is important to understand the conventions of data visualizations, so information is conveyed accurately. This includes auto-drawn data visualizations (about which there has been an explosion of new types).
- They should have a healthy skepticism of their own perceptions, cognitive biases, subjectivities, and intelligence (because the research is fairly strong that people are systematically limited in a number of ways). They should have the finesse to express their skepticisms of others' work diplomatically and constructively. They should be able to recruit other experts to "spot" them on their own work.
- They should have a sense of “askable questions” and knowledge about how to get to the “answers” through data and through statistical analysis.
- They should be able to recognize relevant discoveries in data explorations (even when the signal is weak or subtle)...eventually (if not initially). They should be able to avoid being lured by “noise” as possible signal.
- They should have a strong professional and ethical core in terms of how they conduct themselves as researchers and what they do with data and how they frame and “wield” research findings.
- They should keep an open mind to “black swans”. Just because history was one way does not mean that new occurrences may not exist. Novel and unexpected events with outsized impacts may occur in a dynamic world. (Think chaos theory.)
- They should hold research “truths” fairly lightly even though they may have empirical backing because the world is a dynamic place, and things change. And a data point is only "." big! And there is always more information to collect and understand. And there are risks to over-stating facts and under-understanding the world.
- They should have common sense in how they apply the learned information.
- They should not just explore what is most popular (as in frequency counts) but also “long tails.” They should not just engage a zoomed-out view but also zoomed-in ones. They should engage at macro, meso, and micro levels.
- They need to be dogged and persistent. They should be sufficiently willful to persevere but flexible to see differently and change directions and interpretations as needed.
- They should be humble about assertions because they rarely use a set of data where N = all. Even if they were using “all,” the full set does not represent the world fully. The research methods are not perfect, and they do not result in all-seeing. How the research findings are applied may be overly simplistic or too parsimonious (such as in mono-causal conceptual models)...or they may be overly complex (such as those conceptual models with too many variables) (External validity would still be limited.)
- They should be able to use an exemplar of 1 (n = 1) and still be able to make valid assertions about the data.
- They should be able to be as efficient in their work as possible while maintaining accuracy and data handling standards. The idea is to be efficient (and somewhat speedy) in the work.
- Optimally, they should be able to de-identify a dataset against any sort of re-identification (a hard task). Or they should be able to work with an expert to achieve this.
- Optimally, they should be able to strip out unnecessary data against data leakage, in single objects, in sets of objects, and in full databases (only sometimes possible).
- They understand that there are legal liabilities in research and data analyses, and they color within the lines, not outside.
- They should be able to justify the time, expertise, and resources required for their work--in professional environments.
START Where You Are...
...and take little bites at a time...
About common data analytics tools at desktop level...
(1) Commercial Technologies
- Statistical Package for the Social Sciences (SPSS by IBM) for quantitative data analysis
- NVivo 12 Plus for qualitative and mixed methods data analysis
- ATLAS.ti for qualitative data analysis
- SAS for quantitative data analysis (and much more)
- Linguistic Inquiry and Word Count (LIWC) for basic linguistic analysis and psychometric analysis
- UCINET for network analysis and visualizations
(2) Free-ish Technologies...(some with the limitation to Education and non-commercial usage)
HIGH LEVEL SCRIPTING LANGUAGES WITH ANALYTICS PROGRAMS
- R (with data analytics packages for this high-level scripting language)
- Python (with data analytics packages for this high-level scripting language)
- Microsoft SQL Server Express (free with 10 GB size limit)
DATA ANALYTICS SOFTWARE (WITH MACHINE LEARNING CAPABILITIES)
- RapidMiner Studio: machine learning (free with registration and education email) (with a commercial version)
- Weka for machine learning
NETWORK ANALYSIS SOFTWARE
- Network Overview, Discovery and Exploration for Excel (NodeXL) add-on to Excel
- AutoMap / ORA-NetScenes / ORA-LITE and others (CASOS)
- GEPHI software for graph analysis
- Cytoscape software for graph analysis
AGENT-BASED MODELING (FOR PREDICTIVE ANALYSIS, FOR SYSTEM INTERACTIVITY)
- NetLogo agent based modeling (with extant model library)
and many others
A data science "worldview" (IMHO)...