Categories
Uncategorized

Final Project Luke Hartman

Research Question: Are the overlaps in the text patterns/word choice of these speeches definitively associated with gender, nationality, location, or date?

Process:

This question has evolved a good bit from the first time I interacted with this data corpus in the first two weeks of school. The first thing I did with this data at the beginning of the semester was to upload the entire corpus into Voyant to learn how to use the text analysis platform. The visualizations I was able to produce were very interesting and informative and I knew they could be useful for my final project if I was able to do more work with them. One that particularly intrigued me was the “Cirrus” tool which showed the most often used words throughout the corpus, displayed with relative frequency corresponding to size of the word. It is the interactive image embedded on the “Most Common Words: Overall” slide in the overview section of the timeline. A screenshot is embedded below for reference, although the timeline view is preferable in my opinion as it is interactive.

I knew that I wanted to continue working with this data even from this early point, but I wasn’t sure how to formulate a meaningful research question. When we used Palladio in class, I was able to create a visualization that displayed each of the authors of the speeches with pictures next to them for easy identification, as well as with text snippets about their speech topic and dates/locations of delivery. This was, in my mind, something that could be very useful for my final project as it would give the viewer an overview of the corpus provided baseline knowledge so that deeper text analysis from each speech would be more meaningful. This would allow me to ask a fairly specific research question, but not leave a viewer feeling confused or hopeless for where to start due to a lack of underlying knowledge about the topic. Below is a picture of the said Palladio Visualization.

That being said, my goal for this project was to produce what Hanna Drucker calls a “knowledge generator” in the form of an interactive learning experience for the reader. Because Palladio does not allow for it’s interfaces to be embedded on a third party website, and I felt that a screenshot would not be engaging enough, I had to go another route. Using Knight Lab’s Timeline.js template, I was able to create a timeline that includes data points for all 20 speeches on the dates they were delivered/published (if written statements). The timeline also includes 20 separate data points in another category that have bio’s of the speaker, lending a bit more to their background and giving the viewer context to help interpret each person’s speech with.

One of the initial problems I ran into on the knight lab platform was the ability to differentiate on the timeline itself (see bottom of image) between info that contained speech descriptions and visualizations, and those that contained bios and pictures of the authors. Initially I began trying to find new ways to name the slides so that they would be distinguishable, and even contemplated trying to put all the information (bios, and speech text analysis) on the same slide in order to avoid confusion. What I was able to do instead however  was a best of both worlds solution; I categorized certain data points into groups using an added feature called media grouping/type, thus separating them on the timeline. A screenshot is visible below.

 

This fix was ideal in many ways, but it also presented a new set of problems. With two sets of data points, I needed to find a new set of images to help convey the speech analysis aspect of the 20 slides. I decided that a relative frequency graph that displayed the most frequently used words in each document and how they were used at different points throughout would provide a good standardized comparison between speeches. The issue was that the Voyant link for this visualization, when embedded, automatically reverts to the graphic for the entire corpus. Therefore, I linked this to each speech slide, and added a detailed note in the introduction section informing readers of how to view the individual frequency graphs for each speech as well as how to compare them using the drop-down menu. In the end, this was a frustrating setback, but it led to the creation of the intro slide, to which I then added other descriptive information about the nuances of the project. This was needed for more than just the details of the voyant interface, but I was unable to recognize this flaw until the “setback” previously mentioned made me aware of it. The intro page welcomes reader, outlines the layout of the site, mentions technical stymies the viewer may run into, and then presents the research question in a clear way that “gets the ball rolling” for the reader’s thoughts on analyzing the data. A screenshot of the Intro page is below.

 

Personal Conclusion:

My biggest takeaway from this project has been how incredibly difficult it is to draw conclusions based on text analysis alone. There are so many variables that have gone into each one of these speeches, from date, location, ideology of speaker, race, social context, cultural tone differences, and many more. One trend I did find more consistent than others was a tone of aggression from the African American Speaker’s in the corpus (with the exception of King’s Nobel Peace Prize Speech). I also realized that this tone recognition was far more apparent to me in the text analysis visualizations after having read the speeches which led me to recognize the limitations of Voyant in terms of “sentiment analysis.” This could be a good place to use Gephi for the same data if I or anyone else wanted to delve further into this analysis. As far as location and date, I was unable to draw any significant conclusions from the textual tendency similarities/differences of speeches with similar inputs.

Possible Improvements/Redesign/Reflection:

After completing a long project, I always like to reflect on the ways I could have improved it, and note all the things I realized I should have done differently once I was a ways to far down the road to go back and change it. The first of those things for me was that I wish I had defined my research question earlier so that I was more aware of how I could have used the platforms we tried in class to answer that question (for example how I mentioned sentiment analysis with Gephi).

The other main aspect of redesign would be for me to keep much better track of ALL of my data from the beginning.For example, I gathered all the locations of the speeches one by one very early in the year, and then I input them into a google program that translated them to latitude and longitude. This was very helpful as it allowed for easy input into palladio, but I lost the locations in text form. When I decided then for this project that I was going to do write-ups on each speech… you can see the issue I’m sure. I wanted to include location naturally, as that was a defined category in my research section, but I had carelessly lost that data in a meaningful way, thus I had to go re-find it. When completing a long project like this, things like that can be defeating. Unfortunately, this wasn’t the only scenario like that as I did the same thing with looking up the dates of the speech, and then choosing to also use author’s birth dates when I went to separate data points for bios and text analysis. It’s a lesson I will not forget again when doing data analysis. Never write over anything, just make a new column; you never know when you might want that data in that format again even if it seems useless to you now. The overwrite mistake can be seen in the screenshot below (notice, lat/lon, but no english language location).

That being said, I am very pleased with how this ended up, I made a lot more progress than I thought I would and my final product on timeline is far more polished than I thought it would be when you first introduced the idea of me using this platform. Thanks for all your help, and I hope you enjoy this visualization!

 

Link to Timeline.js Site:

https://cdn.knightlab.com/libs/timeline3/latest/embed/index.html?source=1i7vS5SGGqiC2fRFcmWlSUnHQtYdSwv6X0iDHfsTtAXs&font=Default&lang=en&initial_zoom=2&height=650

Voyant Link:

https://voyant-tools.org/?corpus=196d419a39af8bde45a5cabb6afbf8da

Timeline Excel Template Link:

https://docs.google.com/spreadsheets/d/1i7vS5SGGqiC2fRFcmWlSUnHQtYdSwv6X0iDHfsTtAXs/edit#gid=0

Preliminary Corpus Info Link: 

https://docs.google.com/spreadsheets/d/1qVLefIlz_z_GGl8YJkAcJGqpx8lqMKinO_KNXFVnKxo/edit#gid=0

 

 

Categories
Uncategorized

Assignment 5 Luke Hartman

Luke Hartman– Assignment 5

The purpose of this assignment was to become capable of using Gephi through an analysis of the Baptized Indians Database. I created a worksheet in Gephi and input the 376 names of Indians as nodes, and then created edges  (82 in total) for the names of Indians with ID numbers 225-274. The edges represent connections between Indians within the database with the edge source as the ID (225-274), and the target as the other related person. As is evidenced by the 82 total edges, some of the 50 Indians had multiple connections and thus multiple source-target edges created for their singular ID.

I also distinguished inter-generational relationships by using directed vs. undirected edges. For example, if the source was the son, and the target a mother or father, the edge was directed to show a generational gap. If the source-target relationship showed brothers, sisters, spouses, etc., it would be undirected.

When I initially put the information into Gephi, I was lost to say the least. Below is a screenshot of what the default visualization showed.

As one can see, it is very jumbled and does not show anything discernible at this stage. The next step I took was to run the modularity program that showed nodes grouped by communities allowing me to identify niches within the larger group. I then ran a program called Force Atlas that moved the communities to the outside edges of the data set in the visualization and I set the size of the node to correspond to “degree” which is a measure of how many people a specific person or entity has interacted with other members of the community. The color of the node also distinguishes related communities and relative proximity within the graph shows overlap in groups. This produced a very interesting visualization shown below.

While recognizing that this graphic had value in it’s principle structure, I struggled a bit with how to discern more meaningful comprehension from it because of the overlapping nodes and the lack of visible edge connections. In light of this, I increased the distance factor in between all the nodes in the graph for easier viewing, and then colored them based on closeness centrality, which is a measure of how close one node (or member of a community) is to all the other nodes in the network (or all the other members of the community in this case). Below is the result, followed by a zoom on one specific section of the graph.

(Bottom Right of graph is zoomed in on)

This zoomed in view has many of the desired qualities of a visualization I hoped to create when I began this project. First, the node size is visibly larger corresponding to the total amount of connection each person has in the network. The color of the nodes correspond to values between 0-1 listed in the chart in the top left of the first picture shown above and they display closeness centrality of each node. Next, each edge is shown as a thin line connecting nodes, and the directed edges have arrows at the end which represents a generational difference. This is extremely  informative as it allows the viewer to see the three brothers at the center of the community and then discern the relationships of all the other people in the network just from the graphic. If edges were created for all 376 nodes, this would be a great way to visualize many complex and interwoven connections within the larger set of data.

Overall the ability to use Gephi is something I certainly value. I definitely struggled with it and got frustrated at times but I learned a lot and when I finally made some progress it wasn’t difficult to see the value in the tools the platform offers. I feel much more equipped to tussle with complex and layered data given my knowledge and experience with this assignment and program, who knows what it will be useful for in my life and work going forward.

Categories
Uncategorized

Luke Compare/Contrast Timelines

 

Categories
Assignment 3

Assignment 3 Luke Hartman

My data set is a collection of text documents (mostly speeches with a few recorded statements orations) regarding civil rights, delivered by a variety of divers authors. The authors include a Native American Chief from the 1850’s, Martin Luther King Jr. and even 2016 presidential candidate Hillary Clinton. In my data table I have included the descriptor categories:

Date

Author

Speech Topic/Context

Gender of Author

Location of Speech

Image of Author.

 

I felt like by using this combination of data, I could organize these texts in ways that would be meaningful for comparison and allow a reader to draw connections/ recognize similarities and differences between related topics and speeches.

Above is a screenshot of a visualization I created on Palladio using the Graph tool. It links each speech with it’s latitude and longitude and shows which ones were given in the same location. It can also be filtered with the facet tool to show just certain speeches based on gender, topic, etc. The spatialization of this data provides a visual map for a reader with two dimensions that can be controlled to make inferences about whatever is desired to understand.

This next visualization is a mapping tool that shows the location of each speech’s delivery point on a world map. While this is a useful tool as it allows for the spatial awareness of a specific location to become perceivable for the reader, I feel it has some drawbacks as well. First off, the map is not labeled well at all, so without an extensive knowledge of geography, the map cannot stand alone well. Also, the large scaled dots are useful in the sense that they communicate volume of speeches given in a specific location, but they do not have a visible center and they span too large of areas to pinpoint the exact spot even if one did know exactly where that was on an unlabeled map. Google fusion tables has a similar feature and I feel that it does a better job although I have not completely figured out how to use it perfectly yet either.

Above is my favorite of the tools I was able to use on palladio, the Gallery tool. It provides a template for a display of multiple pieces of information in a card-like format. As you can see, each card has a picture of the author of the text that it represents, along with the topic/context of their speech listed beneath their name. The cards are also organized by date (which is also listed visibly) to form a timeline of sorts within the larger visualization. Upon clicking on each card, it will also link the reader to a full text of the speech given by that author which allows for the further exploration and adds an interactive piece to the visualization to promote deeper research and understanding. I feel that there have been a vast number of times in my life when it would have been very useful to understand how to use this tool; both to sort information for my own purposes, but also for the ease of presenting it to other people.

As far as Drucker’s distinction between “knowledge generators” and “representations” I simply do not like these mutually exclusive classifications. I believe that something can be both at the same time. I think that these visualizations and this platform (palladio) embodies this idea of duality very well. Am I representing a form of data that already exists through a series of templates? Yes, and in that sense it is a representation. But by compiling this data, formatting it in accessible, user friendly ways, and then making it interactive to promote further examination and learning, I am also generating knowledge by was of access and opportunity. I find this to be very valuable and I think that it is certainly a noble pursuit.

 

 

Categories
Uncategorized

Assignment 2 Luke

a.) My corpus is from one of the pre-packaged sets that Professor Faull gave us as an example. While this may seem like an easy way out, I have always had a vested interest in civil rights. Growing up in Birmingham, Alabama, and having grandparents who lived there during the 1950’s, I have always been interested in the civil rights movement in the United States. Also, I am half Palestinian. My grandfather on my mom’s side came to the U.S from Ramallah Palestine when he was 15 after his family was removed from his house at gunpoint by militants. Having been told his story from a young age, the issue of human rights has always been one I have been passionate about. All of this being said, I am still working on adding to my corpus, but for now I spent a good amount of time analyzing what I have in Voyant and Jigsaw.

b.) Voyant provides many different ways to interact with a vast amount of text, I had a lot of interesting thoughts while playing with my data input. One of the first tools that shows up when one uploads their corpus is the “Cirrus” tool. It shows a puzzle-like picture of words where size corresponds to frequency of mention within the entire corpus. It gives an idea of what the central words in a piece or a set of works may be, but it is not an end all be all for the message as it’s simply a frequency representation. Below is the Cirrus for my entire corpus.

 

Another part of Voyant that I found interesting was the “terms” visualization. It shows the most frequently used terms in a list, but then on the side it shows relative frequency and trends for which documents they were most often used in and at what point in those documents. It is shown below as well. I found this very interesting as it’s not necessarily a takeaway one would have or even contemplate when reading the documents themselves.

 

c.) I found Jigsaw to be interesting in a very different way than Voyant. I felt as though it was pushing me to make connections between aspects of the texts, as well as the individual texts themselves. The word tree tool was fascinating as it allowed me to gain perspective on the context in which words were being used across texts throughout the corpus. For example, below is a screenshot of what happened when I searched the word “People”, which as voyant shows above, is the most often used word throughout the corpus.

As can be seen, this image shows tons of different and unique uses of the word people in the corpus, and even this plethora is only 15% of the total usage overall. I also found the document grid viewer very thought-stimulating as it allowed the user (me) to sort the documents based on importance for a variety of factors.

d.) Maybe I am biased because I have likely spent more time with and have a greater understanding of how to use Voyant, but in my opinion its interface is so much more user friendly than that of Jigsaw. It presents easy to read menus and tools with adjustability of features without having to x-out of one window, research a word, and open a new window to see a new visualization. There are some levels of detailed text analysis that I thought Jigsaw was useful for to supplement the limitations of Voyant, and perhaps those are magnified the deeper one goes into analysis, but I found most of the things I wanted to do on Jigsaw, I could find similar data presentations on Voyant in more user friendly ways.

e.) I think that working with these two platforms has greatly contributed to the “multidimensional viewpoint” as Clement put it. I feel that I have garnered insights about these sets of textual data that I could have never surmised simply from reading each individually. The ability to visualize large sets of the data in quantitative and qualitative graphs, tables, etc. allows for a more comprehensive understanding of the meta characteristics of the corpus. It also sheds light on what may be “plausible truths” about the texts and works that would otherwise go undiscovered.

 

Categories
Assignment 1

Assignment #1 Luke Hartman

For the two above visualizations, I found them fascinating for distinctly different reasons. On the left is a picture of a sample visualization from a website (http://rama.inescporto.pt/) that takes musical artists, identifies both characteristics of their music and similar artists based on those characteristics, and then displays the links in a color coded, digital web. It also has a lot of cool interactive features including the ability to click on related artists based on highlighted or selected genres, characteristics, styles etc. and then offers information about selected choices. One can also create radio stations or see music playlists related to any of the items viewable in the chart which I thought was really cool. It reminds me of a visual analyzation of basically what Pandora is trying to assess through one’s thumbs up or down preferences which was very cool to see and may be useful to interact with. I found it to be very effective at accomplishing Stefan Sinclair’s vision for a well-done visualization when he stated, “The humanities approach consists not of converging toward a single interpretation that cannot be challenged but rather of examining the objects of study from as many reasonable and original perspectives as possible to develop convincing interpretation.” This visualization does just that with its user interactivity and lack of boundaries for how to manipulate the data to show different pictures.

The second example above is a much more static visualization. It does not have any way to interact with the data, but it does provide a very detailed and analytical layout for a large amount of qualitative data. As a Palestinian, the culture and stability in the middle east has always been important to me, and like most, it is very difficult for me to understand. While it is not super interactive like many of the other visualizations on this site, I feel that it has a valuable place in promoting understanding about a complex topic. By addressing many topics and clearly showing links between related ideas, it provides a vast amount of knowledge that a reader can analyze. In the Sinclair reading, he proclaims, “a visualization that contributes to new and emergent ways of understanding the material is best.” He talks mostly about how interactive visualizations do this best, but I think this static visualization does it well in this instance because it simply provides facts in an easy to follow way without drawing any conclusions about morality for the reader. It’s a good example of knowing how to present your data in the best way given its content. 

 

The visualization I chose from the DH sample book is called Kindred Britain. A picture of the general layout is below, but to get a grasp on the value of the visualization, it must be explored in its interactive capacity. It is a network of individuals of British decent, and shows blood relation as well as marriage connections through a lens of historical context. I actually found this visualization to be very difficult to understand. There is interesting material being presented, but it is hard to sort it out and in my opinion the formatting and style could be cleaned up a good bit. As Manuel Lima pointed out in his TED talk that we watched in class, a visualization is only as meaningful as how it speaks to those who are viewing it. I agree with this profoundly. If I can’t understand what’s happening, no matter how intricate and well put-together the data is, I’m still not going to get much out of it. I’m not particularly experienced with in depth data sets, and this makes it more difficult to interpret and make the most of a fairly complex data set such as this.

 

Categories
practice Uncategorized

Interesting vizualizations

I chose the following two visualizations because they stood out to me as being poor visual representations of data. The first (how india eats) has percentages that appear to be supposed to correspond to fractions of a circle that are not proportionate at all. The second (peak time for sports and leisure) is incredibly confusing in it’s nature, and although it was rather clever when I was able to finally decipher it, it took almost 10 minutes which is far too long for most readers hoping to see and understand quickly.