Wednesday, 26 March 2014

A/V Woman

Newer posts are at http://avwoman.blogspot.co.uk/
Any feeds should continue as normal (unless they don't, in which case, I'll sort that out... hmm)

Sunday, 23 February 2014

Some more Excellent tricks

I've been playing around with Excel an awful lot lately, which is always guaranteed to cheer me up. Time, then to add some more little discoveries to the pile of Excel-based bits and bobs I've put your way in the past (specifically here, here and here).

Blank skip

How many times have you been infuriated by empty rows getting in the way of good analysis? Perhaps the answer to that question is "none", in which case "clever you". Me, though; I'm always getting myself into situations where for whatever reason I find empty cells getting in my way. One example is to be found in my personal accounts, where I have daily entries, and then weekly and monthly summaries appearing down the side. These summaries are cast adrift amidst rows of emptiness, and that can have implications when it comes to wanting to do anything very involved (graphs, for instance, can be a pain). There are ways around this: I could producing running figures down the page, but that would be ugly and would defeat the idea of the summary; I could copy and paste somewhere else, skipping blanks or sorting them away. That would look better but it would have dynamic repercussions, and I want to perform calculations using live data rather than a copy and paste of a particular point in time. What I need is an array formula.

In this example I have three columns: A contains the days of the week (my x-axis, if you like), while B and C contain data entered on certain days. To abbreviate that table to just those rows that have data in column B, I've popped the following into D1: 

=INDEX($A:$C,SMALL(IF($B:$B<>"",ROW($B:$B)), 
ROWS(D$1:D1)),COLUMNS($D1:D1))

Most importantly (and this is the weird and counter-intuitive bit) you need to press control, shift and enter at the same time (rather than just enter) to make this formula stick properly. Doing this will add those curly brackets round things ({}) and make everything work as if by magic. Infuriatingly, just typing the curly brackets doesn't work. It's all about the ctrl-shift-return. 

Once you've ^⇧↵ed into D1, you can expand the formula across the rest of the cells in the usual way (and colour everything all rainbow-like, if that's how you roll). There are errors in D5-F7, because there's no data to feed them, but that can be sorted out with an =IFERROR in the right places if it bugs you. 

So what is the formula doing? Let's pull it apart:

The bones of the formula is a simple cell reference from a defined array (the first three columns):

=INDEX(array, row number, column number)

The column number bit is a simple column-counting formula that will increase as we drag across E and F. In D1, the value is 1, in E1 it is 2, and in F1 it is 3.
The row number is more involved because this is the bit that's doing the hard work. There's a conditional formula in a SMALL formula that relies on the row number. The condition is that we should return the row number only if the value of a cell in B is greater than or less than blank. The rest of the SMALL formula states that we're only interested in the kth returned value, where k corresponds to the row we're on (so if we're on row 1, which D1 is, then we're after the first value where the value of B is greater than or less than nothingness). 
Doing all that with a normal formula won't get you very far, because you'd be unable to keep count of those successfully met conditions (you could keep a running tally, but we're trying to keep our sheet clean of that sort of thing), but turning it into a magic ^⇧↵ array formula lets Excel tot up various details across a whole range of cells.
Array calculations do make Excel have to think that little bit harder, so you might want to avoid whole-column arrays like the ones I've used.


Offset Mapping
You know what's really irritating? When you've got a lot of data in Sheet2 that's reliant on the data in Sheet1, you delete a row from Sheet1 and everything on Sheet2 goes all error-y. Likewise if you sort Sheet1 in a new way and Sheet2 becomes a befuddled mess. What you need is to break the link between a cell in Sheet1 and a cell in Sheet2 while still carrying over the data. What you need is to be able to map to a particular cell, regardless of its contents. One way of doing it is to use INDEX (like in the example above), for which you must establish an array range. But perhaps easier is to use OFFSET. While INDEX pinpoints a location within an established array, OFFSET pinpoints a location relative to an origin (e.g. cell A1). The difference is subtle but slightly more flexible.
If we want to reproduce the contents of any cell in Sheet1 in a corresponding cell location on Sheet2 (without tying the two together in a relative formula), we can use:
=OFFSET(Sheet1!$A$1,ROW()-1,COLUMN()-1)
The ROW() and COLUMN() bits get the current coordinates, and these are applied with reference to cell A1 on Sheet1 (the "-1"s account for the fact that A1 takes up a cell of physical space). The offset location can be anything we want it to be, as, for that matter, can the offsets (the example above is just a convenient starting-place). 
I'm currently using this to make reading lists. I can use one sheet as a sandpit (inserting and deleting rows, and shuffling content about the place), then have a second sheet which takes a snapshot of that content and arranges it in a pretty way without me having to do anything. It's my favourite thing in Excel at the moment.

The Name Game
Ok, this one's a brag more than a formula...
I just mentioned that I'm using Excel to generate reading lists, and I thought it would be wise to come up with a set of formulae to standardise whatever I put in and turn it into a Harvard citation. On the face of it, this shouldn't be too tricky. We can all look at a citation and work out what's what and how we might move the commas and brackets about to switch from one referencing style to another. Why a program like EndNote makes the process of importing citations such hard work I'm not entirely sure, given that even allowing for the vast wealth of different formats and models, the basic codes are all variations on a theme and surely a reference manager could work out which it was looking at if it could but really be bothered to try. But that's a different quibble. What I'm wanting to do is take a pre-existing citation of indeterminate style, dismantle it into its constituent (citation-manager friendly) parts and reassemble them. Unsurprisingly, the hardest bit of that is the identification of the dismantled parts. The presence of numbers, capitals, and various forms of punctuation are a clear help in this regard, whatever style is in play. Isolating the publication date is very easily done, and things start to unravel nicely from there because there are only so many ways you can describe the elements of a citation (at least in the common forms and for the common formats, which is what I'm playing with). A title is a title and varies very little as a consequence. Even volume and issue numbers follow some sort of discernibly extractable logic. In fact, there is only one part of your general, generic, stylistically vague common-all-garden reference that poses an especially pressing problem to a poorly programmed computer and that is the author name. Having extracted every other necessary building block, I was not to let myself be beaten by a few names.
There are a number of reasons why names cause trouble. They can be awash with commas and full stops, performing a range of different and often contradictory roles: commas between surnames and forenames, commas between collaborating authors; full-stops after each initial, or all initials, or none. Which brings us then to the swirling mess of first names, surnames, middle names, initials etc. But as humans we can look at these things and work our way through them with not too much trouble, so surely a computer can too. Admittedly, part of the human advantage is a reference database of typical forenames and surnames: something which could be emulated in Excel if needs be, but it seems an extreme length to have to go to. There must be a simpler way.
Well, reader, I managed it. Having already isolated the name portion of the original citation, I then dismantled it word by word, classing a comma as a word but eliminating full stops from the equation, to give something like:
Smith , A , Jones , B and McDonald , C
 Now to work out what each cell actually is. Commas and "and"s are easy to determine. Initials were identified using something like: 
=IF(OR(LEN(A1)=1,IFERROR(CODE(MID(A1,2,1))=CODE(UPPER(MID(A1,2,1))),FALSE)),"I1,"")

...in other words, if the length of a cell is one character, or if the second character is upper-case, consider the cell to contain one or more initials.

This is a good starting point and will serve us in good stead. The use of "I1" is to indicate the first instance of initials. Later initials can be given "I2", "I3", etc. This will help when it comes to piecing things back together properly.

Initials are easy. Forenames are pretty easy too, if they immediately follow a comma. An entry that goes SURNAME COMMA FORENAME is easy to tag. SURNAME COMMA FORENAME COMMA SURNAME COMMA FORENAME... is slightly more troublesome, but here maths is our friend: if we assume some sort of internal consistency, then the first comma indicates a forename, the second a new surname, etc.

Sometimes, commas are used more sparingly, eg. SURNAME INITIAL, SURNAME INITIAL, etc., but here we have the benefit of an initial to help us through the soup. Thankfully, SURNAME FORENAME is not something that tends to appear in the occidental world, so we can probably apply the assumption that nothing like that will bother us.

From these foundations, it's quite easy to put together some formulae to draw out the names we need in the majority of cases. But some people have really awkward names. And some people aren't people at all. People with two forenames or two surnames are especially troublesome. Again, commas (and "and") are our friends here. And we can begin to string together some really messy formulae based on relative positions of our respective bits of names. There's a lot of trial and error to this, but after some messing about it's possible to get something together that works almost universally (I've run my current version across about 400 citations and it seems to be working ok). Then it's just a case of glueing everything back together again.

Here's a link to a very-much beta version. The gubbins is very messy trial-and-error stuff, and I'm sure you'll be able to break it quite easily if you try, but it might give you a place to start should you ever require something similar.

Sunday, 12 January 2014

The 40 most-visited-by-me sites of 2013

How much time do I spend on the internet when I'm at home, and what am I looking at? A few years ago I decided to set myself that very question, and to answer it I stuck a clock on my browser. Unfortunately, I forgot to collate the first couple of years' stats. But I've since tweaked the counter a few times, and this year I've actually remembered to crunch the stats therein. The list consists of my 50 most-visited domains, some of which I've combined because they amount to the same site, and one of which I've removed because it is buggy (I did not spend five days watching Channel 4, and the suggestion that I did seems to have stemmed from some corrupt data left in the file from the previous year, but their website should probably be in this list somewhere). Another note before I show you the list: for the first half of the year it was clocking stats from all open tabs, and in the second half it was limited to the active tab. Also omitted is browsing from any other platform except Opera (so my brief flirtation with Netflix is not listed as it took place in Firefox owing to compatibility issues).

Here, then, are my 40 most-visited sites (caveats already noted). For more information about a site, hover over the name.



Total (h:m)Total (d:h)Daily av. (m:s)%

OVERALL1254:0052d 6h3h 26:00
1Tweetdeck382:5715d 23h62:5731
2Wikipedia72:023d11:505.7
3Old Reader71:543d11:495.7
4Blogger (my 'blogs)59:372d 12h9:484.8
5YouTube57:192d 9h9:254.6
6Facebook52:412d 12h8:404.2
7BBC41:361d 18h6:503.3
8Google37:181d 33h6:083
9Google Maps28:511d 12h4:452.3
10Twitter18:153:00 1.5
11(Local Host)17:022:481.4
12Historic Digimap9:26
1:330.8
13Splendid Chaps8:031:190.6
14A/V Woman7:491:170.6
15Alveley Historical Society7:111:110.6
16The Guardian6:24
1:030.5
17jobs.york.ac.uk5:25 53s0.4
18TVCatchup5:08 51s0.4
19Bing Maps4:34 45s0.4
20Sporcle4:25 44s0.4
21RightMove4:10 41s0.3
22(Router)3:38 36s0.3
23subjectguides.york.ac.uk3:20 33s0.3
24Lawrence Miles ('blogs)3:17
32s0.3
25Sheffield Planning Applications3:08 31s0.2
26NRK2:50
 28s0.2
27Lady V London2:48
 28s0.2
28BBC Good Food2:47
 27s0.2
29Know Your Meme2:30 25s0.2
30Pinterest2:27 24s0.2
31Amazon2:21 23s0.2
32Flickr1:52 18s0.1
33Out Of This World ('blog)1:47 18s0.1
34The Golden Era of GP Racing1:43
 17s0.1
35Tattuinardoelasaga1:41 17s0.1
36thetrainline.com1:40 16s0.1
37IMDb1:39 16s0.1
38Gizoogle1:36 16s0.1
39University of York1:28 14s0.1
40Doctor Macro1:27 14s0.1

So there you go. The above list accounts for about 85% of my browsing (a stat that includes the otherwise omitted Channel 4 entry for necessary mathematical reasons). The colour-banding groups things together semi-thematically. What do we learn from all of this? That I like maps? We already knew that (or I did anyway). 

To get an idea of what this is telling us, we need another data-set. My other data-set is anecdotal. It is me remembering what sites I used most about a decade ago. About a decade ago I mainly used Google, Wikipedia, BBC and The Guardian. Those sites constituted a Big Four, with my own content probably filling in a fifth slot. The rest of my internet activity can be discovered from the array of bookmarks I still have in my browser from Them Days: home-made websites (largely replaced by 'blogs now, and consumed via my feed reader). Gone now are the days when I sat grazing the news at The Grauniad and the Beeb (International Version). Twitter feeds now deliver that news to me. But the difference is not as radical as if we roll back the clock another five years, when my most-used sites would've been Alta-Vista and Yahoo! We've come a long way since the Web Ring. We've gone full-circle and found ourselves back in Usenet, but with a 140 character limit.

I'm now going to clear the stat counter and hopefully come back in another 12 months to do some proper, more meaningful analysis of my browsing trends. Between those things, I need to do the pots and make some tea. 

Thursday, 2 January 2014

6,697 Tweets In : The 2013 Twitter Breakdown

4I last did an analysis of my tweeting back in May 2012, and before that in July 2011. Those posts took a thorough look at the content of my first 100 and first 1,000 tweets from the @SaintEvelin account. My tweet tally now exceeds the 6,700 mark, which is too many to go through by hand. Fortunately, there are ways to crunch the content, and so I will, looking specifically at my twitter activity through the course of 2013.

First of all, let's take a look at the volume of my tweeting this last year: 3787 tweets in all. On average I make ten tweets a day (up four from last year). Rounding as necessary, four of those tweets are @-led conversations, two are retweets, and one contains the word-stem libr*. The numbers round more reliably if we expand to a full week's twitter activity, so here we go:



2012 2013
Tweets/week 43 73
@s 21 (49%) 29 (39%)
RTs 8 (18%) 17 (24%)
libr* 3 (8%) 4 (6%)

So I'm tweeting more, conversing more, retweeting twice as much as last year, and tweeting without the words libr* about twice as much as last year. My use of the L-word is up by one tweet a week, but as a proportion of my output it is lower than in 2012.

Of course, it's not all about the L-word. If we include info* in the search, things go up by another tweet a week. Adding CILIP has the same effect, putting us at six library-y tweetings in seven days. This by no means tells the whole story, so let's add another one for luck. This puts us a little more honestly at the rounded-up figure of a library tweet a day. Let's call it a tenth of my twitter output. That sounds a fair proportion really. At the 1,000 tweets mark it was about a third, so there may be more to this story. Applying the same criteria to 2012 as to 2013 I'm up by a percent or so.

To get a more accurate view of what I tweet about, we can employ a word cloud. Here's one I made earlier:

The RT is for scale. We already know that that's about a quarter of my twittering. I also say "just" quite a lot, don't I just? Looking around the edges I note that I tweeted about Strictly Come Dancing more than I did about the Great British Bake-Off, which is entirely right and proper. But I think the real story can be found at the top of the nebula, with the inspirational legend: NOW EUROVISION LIBRARY! Nice to note that library is the biggest noun in there. I may mention libraries less than once a day, but I seem to mention them more than I mention other things. I think that's an appropriate balance, really. 

I notice that York is bigger than (pretty) Hull. But cake is bigger than both. The words that appear all seem rather positive, so that's nice (top left, between great and lovely).

To wrap up, here's a chart depicting pretty much all my tweeting ever, including on my other twitter account. It doesn't include my tweets from @HullUni_Library. What it does do is show how @SaintEvelin, an account initially intended as a "professional" networky sort of thing, has become my main account for most of what I have to say, be it about libraries or about Strictly. My original @KingConstance account still trickles along, mainly as a vessel for my angrier tweets, and, when I remember, to continue the pretence of a medieval kingdom in South Yorkshire. But the original reasoning behind the schism: a sort of separation of church and state (social versus professional, if you will) has been disavowed. Tweetdeck, and the raising of the limit on the number of members in a twitter list, allow me to monitor various spheres of interest at once. What is more, the social aspect of tweeting is, I believe, a vital component of the networking aspect: all work and no play makes @Jack a dull tweeter. There's no need to carry on a split personality. Not with librarians at any rate. Cake is an essential part of the job.

Yes, the scale is logarithmic.

My total number of tweets this year from both accounts was 4,369 (582 from @KingConstance). In 2012 it was 2,808 (573 from @KingConstance), so that's roughly a 150% increase in activity. In 2011 (the year I started tweeting as @SaintEvelin) it was 2,853 (2,178 of those from @KingConstance). 

The peak in summer 2012 corresponds with library conferences and unconferences, and with me looking for a full-time job. The near-death of @KingConstance in the last quarter of that year corresponds with me finding a full-time job. There's activity on both accounts in January of this year when I was out of work, then things go quieter when I'm in work again. Funny that. May sees Eurovision, and my biggest spike of activity (127 tweets in one day). This is my third busiest day of tweeting to date (Eurovision 2011 clocked 139, and I made 189 tweets in October 2010 when I "live-tweeted" the historical activities of the Sheffield police force). Since July, my tweet-rate has been fairly constant at around the 12-tweets-a-day mark. BBC television may have some part to play in this.

So that's the twitter analysis for another year. I may have slightly exceeded the 140 character limit. My average tweet length is 92.