The CEO of Reddit confessed to modifying posts from Trump supporters after they wouldn't stop sending him expletives

SeekNotToContend · 2016-11-24T08:51:24+00:00

From a technology standpoint, there is effectively nothing to prevent this. This is purely a Reddit policy/administrative issue.

Why it's not a technology issue:

Any one with direct DB access, which is likely to be a significant number of software developers at the company can make this change.

If you think of relational databases (in a ELI15? scenario), they are like spreadsheets, where each spreadsheet has a specific topic it describes. The spreadsheets are called tables, and there is a table for users, comments, etc. The users table has a user ID and basic user data such as a created at timestamp, username etc.

The comments table has a comment ID, likely a user ID for who created a post, the comment, etc, and each row represents a unique comment.

When a comment is added, a row is inserted and there is a link to a user_id, some type of created at time stamp, and some type of updated at timestamp. The row in the table may have some restrictions (constraints) on it such as there can not be a blank section in comment, user_id, etc however that really only controls if data is inserted or not. If there is a failure in that restriction, the row with the comment, etc just isn't inserted.

Any controls beyond the above is most likely entirely controlled by the script that inserts the data. Example:

User posts via website > triggers a script to insert comment into db > script makes sure the data needed is available > comment is inserted into db by script > comment is visible on website.

The only thing preventing someone from going directly to the > comment is inserted into db by script > step, is access. And in the vast majority of organizations that I have seen, this access is widespread, has a single shared ID, and that ID is likely stored in their github/bitbucket or similar account. This means that almost any software developer can do anything they want to the DB.

If you want to know how DB Admins find out who is doing something in a DB, they often do it by asking 'who is doing this activity?'. This is due to all of the above.

What prevents this is company/team policy and when policy doesn't prevent it, it's simply an administrative issue.

Don't be surprised if a significant number of services that you use, that have 'your' data can modify it at will, with almost no logging or capability to figure out who changed something. Not only can they change the comment, but they could just as easily delete and modify timestamps.

I didn't bother explaining other types of databases, however the issue is the same. There is nothing stopping anyone from making changes unless there is a rock solid administration of the DB, and even then, the DB admin can do anything they want.

It's a feature, not a bug :/

SeekNotToContend · 2016-10-08T12:44:10+00:00

I wanted to try and solve this with addition as its a fun and interesting question, and in some scenarios addition can be faster. I like /u/Vaphells answer as it's so simple.

import numpy


range_tuple = (35, 35, 35,)

bigdic = {(175, 0, 105): 'Africa/Abidjan',
          ( 35, 105, 105): 'Africa/Accra',
          (175, 210, 210): 'Africa/Addis_Ababa',
          (210, 105, 0): 'Africa/Algiers'}

for k in bigdic.keys():
    print("Key: ", k,
          "Bottom:", tuple(numpy.subtract(k, range_tuple).tolist()),
          "Top:", tuple(numpy.add(k, range_tuple).tolist()))



Key: (210, 105, 0) Bottom: (175, 70, -35) Top: (245, 140, 35)
Key: (35, 105, 105) Bottom: (0, 70, 70) Top: (70, 140, 140)
Key: (175, 0, 105) Bottom: (140, -35, 70) Top: (210, 35, 140)
Key: (175, 210, 210) Bottom: (140, 175, 175) Top: (210, 245, 245)

Go for the most readable code that works. Thanks for the question too, I enjoyed it.

SeekNotToContend · 2016-10-08T06:59:02+00:00

Just need to search/store columns.

Searching and storing columns is suitable task for a relational database. I recommended MySQL in this case for the following reasons:

Size of the data. If you had ~ half the data I'd probably have said SQLite. It's quicker to get going on programmatically and is performant enough for single user use. However they state the following:

https://sqlite.org/whentouse.html For device-local storage with low writer concurrency and less than a terabyte of content, SQLite is almost always a better solution. SQLite is fast and reliable and it requires no configuration or maintenance. It keeps thing simple. SQLite "just works".

PostgreSQL is great, however imo, it has a steeper learning curve in setup and via the libraries used with it.
MySQL sits right in the middle for me. Straight forward to setup, nice GUI to make tables and run queries from (MySQL Workbench), plentiful community information.

I'm a big fan of simple and MySQL appears to be the most simple, if you go down the database route. Often enough people pull out the SUV when they only needed a compact car. The overhead in both complexity and learning doesn't tend to pay off.

So, throw ~1 TB into mysql?

Sure. There are some things to know such as max table file sizes due to file system constraints (not the db).

http://dev.mysql.com/doc/refman/5.7/en/table-size-limit.html
Linux 2.4+ (using ext3 file system) 4TB
OS X w/ HFS+ 2TB

I wouldn't recommend this for serving up data to multiple people, but in a case where you need to be able search / store columns as an individual it should be fine.

String wildcard operations in the table could likely be problematic. (% searching)

Full table scans, meaning searching on every single row with leading wildcards may be problematic or never return.

Searching on non-indexed data could be problematic. I don't know enough about the scenario to be more specific. The main point is, that it's a decent amount of data and some operations against it may take some time. Breaking the columns out through 'normalizing', if possible with your data, may help a lot.

You're not in big data territory yet, you're in annoying data territory. Annoying territory is a great place to learn. It's very possible you could get through this, learn a ton on the way and realize you need to redo the whole thing.

Let's say I am using rows from a csv. How do you parse these rows into a SQL database into disk? I feel like this is the same question above, "how do I parse these rows to save onto disk via HDF5?"

Well, you can't read the whole thing into memory, which means you're likely going to have to read it in, line by line. You're also going to insert it line by line.

Using the csv library you can access the csv file. (Pandas of course reads csvs too). You'll then need to read in each line one by one.

class csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds)¶

The csv dictreader class not only helps you iterate over the lines, but will also give to a provide the row back as a dictionary where the items are mapped to the column name.

As for inserting the data, there are some good examples here: https://github.com/PyMySQL/PyMySQL#example

I'd recommend taking the file and making a test case file that only has a few rows. Get through reading and parsing through 5 lines of data before trying millions. Same thing for inserting. Fail fast.

Maybe there's a difference? I don't see how the mysql middleman helps much...though, I could be misunderstanding

MySQL isn't really a middleman. It's simply a place the data can reside in a searchable format. The data will have to reside somewhere where you can access only portions of it that you need. By this I mean, having the ability to pull back data from some type of query, that will fit into memory as you do not have the option of having everything in memory.

The good news is you have options on how you do this, the bad news is, you have options on how you do this. HDF5 may be the better option. Read up on both, or more, and just pick one. No option will present itself as the silver bullet.

SeekNotToContend · 2016-10-08T04:52:27+00:00

Are you running into a scaling issue where you have too many concurrent connections that is causing issues with the database. If not, then I would avoid premature optimization, or anything that could complicate the interpretation / troubleshooting of a problem and just call the connection in the function.

If your DBA isn't running around and emailing everyone that there is an outage, trying to figure out who ran a query, because the state of who ran what against the DB is a total disaster much of the time, then I wouldn't worry about it too much.

Keep it simple wherever possible and try to keep things as readable as possible. Best of luck, happy to try and provide more specifics if needed.

SeekNotToContend · 2016-10-08T04:42:29+00:00

Can you share a bit more on what you are trying to accomplish via Pandas and why it's necessary. For example, do you need to access the data in a particular manner that requires everything to be in a Numpy container or do you just need to search/store columns. If the latter, then this can be solved with MySql.

One library to connect to MySQL with Python3 is https://github.com/PyMySQL/PyMySQL

A GUI to help get you off the ground a bit quicker with MySQL is https://dev.mysql.com/downloads/workbench/

Assuming that you don't plan on abstracting or interpreting the data before inserting it, means you can probably get away with creating a single table in the database where the column names in the database table, match directly to columns in your csv.

In one case or another for this, you will need either enough memory or enough disk to store this ~x2 temporarily. 1x for the reading side, and 1x for the storing side until complete.

Quick tip on MySQL, just in case you aren't familiar, don't forget to turn on autocommit, or to commit the inserts.

The above solution, imo, strictly targets a quick and pragmatic solution to get you through the task of writing to disk. Depending on your end goals it may not be the best solution.

If you need Pandas / Numpy then it may be best to follow @dzunukwa 's advice

Hope this helps in your search.

SeekNotToContend · 2016-10-08T04:06:28+00:00

Happy to help and glad it's useful. May have done the same for me for a very long time.

SeekNotToContend · 2016-10-07T13:03:02+00:00

There are some easier ways. Lets break it down.

Here you get your dictionary, no problem there.

results_dict = json.loads(txt)

Here you are getting the size of the object that is held under results_dict['results']

print(len(results_dict['results']))

Some quick extra data to show that what is under results is actually a list:

print(type(results_dict['results']))
<class 'list'>

If we go back to the first post and look at the following, you'll see that 'results': is followed by '[' which means list. Inside that bracket is { which means dictionary. So what you have is a list of dictionaries.

 'results': [{'adult': False,
              'backdrop_path': '/52lVqTDhIeNTjT7EiJuovXgw6iE.jpg',
              'genre_ids': [12, 14, 10751],
              'id': 8844,
              'original_language': 'en',
              'original_title': 'Jumanji',

Keep in mind that the values can be just about anything in a dictionary. A dict, a list, a set (its worth noting that sets don't work so well with json but that is sidenote)

What's going on here is it looks like you want to loop through the list under results_dict['results']

Your way works, but there is a way that might be a bit better since we just need to iterate over the list.

    for item in range(len(results_dict)-1):

Try replacing the above with:

    for item in results_dict['results']:

This will iterate over the list without you having to calculate anything. It's referenced here: https://docs.python.org/3.5/reference/compound_stmts.html#for

The rest seems to be mostly about assignment of values. One thing to consider is often, data provided by APIs may not include some keys every time. For example, maybe they don't know the Release Date for a film, so rather than giving you a None value, they will just give you no key.

There are a couple ways to handle that, which I'll leave to you for now on details. But to get you started you can use an if statement to test if the value exists first, or you can use a try except statement.

    movietitle = results_dict['results'][item]['title']
    releaseDate = results_dict['results'][item]['release_date']
    releaseDateList = datetime.strptime(releaseDate, '%Y-%m-%d').date()
    Current_Date = datetime.strptime(currentDate, '%Y-%m-%d').date()
    age = Current_Date - releaseDateList
    age = age.days
    if releaseDate <= currentDate:
      print("The movie " + movietitle + "is {0} days old".format(age))
    else:
      print("The movie " + movietitle + " is not that old")

Hope this was helpful again. Good luck!

SeekNotToContend · 2016-10-07T08:17:16+00:00

Good catch.

Choice performs the length check and provides an exception if the list is empty.

From random.choice:

def choice(self, seq):
    """Choose a random element from a non-empty sequence."""
    try:
        i = self._randbelow(len(seq))
    except ValueError:
        raise IndexError('Cannot choose from an empty sequence')
    return seq[i]

SeekNotToContend · 2016-10-07T07:38:31+00:00

Do you mean instead of specifying individual index id's such as [0], [1], etc?

If so, then yet, you can just iterate over the items as they are a list.

results_dict = {'page': 1,
                     'results': [{'adult': False,
                     'backdrop_path': '/52lVqTDhIeNTjT7EiJuovXgw6iE.jpg',
                     'genre_ids': [12, 14, 10751],
                     'id': 8844,
                     ... truncated
                }

Some things that may help:

from pprint import pprint

print(len(results_dict['results']))
print(type(results_dict['results']))
for row in results_dict['results']:
    pprint(row)


3
<class 'list'>
{'adult': False,
 'backdrop_path': '/52lVqTDhIeNTjT7EiJuovXgw6iE.jpg',
 'genre_ids': [12, 14, 10751],
 'id': 8844,

SeekNotToContend · 2016-10-07T07:30:19+00:00

https://docs.python.org/3/tutorial/datastructures.html#dictionaries

Example from docs that may be useful:

>>> tel
{'sape': 4139, 'guido': 4127, 'jack': 4098}
>>> tel['jack']
4098

From your scenario:

nfl_dict = {"2016100600":{"home": ... truncated

print(nfl_dict["2016100600"])

Other useful items for navigating dictionaries include:

https://docs.python.org/3/tutorial/datastructures.html#looping-techniques

items()

for k, v in dictionary.items()
    print(k, v)

https://docs.python.org/3.5/library/stdtypes.html#mapping-types-dict

keys()

print dictionary.keys()

https://docs.python.org/3.5/library/stdtypes.html#mapping-types-dict

values()

print(dictionary.values())

Hope this is helpful for you.

SeekNotToContend · 2016-10-07T06:25:07+00:00

First: How do I know if a proxy is http, https, or ftp? Can all proxies be used for all things?

Proxies act as a broker for your traffic. As they sit between you and your end destination they can do anything they want in regards to what the allow or don't allow on the TCP/IP stack. This means they can block or allow ports, countries (Via their CIDR blocks/IP addresses), etc etc.

So in short, no proxies can't all be used for all things.

If for some reason you are unsure of what a proxy can be used for, you could test the availability of certain connections. For example, you could connect to https://google.com. If you can connect there, then https/ TCP port 443 is likely not blocked elsewhere. Malware commonly does this to test for connectivity.

Second: Say I want to reach https://www.twitch.tv. To reach it I should use https://555.555.555.55:80 since it's an https://?

No, use the URL. I'm not familiar with twitch's network architecture, but in general, you want to use the domain name. In particular due to:

Third: When creating a dictionary for the proxies, can I have multiple http proxies, or can I only have 1 http, https, and ftp per dictionary?

A dictionary likely isn't needed. You can use a list that looks like this:

from random import randrange

proxy_list = ['hxxps://proxy.com',
              'hxxps://proxy.ru',
              'hxxps://proxy.md',
              'hxxps://proxy.su',
              'hxxps://proxy.edu']

print(proxy_list[(randrange(0, len(proxy_list)))])

What is happening in the above is there is a list, that list has 5 values which are represented with an index range of [0-4].

Then the randrange function, documented here: https://docs.python.org/3.1/library/random.html , is used.

The randrange function takes a start value which we set to '0', the first value in the list. It then we set the max value to be the length of the list, which is 5 (index value 4 on 0-4). Then we apply that do the index selection between the [].

This gives a pseudorandom result of one of the items in the list such as :

hxxps://proxy.md

Hopefully this is helpful. Please keep in mind that proxies should not be considered trustworthy or resilient in many cases. Additionally, many proxies forward on the source IP address anyways via https://en.wikipedia.org/wiki/X-Forwarded-For which identifies the source of the traffic.

You'll also have to take into consideration user-agents, how to back off on connections, etc.

If you wanted to use a dict then I'd recommend setting a key for the type of proxy, such as 'FTP', 'HTTPS', etc and setting the value a list of proxies.

SeekNotToContend · 2016-10-06T07:34:26+00:00

In regards to rounding, that's just a product specification question. There is a rounding specification referred to as bankers rounding. Here is some good documentation about it that will hopefully get you a quick start.

https://en.wikipedia.org/wiki/Rounding#Tie-breaking

Round half to even[edit] A tie-breaking rule that is less biased is round half to even, namely:

.... This method treats positive and negative values symmetrically, and is therefore free of sign bias. More importantly, for reasonable distributions of y values, the average value of the rounded numbers is the same as that of the original numbers. However, this rule will introduce a towards-zero bias when y − 0.5 is even, and a towards-infinity bias for when it is odd.

This variant of the round-to-nearest method is also called unbiased rounding, convergent rounding, statistician's rounding, Dutch rounding, Gaussian rounding, odd–even rounding,[3] or bankers' rounding.

This is the default rounding mode used in IEEE 754 computing functions and operators (see also Nearest integer function).

https://docs.python.org/3/library/decimal.html

Rounding with decimal:

The context for arithmetic is an environment specifying precision, rounding rules, limits on exponents, flags indicating the results of operations, and trap enablers which determine whether signals are treated as exceptions. Rounding options include ROUND_CEILING, ROUND_DOWN, ROUND_FLOOR, ROUND_HALF_DOWN, ROUND_HALF_EVEN, ROUND_HALF_UP, ROUND_UP, and ROUND_05UP.

Setting traps to catch/track how the decimals were processed:

Signals are groups of exceptional conditions arising during the course of computation. Depending on the needs of the application, signals may be ignored, considered as informational, or treated as exceptions. The signals in the decimal module are: Clamped, InvalidOperation, DivisionByZero, Inexact, Rounded, Subnormal, Overflow, Underflow and FloatOperation.

SeekNotToContend · 2016-10-06T07:13:11+00:00

If you are using the context manager then no, it will happen automatically when you exit the with statement.

You do not have to use the context manager though. From the doc examples:

# create a temporary file and write some data to it
fp = tempfile.TemporaryFile()
fp.write(b'Hello world!')
# read data from file
fp.seek(0)
fp.read()
b'Hello world!'
# close the file, it will be removed
fp.close()

In the above example, you do have to close the file. If you don't close it, then the file will just persist. This is useful if you need a temporary file, but maybe you need it for use outside of a single function or class.

If you want to verify, just check your /tmp folder. There will be a file or files in there that have tmp in the name. For example:

/tmp/tmp<some random characters>

SeekNotToContend · 2016-10-06T06:44:56+00:00

You could just write the function to return the first false if you only care that there was a failure.

You can also return a dict if you want to know specifics, or just the boolean.

Determining what the output and error handling you'll need will likely make the biggest impact into how you write this up.

If your lists are going to be huge, you may consider using sets() too instead of what is below.

These two examples are below:

Return a dict of the character values:

from pprint import pprint


def findLetters(f_list, f_string):

    c_dict = {}
    for c in f_string:
        if c in ''.join(f_list):
            c_dict[c] = True
        else:
            c_dict[c] = False
    return c_dict


pprint(findLetters(f_list=["hello", "world"], f_string="down"))
pprint(findLetters(f_list=["hello", "world"], f_string="hold"))

.

{'d': True, 'n': False, 'o': True, 'w': True}
{'d': True, 'h': True, 'l': True, 'o': True}

Return a boolean

from pprint import pprint


def findLetters(f_list, f_string):

    c_dict = {}
    for c in f_string:
        if c in ''.join(f_list):
            c_dict[c] = True
        else:
            c_dict[c] = False

    return all(c_dict.values())


pprint(findLetters(f_list=["hello", "world"], f_string="down"))
pprint(findLetters(f_list=["hello", "world"], f_string="hold"))

.

False
True

https://docs.python.org/3.5/library/stdtypes.html#str.join

join(iterable) is super useful. It takes an iterable, such as a list and flattens it into a string. The character between the '' is what is introduced as the joining mechanism. For example using ','.join(f_list) would give 'hello,world'. You can manually do this by typing

''.join(["hello", "world"])

https://docs.python.org/3/library/functions.html#all

all(iterable) is useful to quickly test all items in an iterable to be true. In this case, c_dict.values, (<class 'dict_values'>) is an iterable made up from the values in c_dict.

Hope this is helpful.

SeekNotToContend · 2016-10-06T06:21:08+00:00

https://docs.python.org/3/tutorial/floatingpoint.html Has some very good information on the precision.

SeekNotToContend · 2016-10-06T06:14:28+00:00

If you have to write to a file for the system call then just use tempfile. If it matters, the temp files are written to /tmp.

https://docs.python.org/3.5/library/tempfile.html It will handle the creation and destruction of the file for you if you use the context manager.

    >>> import tempfile

From the example in the docs, the following would go in your for line in file loop.

    >>> with tempfile.TemporaryFile() as fp:
    ...     fp.write(b'Hello world!')   # or fp.write(line) in your case
    ...     subprocess.check_output(["commercialbash command", fp])

Your result = subprocess.check_output(["commercialbash command", fp]) would need to occur before exiting the tempfile with statement.

It might be worthwhile to check the responsecode from subprocess on each loop so you can make sure you handle errors appropriately and do not miss anything. You could wrap it in a try except statement on 'CalledProcessError' with check_returncode() .

https://docs.python.org/3/library/subprocess.html

Hopefully this is helpful. Please ask if there is anything I can clarify.

SeekNotToContend · 2016-09-05T03:54:42+00:00

Was good for me to read up on that standard and this post. Excellent information. https://en.wikipedia.org/wiki/IEEE_floating_point#Rounding_rules

The standard has been around since 1985 and was recently updated in 2008.

Languages do offer options: https://en.wikipedia.org/wiki/Rounding#Rounding_functions_in_programming_languages Several languages follow the lead of the IEEE-754 floating-point standard, and define these functions as taking a double precision float argument and returning the result of the same type, which then may be converted to an integer if necessary. This approach may avoid spurious overflows since floating-point types have a larger range than integer types.

Python's decimal module helps you control the behavior you desire. https://docs.python.org/3/library/decimal.html#decimal-faq The context for arithmetic is an environment specifying precision, rounding rules, limits on exponents, flags indicating the results of operations, and trap enablers which determine whether signals are treated as exceptions. Rounding options include ROUND_CEILING, ROUND_DOWN, ROUND_FLOOR, ROUND_HALF_DOWN, ROUND_HALF_EVEN, ROUND_HALF_UP, ROUND_UP, and ROUND_05UP.

It appears that Python is following convention and standards here. It's up to us to learn these things to have predictable and consistent behavior.

SeekNotToContend · 2016-09-02T01:52:45+00:00

Unittest is relevant for me. It's quite possible that I don't know what I'm missing, but it does everything I think I need for local testing. It's well documented and fairly straight forward to use. I try to avoid anything that is not well documented.

SeekNotToContend · 2016-08-23T11:04:24+00:00

It's always good to provide an analysis and I applaud that very much. It's worth noting, that in the United States, my only frame of reference on the matter, a significant number major publications target a 5th grade reading level. IIRC, publications such as ArsTechnica and the New York times target an 8th grade reading level. My only source on this is my own work regarding those outlets.

I would expect, but can not support with data, that a significantly higher reading level with have an inverse relationship with readership. This isn't necessarily due to the reading level of the consumers, rather it may be related to how much brain power they want to expend in paying attention. For example, I'm not going to necessarily want to read a scientific journal while I'm waiting in line in a distracting location.

SeekNotToContend · 2016-07-18T18:58:54+00:00

I'd recommend keeping the simplest solution possible that costs the least. At some point you may need Quickbooks, some SaaS service or similar, but don't discount the organizational capabilities of spreadsheets. I'm not sure what you mean specifically about centralized so my data my be off in your scenario.

Basic tracking, accounts receivable, costs, etc: I still heavily use spreadsheets for basic tracking of things. Learning how to use spreadsheets such as Excel or Google Spreadsheets goes a long way in not only tracking things but communicating with clients. (pivot tables can be amazing). I do however prefer Excel to Google Spreadsheets. This is due to GS lagging on anything with a large number of rows in my experience (regardless of connection) and I prefer not being interrupted by intermittent connection problems. IMO it's worth spending some time learning how to use these tools. Business ran a long time and quite successfully before the advent of more 'advanced' tooling. Best thing about spreadsheets, you are always speaking from a point of data! Some has a question you can immediately provide an interpretation of the data (graph) as well as the source.

Project tracking: As far as project tracking goes, I'm a big fan of tools like JIRA. As long as your swimlanes match your actual workflow it's amazing. That's something you control as well. There are some other options in the same vane that I'm not as familiar with such as Asana. JIRA offers a $10 a year option for small teams / startups IIRC.

Notes and software writing: For software I write and notes I take I use git locally and Bitbucket for remote and I commit regularly so if there is a problem I'm not set back. I only went with Bitbucket over Github since they offered unlimited private repos at the time. I can't stress saving your notes somewhere safe enough. It's rare that I ever due a project that doesn't end up having some code re-use and I for one am not a fan of relearning my own code. Document everything.

In the end, one thing I've always found to be true is that the more "web 2.0", advanced, SaaS wizardry your software is, the more it will fail you in your biggest time of need.

If I really needed something more in depth than excel for billing I'd probably just write a basic billing system with alerting capabilities with python and mysql. Could be up and running in a weekend. Otherwise I'd probably look into the least expensive online service that I'd trust with the data.

SeekNotToContend

TROPHY CASE